Design Better SQL Schemas





Design Better SQL Schemas

Most PHP applications use an SQL database. Here are some hints to help you avoid common problems.

PHP applications usually use MySQL databases for the back end. I've worked on a bunch of my own applications, as well as with open source application databases and some commercial ones. In my travels, I have seen a few common problems appear repeatedly; here are a few of those problems, along with easy solutions.

Bad Primary Keys

To find a unique record in a database table, you need a primary key. This is usually a unique, nonrepeating integer that starts at 1. All databases have the ability to handle this for you, but it seems that some engineers aren't aware of it.

Take the simple schema in Figure. You have an author table with an id and a name.

SQL without a primary key
DROP TABLE IF EXISTS author;
CREATE TABLE author (
		id INT,
		name TEXT
		);

But who ensures that the ID is unique? Often the PHP code that uses a table like this will first do a SELECT to find the maximum value of the ID field, and then create a new record with that value plus 1. But that takes an extra SQL statement and assumes the PHP developer remembers to take this step. It's much better to let the database handle this (rather routine) task.

A much better version of the schema from Figure is shown in Figure.

Adding an auto-incrementing ID field
DROP TABLE IF EXISTS author;
CREATE TABLE author (
		id INT NOT NULL AUTO_INCREMENT,
		name TEXT,
		PRIMARY KEY( id )
		);

Now the ID field is specified as an auto-incrementing integer that cannot be null. It's also identified as the primary key.

To insert a record into this type of table, follow this recipe:

	INSERT INTO author VALUES ( 0, "Brad Phillips" );

MySQL replaces the 0 value for the ID with an auto-incrementing value. To find out the value of the ID from the most recent insert, use this SELECT statement:

	SELECT LAST_INSERT_ID();

This version of the table is also faster than the first version because the primary key specification creates an index that speeds up the look-up process.

If you find that the code that inserts records into the database is first doing a SELECT to find the largest ID value, and then running INSERT with that value plus 1, you know the code is not auto-incrementing primary keys. This is not only a performance hit, but it's also a problem on high-traffic sites where records could easily be added with duplicate primary keys because of the separation in time between the SELECT that finds the largest primary key value and the INSERT that creates a new record.

Misunderstanding Relational Databases

Relational databases, such as Oracle and MySQL, are different from in-memory data structures that you would develop in structured programming languages (e.g., PHP, C, and Java) or object-oriented languages. In particular, programming languages can have a data structure that includes an array. Data structures like this don't translate directly over to relational databases.

The sample SQL in Figure shows a schema where a text field is used as an array of IDs.

Approximating an array in a programming language
DROP TABLE IF EXISTS author;
CREATE TABLE author (
		id INT,
		name TEXT
		);

DROP TABLE IF EXISTS book;
CREATE TABLE book (
		id INT,
		name TEXT,
		authors TEXT
		);

Figure shows the relationship between these two tables (that is, the lack of a real relationship).

The name and book tables


Here is an example of how you might use these tables:

	INSERT INTO author VALUES ( 1, "Brad Phillips" );
	INSERT INTO author VALUES ( 2, "Don Charles" );
	INSERT INTO author VALUES ( 3, "Brad silver" );
	INSERT INTO book VALUES ( 1, "MySQL in a bucket", "1,2" );
	INSERT INTO book VALUES ( 2, "Databases for Delinquents", "3" );

The books are added with comma-delimited lists of author IDs. Hardly takes advantage of the database's structure, does it.

To find out which authors belong to which books, the script first queries the book record, then splits the authors text field using preg_split(), and finally does another set of queries against those IDs (hopefully this makes you cringe as much as it does me).

Now let's talk about how this should be done. Figure shows a corrected schema with three tables. One is for authors, another is for books, and a third table relates the first two.

Using a join table to couple two tables together
DROP TABLE IF EXISTS author;
CREATE TABLE author (
		id INT NOT NULL AUTO_INCREMENT,
		name TEXT,
		PRIMARY KEY( id )
		);
DROP TABLE IF EXISTS book;
CREATE TABLE book (
		id INT NOT NULL AUTO_INCREMENT,
		name TEXT,
		PRIMARY KEY( id )
		);
DROP TABLE IF EXISTS book_author;
CREATE TABLE book_author (
		book_id INT,
		author_id INT
		);

The third table in this schema is critical. It relates the two tables, author and book, by the id field in each. For each author of a book, there will be a record in the book_author table. Figure shows the relationship among these three tables.

The author and book tables now properly related


Here is how the initial data is loaded:

	INSERT INTO author VALUES ( 0, "Brad Phillips" );
	INSERT INTO author VALUES ( 0, "Don Charles" );
	INSERT INTO author VALUES ( 0, "Brad silver" );
	INSERT INTO book VALUES ( 0, "MySQL in a bucket" );
	INSERT INTO book VALUES ( 0, "Databases for Delinquents" );
	INSERT INTO book_author VALUES ( 1, 1 );
	INSERT INTO book_author VALUES ( 1, 2 );
	INSERT INTO book_author VALUES ( 2, 3 );

The last set of INSERT statements assigns the first two authors to the first book and the third author to the last book.

Here is a query that returns a table with one line for every book and author combination:

	SELECT
			a.name AS author,
			b.name AS book,
			a.id AS author_id,
			b.name AS book_id
	FROM
			author AS a,
			book AS b,
			book_author AS ba
	WHERE
			a.id = ba.author_id AND
			b.id = ba.book_id;

And here is a similar query (augmented with another conditional to specify a single book):

	SELECT
			a.name AS author,
			b.name AS book,
			a.id AS author_id,
			b.name AS book_id
	FROM
			author AS a,
			book AS b,
			book_author AS ba
	WHERE
			a.id = ba.author_id AND
			b.id = ba.book_id AND
			ba.book_id = 1;

This query will return all of the authors of a single book without subqueries or string parsing. The results are a faster query and a database schema that takes advantage of a database's strengths (relationships) rather than its weaknesses (text parsing).

Use Not Null Fields

Databases can provide a lot of validation on the data stored in them. But even rudimentary validations go unused by programmers who don't know that the validations exist. Take the table in Figure, for instance.

A table that omits some implied requirements of the data it stores
DROP TABLE IF EXISTS user;
CREATE TABLE user (
		id INT,
	first TEXT,
	last TEXT,
	username TEXT,
	password TEXT,
	description TEXT
	);

What do we know about this table? Well, we know that the first, last, username, and password fields must never be empty. Wouldn't it be great if the database could help us to ensure that? Of course, it actually cancheck out Figure.

Small changes that make for dramatic results
DROP TABLE IF EXISTS user;
CREATE TABLE user (
		id INT NOT NULL AUTO_INCREMENT,

		first TEXT NOT NULL,
		last TEXT NOT NULL,
		username TEXT NOT NULL,
		password TEXT NOT NULL,
		description TEXT,
		PRIMARY KEY ( id )
		);

Here, I have not only upgraded the primary key so that it's automatically generated, but also added NOT NULL specifications to the fields that must not be empty. This will ensure that inserts will fail if the data is invalid. Why rely on good programming when your database can enforce these rules on its own?

See Also



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows