
(Attention conservation notice: 2300 words about the foundations of databases. It would have been much less if I hadn’t felt obliged to explain everything from the ground up to people who know nothing about databases. I particularly invite comment from that part of my audience. How did I do?)
I’m glad I read this book when I did, before I knew very much about SQL or about databases generally. It’s an attempt to get people to think abstractly about databases, and therefore think more expansively about what databases can do for us. A database, in C.J. Date’s telling — and he’s got nearly (I emphasize nearly) all the weight of logic and evidence behind him — is a mathematical object that nearly exists in a Platonic realm; it only comes down to earth periodically to represent a collection of employees or a parts database or whatnot.
So then. What do most people think of when they think of a database? Under the influence of SQL, they imagine tables that look like so:
| Employee ID |
First name |
Last name |
Title |
Salary |
| 1 |
Jim |
Smitd |
CEO |
$9E9 |
| 2 |
Mark |
Johnson |
Janitor |
$10,000 |
Then we might have another table that lists addresses. “Why not put addresses in the same table as the employee-ID/first-name/last-name/salary table?” you might ask? Well, multiple employees can live at a single address (think of a husband and wife working at the same company), and a single employee can have multiple addresses (think vacation home and regular home, though naturally you might wonder why an employee database would list vacation homes). This is what’s called a “many-to-many relation,” and it needs to be treated in a special way. The specialness comes down to a desire not to introduce duplicate data into the database: you should only have to enter an address once, and you don’t want to risk mistyping the husband’s or the wife’s address. This is called “normalization,” and in a few specific forms it’s a tool to help keep the database consistent.
There are other accepted ways to keep databases consistent. One is to attach a type to each field. A “start date” field, for instance, must be of type DATETIME; “janitor” is not a valid start date. This is here to prevent humans from making mistakes; it would be all too easy to think you’re typing in the “job type” field, when in fact you’re typing in the “start date” field, and make some computer code further along down the line explode. True, you could write computer code to perform this same check, but enforcing a type for each field is the accepted way to do this sort of thing. [1]. And leaving this up to the application programmer is a recipe for an inconsistent database.
We define other constraints on the data as well. Let’s say, for instance, that we add a “Manager” field to our employee record. A manager is another type of employee, so this field should just include the manager’s employee ID. We add the constraint that the number under “manager’s employee ID” must exist under the “employee ID” field somewhere else in the database. This is called a “foreign-key constraint.” It’s another way to prevent errors. And when we point to managers by way of their employee IDs, we prevent duplicate records: I should only enter a manager’s full name once, and from then on only refer to the manager by way of his employee ID. Entering an employee record twice means twice the opportunities to introduce typing errors.
Now we have a collection of tables, constrained in certain ways. Here’s where C.J. Date takes this perspective and reframes it in a really compelling way: a database isn’t just a collection of data; it is a collection of true propositions. We’ve gone through all this effort to constrain our data because we want our database to express only truths. A row containing employee ID 10, with name “Andy Goldsworthy”, address “123 3rd St, Cambridge MA 02138″, and manager ID 129, is in fact a statement: “Employee ID 10 has name Andy Goldsworthy, etc.” If you’ve entered a typo, you have introduced a false statement into this database, and everything else is potentially broken: from a false statement, anything follows, and your database has now brought falsehood upon the land. You have darkened your neighbor’s door, and you shall be forced to atone.
If you think that this overstates the severity of the mistake you’ve just made, you are clearly not C.J. Date. The man takes his databases very seriously, because he views them as logical objects that just happen to be instantiated on physical computers.
This focus on the logic of databases leads him to despise certain parts of the modern database toolkit. The idea of a null value is his most-despised enemy, for one. A null indicates that we don’t know the answer to something. Let’s take the “salary” field in our employee database, for instance. Maybe this is currently unknown, because HR manages the database and hasn’t yet gotten that information from the person responsible for salary negotiations. HR needs to complete a report on employeee compensation, though, so it needs to compute amounts like the average employee salary. The average is the total salary for all employees divided by the number of employees. But the total salary is unknown, because certain employees’ salaries are unknown. If you’re not careful, you’ll end up with an average that’s too low: you’ll divide a salary total for all those with non-null salaries by the total count of all employees. And you can’t expect query languages to help you here: sometimes you’ll want to divide by the real total, and sometimes by the non-null total.
The use of nulls, though, seems largely unavoidable. You could split the salary field off into a table containing just employee ID and salary, for instance, and include no row at all when the salary is unknown. But eventually you’ll need to put this salary back with the rest of your employees’ information, using what we call a “join” (specifically, we’re “joining on employee ID”). In that joined table, sometimes you’ll want all employees to show up, whether or not their salaries are defined (an “outer join”), and sometimes you’ll only want those with salaries to show up (“inner join”). In the former case, you’re going to end up with nulls.
Or at least, C.J. Date doesn’t really give us much help in getting around this problem. It’s not clear what he would advocate in this situation. His theoretical cast of mind tells him that a database is a collection of true statements, period, full stop. Since a null introduces the possibility of false statements, it is to be rejected. It must be noted that he has a great many other practical reasons for rejecting nulls, among them that they make query optimization harder and require the user to go through hoops — like being careful about defining what “average” means, as above — that they just shouldn’t have to go through.
Viewing a database as a collection of true statements, it follows that the database should at no point contain false assertions; it must be carried from one true statement to another. This principle leads Date to reject many principles that normally guide database design. For one, he rejects the idea that a database should only be in a consistent state across transaction boundaries; every single statement that changes the state of a database, whether inside a transaction or not, must change it to another true state.
To explain what this means, we’ll use the classic example of database transactions: a bank. When you transfer money from your checking to your savings account, typically something like the following is supposed to happen within the bank’s computers:
START TRANSACTION;
Deposit $500 in savings;
Deduct $500 from checking;
COMMIT TRANSACTION;
(Semicolons indicate the end of a statement.)
If someone trips over the database’s power cable between the deposit and the deduction, a badly designed system will have just given you 500 extra dollars; such a system is said to not be “transactional.” A properly designed system, on the other hand, will start back up when the power cable goes back in, will note that the database is in an inconsistent state, and will “roll back the transaction” by withdrawing that $500 from savings. And even though that money may have ended up in savings before being rolled back, no one in the world would ever see it; so far as anyone viewing the database knows, no money leaves checking until it reaches savings. This is called “consistency across a transaction boundary”: inside the transaction, the checking account may temporarily have too much money, but outside the transaction the database contains a true picture of the world.
C.J. Date rejects this kind of consistency, again because a database is a collection of true statements about the world. Consistency, then, must be enforced at statement boundaries rather than transaction boundaries.
How would statement-level consistency actually be enforced? Date uses a new bit of syntax to enforce consistency at statement boundaries when the transaction involves multiple tables; instead of the transactional syntax from above, our checking-to-savings transfer would be written like this:
deposit $500 in savings, deduct $500 from checking;
This syntax puzzles me, because it doesn’t actually solve the problem that Date was trying to solve. The problem is that, in a real physical computer system, some actions happen after other actions. In our case, the deduction happens at a different time from the deposit, no matter what the syntax says. So there will be a time T when savings contains 500 more dollars than checking. Date’s syntax serves to mask this problem rather than solve it. The fundamental reason that he thinks this syntax solves his problem is that his head is focused on the logical basis of databases; that logical basis is timeless, in that collections of statements transform from one state to another instantaneously. His Platonic form of a database is literally timeless. He’s less concerned with physical databases, in which transactions take finite time to complete.
Date’s syntax woes pervade all of Database in Depth. He really, really, really hates SQL. He hates it because it provides limited syntactic support for fairly common use cases. The trouble is that he doesn’t provide a real alternative to SQL. Yes, he provides his own language called Tutorial D (bolding is [sic] from throughout Database in Depth), but Tutorial D is undercooked even by Date’s own standards: at points throughout Database in Depth, he invents his own ad hoc syntax, which tells us that the language certainly wasn’t ready for prime time before he started writing the book.
Even if Tutorial D were entirely ready for action, Date makes clear that he’s never implemented it in a real computer system. As its name implies, it is a language for use in the classroom. So far as I know, Tutorial D is not in use in any commercial database system anywhere. Certainly Date gives no reason to think that it has actually been implemented anywhere, which lets SQL win by default.
Anyone who reads Database in Depth from a relative newbie perspective will, I trust, be puzzled for the same reasons that I was: if SQL is so bad, then why hasn’t something like Tutorial D supplanted it? Is it another case of path dependence, where the world’s earlier mistaken adoption of SQL makes it harder to cast off SQL in the future (think “Microsoft Windows”)? Or is SQL the best real-world instantiation of the relational algebra that we could hope for? Likewise with nulls, and transactions, and all the other things to which Date is religiously opposed: if they’re so bad, why do we keep using them? Date never really says, which lets SQL win by default. Likewise, if Tutorial D is so expressive, and allows so many SQL constructs to be expressed in far fewer words, then how come no one — including Date himself! — has implemented the language in a commercial product or deployed it in a large database?
So it’s best to read Database In Depth as a summary of the relational algebra, and perhaps less of a good idea to read it as a damning verdict levied against SQL. You’ll learn the relational basis of all the SQL that you know and love. You’ll learn that foreign-key constraints and all the rest of a database’s integrity checks are just syntactic sugar on top of the relational algebra. You’ll learn how to reduce any of these sugary bits to their relational axioms. You’ll learn how a database query optimizer can transform your query into a faster one and know, beyond a shadow of a doubt, that the transformed query will return the same result as your original. You’ll learn the difficulties that SQL causes for those same query optimizers by not acting like a full relational algebra. Those of you with a mathematical cast of mind who like to understand how your systems work under the covers should read and reread Database in Depth.
[1] — This gets back to my question about inserting an IP address into MySQL. People raised the reasonable point that SQL itself shouldn’t be defining lots of different types; there are millions of reasonable types, and it’s not the language’s responsibility to provide all types to all people. While true, I think this misses the point that a modern language provides an extensible type system: C++ and Java don’t define a primitive date/time type, but they do allow programmers to compose their own date/time type from the more primitive types that the language makes available. It should be possible, in any modern language, to enforce a particular type for a particular field, and the type enforcement should take relatively few characters. Python, I’m looking at you here: I shouldn’t have to do
def some_func(foo, bar):
if not isinstance(foo, str):
raise TypeError("First arg to some_func must be a string")
if not isinstance(bar, int):
raise TypeError("Second arg to some_func must be an integer")
[rest of function]
I should be able to do
def some_func(string foo, int bar):
[rest of function]
Of course, just to be clear, Python does allow user-defined types. You can create your own classes to represent any complicated real-world thing. It just doesn’t allow shorthand type enforcement the way SQL does.
It looks some variants of SQL do, indeed, allow user-defined types. See, e.g., PostgreSQL, though this is the first I’ve read about it.