Decades of Architecture: NoSQL

There is this whole motion afloat, trying to declare SQL bankrupt and do without. Instead of SQL, one hears, there are going to be much better databases in the future. Dozens of projects are floating around, each with a different notion of what "better storage" mean, all aiming at being better data stores for the Internet.

Now, it is clear that SQL databases have their supreme annoyances, and the need for reform is clear. What pretty much all NoSQL project have in common, though, is that they look at the wrong problems and try to solve them with a more theological than philosophical or architectural approach.

Let's look at the deficiencies of SQL first. There are three main classes of problems:

SQL the language itself
SQL the data store format
SQL database scalability

Each of these classes brings a completely different set of considerations into play - and different reasons why SQL needs to change.

SQL-the-language

The main problem I have found with SQL as a language has always been the dynamic, interpreted nature of its command strings. You don't issue orders to a database, you pass it a string, and the database interprets it. As everyone knows that has dealt with interpretation in Poetry 101, that's dangerous and rarely unique.

The main issue is that it's very easy to get the escaping rules for content wrong and end up with a query that is not what you meant. In particular, that's dangerous when you mindlessly send user input to a database, which is precisely what a SQL injection is all about.

The forms of remedy that databases have used so far are meant to mitigate, not to prevent the issue. I think the problem is that most database developers think of developers that expose themselves to SQL vulnerability as fundamentally incompetent and are not willing to admit that the problem is in the language itself, that makes such exploits so easy.

The solution to the problem, instead, comes from changing the language itself. Instead of sending a string that needs to be interpreted, the database should require a structure that needs to be executed.

SQL-the-data-store

SQL is a little like the "C" of databases: something that is complete in that it can do everything that you need, but that is needlessly complicated for standard things you might want to do. In many respects, SQL constructs remind me of the absurd complexity of C "for" statements. When you want to iterate over a collection, it is silly to use a control structure that allows you all sorts of fancy.

There are two immediately obvious issues with SQL:

You can have only a single column of a given name
You cannot use a table as a data type

OK, I probably lost most of you now. What do these two mean?

Imagine your typical application. You have a user table, and it needs to store the user's email address. Now, users can have multiple email addresses. What do you do in SQL? You have two main options:

You create a column for as many email addresses as you want to allow
You create a separate table for email addresses and link it back to the user table

In the first case, you end up with tables that have columns like EMAIL1, EMAIL2, etc. In the second, whenever you want to add email to your searches, you have to perform a join with a table from which you match the email address by user ID. Something like:
SELECT user.username, email.address FROM user, email WHERE user.first = 'Marco' and user.id = email.userid;
Notice that you have to do this every single time you want user information, and the email table becomes an adjunct of the user table.

Instead, the rational thing to do would be to tell the database to worry about this crap, and to just allow the user to specify that a particular structure is present multiple times, or that a particular other table contains records that we are interested in.

For instance, there could be a field attribute 'hasmany' that allows you to specify that a table has a particular attribute multiple times. When a query comes in for this attribute, all the different values are considered. Instead of searching for an email address in the many fields, like this:
SELECT user.id FROM user WHERE email1 = 'marcog@example.edu' OR email2 = 'marcog@example.edu' OR email3 = 'marcog@example.edu';
you search in the field email that has multiple possible occurrences.

At the same time, consider the case of an address. That is a complex field, made up of multiple subfields (ZIP code, street address, city, state, apartment number, etc.). Instead of creating those in the user table (and then again in every other table that requires addresses), we can create an address table and link to it. Again, though, when we search, we don't want to have to do a join, we'd like to have the database do that for us.

There is a FOREIGN KEY constraint in SQL, but it is just that - a constraint. Instead, we'd like to have the external table as a data type. Right now, we specify the same type of the key (typically some form of int) in the foreign key column and the database allows only values that correspond to keys or the NULL. Instead, we'd like to simply say that the type of the column is the other table. The column definition would change from:
address_id INT FOREIGN KEY,
to
address TABLE addresses,

The new type is structured, so that when you ask for address.zip it is clear what you mean.

Hierarchical Data

Another thing that SQL is notoriously bad for is the storage and retrieval of hierarchies. That's a problem with SQL and recursion, I presume, and it would be easily fixed.

Suppose you have an employee table in which you store the hierarchical relationships between employees. Every employee has a manager (except for the CEO), and reports. The only functionality that SQL offers is the foreign key constraint into the same table, which is way too weak to be of help. We cannot ask questions like, 'who all is under the CFO?' Instead, we have to ask who is under the CFO, then who is under those who are under the CFO, then who is under this last set, etc. We need to repeat that for as many times as we have reports.

If SQL databases were aware of hierarchies, they could do the work without bothering us with complex queries. Even better, since the data they'd have to look at is such a small subset of the total data in the table, a specialized hierarchy index would speed up queries enormously. At the same time, it's really easy to figure out what hierarchical data means - if you have a foreign key into the same table, then it's hierarchical. Real easy to do.

Full Text Search

Another common annoyance of the SQL data store is that it doesn't provide a full view of the contents. When you want to know something, you have to ask every single column. Now, the murmur in the crowd will tell me, 'Why, Marco, but that's precisely what SQL databases are meant to be! If you want full text search, go create a full text index!'

Things are not as simple as that. Fact is that you want and have a structured store, but sometimes you just want to have a complete view of the record.

So far, if you wanted to do a full text search, you had to do one of the following:

Create a full-text index into the database or table, using an engine that allowed that
Dump the database or table and create an index of the dump
Create queries that joined all the possible fields in the database

All of these are inadequate. The first one limits you to specific engines. The second one relies on a laborious process, without guarantee that you can actually find the record associated with a particular location in the dump. The third one is too labor intensive.

The amazing issue is that databases know exactly where data is stored. For them to find a particular piece of information in the raw data is fairly easy, and it's a real shame that they don't allow for full text searching and indexing as a matter of course.

SQL Scalability

In my experience on the web, the first thing that causes headaches from a scalability perspective is the database server. Those, correspondingly, require the highest level of attention, the most effort, and the biggest hardware cost. Given that they don't really do that much, it's a real shame that they would be the consistent bottleneck. Fortunately, Internet architectures have several possibilities for improvement on current designs.

First, it should be though noted that most developers are not very proficient with SQL and database design. As a matter of fact, databases are a mode of thinking that is so far from software development, that it makes perfect sense for someone to specialize in them.

Unfortunately, this database administrator is also the person least likely to understand the shortcomings of database designs and the least likely to have the key insights to the changes required and desirable. All of them, indeed, stem from the difference between the enterprise design of database and their Internet usage. Follow me for a minute.

ACID Semantics

The greatest care in databases has traditionally been given to ACID. You can look up what that stands for, essentially it means that at any precise point in time, a database will always give the same answer for a specific request. That's extremely important when money is concerned, and relatively important in enterprise settings when you want to know for sure that nobody is going to get two different answers for the same request.

On the Internet, for virtually all applications that don't involve buying and selling, you are much better off relaxing that requirement. In general, it's not particularly important if an update arrives instantly or in two minutes, and it's not tragic if you happen to get a different response if you happen to hit a different server.

Strangely, ACID is one of the highest cost factors in databases, and it's only inertia that kept this kind of semantical effort afloat for all types of data. Giving up on it partially allows for a series of performance improvements that can make scalability much easier and cheaper.

Read/Write Mixes and Replication

Depending on the application and the implementation, on the Internet people typically read data much more frequently than they write it. Since writing is more expensive (to a database) than reading, it makes perfect sense for anyone that wants to scale up to distribute the reads and concentrate the writes.

To do this, database servers have to be set up in replication mode. One server is the master, a set of servers is the read replica that receives updates from the master. MySQL does replication in a particularly transparent manner: it creates a binary log of transactions and replays that log from the master to all slaves. In essence, the master tells all slaves what it did, and all the slaves copy it verbatim.

Some replication systems are more focused on speed and transmit file differences instead of transactions. Other systems focus on efficiency and transmit the smallest possible set of changes. In any case, the result is always the same: replicas and master are generally not synchronized, which means data can be different on the master and the replicas. This is the so-called replication delay.

Query Caching

Indexes into the database speed up standard queries by factors. The availability of full text indexes speeds up those by factors, as well. Individual queries, though, would benefit from their own caching. Most database servers implement query caching, but they cannot know which queries should be cached and so keep in memory everything that has been asked, which means the data that is likely to be asked again can be removed from the cache so that useless data can be stored.

Instead, queries should allow for a cache hint that tells the server they should be kept around, since the results are going to be required again. What kind of queries would that be? Typically, those that involve large numbers of items that need to be scrolled through. If your query needs to be paginated, then it needs to be cached, since the user will ask for more than you are presenting.

Optionally, the hint could be formulated in a negative fashion, like in the C compiler hint discardable.

Summary

Wow, this was quite a lengthy blog post. In summary, I believe that the movement to drop SQL in favor of different types of databases is unnecessary and destructive. Instead, a set of incredibly important improvements and extensions to the SQL language and to the semantics of databases would serve the new use case of Internet databases very well.

Decades of Architecture

Monday, April 12, 2010

NoSQL - No: NuSQL

5 comments:

Followers

Blog Archive

About Me