Sunday, October 24, 2010

The Curse of the Competent

Cursing the Competent

As long as the Internet economy was happily bubbling ahead, the Peter Principle reigned supreme: Peter joined a startup and got promoted quickly. Invariably, he would land in a position where he not only started failing, but dragged the entire startup down with his incompetence. Which of course didn't quite matter, since the Peter Principle was (and is) democratic and made everybody else look just as stupid as Peter himself.

The economy turned sour, Peter's startup died as everybody else's, and he scrambles to find a new job. Of course, he will want a stable company, with lots of cash reserves, good management and excellent products. Since he is good, he lands The Job, invariably at three hierarchy levels below where he was, with maybe half the salary. But who complains?

Around Peter all the people that have been there before him. They are a little afraid, since they know they kept their position because their company did well, not because they were better than poor Peter. Russian roulette of startups. The manager might not even have been able to get hired in Peter's old company. But that really doesn't matter.

Peter works hard, as he's used. He solves any problem in half the time that anyone else takes. Which doesn't make him a popular guy on the block. In particular, his manager knows how she will be followed in her every action with the critical eye of someone that used to supervise her likes. And that's when Peter is hit by the Curse of the Competent.

Wednesday, October 13, 2010

The 10 Most Common Mistakes by Tech Job Interviewees (and How to Work Around Them)

The job market is very tight and even if you are the smartest guy around, with tons of experience, and no salary requirements to speak of, you might find it a tad difficult to find a new position. That's particularly true for those that are just starting out, or who have been out of the game for a while. Unless someone is actively pushing your name somewhere, you'll see it's harder and harder to get that job landed.

Having witnessed literally hundreds of interviews, I can speak with some authority on what works and what doesn't from an interviewer's perspective. There is outstanding advice out there for general interviewing ("Check your resume for spelling mistakes!", "Nobody has ever been refused a job for wearing nice clothes!", "Do not become confrontational!") the advice is scant for the technical portion of your interview. Here my Top 10 mistakes made, and how to avoid them.

Tuesday, October 5, 2010

Femininity - the Missing Half of Science and Technology?

I am a man, in the most stereotypical way imaginable. I suffer from all the symptoms of the condition - the hair slowly starting to grow where it shouldn't, the quick temper ready to flare up for virtually no reason, but most of all for the way I think.

You could possibly try to call me sexist because of saying this, but there seem to be marked differences in the way the male and the female brains work. These differences seem to relate to evolutionary advantages, and they seem to confirm a stereotypical notion of gender roles in the incipient human community. I am at the butt end of the evolution, but I can see how we got to me.

Let's start with the unproven assumption that men were the hunters and foragers, while women were the nurturers. Let's add the assumption that humans, like many primates, were naturally inclined to form societies. What does that yield?

Saturday, July 3, 2010

The Role and Importance of Quality Assurance (QA)

There is a moment where the young and enthusiastic learn that seat-off-the-pants is quick but eventually leads to catastrophe. You can tell at which stage engineers are by asking them what they think of QA: if they think it's an occupation for the lesser divinities of programming, they aren't there yet; if they have enough experience, they will think of QA engineers as demi-gods whose verdict makes and breaks months of coding.

Having been on for decades, I am of course a very, very strong proponent of mandatory QA. To me, this last step in the development process fulfills three main goals:
  1. Interface stability and security Making sure that the code does what it is supposed to do especially in boundary conditions that developers typically overlook. The most common scenario is that of empty data (null pointers, etc.) somewhere code assumes there to be an object, but testing code for SQL injections is another, perfectly invaluable example. This has nothing to do with the functionality of the code, but with its ability to behave properly in unusual conditions.
  2. Performance and stress testing Checking how the code behaves under realistic scenarios and not in the simple case the developer faces. Instead of 5 users, make 500,000 run concurrently on the software and see what it does. Instead of 100 messages, see what the system does with 100,000,000. Instead of running on a souped up developer machine with a 25" display, look at your software from the point of view of a user with a $200 netbook.
  3. User experience and acceptance Ensuring the flows make sense from the end use perspective. Feel yourself into the user and try performing some common tasks. See what happens if you try doing something normal, but atypical. For instance, try adding an extension to a phone number and see whether the software rejects the input. 

Wednesday, June 2, 2010

Security Matters

When I was little, I recall watching this popular science program in which Peter Ustinov popularized the theory of relativity. There was a rocket, a man on the rocket, a launch of the man and the rocket, and a transmission from the really fast man on the really fast rocket. The man on the really fast rocket saw earthlings slow down. Conversely, the earthlings saw the man talk really fast. Makes sense?

No, it doesn't. A cursory understanding of relativity tells you that the other's time must slow down or accelerate for both observers, not just for one. So both the man on the rocket and the earth station would have observed an apparent slowdown in the other's time. Of course, that seems confusing, since once the man on the really fast rocket returns to earth, time will have passed much faster on earth than for the man - but the reason for it is not speed, but acceleration.

This intro serves to explain something that has been bothering me for a while: the way people misunderstand information security concepts and continually use the wrong thing for the right purpose. It's really not hard, since there are only a very few and very distinct concepts - yet people get them wrong all the time. It's a little as if people take "security" as a one-size fits all umbrella, and doing something secure means doing everything under the umbrella.


Wednesday, May 26, 2010

Web 3.0

People have been thinking about the next generation of Web ever since Web 2.0 landed and got its incarnation in Friendster, MySpace, Facebook, and Twitter. No conclusive Web 3.0 road map has ever convinced me, though, so I started thinking about my own one.

I started looking at what made Web 1.0 and then Web 2.0 and decided that the trend could be extrapolated from there. The approach, I thought, should be Hegelian: every manifestation of the Web should solve a problem, create a new problem, and find its own solution in the next one.

What made Web 1.0? The problem we were having was information. Mainly the availability of information anywhere. People were paid for information back in the days. They were in a huge industry of information gathering, sifting, sorting, and selling. Web 1.0 was all about that information  - easier ways to distribute it, easier ways to connect with it, easier ways to share it.


Tuesday, April 27, 2010

Open Source and Interfaces

One of the things that complicates the development of software in the open source world is the enormous variety of different interfaces you have to deal with. This is eminently an architectural problem: interfaces need to be defined ahead of coding, and if you just start developing your own project, you have no need for uniformity across projects.

Initially, of course, there is no standard and hence no interface. Later, separate projects come up and they make a point of having different interfaces to better spread and create incompatible ecosystems. That was the whole enmity between KDE and Gnome, or the many different brands of SQL server implementations.

There is nothing wrong with competing projects, especially nowadays. It dawned on people that copying from successful efforts is a good idea, and the availability of source code makes it easy to get not only ideas, but also implementations moved. It so came to be that the outstanding networking code in BSD became the forefather of the networking layers in many UNIX-ish operating systems.


Monday, April 12, 2010

NoSQL - No: NuSQL

There is this whole motion afloat, trying to declare SQL bankrupt and do without. Instead of SQL, one hears, there are going to be much better databases in the future. Dozens of projects are floating around, each with a different notion of what "better storage" mean, all aiming at being better data stores for the Internet.

Now, it is clear that SQL databases have their supreme annoyances, and the need for reform is clear. What pretty much all NoSQL project have in common, though, is that they look at the wrong problems and try to solve them with a more theological than philosophical or architectural approach.

Let's look at the deficiencies of SQL first. There are three main classes of problems:
  1. SQL the language itself
  2. SQL the data store format
  3. SQL database scalability
Each of these classes brings a completely different set of considerations into play - and different reasons why SQL needs to change.

SQL-the-language

The main problem I have found with SQL as a language has always been the dynamic, interpreted nature of its command strings. You don't issue orders to a database, you pass it a string, and the database interprets it. As everyone knows that has dealt with interpretation in Poetry 101, that's dangerous and rarely unique.

The main issue is that it's very easy to get the escaping rules for content wrong and end up with a query that is not what you meant. In particular, that's dangerous when you mindlessly send user input to a database, which is precisely what a SQL injection is all about.

The forms of remedy that databases have used so far are meant to mitigate, not to prevent the issue. I think the problem is that most database developers think of developers that expose themselves to SQL vulnerability as fundamentally incompetent and are not willing to admit that the problem is in the language itself, that makes such exploits so easy.

The solution to the problem, instead, comes from changing the language itself. Instead of sending a string that needs to be interpreted, the database should require a structure that needs to be executed.

SQL-the-data-store

SQL is a little like the "C" of databases: something that is complete in that it can do everything that you need, but that is needlessly complicated for standard things you might want to do. In many respects, SQL constructs remind me of the absurd complexity of C "for" statements. When you want to iterate over a collection, it is silly to use a control structure that allows you all sorts of fancy.

There are two immediately obvious issues with SQL:
  1. You can have only a single column of a given name
  2. You cannot use a table as a data type
OK, I probably lost most of you now. What do these two mean?

Imagine your typical application. You have a user table, and it needs to store the user's email address. Now, users can have multiple email addresses. What do you do in SQL? You have two main options:
  1. You create a column for as many email addresses as you want to allow
  2. You create a separate table for email addresses and link it back to the user table
In the first case, you end up with tables that have columns like EMAIL1, EMAIL2, etc. In the second, whenever you want to add email to your searches, you have to perform a join with a table from which you match the email address by user ID. Something like:
SELECT user.username, email.address FROM user, email WHERE user.first = 'Marco' and user.id = email.userid;
Notice that you have to do this every single time you want user information, and the email table becomes an adjunct of the user table.

Instead, the rational thing to do would be to tell the database to worry about this crap, and to just allow the user to specify that a particular structure is present multiple times, or that a particular other table contains records that we are interested in.

For instance, there could be a field attribute 'hasmany' that allows you to specify that a table has a particular attribute multiple times. When a query comes in for this attribute, all the different values are considered. Instead of searching for an email address in the many fields, like this:
SELECT user.id FROM user WHERE email1 = 'marcog@example.edu' OR email2 = 'marcog@example.edu' OR email3 = 'marcog@example.edu';
you search in the field email that has multiple possible occurrences.

At the same time, consider the case of an address. That is a complex field, made up of multiple subfields (ZIP code, street address, city, state, apartment number, etc.). Instead of creating those in the user table (and then again in every other table that requires addresses), we can create an address table and link to it. Again, though, when we search, we don't want to have to do a join, we'd like to have the database do that for us.

There is a FOREIGN KEY constraint in SQL, but it is just that - a constraint. Instead, we'd like to have the external table as a data type. Right now, we specify the same type of the key (typically some form of int) in the foreign key column and the database allows only values that correspond to keys or the NULL. Instead, we'd like to simply say that the type of the column is the other table. The column definition would change from:
address_id INT FOREIGN KEY,
to
address TABLE addresses,

The new type is structured, so that when you ask for address.zip it is clear what you mean.

Hierarchical Data

Another thing that SQL is notoriously bad for is the storage and retrieval of hierarchies. That's a problem with SQL and recursion, I presume, and it would be easily fixed.

Suppose you have an employee table in which you store the hierarchical relationships between employees. Every employee has a manager (except for the CEO), and reports. The only functionality that SQL offers is the foreign key constraint into the same table, which is way too weak to be of help. We cannot ask questions like, 'who all is under the CFO?' Instead, we have to ask who is under the CFO, then who is under those who are under the CFO, then who is under this last set, etc. We need to repeat that for as many times as we have reports.

If SQL databases were aware of hierarchies, they could do the work without bothering us with complex queries. Even better, since the data they'd have to look at is such a small subset of the total data in the table, a specialized hierarchy index would speed up queries enormously. At the same time, it's really easy to figure out what hierarchical data means - if you have a foreign key into the same table, then it's hierarchical. Real easy to do.

Full Text Search

Another common annoyance of the SQL data store is that it doesn't provide a full view of the contents. When you want to know something, you have to ask every single column. Now, the murmur in the crowd will tell me, 'Why, Marco, but that's precisely what SQL databases are meant to be! If you want full text search, go create a full text index!'

Things are not as simple as that. Fact is that you want and have a structured store, but sometimes you just want to have a complete view of the record.

So far, if you wanted to do a full text search, you had to do one of the following:
  1. Create a full-text index into the database or table, using an engine that allowed that
  2. Dump the database or table and create an index of the dump
  3. Create queries that joined all the possible fields in the database
All of these are inadequate. The first one limits you to specific engines. The second one relies on a laborious process, without guarantee that you can actually find the record associated with a particular location in the dump. The third one is too labor intensive.

The amazing issue is that databases know exactly where data is stored. For them to find a particular piece of information in the raw data is fairly easy, and it's a real shame that they don't allow for full text searching and indexing as a matter of course.

SQL Scalability

In my experience on the web, the first thing that causes headaches from a scalability perspective is the database server. Those, correspondingly, require the highest level of attention, the most effort, and the biggest hardware cost. Given that they don't really do that much, it's a real shame that they would be the consistent bottleneck. Fortunately, Internet architectures have several possibilities for improvement on current designs.

First, it should be though noted that most developers are not very proficient with SQL and database design. As a matter of fact, databases are a mode of thinking that is so far from software development, that it makes perfect sense for someone to specialize in them.

Unfortunately, this database administrator is also the person least likely to understand the shortcomings of database designs and the least likely to have the key insights to the changes required and desirable. All of them, indeed, stem from the difference between the enterprise design of database and their Internet usage. Follow me for a minute.

ACID Semantics

The greatest care in databases has traditionally been given to ACID. You can look up what that stands for, essentially it means that at any precise point in time, a database will always give the same answer for a specific request. That's extremely important when money is concerned, and relatively important in enterprise settings when you want to know for sure that nobody is going to get two different answers for the same request.

On the Internet, for virtually all applications that don't involve buying and selling, you are much better off relaxing that requirement. In general, it's not particularly important if an update arrives instantly or in two minutes, and it's not tragic if you happen to get a different response if you happen to hit a different server.

Strangely, ACID is one of the highest cost factors in databases, and it's only inertia that kept this kind of semantical effort afloat for all types of data. Giving up on it partially allows for a series of performance improvements that can make scalability much easier and cheaper.

Read/Write Mixes and Replication

Depending on the application and the implementation, on the Internet people typically read data much more frequently than they write it. Since writing is more expensive (to a database) than reading, it makes perfect sense for anyone that wants to scale up to distribute the reads and concentrate the writes.

To do this, database servers have to be set up in replication mode. One server is the master, a set of servers is the read replica that receives updates from the master. MySQL does replication in a particularly transparent manner: it creates a binary log of transactions and replays that log from the master to all slaves. In essence, the master tells all slaves what it did, and all the slaves copy it verbatim.

Some replication systems are more focused on speed and transmit file differences instead of transactions. Other systems focus on efficiency and transmit the smallest possible set of changes. In any case, the result is always the same: replicas and master are generally not synchronized, which means data can be different on the master and the replicas. This is the so-called replication delay.

Query Caching

Indexes into the database speed up standard queries by factors. The availability of full text indexes speeds up those by factors, as well. Individual queries, though, would benefit from their own caching. Most database servers implement query caching, but they cannot know which queries should be cached and so keep in memory everything that has been asked, which means the data that is likely to be asked again can be removed from the cache so that useless data can be stored.

Instead, queries should allow for a cache hint that tells the server they should be kept around, since the results are going to be required again. What kind of queries would that be? Typically, those that involve large numbers of items that need to be scrolled through. If your query needs to be paginated, then it needs to be cached, since the user will ask for more than you are presenting.

Optionally, the hint could be formulated in a negative fashion, like in the C compiler hint discardable.

Summary

Wow, this was quite a lengthy blog post. In summary, I believe that the movement to drop SQL in favor of different types of databases is unnecessary and destructive. Instead, a set of incredibly important improvements and extensions to the SQL language and to the semantics of databases would serve the new use case of Internet databases very well.

Sunday, April 11, 2010

MVC I: Hierarchical Views

I thought I'd start this architecture blog with a post on one of the things that, traditionally, have given me the most heartburn: the implementation of view in MVC architectures.

As a refresher, MVC is by now the standard architecture for most applications, web or not. It stands for Model, View, Controller. Model is the abstract representation of the data you are handling (for instance, invoices). The view is the particular representation of the data (for instance, the invoice edit screen). The controller is what ties the two together and add user input. The controller fetches the data by instantiating the model, and passes it to the view, which can take the data and make it into a page.

The problem I have seen over and over is that MVC implementations tend to be controller-centric. The controllers compose the activities of the application. They tell the models which data to fetch, how to alter the data, and how to instantiate the session and other user parameters. They then select the view to display and invoke it, passing it the data they found along the way.

As a result of this mode of thinking, views are typically considered subservient to the controllers. Every controller has its own set of views, and when a particular view isn't working precisely for the problem at hand, the template is copied, a new folder created, and the copy manipulated.

The main problem with this approach is that this is not how graphic designers and users think. These two groups cannot see the underlying structures and think in terms of objects on pages. To them, the navigation bar at the top is a constant - an object - and seeing it change for no reason is surprising. Which in applications always means, bad.

Some view implementation try to take that into account by adding "snippets" of various kinds to the implementation. They can either be subtemplates, macros, shortcuts, include files, or something like that. But, really, what the implementation must look like is object-oriented. If the user thinks of a portion of the page as a navigation bar, then that's exactly what the code should say.

The advantage of going about things this way is that when the graphic designer thinks it would be better to have a vertical instead of a horizontal navigation bar, you change the implementation only in the file that defines the position of objects in the area of the navigation bar. This way, one change to the views (from the users' perspective) is reflected in one change in the code.

There is no penalty but planning for doing so. Modern processors are perfectly capable of doing the replacement at incredibly high speed, and the page generation can be cached for performance (even though the caching is typically unnecessary, since the fetching of data is typically a lot more expensive). The code is both easier to write and easier to maintain, and subclassing gives you a good feel of how inconsistent your application becomes.

Which is why I wonder why I yet have to encounter a popular framework that starts its business with class hierarchies for views.

Saturday, April 10, 2010

printf("Hello, World!");

I've been working on the architecture of Internet systems since 1994. It's been a wonderful time, one that has seen software architects move collectively from obscure geekdom to running the development departments of the biggest Internet juggernauts. The best of times and the worst of times, frequently very close to each other, sometimes even coinciding.

I find that the world of software needs a lot of architectural attention. You won't agree with all my ideas of how things should work, but I guess you'll always have an opinion about them. The more people think about architecture, the more they actually do architecture, and that can only be a good thing.

So, comment liberally, disagree or agree with me, just use these humble thoughts as a jumping board for your own imagination. That's all I want out of this, and it would be a lot.