Monday, January 31, 2011

Comments in response to Don Anderson's 3n+1 benchmark

I am thrilled that Don Anderson has responded to a benchmarking problem I proposed aimed at apples to apples benchmarks of databases and platforms.

I did virtually no tuning when I posted a solution using FIS GT.M.  Now that Anderson has posted results with tuning, I am motivated to do some performance tuning to see how much better I can get my program to run.  I will post results when I have them.

Incidentally I used the benchmark to to compare several Linux file systems - in a nutshell, from fastest to slowest, the typical ranking was ext2, ext4, xfs, jfs, ext3 and btrfs.

Some Comments and Corrections to Anderson's Observations

Apropos transaction processing and ACID transactions.  Since many NoSQL databases don't support transaction processing and ACID transactions, I did not require the use of transaction processing in the problem statement, but did require the database to be operated for recoverability (understanding that apples to apples comparisons of recoverability algorithms are not straightforward).  GT.M supports both fully ACID transactions (TP) as well as non-TP updates with slightly relaxed Durability (which are also fully journaled and recoverable - the run-time system ensures that no database block is written to disk until the journal records for that update are written to the journal file and an fsync() performed).  In order to ensure serialization, a TP commit also commits any prior non-TP updates.  The JNLFLUSH that Anderson writes about is not needed for normal operation.  My benchmark program uses a combination of TP and non-TP - results are updated using TP commits, but intermediate computations use non-TP updates.  So, the database is indeed operated in a recoverable mode.

Apropos his comments about "a stopwatch being started after the runners hear the starting gun", what he is evidently referring to are these two lines of code:

  Lock -l1 ; Release lock so processes can run

  Set startat=$HOROLOG ; Get starting time

The situation at that point is that several worker processes have been spawned, and are waiting for the parent process to release lock l1.  In theory, we want the lock release and the starting of the timer to happen at the same time.  The reason I release the lock first is that interprocess communication is involved for the workers to be able to start their work, whereas the timer is started at the very next line of the same process - in other words, doing them in this order gets the start time closer to when the workers actually start.  But even if I were to reverse the lines, it would only affect times by milliseconds in the worst case, because the parent process is not getting a lock: it is releasing a lock which it owns, which is a near instantaneous operation.

Extending Anderson's analogy, the starting gun of a race starts the stopwatch.  That's what we have here.

A Few Minor Nits

Anderson writes "GT.M (the language formally known as MUMPS)".  Actually, the formal name of the language (commonly used in health care applications, just as SQL is commonly used in business applications) is M (ISO/IEC 11756:1999), but it is informally known as MUMPS (as in Massachusetts General Hospital Utilty Multi-Programming System).  GT.M is a FOSS implementation of a transaction processing database engine with an implementation of M.  It is easy to use the database from C, as well as Python, Perl and PHP.  There are API compatible clients for other databases (go to M/Gateway to see some).  Indeed, the Universal NoSQL white paper by George James and Rob Tweed discusses how the GT.M engine can store all four common NoSQL use cases.

To access current GT.M documentation, go to the GT.M home page and click on the User Documentation tab.  The link in Anderson's post is to old documentation on Source Forge - which remains there so as to not break old web pages that may point to it.

GT.M has functionality for creating LMS (Logical Multi Site) application configurations as described in the user documentation, with a 20,000 kilometer distance limit between instances.

When Anderson says, "I used 3 threads, as that seemed to be the sweet spot", he misses one aspect of the benchmarking problem.  In order to simulate a database under load (dealing with concurrency has costs), it requires at least two processes per core.  I suspect that going from 3 threads to 4 threads (since his computer has a dual core CPU) would not affect his results too much.  I did not tune my results for the number of cores by the way - my results include runs with 8 worker processes on a dual core CPU.  Someday, I'd like to see if I can come up with an apples to apples way to compare database engines under heavy load - how a system performs as it is pushed to the breaking point tells us more about its robustness than how it runs with resources to spare.


  1. GT.M can also be accessed via Javascript (using Node.js) - see


  2. I have a few comments.

    First: I do not know what you mean by "recoverable". If the transactions are not durable, then you cannot recover them. Recoverable in the database literature means that you can recover the state as it was after all transactions that committeed, in the face of any "stop" including total system crash. The log file must be persisted when a (write) transaction commits. Omitting this can speed things up a lot; real durability isn't cheap if you have to actually write the log to rotating magnetic storage. (Less so if you can use FLASH, or you disk has a battery-backed-up RAM buffer, or you're copying the data to main memory in two servers with independent failure modes, or something like that.)

    Is this really a database benchmark? You use the word "NoSQL" in the name. But since this is just a computation that ends promptly, getting the job done does not need durability anyway. It just runs to completion. That is, unless you want the application to complete even if a server crashes; but then measuring the execution time would not make sense. I don't even see why there's any need for a log at all.

    Second: This benchmark appears to be operating on key/value pairs that are very small, containing small character strings (naming integers) rather than, say, images. Don Anderson says that the record sizes are typically 20-40 bytes. Some of the "NoSQL" systems are clearly designed for much larger K/V pairs. I am pretty sure that Riak (with BitCask) is like this, as well as Google's BigTable and the clones thereof (HBase in Hadoop, and Cassandra).

    Third: Many of the NoSQL systems are specifically designed to use replication, to the point where using them without replication is simply not done, and therefore not optimized for. This might make a difference.

    Fourth: About tuning, this is a problem that always arises in benchmark comparisons. I have been involved in many such comparisons in my career, especially the "OO7" benchmake for object-oriented database systems. "Apples to apples" gets to be very hard to define fairly. This isn't your fault; it's an inherent problem in comparative benchmarks.

    It's a good idea to have well-specified benchmarks, and this is a good start!

    -- Dan Weinreb

  3. I responded to Dan Weinreb's comments with another post ( because it was too long for a reply.

    -- Bhaskar