Saturday, June 12, 2010

3n+1 NoSQL/Key-value/Schema-free/Sche... - version 2

Problem statement of the second version of the benchmark.  An earlier version of this benchmark problem statement can be found at
http://ksbhaskar.blogspot.com/2010/02/3n1-nosqlkey-valueschema-freesche.html

Motivation

This is an attempt to create a benchmark for apples to apples comparisons of databases that are generally described as NoSQL, key-value, schema-free or schema-less databases.  The goals are to create a benchmark that:

  • is simple to understand and code in most programming languages;
  • performs a non-trivial amount of representative activity - reads and updates, access contention, recovery, etc.; and
  • generates results that can be easily replicated and verified by others.

Since many popular databases of the genre being discussed do not support ACID (Atomic, Consistent, Isolated, Durable) transactions, the benchmark does not require transaction processing, although there is no prohibition on using such functionality if available.

If your database supports strings with international character sets with a multi-byte encoding such as UTF-8, please do demonstrate their usage, since it will help show your database in a more attractive light.

Background

Consider a sequence of positive integers n0, n1, n2 ...  ni+1 is generated from ni as follows:

  • if ni is even, ni+1=ni/2
  • if ni is odd, ni+1=3*ni+1

If n0=1, this leads to the cycle 1, 4, 2, 1 ...  The Collatz conjecture says that for any positive integer starting point n0, the sequence eventually leads to 1.  The sequence starting with 4 and ending at 1 has three steps.  The sequence starting with 6 and ending with 1 (6, 3, 10, 5, 16, 8, 4, 2, 1) has 8 steps, and the highest integer encountered in the sequence is 16.

Problem

Write a program that reads from its standard input three positive integers i, j (0<i<j<10,000,000,001), k (>0), and an optional fourth integer l (>0), each separated by a single space.  It should implement a multiple parallel execution stream (multi-process or multi-threaded) solution that uses a database to compute and report the number of steps in the largest Collatz sequence of any integer in the range, as well as the largest integer encountered when computing those sequences.  The input range should be divided into blocks of integers, such that each execution stream looks for the next block not already claimed, claims it, computes and stores results for the integers in it, then moves to the next available block until the entire range is computed.  The program should also collect and report statistics on database updates and reads, as well as overall update and read rates.

More specifically, on its standard output, the program should print the following separated by spaces:

  • the inputs, i and j; representing the range of integers over which the Collatz sequence should be computed
  • the number of parallel execution streams, which must be at least twice the number of CPUs or cores (which is reported in /proc/cpuinfo on Linux); if the input number k greater than or equal to this number, simply print it and use it, otherwise print it in the form k0->k1 where k0 is input number, and k1 is the actual number of execution streams used (i.e., twice the numbers of CPUs or cores)
  • the size of blocks of integers that each execution stream will work on at a time; if this is not specified, or is greater than the range divided by the number of parallel execution streams, it should be forced to the range divided by the number of parallel execution streams (one block may be smaller than the others); if the input number l is greater than zero and not greater than the range divided by the number of streams, simply print it, otherwise print it in the form l0->l1 where l0 is the input number and l1 is the actual block size
  • the maximum number of steps in the sequence for any integer i through j
  • the largest integer reached when computing the steps for any integer in the sequence i through j (caveat: as this will significantly overflow a 32-bit integer, you must use something larger)
  • the number of elapsed (wall clock) seconds (fractional seconds optional) it took to solve the problem
  • the number of updates performed on the database (which will be much higher than j-i+1 since two or more execution streams can potentially update the same node at the same time and because the program will work on integers larger than j); the number of updates to the nodes where the Collatz sequences are tracked should be counted; counting database updates for coordination and control is optional
  • the number of reads performed on the database (which will be much higher than j-i+1); the number of reads to the nodes where the Collatz sequences are tracked should be counted, those for coordination and control are optional
  • the update rate calculated by the number of updates divided by the elapsed time; and
  • the read rate calculated by the number of reads divided by the elapsed time

The code does not need to validate inputs.  Here are required elements of the solution:

  • Your solution must use a database configured to be recoverable.  In other words, if the system were to crash in the middle of solving a problem, rebooting the computer, recovering the database and restarting the program with the same input i and j uses the partial results in the database computed prior to the crash.  Note that the 3n+1 problem does not have the concept of atomic transactions to which a database can supply ACID properties.  However, the concepts of Consistency and Durability are meaningful - when a new longest 3n+1 sequence is computed, it must be made Durable, and the lengths of the 3n+1 sequences of each integer in its sequence must also be made Durable.  A recovery mechanism of replicating to another system is acceptable as long as the replication is synchronous - i.e., an update is not considered committed to the database on the originating instance until it is received and accepted by the replicating instance.  When submitting results, please document the recovery technique used (forward recovery, which requires a backup to be restored and updates applied from a journal file; backward recovery which recovers the database in place from a journal file; etc.).

  • For convenience, your program is allowed to accept multiple lines of input, processing each line in turn.  However, between lines, you must empty, clear, re-initialize (whatever term your database engine uses) the data in the database, so that the computation for each line starts out with a database with no data.  Of course, this means that in the event of a crash, only the data for the specific i and j being worked is recovered, and resuming and completing the problem only refers to the solution for that i and j.

  • In order to factor in the effect of dealing with access contention and concurrency control, your solution must use at least the greater of k or two concurrent execution streams (threads, processes, or both) per CPU core to solve the problem.  So, if your CPU has two cores, and k is 3, you must use at least 4 execution streams.

  • The problem involves only integer values with integer keys in the database.  This is not representative of real world applications, which more commonly need to use string keys and values.  The program should convert the integer keys and values for the number of steps to be stored as strings rather than as integers.  For example, instead of creating an entry for a node with a subscript of 11 with a value of 15, you would create an entry with a subscript of "one one" with a value of "one five".  Except for nodes storing the number of steps in the Collatz seqence for each integer, you are free to use numerics for all other purposes, including program flow control.

Submission

Post your results on the web.  You should include:

  • The computing platform on which you ran your benchmark (details of CPU, RAM, disk type, operating system).
  • Your program.  Your program must either be released under a FOSS license (either determined to be GPL compatible by the FSF or OSI approved) or it must be released with no claim of Copyright (which puts it in the public domain).
  • A summary of your database configuration (in order to keep it concise and readable, list only information that would be relevant to someone attempting to reproduce your results).
  • Except for the operating system, licenses for the software stack you used ("proprietary" suffices as a description of any license that is not one on the FSF or OSI lists).  If your database engine has different versions available under different licenses, you must specify the license of the version you used.
  • Results for as many of the following inputs as you have the patience to run (you may use higher values for k if that gives better results, and you may specify values of l that optimize throughput):
    • 1 100000 4
    • 1 100000 8
    • 1 1000000 4
    • 1 1000000 8
    • 1 10000000 8
    • 1 10000000 16
    • 1 100000000 8
    • 1 100000000 16
    • 1 1000000000 16
    • 1 1000000000 32
  • Results for any other sets of inputs that you choose to run that you think presents your database well.
  • Any other information that you think adds value to the understanding of the above.

Revision History

Version
Date
Summary
0.1June 12, 2010Initial draft, based on previous Collatz benchmark problem.

K.S. Bhaskar
bhaskar at bhaskars dot com


3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This is an interesting benchmark, and you've done a good job of defining it carefully, an important thing for performance measurement. The NoSQL world badly needs some benchmark results to back up all the scalability claims. However, I prefer the Yahoo benchmark as a better measure for NoSQL and Web 2.0 applications. I'm a great fan of keeping a benchmark simple, but I believe yours is too simple. We need a more realistic balance of reads and writes, a measure of scaling with additional servers, etc. Anyway, I'm adding a NoSQL "call for benchmarks" in the paper I'm publishing, to try to get something going...

    ReplyDelete
  3. Rick, thanks for your comments. You are right that this is a simple problem - but it certainly has some interesting subtleties that will become apparent if you spend an hour or so to solve it.

    I would like to get a better balance of reads and updates - I would really like to limit updates to 5% of accesses and no more than 10% (it's concurrent updating that makes databases interesting), but I have not yet found the right problem with that update rate and which is simple to program.

    I thought about including replication in the benchmark, but ultimately left it out because there are many models of replication. Suggestions for making the benchmark more sophisticated are welcome.

    ReplyDelete