Sunday, February 14, 2010

3n+1 NoSQL/Key-Value/Schema-Free/Schema-Less Database Benchmark

Motivation

This is an attempt to create a benchmark for apples to apples comparisons of databases that are generally described as NoSQL, key-value, schema-free or schema-less databases.  The goals are to create a benchmark that:

  • is simple to understand and code - ideally, a page of code or less in most programming languages;
  • performs a non-trivial amount of representative activity - reads and updates, access contention, recovery, etc.; and
  • generates results that can be easily replicated and verified by others.

Since many popular databases of the genre being discussed do not support ACID (Atomic, Consistent, Isolated, Durable) transactions, the benchmark uses simple reads and updates rather than requiring transaction processing.

If your database supports strings, it is strongly recommended that you implement and present the enhanced benchmark with strings described below, even if you also implement and present an integer-only benchmark.  Also, if your database supports strings with international character sets with a multi-byte encoding such as UTF-8, please do demonstrate their usage, since it will help show your database in a more attractive light.

Background

Consider a sequence of positive integers n0, n1, n2 ...  ni+1 is generated from ni as follows:

  • if ni is even, ni+1=ni/2
  • if ni is odd, ni+1=3*ni+1

If n0=1, this leads to the cycle 1, 4, 2, 1 ...  The Collatz conjecture says that for any positive integer starting point n0, the sequence eventually leads to 1.  The sequence starting with 4 and ending at 1 has three steps.  The sequence starting with 6 and ending with 1 (6, 3, 10, 5, 16, 8, 4, 2, 1) has 8 steps.

Problem

Write a program that reads from its standard input three positive integers i, j and k (0<i<j<10,000,000,001), separated by a single space, and on its standard output, prints the following separated by spaces:

  • the inputs, i, j & k;
  • the actual number of execution streams;
  • the maximum number of steps for any integer in the sequence i through j;
  • the largest integer reached when computing the steps for any integer in the sequence i through j;
  • the number of elapsed (wall clock) seconds it took to solve the problem;
  • the minimum number of updates performed on the database (the actual number may be higher if two or more execution streams updated the same node);
  • the minimum number of reads performed on the database (the actual number may be higher);
  • the update rate calculated by the minimum number of updates divided by the elapsed time; and
  • the read rate calculated by the minimum number of reads divided by the elapsed time.

The code does not need to validate inputs.  Here are required elements of the solution:

  • Your solution must use a database configured to be recoverable.  In other words, if the system were to crash in the middle of solving a problem, rebooting the computer, recovering the database and restarting the program with the same input i and j uses the partial results in the database prior to the crash.  A recovery mechanism of replicating to another system is acceptable as long as the replication is synchronous - i.e., an update is not considered committed to the database on the originating instance until it is received and accepted by the replicating instance.  When submitting results, please document the recovery technique used (forward recovery, which requires a backup to be restored and updates applied from a journal file; backward recovery which recovers the database in place from a journal file; etc.).

  • For convenience, your program is allowed to accept multiple lines of input, processing each line in turn.  However, between lines, you must empty, clear, re-initialize (whatever term your database engine uses) the data in the database, so that the computation for each line starts out with a database with no data.  Of course, this means that in the event of a crash, only the data for the specific i and j being worked is recovered, and resuming and completing the problem only refers to the solution for that i and j.

  • In order to factor in the effect of dealing with access contention and concurrency control, your solution must use at least the greater of k or four concurrent execution streams (threads, processes, or both) per CPU core to solve the problem.  So, if your CPU has two cores, and k is 6, you must use at least 8 execution streams.  When each stream completes its allocated work, it must check to make sure that no other stream has unfinished work, and if it finds unfinished work, it should work on it.  Thus, if the problem is 1 through 1,000,000 and there are four streams assigned 1 through 250,000; 250,001 through 500,000; 501,000 through 750,000 and 750,001 through 1,000,000; after completing the work on its 250,000 numbers, each stream must check the other 750,000 numbers and work on any number that it finds still waiting to be worked on.  In this example, this means that solving the problem involves at least 1,000,000 database updates and at least 4,000,000 database reads.

Submission

Post your results either directly to the NoSQL list, or on the web with a URL in a post on the NoSQL list.  You should include:

  • The computing platform on which you ran your benchmark (details of CPU, RAM, disk type, operating system).
  • Your program.  Your program must either be released under a FOSS license (either determined to be GPL compatible by the FSF or OSI approved) or it must be released with no claim of Copyright (which puts it in the public domain).
  • A summary of your database configuration (in order to keep it concise and readable, list only information that would be relevant to someone attempting to reproduce your results).
  • Except for the operating system, licenses for the software stack you used ("proprietary" suffices as a description of any license that is not one on the FSF or OSI lists).  If your database engine has different versions available under different licenses, you must specify the license of the version you used.
  • Results for as many of the following inputs as you have the patience to run:
    • 1 100000 4
    • 1 100000 8
    • 1 1000000 4
    • 1 1000000 8
    • 1 10000000 8
    • 1 10000000 16
    • 1 100000000 8
    • 1 100000000 16
    • 1 1000000000 16
    • 1 1000000000 32
  • Results for any other sets of inputs that you choose to run that you think presents your database well.
  • Any other information that you think adds value to the understanding of the above.

Enhancement to Use Strings

The benchmark as formulated above stores integer keys and values in the database, which is not representative of real world applications that more commonly need to use keys and values that are strings.  The enhanced benchmark requires that keys and values for the number of steps be stored as strings rather than as integers.  For example, instead of creating an entry for step(11) with a value of 15, you would create an entry step("one one") with a value of "one five".  You are free to use numerics for all other purposes, including program flow control.

Revision History

Version
Date
Summary
0.1November 21, 2009Initial Draft
0.2December 1, 2009Added section on Enhancement to Use Strings
0.3February 19, 2010
  • Add recommendation to provide enhanced benchmark using strings, with international character support, if supported by the database.
  • Change "cycle length" to "steps" to correspond to Wikipedia page on the Collatz conjecture.  The number of steps is one smaller than the cycle length.
  • Enhance output to print largest integer reached during the computation, estimate number of reads and number of updates, and read and update rates.
  • Clean up / clarify wording to improve readability.

K.S. Bhaskar
bhaskar at bhaskars dot com


No comments:

Post a Comment