Saturday, February 12, 2011

From 44 seconds to 27 seconds - Simple Tuning of the 3n+1 Benchmark on GT.M

The bottom line first: with a little simple tuning, I was able to improve the performance on my laptop of my benchmark of the 1 to 1,000,000 3n+1 problem.
  • For four worker processes:
    • from 44 to 27 seconds
    • from to 72,026 to 109,311 reads/second
    • from to 49,298 to 74,828 updates/second
  • For eight worker processes:
    • from 248 to 30 seconds
    • from 12,782 to 109,285 reads/second
    • from 8,750 to 74,802 updates/second

In response to Don Anderson's report of his results on the 3n+1database benchmark version 2 this post documents some simple tuning of the implementation I previously reported.  At that time, I did only minimal tuning (GT.M has fewer knobs to turn and buttons to push than most database engines because when we add tuning knobs and buttons we endeavor to make them self-tuning).

Notes
  • I did not attempt to tune:
    • The number of parallel worker threads - one objective of the benchmark is to test the database with access contention, which is simulated with two and four worker processes per CPU (core).  [Studying database operation without access contention is like studying a highway without traffic.]
    • The database block size - GT.M supports database block sizes in multiples of 512 bytes from 512 to 65,024 bytes.  Smaller block sizes are computationally more efficient; larger block sizes (at least those which are multiples of the native page size of the underlying file system - 4KB on my laptop) are more efficient for IO).  I just used the popular 4KB database block size from GT.M's gdedefaults file.
    • The access method - GT.M supports the Buffered Global (BG) access method which uses a pool of buffers in shared memory as a cache for database blocks and the Mapped Memory (MM) access method which simply maps database files to the virtual address space of processes that access them, letting the operating system manage caching (MM also provides recoverability).  BG is more commonly used, and that is what I benchmarked.
    • A few other parameters that are not self-tuning, and for many of which the default is often used.  For example, there is an option to write a full block worth of bytes, even if the actual data is less than a full block - on some IO subsystems, this may result in a faster and simpler write operation instead of a read-modify-write operation - I just used the default, to write only the actual data.
  • I ran the tests on both jfs, which is my preferred file system, and ext4, which is popular these days for Linux.
  • I ran the test for inputs of 1 through 100,000 as well as 1through 1,000,000.  I am reporting only the results for the larger problem - running the smaller problem first only gives me an immediate check in case I don't have something set up correctly.
  • The test program is available and can be downloaded from http://sourceforge.net/projects/fis-gtm/files/Benchmarking/threeen1/threeen1f.tgz
  • A spreadsheet with all results is available at http://tinyurl.com/Bhaskar3nplus1Results - the benchmark results reported here are the runs dated February 5, 2011.

    Benchmark System

    I ran the benchmark on a System 76 laptop with the following specifications (the system referred to as Ardmore in the spreadsheet):
    • CPU: Intel Core2 Duo T7500 at 2.20GHz
    • RAM: 4GB 667MHz DDR2
    • Disk: Seagate ST9200420AS 200GB 7200rpm with SATA interface
    • OS: 64-bit Ubuntu 10.10 with Linux kernel 2.6.35-25-generic
    • File system: jfs & ext4
    • Database: GT.M V5.4-001 64-bit version

    Baseline

    I first ran the test with default settings from the gdedefaults file in the GT.M distribution: a database block size of 4KB with 1,000 global buffers (corresponding to a database cache of 4MB) and 128 journal buffers (each of 512 bytes, corresponding to 64KB of journal file buffers).  This yielded the following run times in seconds:
    • Four worker processes - jfs: 44; ext4: 66
    • Eight worker processes - jfs: 248; ext4: 104

    I believe that the reason eight workers were so much worse than four resulted from churning the database blocks cached in the global buffer pool.  The database grows to 138MB, and the 3n+1 problem results in only very limited locality of references.  So more concurrent processes results in a buffer pool that is too small for a large working set.


    Increase database buffer pool to 5,000 buffers

    With 5,000 buffers (20MB buffer pool), performance increases with eight workers, improves slightly with ext4 but decreases with four on jfs.  I do not have a good explanation for the anomalous decrease - unless increasing the buffer pool impacts overall memory usage (for example by reducing the file buffer cache - an unlikely scenario when going from 4MB to 20MB on a machine with 4GB RAM), increasing the buffer pool should never reduce throughput.

    With GT.M's daemonless database engine, concurrent processes cooperate to manage the database: just one process accessing the database does not give optimal performance, and adding processes increases performance (only up to a point of course).  In this case, eight processes perform better than four.
    • Four worker processes - jfs: 81; ext4: 61
    • Eight worker processes - jfs: 52; ext4: 40

    Increase journal buffers to 2,000

    I increased the number of journal buffers to 2,000 (1MB) expecting performance to be about the same, and if anything slightly better (there should be no penalty to increasing the number of journal buffers).  I was a little surprised that two increased and one decreased (as the resolution of reported times is one second, the difference between 39 seconds and 40 seconds is not meaningful).  I don't yet have a good explanation for this anomaly.
    • Four worker processes - jfs: 75; ext4: 66
    • Eight worker processes - jfs: 66; ext4: 39
    In my opinion, this configuration (5,000 global buffers & 2,000 journal buffers) reflects a common scenario for small to medium database applications where the in-memory cache size is perhaps 10-20% of the database size.

    Increase database buffer pool to 35,000 buffers

    To replicate Don Anderson's test where he had a RAM cache large enough for the entire database, I increased the number of database buffers to 35,000.  The database was still operated so as to be recoverable from the journal files to exactly the same extent as before.

    This gave me the best results.  I ran each test three times to verify that the results were similar each time.
    • Four worker processes - jfs: 27, 27, 29; ext4: 30, 30, 34
    • Eight worker processes - jfs: 29, 29, 30; ext4: 30, 31,36
    The median update and reads rates were:
    • Four worker processes
      • jfs: 80,332 updates/second & 117,369 reads/second
      • ext4: 72,308 updates/second & 105,642 reads/second
    • Eight worker processes
      • jfs: 74,828 updates/second & 109,311 reads/second
      • ext4: 70,008 updates/second and 102,266 reads/second

    In conclusion

    A little effort in tuning the database provides improved results over the default.  It is also worth noting that even with the increased contention of eight parallel worker processes (four processes per core), performance was very good except in the cases where they were churning a too-small buffer pool.

    8 comments:

    1. Thank's for your very interesting work. It might interesting to see how the results are if you use MM Memory mapped database buffers. My experience is that MM performs better than BG. Maybe you find the time to repeat the tests with MM.

      Kind regards

      ReplyDelete
    2. Thank you, Lothar. I do hope to run the benchmark with MM. Also, I will clean up my instructions so that anyone can run the benchmark on their own system.

      ReplyDelete
    3. Strange, I had spent part of the last weekend doing exactly the same stuff :-)

      Regarding the MM, on my testing machine (slow single core CPU, jfs, 2GB RAM) it seems to be significantly slower (40-300 percent!) than BG with the same setup (except of before/after image journaling).

      I will re-test this later on with better hardware, although I somewhat prefer to perform performance test on slower machines as the time differences are more obvious.

      ReplyDelete
    4. Jirka, you can't do before image journaling with MM because GT.M cannot ensure that the before image journal records reach the disk before the corresponding updated database blocks. So, MM only supports "nobefore" journaling - which is of course fully recoverable using forward recovery.

      At one time, I too thought that MM should always be faster than BG, but I have found that sometimes it can be much faster, and at other times slower. I don't yet have a good explanation for this, other than that in the case of MM the operating system controls the buffering, whereas in the case of BG, GT.M controls the buffering.

      ReplyDelete
    5. Bhaskar, my test originally used BIJ as it would allow testing of replication as well. When I started to play with MM, I switched to after image journaling to retain the 'apples to apples' style.

      Regarding the MM performance, I have some preliminary results. The problem is related to journaling - the journal file I/O is apparently throttled by one or more processes (one CPU is in 'Wait for I/O' state), while the other processes are blocked (CPUs are idle). It happens only in the MM mode, for BG it works fine. When the journaling is disabled the performance is good.

      I'll email you some more detailed data.

      ReplyDelete
    6. Thanks for the insight, Jirka. I'll look for your e-mail.

      ReplyDelete
    7. Hi Bhaskar,

      I first noticed the recent 3n+1 benchmark on Don's libdb blog, and then landed on your blog from there.

      I was very intrigued by your 3n+1 post, so I took both your GT.M benchmark and Don's routine and ported both to InterSystems Caché (basically, I created Caché Object Script Routines). I just wanted to share my results with you and the rest of the readers of your blog.

      I’ve provided all the gory details of my test and configuration on the intersystems-cache-googlegroup – you can read all about it here: https://groups.google.com/group/intersystems-public-cache/browse_thread/thread/5b3d9b75236d7f83?hl=en.

      In summary, I found the following:
      1.Running Don's version of the tests, my results for Caché on a 2.66 GHz Intel i7 Mac with 8GB memory and 256MB of database cache took 37 sec for 1-1000000 numbers range running 3 jobs. This appears to be roughly twice as fast as Don's test with BDB.
      2. Running your tests (on the same hardware that I used for Don's test), I found:
      - For 4 worker processes: my test completed roughly 22% faster with ~25% more updates/sec and ~27% more reads/sec.
      - For 8 worker processes: my test completed roughly 36% faster with ~34% more updates/sec and ~6% more reads/sec.

      Let me know if you have specific questions about my tests – happy to provide details.

      Thanks,

      Vik

      ReplyDelete
    8. Thanks, Vik. It looks like the results mostly reflect the CPU+RAM speeds since for these benchmarks, the databases will pretty much run out of the file system cache.

      ReplyDelete