I am responding to Dan Weinreb's comments to my post Comments in response to Don Anderson's 3n+1 benchmark with another blog entry because this was too long to post as a reply.
Thanks for the comments, Dan. Let me start by echoing and responding to your closing comment, because that is a topic near and dear to my heart - it was the paucity of well-specified benchmarks for side by side comparisons that started me off down this path some months ago. I currently seek benchmark problems to meaningfully show how a platform (the combination of database engine and OS) degrade when pushed to the limits by ever increasing workloads.
By way of background in response to your other comments, let us first look at the benchmark problem. I admit it is slightly contrived to use a database but the 3n+1 problem has the virtue of being easily understood and programmed. Also, using a recoverable database is not unreasonable for long-running computational problems.
For this problem, the concept of a transaction is not well defined, unlike, say, transfering money from my checking account to my savings account where the application requirements clearly delineate the computational boundaries to which ACID semantics must apply. In the case of the 3n+1 problem, Consistency can mean that for the longest sequence found so far, the lengths for the integers in that sequence should be recorded.
So, I believe my concept of recoverability is the same as yours. If the system crashes, Durability requires that the database must be restorable to its last Consistent state. For the 3n+1 problem, Atomicity and Isolation are not meaningful.
A real world implementation of Schrödinger's cat is pipelining and parallel execution in CPUs, where the black box view is well defined but the glass box view (such as the state of the program counter) is fuzzy (i.e., trying to get a clear glass box view forces the system to give you a black box view). But the hardware guarantees a Consistent state when an external event mandates it, such as an interrupt that requires state to be saved for future resumption, or when a debugger attaches to a process.
GT.M implements pipelining in the database engine (as I am sure other engines do, but GT.M is the only one I am familiar with). Journal buffers are filled and database blocks in the cache are dirtied. Journal buffers are written out and "hardened" with an fsync() before the corresponding database blocks are written out. The glass box view of the database is fuzzy. Discovery of a new longest sequence is like an external event, which requires and results in a Consistent state being made Durable.
With that background, in response to you question as to whether this is a database benchmark, yes, this is a database benchmark. It measures the performance of a database that is operated in a recoverable mode for the problem solution.
With respect to your comment on the sizes of keys, this problem as discussed does use keys in the tens of bytes. But note that its easy to make the keys longer, should one wish to benchmark longer keys - since integers are converted to strings by replacing each digit with a string, using longer strings for each digit would use longer keys. Similarly, the values associated with each key are also easily made longer. GT.M also supports longer keys and values than the benchmark uses.
Indeed, we have used this same program to benchmark file systems by using an upper limit of 40,000,000, during the course of which the largest integer encountered is the fifteen digit integer 474,637,698,851,092 which the unchanged threeen1f program represents as a string of 74 UTF-8 characters / 124 bytes.
You mention replication. GT.M also allows logical multi site (LMS) application configurations to be deployed. Every database engine makes different engineering trade-offs with respect to CAP limitations. Depending on the engineering trade-offs made between consistency, availability and performance, replication can certainly affect database performance. I would like to come up with a problem that allows apples-to-apples comparisons with respect to replication, but have not yet had that light bulb moment of a problem whose solutions can overcome the different engineering trade-offs. GT.M has at least one LMS configuration operating with instances in England, France and California, and at least for the trade-offs made by GT.M, replication does not materially affect performance.
Your comment with respect to tuning are spot on. Even in an apples-to-apples comparison, no two apples are ever exactly alike!
Thanks for the comments, Dan. Let me start by echoing and responding to your closing comment, because that is a topic near and dear to my heart - it was the paucity of well-specified benchmarks for side by side comparisons that started me off down this path some months ago. I currently seek benchmark problems to meaningfully show how a platform (the combination of database engine and OS) degrade when pushed to the limits by ever increasing workloads.
By way of background in response to your other comments, let us first look at the benchmark problem. I admit it is slightly contrived to use a database but the 3n+1 problem has the virtue of being easily understood and programmed. Also, using a recoverable database is not unreasonable for long-running computational problems.
For this problem, the concept of a transaction is not well defined, unlike, say, transfering money from my checking account to my savings account where the application requirements clearly delineate the computational boundaries to which ACID semantics must apply. In the case of the 3n+1 problem, Consistency can mean that for the longest sequence found so far, the lengths for the integers in that sequence should be recorded.
So, I believe my concept of recoverability is the same as yours. If the system crashes, Durability requires that the database must be restorable to its last Consistent state. For the 3n+1 problem, Atomicity and Isolation are not meaningful.
A real world implementation of Schrödinger's cat is pipelining and parallel execution in CPUs, where the black box view is well defined but the glass box view (such as the state of the program counter) is fuzzy (i.e., trying to get a clear glass box view forces the system to give you a black box view). But the hardware guarantees a Consistent state when an external event mandates it, such as an interrupt that requires state to be saved for future resumption, or when a debugger attaches to a process.
GT.M implements pipelining in the database engine (as I am sure other engines do, but GT.M is the only one I am familiar with). Journal buffers are filled and database blocks in the cache are dirtied. Journal buffers are written out and "hardened" with an fsync() before the corresponding database blocks are written out. The glass box view of the database is fuzzy. Discovery of a new longest sequence is like an external event, which requires and results in a Consistent state being made Durable.
With that background, in response to you question as to whether this is a database benchmark, yes, this is a database benchmark. It measures the performance of a database that is operated in a recoverable mode for the problem solution.
With respect to your comment on the sizes of keys, this problem as discussed does use keys in the tens of bytes. But note that its easy to make the keys longer, should one wish to benchmark longer keys - since integers are converted to strings by replacing each digit with a string, using longer strings for each digit would use longer keys. Similarly, the values associated with each key are also easily made longer. GT.M also supports longer keys and values than the benchmark uses.
Indeed, we have used this same program to benchmark file systems by using an upper limit of 40,000,000, during the course of which the largest integer encountered is the fifteen digit integer 474,637,698,851,092 which the unchanged threeen1f program represents as a string of 74 UTF-8 characters / 124 bytes.
You mention replication. GT.M also allows logical multi site (LMS) application configurations to be deployed. Every database engine makes different engineering trade-offs with respect to CAP limitations. Depending on the engineering trade-offs made between consistency, availability and performance, replication can certainly affect database performance. I would like to come up with a problem that allows apples-to-apples comparisons with respect to replication, but have not yet had that light bulb moment of a problem whose solutions can overcome the different engineering trade-offs. GT.M has at least one LMS configuration operating with instances in England, France and California, and at least for the trade-offs made by GT.M, replication does not materially affect performance.
Your comment with respect to tuning are spot on. Even in an apples-to-apples comparison, no two apples are ever exactly alike!
If I understand correctly, what you're saying is that "recoverable" means that if there is a "stop" failure, the state of the system will be in some consistent state, reflecting the results of a contiguous-from-the-start sequence execution of steps, but not necessarily all of the steps whose execution was seen as completed by the client side. In other words, it's all the way it should be except that you can lose the effects of the N latest transactions.
ReplyDeleteYes, I agree that it's easy to make a variant of the benchmark with larger records. The cases I was thinking of mainly involved larger values more than larger keys, but it's easy to tune either one.
In general, this is one of the challenges with benchmarks: there are so many parameters. Or to put it another way, there are so many different questions to ask. There is no easy solution, and I didn't mean to imply that your benchmark is doing anything wrong. I was only pointing out that anyone interpreting the benchmark should keep in mind what it's testing, and consider the results in the context of what the system under test is optimized for.
I agree completely that the replication issue is another case where it's hard to know exactly what to do.
Thank you very much for your thoughtful reply!
Dan, sorry for the delayed reply - I thought I had it set for automated alerts if someone commented, but evidently I hadn't.
ReplyDeleteApropos "lose the effects of the N latest transactions", in the event of a "stop" in-flight transactions (those which have not been committed) didn't happen.
The analogy here with hardware pipelining and parallel execution is the case where an execution unit may have completed a computation when an interrupt occurs, but that interim result may not have been transferred to memory ("committed") in the state that is saved and from which the CPU will resume after processing the interrupt.