April 16, 2005

The Twilight Zone

Those that are deep into transactional database work, as everyone in payment systems and the like is, know there is a deep dim and ghostly place that we all fear. I've just walked that through that place, and as soon as I saw it, I know I was staring at the Twilight Zone.

The Twilight Zone is a special nightmare for database engineers. It is when your transactional set forks into two; both are correct because they are transactions, after all, but both places are wrong because of the other place. Worse, the further time passes, the more chance of more forks, more and more places, all in the same zone. It is when the time-space continuum of your data fractures and spreads out in an infinite tree of possibilities.

I've always known it existed. When you've travelled so many databases, so many scenarios, you realise that the perfect database doesn't exist. Software is meant to fail, and getting it right today just means it will really go wrong tomorrow. For nine years, tomorrow never came, until one day in Vienna, I discovered a whole issuance of newly minted gold, Euro and sterling had just ... vanished into another space. It took me over two days of isolating and isolation before I realised where I was. And where I was.

(A brief digression for the non-digerati: database software does transactions, which are like records or receipts or sales or somethings that have special characteristics: they happen once and once only, if they happen at all, and if they happen, they happen forever. We call them atomic, because they either do or they don't happen, we can't divide them into half-happens. We do this because when we move money from one place to another, we want to make darn sure it either moves or it doesn't. No halfway house. And no going back, once we got there. We actually care so much about this that we don't really care which it is - happens or not happens!)

So when my fresh gold decided it had happened and not happened, I was sucked into the Twilight Zone. The reason it exists is quite fundamental: transactional software is perfect in theory, but implementations are flawed. No matter how much care you take, changes occur, features get added, bugs need to be fixed; step by small baby step, the logical beauty of your original design flits and dances towards the forking point. With all software, everywhere, no matter the manufacturer's guarantee, there will always be the possibility of so many bugs and so many patches and so many engineers who didn't understand, all one day coming together to split your state into the twilight zone.

This is why space shuttles blow up. Why Titanics sink, dams collapse, power grids shut down, and stock exchanges melt down. It's not because of a lack in the quality of the people or the software, it's because of the complexity of the system. Fundamentally, if you got it right, someone will build a better system on yours that is 99% right, and reliant on yours 101%. And the next person will layer their opus magnum over that great work and get that 98% right... and so it goes on until the mother of all meltdowns occur.

Specifically, what happened was an event notification - a new feature added in so as to enable chat broadcasts via payments - had a dodgy forwarding address. Which would have been fine, but the change to fix that broke. Which wasn't picked up in testing, because it didn't break in quite that way, but was picked up by a recovered transaction which did look it in exactly that way, which in turn failed and then went on to block another transaction in recovery. (Long time hackers will see a chain of bugs here, one tripping another in a cascade.)

This last transaction was a minting transaction. That means, it created value, which was the sterling I mentioned earlier (or gold, or Euro, I forget). Which, by a series of other unfortunate events caused yet another whole chain of transactions to fail in weird ways and Shazam! We entered the twilight zone where half the world thought they had a bucket of dosh, and the other half did not.

Fixing the bugs is obvious, boring, and won't be discussed further. The real issues are more systemic: it is going to happen and happen again. So infrequently that its very rarity makes it much more traumatic for its lack of precedent. It is very hard to create procedures and policies to deal with something that hasn't happened in living memory, would be fixed immediately if we knew how it was going to happen, and is so not-going-to-happen that the guarantee doesn't permit it. Nor its solution, nor even the admittance of the failure.

So how do we deal with the twilight zone? Well, like quantum physics, the notion is to look at the uncertain states and attempt to collapse them into one place. With luck this is possible, simply by re-running all the transactions and hoping that it all works out. With bad luck however, there would be a clash between transactions that resulted in leaving the twilight zone the wrong way, and being splintered forever: Simply put if I had given money to you in one place, and to your sister in another place, when the two places collapsed into one then the time-space of accounting would rip asunder and swallow us all, because money can't exist in two states at once. It would be light and day together for evermore. At the least, permanent migraines.

Which leads me to our special benefit and our own fatal curse: the signed receipt. In our transactions, the evidence is a receipt, digitally signed that is distributed to all the accounts' users. This means we as issuers of contractual value are locked into each and every transaction. Even if we wanted to fiddle with the database and back out a few tranasctions to pretend your sister doesn't exist, it won't work because the software knows about the signed transactions. This trick is that which I'd suggest to other databases, and that's why we signed the receipts in the first place; We never wanted that to work, and now it doesn't. Stuck, we are.

It does however mean that the simple tactical phase is a good starting point: re-run all the transactions, and live with the potentially broken accounts, the accounting time-space rent asunder if so discovered. How we'd deal with that is a nice little question for our final exam in post-graduate governance.

My walk through the twilight zone was then guided by a strategy: find all the signed receipts, and re-run them. Every one, and hope it worked out! Luck was indeed on my side this time, as it was a minting that had failed, so the two places were cleanly separated in the zone. I had to fix countless interlocking bugs, make yet more significant feature changes, and conduct days worth of testing. Even after I had done all this, and had watched the thrilling sight of 10 transactions reborn in my preferred space, I still had only the beginnings of a systemic solution to the problem of walking the twilight zone.

How to do that is definately a tricky problem. Here are my requirements so far: even though it should never happen, it must be a regular occurrence. Even though the receipts are scattered far and wide, and are unobtainable to the server, we must acquire the receipts back. And, even though we cannot collapse the states back when they have forked too far, we must re-engineer the states for collapse.

I have the essence of a solution. But it will have to remain on the drawing board, awaiting the next dim opportunity; as no-one willingly walks into the Twilight Zone.

Posted by iang at April 16, 2005 09:47 AM | TrackBack

Wow - reading that made the hair on the back of my neck stand up! I've been in similar predicaments and it ain't pretty! Best wishes on your speedy escape from 'The Twilight Zone'!

Posted by: Wren at April 16, 2005 07:15 AM

That's how I felt when I finally saw what had really happened! I just had to stop at that point ... hit the beer fridge and leave it until the next day. Luckily I was able to just shut down the effected issues, heaven knows what one would do in a busy system. Which is why I'm thinking on the systemic solution... and not rushing it :)

Posted by: Iang at April 16, 2005 12:02 PM

A lot of food for thought, indeed. Makes one ask "can it happen to me?", but this question is meaningless. Of course it can't, but it couldn't have happened to you either, and yet it did. That's the whole point. But I'll try, nevertheless:
Our payment system guards against the dark forces of evil by making the set of signed receipts and the transaction records actually the same thing; the signed receipts constitute the recorded database. There are no other records. Transactions happen as their signed receipts enter the public records. Thus, signed receipts are distributed not only to all the account's users but to everybody.
This, of course, implies that the signed receipts should be devoid of unencrypted private information, which is quite a challenge by itself, and I have no idea how to formalize this requirement and how to verify it. I hope that I have solved this problem, but I cannot state it with any certainity.
I am perfectly satisfied, however, that in our system there can be no inconsystency between the signed receipts that the users have and the state of the system that the issuer (the minting service, using Ian's terminology) sees, because the two are the same thing. This is certainly part of a systemic solution, but is this enough?
The conservative nature of contractual value can be verified by anyone at all times. This was one of the most important design criteria, though for different reasons (to guard against malicious issuers -- the architects of our system are Hungarians, whose grandparents have witnessed the hyperinflation of 1946 that ended with the exchange rate of 1:4e29).
We also have a fail-stop procedure (for contingencies such as the compromise of the issuer's private signature key), after which the users have to prove title to contractual value using signed receipts from before the triggering of the fail-stop mode.
Thus, if the mint is by accident restarted from an earlier backup with some transactions missing, we have both a proof (two different receipts with the same serial number) and a procedure to follow. Our wallet application reports such inconsistencies (if noticed) to the issuer, triggering the fail-stop. Doing so is in the best interest of the clients, so there are no tragedy of commons issues here; users have no reason to disable the verification code in the wallet application.
Am I right in my assertion that we will always have a regular way out of the twilight zone?

Posted by: Daniel A. Nagy at April 16, 2005 11:52 PM

again some background ... financial transactions mapping to database transactions

Posted by: Lynn at April 18, 2005 11:01 AM

If The Twilight Zone exists, it's to myself an issue, I don't know, don't think, if I can fail to dwell on, as a result of its unsolved 'existence', for your own sake, please receive a few more mails of mine, also that my labor situation still deserves to be controlled by myself, so that I can of course tell & e.g. help us both etc. find out & so on, greetings, arentved@in.com, there to be continued.

Posted by: Joram Arentved at November 24, 2009 09:18 PM
Post a comment

Remember personal info?

Hit preview to see your comment as it would be displayed.