Those that are deep into transactional database work, as everyone in payment systems and the like is, know there is a deep dim and ghostly place that we all fear. I've just walked that through that place, and as soon as I saw it, I know I was staring at the Twilight Zone.
The Twilight Zone is a special nightmare for database engineers. It is when your transactional set forks into two; both are correct because they are transactions, after all, but both places are wrong because of the other place. Worse, the further time passes, the more chance of more forks, more and more places, all in the same zone. It is when the time-space continuum of your data fractures and spreads out in an infinite tree of possibilities.
I've always known it existed. When you've travelled so many databases, so many scenarios, you realise that the perfect database doesn't exist. Software is meant to fail, and getting it right today just means it will really go wrong tomorrow. For nine years, tomorrow never came, until one day in Vienna, I discovered a whole issuance of newly minted gold, Euro and sterling had just ... vanished into another space. It took me over two days of isolating and isolation before I realised where I was. And where I was.
(A brief digression for the non-digerati: database software does transactions, which are like records or receipts or sales or somethings that have special characteristics: they happen once and once only, if they happen at all, and if they happen, they happen forever. We call them atomic, because they either do or they don't happen, we can't divide them into half-happens. We do this because when we move money from one place to another, we want to make darn sure it either moves or it doesn't. No halfway house. And no going back, once we got there. We actually care so much about this that we don't really care which it is - happens or not happens!)
So when my fresh gold decided it had happened and not happened, I was sucked into the Twilight Zone. The reason it exists is quite fundamental: transactional software is perfect in theory, but implementations are flawed. No matter how much care you take, changes occur, features get added, bugs need to be fixed; step by small baby step, the logical beauty of your original design flits and dances towards the forking point. With all software, everywhere, no matter the manufacturer's guarantee, there will always be the possibility of so many bugs and so many patches and so many engineers who didn't understand, all one day coming together to split your state into the twilight zone.
This is why space shuttles blow up. Why Titanics sink, dams collapse, power grids shut down, and stock exchanges melt down. It's not because of a lack in the quality of the people or the software, it's because of the complexity of the system. Fundamentally, if you got it right, someone will build a better system on yours that is 99% right, and reliant on yours 101%. And the next person will layer their opus magnum over that great work and get that 98% right... and so it goes on until the mother of all meltdowns occur.
Specifically, what happened was an event notification - a new feature added in so as to enable chat broadcasts via payments - had a dodgy forwarding address. Which would have been fine, but the change to fix that broke. Which wasn't picked up in testing, because it didn't break in quite that way, but was picked up by a recovered transaction which did look it in exactly that way, which in turn failed and then went on to block another transaction in recovery. (Long time hackers will see a chain of bugs here, one tripping another in a cascade.)
This last transaction was a minting transaction. That means, it created value, which was the sterling I mentioned earlier (or gold, or Euro, I forget). Which, by a series of other unfortunate events caused yet another whole chain of transactions to fail in weird ways and Shazam! We entered the twilight zone where half the world thought they had a bucket of dosh, and the other half did not.
Fixing the bugs is obvious, boring, and won't be discussed further. The real issues are more systemic: it is going to happen and happen again. So infrequently that its very rarity makes it much more traumatic for its lack of precedent. It is very hard to create procedures and policies to deal with something that hasn't happened in living memory, would be fixed immediately if we knew how it was going to happen, and is so not-going-to-happen that the guarantee doesn't permit it. Nor its solution, nor even the admittance of the failure.
So how do we deal with the twilight zone? Well, like quantum physics, the notion is to look at the uncertain states and attempt to collapse them into one place. With luck this is possible, simply by re-running all the transactions and hoping that it all works out. With bad luck however, there would be a clash between transactions that resulted in leaving the twilight zone the wrong way, and being splintered forever: Simply put if I had given money to you in one place, and to your sister in another place, when the two places collapsed into one then the time-space of accounting would rip asunder and swallow us all, because money can't exist in two states at once. It would be light and day together for evermore. At the least, permanent migraines.
Which leads me to our special benefit and our own fatal curse: the signed receipt. In our transactions, the evidence is a receipt, digitally signed that is distributed to all the accounts' users. This means we as issuers of contractual value are locked into each and every transaction. Even if we wanted to fiddle with the database and back out a few tranasctions to pretend your sister doesn't exist, it won't work because the software knows about the signed transactions. This trick is that which I'd suggest to other databases, and that's why we signed the receipts in the first place; We never wanted that to work, and now it doesn't. Stuck, we are.
It does however mean that the simple tactical phase is a good starting point: re-run all the transactions, and live with the potentially broken accounts, the accounting time-space rent asunder if so discovered. How we'd deal with that is a nice little question for our final exam in post-graduate governance.
My walk through the twilight zone was then guided by a strategy: find all the signed receipts, and re-run them. Every one, and hope it worked out! Luck was indeed on my side this time, as it was a minting that had failed, so the two places were cleanly separated in the zone. I had to fix countless interlocking bugs, make yet more significant feature changes, and conduct days worth of testing. Even after I had done all this, and had watched the thrilling sight of 10 transactions reborn in my preferred space, I still had only the beginnings of a systemic solution to the problem of walking the twilight zone.
How to do that is definately a tricky problem. Here are my requirements so far: even though it should never happen, it must be a regular occurrence. Even though the receipts are scattered far and wide, and are unobtainable to the server, we must acquire the receipts back. And, even though we cannot collapse the states back when they have forked too far, we must re-engineer the states for collapse.
I have the essence of a solution. But it will have to remain on the drawing board, awaiting the next dim opportunity; as no-one willingly walks into the Twilight Zone.Posted by iang at April 16, 2005 09:47 AM | TrackBack