Tuesday, January 5, 2010

The Curious Case of The Y2.01K Bug

I love a good computer bug story.

The 1st of January, 2010 saw a real doozy come to light: The Curious Case of the Bank of Queensland "Y2.01K" Bug.

The Age published an article on it over here. Bank of Queensland customers were surprised to find their transactions failing to clear on the 1st of January 2010. It seems the Bank's systems believed the current date was the 1st of January 2016 not the 1st of January 2010. Credit card transactions were naturally being rejected as expired - correct behaviour for a system that thought it was operating 6 years into the future.

It's easy to laugh and point the figure at poor programming, but date related bugs can be very innocuous - tricky little time bombs, ticking away, waiting for the right temporal conditions to blow up and bring down your system. I know, because I've been bitten by them before. My first job was working for a company that supplied Airport operational systems and I had the pleasure of missing the first half of Les Misérables one Sunday afternoon as I was called into fix a system I had created that had failed to correctly adjust local time for Day-Light-Savings. Like most bugs it turned out to be a single mistype of the keyboard - a plus where I should have used a minus in a block of date logic, but it gave me an appreciation of how systems that go through very extensive testing (at the developer/consultant side and at the acceptance/client side) can miss these bugs.

So back to BoQ and their Y2.01K bug. What the hell went wrong? How can the year go from 2009 to 2016. Are the programmers just seriously incompetent or what? Well, without knowing the codebase or the system I think we can make some educated guesses.

First off, is it likely somebody mistyped something like:

year = year + 7 instead of year = year + 1 ?

This would be an interesting bug because depending on the type of font used in a code-review, it could be a bug that passed a peer code review session. 7's and 1's can look very similar in many fonts.

This is unlikely as it would manifest itself at the first New Years a system is operational. I'm suspecting this system has been around for a lot longer - just by the fact it has taken many days to fix (I'm still not sure if it is even fixed). This is probably a system that is several years old and the original developers are being tracked down to help debug.

So, not a simple mistype - then what. How does a system jump forward 7 years instead of 1? Why is it doing date incrementing logic anyway instead of just using the system provided date?

I think the clue to this lies in the year it jumped to: "2016". That's a very suspicious year. Particularly the "16" bit. You see, in Hexadecimal (Base 16 - a more workable mathematical modelling of the binary numbering system computers are based on), the value "10" is equivalent to decimal "16".

Hexadecimal counting goes 1,2,3,4,5,6,7,8,9 then A,B,C,D,E,F before rolling over to "10" which is the 16th number in the sequence.

To repeat: that is "10" Hexadecimal = "16" Decimal. What happened at BoQ? It went from 2009 to 2016 instead of 2010.

Ahhh... I suspect this might be a highly plausible candidate for the bug. I'm assuming the system has somewhere in it's bowls a "current year" field. I would further assume that it doesn't store the century just the year as an offset from the year 2000. Hence 2009 would be stored as "9".

Here's a theory. Whatever system routine is called to get and set the value of this date takes and receives a Hexadecimal value. Perhaps it's a real low level routine, getting close to the bits and bytes of the computer rather than dealing in unnatural (to a computer) decimal base numbers.

Let's say there is some pseudocode logic in the system that goes like this:

boqSystemYear = getBoqSystemYear()
currentYear = getOperatingSystemDate().extractYear()

if currentYear > boqSystemYear then setBoqSystemYear(currentYear)

The operating system function returns the year as a decimal value, which on the 1st of January 2010 would come back as "10".

The low level BoQ system year, if we assume works on hexadecimal, would currently set to "9" (which is the same in both hexadecimal and decimal). The code picks up that it needs to be updated to decimal "10" but forgets (or is unaware) to convert it to hexadecimal (10 decimal = A hexadecimal). Instead it sets it to "10" hexadecimal which is 16 decimal.

BOOOOOOOOOOOOOOOM!!! The ticking time bomb goes off. The system now think's it is 2016 instead of 2010 and chaos ensures.

Here we have a hypothesis that fits the observed facts. Without knowing the codebase we can do no more, but it does highlight how these innocuous temporal bugs can occur. Is this the result of sloppy coding? Hard to say. No programmer or even team of programmer works completely in isolation building a system from the lowest levels up. They work on top of layers of code created over many decades by thousands of other programmers. Layers that provide operating systems, database systems, date & time manipulation routines and so on. It could be that this hexadecimal/decimal bug is the result of poorly documented routines that did not advertise what they expect their input in or provide output as. Mind you, this usually this occurs when numbers are transfered around the place as strings, which is a bit of an "anti-pattern" in itself because of these sorts of bugs.

I'm not a tester, but I wonder how many systems go through a range of what I would call "Temporal Testing" - testing the system at particular points in time. From my experience the focus tends to be on functional test plans and meeting performance KPIs but not so much on rolling the system through a range of dates. Note that this is different from putting in a range of date data into the system - this involves moving the system clock forward in time to test how the system behaves operationally in the future. It would be easy to identify a range of boundary dates to simulate future behaviour on - New Year Change-overs, Day-Light-Savings, Leap Years, Month end/start dates (check all those 30/31 day months are correctly done).

Any others? Are there test systems / test beds out there that do this sort of temporal range testing?

6 comments:

  1. Poor test planning and poor review of said test plan. More than likely due to the plan being written by a generic tester rather than someone with experience in banking and finance. This fuckup has "consultancy" written all over it: "Let's hire these guys, they write software" rather than getting people with the relevant skill set.

    Testing banking and finance apps requires a slightly different mindset. There's a bunch of legal stuff to watch out for, plus international time/date stuff that can send your pretty little app into a tizz, and of course year switches.

    Basically they didn't test over as broad a range as they should have. I'd also venture that unit testing wasn't thorough enough, but it's still up to QA to catch this sort of stupidity.

    ReplyDelete
  2. Fair points. I'm curious as to wether people take this concept of Temporal Testing seriously - actually changing the low level Operating System date/time settings and see what happens to the system under test.

    If the bug is the result of erroneous conversation from low-level operating system routines, then no matter of unit-testing will help. In fact, one could argue that unit-testing gives a false sense of confidence that the code is "tested" - How many times have we heard "we don't need testers because we unit-test"?

    Some bugs require as real a test bed as possible to put the system through it's paces.

    As I said, these bugs can be innocuous .... let's say this system went live in the lead up to Y2K - 10 years ago. The test plan tested key dates like New Years Eve, Leap Years etc.

    i.e. thinking like...

    They tested the rollover from 2000 to 2001.
    They tested the rollover from 2001 to 2002.
    They tested the rollover from 2002 to 2003.
    They tested the rollover ... ah stuff it, that's enough testing - year rollovers are working fine.

    2010-2016 mix up seems obvious in hindsight, but perhaps not so obvious at the time?

    I'm genuinely curious though - are there test tools out there that can take a system and run a standard set of automated tests on a target platform set to a particular date and time (and then automate the roll forward).

    For example, if a test suite takes 10mins to run then you run a test for each date in the future that the system is likely to be operational for - say the next 20 years, with say 2 iterations per day - at midnight and midday.

    At what point does the cost of these time-bombs outweigh the cost of doing such automated temporal tested?

    ReplyDelete
  3. Or here is another thought. If you had a really good automated test suite and you know that at worst any really bad bug could be fixed in 30 days, you could always keep a live staging area but with it's clock set to 30 days in the future, running a test suite every day.

    If a time-bomb bug is triggered by the test suite, then you have 30 days warning before it hits your live production system. Wonder if anybody goes to this much trouble?

    ReplyDelete
  4. Maybe in the future they will. In Germany the same problem hit the people with an estimated 1/3 (!) of all cards not working anymore, at a time when retail is expecting one of the highs of the year in the post-Chritsmas sales.
    What's more, it is still not entirely fixed (they promised to have it fixed by today, but doesn't look good). It is not just one bank, but affects all banks, since it is the company that fixed the chip in the cards that's the culprit. And oh: Europeans travel a lot. If you happen to have spend your New Year's Holidays in the Alps or wherever, then you are probably stuck without decent means of getting money and that fix will just have to "wait longer". They don't know when they will be able to fix the readers abroad. So how about that for cost and trouble making?

    ReplyDelete
  5. Ash,
    Changing the "low level Operating System date/time settings" aka system time used to be seen as a viable option, except after doing so one then has numerous files on the machine that could be time-stamped into the future. some of these will be data files but a lot more will be system runtime stuff, Database & Webserver logs, control files, etc..

    It's not fun to have to rebuild a larger UNIX system. Add to that, that if the site uses Microsoft's Active Directory as the network login system, any machine that has it's system time more than 5 minutes ahead will be locked out of the domain.

    So, if you want to run a system 30 days ahead or 3 years ahead, whatever, they should be using Time Machine.

    www solution-soft.com / timemachine.shtml

    ReplyDelete
  6. Yes good points. What I was getting at with the "system +30day" testing was actually taking a self-contained network (i.e. Active Directory controllers and all, client browsers for testing etc, firewalls, the whole lot!) and running them all +30 days.

    Otherwise you won't get those truly bizarre bugs that are a result of different systems on the network interacting together.

    Time Machine looks a pretty good tool for this sort of testing.

    ReplyDelete