The most expensive bug in history...

 

The development team gets a call by CEO that they need to implement a new feature in SMARS (their system) to take trades from dark pools as well in addition to the regulated markets.

 

The development team had 30 days to comply with the new rule. Although the CEO himself was not in favour of this new feature he had to anyway go for it as it meant retaining existing business.

 

The release date for the new feature in the system was set to 9:30 AM EST on August 1, 2012

 

The plan was to deploy the new system behind a feature flag the week before the deadline; when the market opened on August 1, they'd simply turn it on.

 

At 9:30 AM EST on August 1, the Knight developers did just that: they enabled the feature flag, and SMARS began to route orders through to the RLP—they were live!

 

But something was wrong. Their charts showed anomalous spikes in trading activity on the open markets. At 9:34 AM, the NYSE called  Knight was executing a lot of trades—so many, in fact, that trading volumes for the entire market were double their normal level.

 

To make matters worse, the trades they were making didn't make sense. SMARS appeared to be buying high and selling low. At the current rate, they were losing thousands of dollars per second.

 

Alerted to the problem, Knight's Chief Information Officer called the top operations engineers together to try to identify the root cause. The rogue orders seemed to be originating from the new RLP router code, but no one could pinpoint the bug.

 

20 minutes had screamed by since the market opened, and the unauthorized trades executed by SMARS already totalled well into the billions of dollars. It was time to roll back, and ask questions later.

 

With a shaky sense of relief, the operations team scrambled to check out the last known stable version of SMARS and deploy it to their 8 production servers.

 

To their horror, as soon as the router restarted, trading volumes on the NYSE spiked again: they were now executing even more trades than before.

 

At 9:58 AM, the Knight developers shut down SMARS entirely. It had been 8 minutes since rolling back the RLP code, and 28 minutes since the market opened.

 

They'd just lost their company $460 million dollars.

 

But what actually happened?

When the developers of Knight's high frequency trading algorithm replaced some unused legacy code, they repurposed a feature flag which had been used to disable it.

 

The deployment was a success for 7 of their 8 servers, but the deploy to the 8th server failed silently, meaning that one server was still running the legacy code. When they enabled the feature flag, 7 servers operated as expected; the 8th executed the legacy code, which should have never run in production.

 

Instead of re-deploying the new code to the 8th server, they decided to roll back to the last known good state. Unfortunately, they didn't know that the problem was the feature flag, and it didn't cross their minds to turn it off. When the old system was re-deployed, every server began to run the legacy code, dramatically compounding their losses.

 

 

Where did they go wrong?

 

The Knight developers should have never allowed dead code to remain in their app for so long. Had they been more proactive, they could have easily avoided catastrophe. Reusing a feature flag was a dumb mistake that just shouldn't have been made. The developers weren't entirely to blame, though—if there's one certainty in life, it's that we will make mistakes.

 

 

When they deployed SMARS, they didn't have an automated deployment pipeline, instead relying on their engineers to manually deploy the new code; as a result, they missed an important step on the fated 8th server. When the first mistake led to a crisis situation, their monitoring was inadequate, and they didn't have documented incident response procedures which could have prevented them from making an even worse mistake under pressure.

 

 

 

Comments

Popular posts from this blog

Is Apple's 64 Bit Architecture a Hype?

Appcelerator Announces Virtual Private Cloud