(Mostly) Recent interesting software crashes (not the exhaustive list)

Posted on May 17, 2020

Software industry really is one interesting place to work in. Especially for the last five to ten years; a lot has been happening around software and most of them are really good things like driver-less cars, automatic drones that deliver, face predicting and generating AI and ML eating the world (and hoping to cure diseases and predicting the right pair of shoes I’m going to buy), jumping soldier robots that is about to be contracted to the highest bidding army and countless others.

But there’s one more trend that’s catching my attention more and more frequently: silly epic software crashes! Most of the people involved in the industry can cite four five incidents right off the top of their heads, while a quick search and scan in technology sites can bring tens of those quite easily. I don’t know any other profession/discipline that fucks up this frequently to be honest and it is interesting to observe these crashes as an insider these days for sure.

When I first started my career after a year of very low paid (but paid) internship at the same company; I was bewildered with the low quality of the software that was running in major banks in Turkey. And I contributed to my share of the low-quality code and bugs of course. This even had led me to believe that I was just not cut for this profession and, I started getting depressed. Then I found out that what I had learnt in university was not going to be enough for me to succeed in this software world so I decided to hone my skills and jump ship whenever I saw a better opportunity to do better software (and earn more possibly). This journey introduced me to functional programming, soundness of static typing, completeness of dynamic typing, test driven development, devops practices, OCD and other niceties of software profession.

While this proved to be a better strategy than to stay stuck in a non-functional company to feel depressed, I was still having bugs, companies I worked had tremendous (all the while preventable and silly) bugs and crashes and more importantly I realized that everyone else in the industry was doing buggy work and causing crashes all around! This made me feel a bit schadenfreude but I started to think about the main things that are wrong with us/industries we work for/the way we do business/the companies that employ us. Before I reach to conclusions (which will be another post hopefully:) ) I decided to write about a fun list of crashes of mostly recent incidents (not off the top of my head), so here goes nothing:

1 In last September, Tory party in the UK had organized its annual party conference. To this end, they ordered a bespoke conference app that shows schedule, speakers and enables the journalists to register and ask questions. But this app had a fatal flaw, that it disclosed all contact information on day one of the conference (if you can guess Boris Johnson’s email I mean)! Because, wait for it, the app didn’t have a password and email login. It had only an email login. Given the fact that many politicians were using publicly available email addresses to register and login to the application, a lot of people had login as these politicians and received their contact information. Think about it, telephone numbers and emails of the all prominent politicians in the UK! They managed to fix that in thirty minutes(reportedly) but the damage is probably done. The sad fact is, the app was developed by CrowdComms, who asserts in its website that it powers the events industry! Information on the story: https://www.theguardian.com/politics/2018/sep/29/tory-conference-app-flaw-reveals-private-data-of-senior-mps

2 More or less around the same time, Facebook announced that they managed to get themselves hacked (I mean Cambridge Analytica basically hacked the whole Facebook but that’s another long story). Facebook. For 50 million users. Including Mark Zuckerberg. Hackers reportedly had stolen access tokens which are invalidated after Facebook figured out what happened but those tokens might still be used in third-party services! The problem was caused by the ‘View as’ tool which I never heard about before this incident and don’t want to hear about in future but it basically enables you to stalk yourself as one of your friends in Facebook. Facebook people succeeded to put three bugs together in sequence in this functionality that at the end hackers just got our access tokens by viewing their profiles as us. Move fast and break things! What a shame! More information on the story: https://www.wired.co.uk/article/facebook-hack-data-breach-news-what-to-do

Photo by Francois Van

3Just a month before this incident with Facebook, British Airways got hacked and lost 380.000 passenger information such as credit cards including one top-up card of yours truly. They managed to let hackers break into their infrastructure and website and once it is done, they slept on it for 15 days! Only after 15 days, BA understood that one of the scripts on their website was not a script they put there on purpose and alarmed the bell. But it was a bit late then though, innit? The hack was done by changing a Javascript library that was hosted on BA infrastructure and adding a nice function to the end that reads your payment information and sends it to their beautiful little server in Romania. This means that the hackers had gotten access to BA servers to change a file that was going to be used in payment pages. More information on the story: https://www.theregister.co.uk/2018/09/11/british_airways_website_scripts/

4There is a bank called TSB in the UK (by the way they have a sister bank in Ireland called “Permanent TSB”, go figure:)) which had intermittent failures throughout 2018. They have a story actually, first they had this database meltdown while they were migrating their user accounts from the legacy Lloyd’s Bank’s system to their new owners. (Migration of a bank customer data, they must be crazy) This failed obviously, come Monday morning, people weren’t able to use their accounts for a while. This created such an outrage that MP’s from the parliament actually questioned their CEO. Anyway, he survived until there was another outage that lasted two weekend days plus Monday to be sure; and it is reported that he’s getting a nice severance package out of the door (they say 1.7£m, not bad for a rainy day). More information on the story: https://www.theregister.co.uk/2018/11/19/tsb_names_debbie_crosbie_as_ceo/

This is not a recent one but it is still fun to talk about. In 1999, NASA sent the Mars Climate Orbiter to Mars (obviously). It was designed to enter orbital position on a certain altitude around 226 kilometers. On the day of the maneuver, communication with the orbiter was unfortunately lost somewhere around 110 kilometers altitude. When investigated, it turned out that the error had occurred from the fact that a piece of the software that was calculating the force of the thrusting engines of the spacecraft was producing output in** pound-force** units, while the trajectory calculating software piece was expecting inputs of newton-force units. Thus the orbiter made all the wrong decisions and it probably disintegrated somewhere around 80 kilometers above Mars. It is only caused by the difference of data type! Just the data type! Why would anyone produce a non-metric output that’s going to be used in a spacecraft? Why? But anyway, it happened.

Another one from my workplace; so we use magnetic cards to enter the office because we’re practical and safe right? A couple of weeks ago, we had to wait three hours to get in, because the door system had broken down. We had to call in emergency guys to grind and cut and break the metal doors to get in. And you know what happened? Electricity went down on the weekend and when that happened, the software on the door tried to use the batteries for fail-safe. But it turns out that we forgot to renew our subscription for the support of the door software, thus no one brought any new batteries in the meantime and the current batteries just died out of boredom. Since the fail-safe software didn’t know how to survive a battery not found exception, it died quietly even though the power was back on in 30 minutes. At least the company reimbursed the extra morning coffee and croissant that day. So, still safe, but not very practical.

I think I could go on with failures of all kinds, but it will be the same and not fun anymore (but I’d love to hear/read your own stories). So the big question is: Why do these stupid and funny errors keep happening? What are we doing wrong with software? Are we just mentally not facilitated enough to write software? Does the world even understand software? Is there any way we could scale software? Even though these questions look a bit depressing, I think they are important, guiding and enlightening questions that we should think about (for our future’s sake).

Photo by Marc-Olivier Jodoin

Erkin Ünlü

Software Engineer

(Mostly) Recent interesting software crashes (not the exhaustive list)

Posted on May 17, 2020