Windows blue screen of death with the CrowdStrike logo as the error message

Friday morning, we all awoke to what looked like a worldwide failure of Microsoft Windows machines. Banks, Airlines, Hospitals (and even Starbucks!) were all affected by an outage of critical machines running Windows. Microsoft would later confirm somewhere around eight million Windows machines were affected by a bad update pushed by a company many of us never heard of called CrowdStrike. CrowdStrike is apparently a cybersecurity company out of Texas who’s software runs on some apparently mission critical machines. But this post isn’t even really about CrowdStrike. It’s more about the reaction on social media and the idea that something like this happening is a uniquely Windows problem.

My social feeds have been inundated with some variation of:

  • “Wow, Windows is really terrible if a 3rd party app can bring down the entire system”.
  • “This wouldn’t have happened on a Mac”
  • “Buy a Mac, Problem solved”

On one hand, still being into the whole Mac vs. PC rivalry feels very early 2000’s, but I get the idea of poking fun. It’s like sports, where it’s (mostly) all in good fun. But there was something about the level of gloating I saw on social media…the uninformed smugness… that really irked me. Not just because this was in the midst of event that was having very real world consequences on millions of people. It was more the idea that something like could never happen to any other operating system, and that this is more proof that [insert name of your favorite operating system] is better than Windows, and Windows is terrible. And folks, that’s not how software works. Not by a long shot.

All Software Has Bugs

Software of any real complexity has bugs. This is the reality of modern software development. Linux and macOS obviously handled this update better than Windows (assuming it was pushed out to those clients as well), but there’s nothing to say that they’re bulletproof.

The Mac has had its share of serious bugs. In fact, this episode reminds me of the OSCP incident that left macOS users essentially unable to launch 3rd party apps for the better part of an afternoon. MacBooks weren’t boot looping, but I can tell you I wasn’t able to get much work done that afternoon.

I haven’t run Linux in some time, but here’s a good article about some of the more notable bugs users have encountered in various distributions over the years.

One of the events that helped accelerate Blackberry’s demise in the phone market was a massive service outage in 2011. This stuff unfortunately happens.

People Are Still At The Heart of This

Outside of the aforementioned real world impact, I also thought about the engineers that will be eventually held responsible. I’ve seen lots of commentary speculating that CrowdStrike “clearly” must not have tested this update, and this is all the result of gross negligence. And look, I don’t work at CloudStrike, I don’t know what their processes are internally. Something was clearly missed. But it’s highly unlikely that this update wasn’t tested. Its more likely that this update was tested, like other updates before it, and a combination of bad data and Windows settings resulted in this unforeseen result. Again, there’s clearly something being overlooked in their testing process and it needs improvement. There’s also the question of why this update wasn’t rolled out using a staggered release pattern. 8 million affected devices is significant, and a less aggressive release strategy could have made this much less of a catastrophe.

At the end of the day, no engineer is trying to push bugs or bad/malicious code (except maybe Jia Tan). People make mistakes. Things happen. I don’t work on software that affects anywhere near this number of people, but I get upset if even small handful of users are mildly inconvenienced by some bug I was a part of. That’s the stuff that keeps me up at night. I can’t imagine the guilt one might feel for essentially bricking 8 million machines, many of which are mission critical. So while I feel for the affected users and companies, I also feel for the person or team that will ultimately have to answer for this in the coming days and weeks.

We’re so used to dealing with large, faceless behemoth companies like Apple and Microsoft, we forget that companies of all sizes are just people. Often doing the best they can. But people make mistakes.

Conclusion

Your favorite operating system may not have be getting the blame today, but let’s not pretend that an event like this couldn’t happen to any operating system. That’s certainly not how it should be, but its the reality as software get more complex over time, not less.

Leave a Reply

Discover more from SlatePad

Subscribe now to keep reading and get access to the full archive.

Continue reading