CrowdStrike and Digital Operational Resilience

In this blog we outline two steps to mitigate the next CrowdStrike like incident. Take control of your ICT Risk. It’s your risk and you can’t outsource it!

Concentric rings grouping systems for targeted software updates
Deployment rings used to identify issues before they become critical

During the last week the EU Digital Operational Resilience second batch of policies were published. We were part way through reviewing them, about to start on Threat Led Penetration Testing, when the world-wide IT outage caused by CrowdStrike struck. The irony wasn’t wasted on us that a major cybersecurity incident such as this should disrupt our review of the new policies designed to protect against digital operational resilience related issues.

The incident according to Microsoft impacted eight and a half million computers running a Microsoft operating system that has CrowdStrike Falcon installed. Helping our customers through the CrowdStrike outage - The Official Microsoft Blog. The impact reported by Joe Tidy at the BBC could take days to get back to normal and that the outage impacted the airline industry, banks, healthcare and shops. IT problems will take 'some time' to fix, says CrowdStrike boss - BBC News. This isn’t a quick fix because it involves manual updates that need to be made to each machine affected – this cannot be fully automated.

Pondering this incident over the weekend has allowed us to digest the enormity of the incident and how little control companies appear to have. We say appear because companies have a lot of control they just fail to think about the problem logically or critically and take the necessary action. Instead, companies choose the easy route and place the resilience of their firm in the hands of a third party which has worked well – until it didn’t.

Incidents like CrowdStrike have happened in the past and will happen in the future. For CrowdStrike, we suspect there will be lots of legal discussions to be had about the outage but critical or systemic firms and industries should take note and make a more practical and measured response. In this blog we outline two steps to mitigate the next CrowdStrike like incident.

Technology stack diversification

Companies need to look at their critical services and the software that support them with a critical eye. You simply cannot afford to have any of your critical services reliant on a single operating system or supporting piece of software such as CrowdStrike. Critical services usually span multiple servers for high availability or disaster recovery purposes. It is essential that these have independent technology stacks to prevent any single issue in any piece of software from preventing you from meeting your minimum service level for your clients or customers.

The same applies to workstations so maybe it’s time to think about having emergency workstations systems on standby that are not all based on the same technology and technology stack. If your workstations are all based on Windows 11 your emergency workstations could be Linux or Mac O/S.

Deployment Rings for software updates

As for CrowdStrike and the other Endpoint Detection and Response (EDR) / Anti virus (AV) vendors, they need to provide customers with improved control over deployment timings. Similar to, Deployment Rings used to manage Microsoft updates and other software updates companies need to be in control of which systems receive the update first and in which order. Deployment rings put companies in control of their own Information and Communications Technology (ICT) Risk. This allows for a gradual and controlled rollout enabling technical staff to stop deployment if things go wrong.

In the graphic above you start with a relatively small pilot group. This could be members of the IT team for example. The wider pilot should include people from other teams so that you can spot incompatibility with any of the software deployed to support the business teams. Then you start the wider rollout to more workstations and work your way through the server estate. Starting with development and test services, moving to User Acceptance Testing (UAT) and then the two most important. 1) Production “non-critical servers” with “critical secondary servers” (e.g. Disaster recovery systems) and then finally 2) once you have a high level of confidence you deploy to “critical primary servers”.

There could be an expediated version of the deployment rings above for urgent security updates that must be deployed rapidly.

The benefit of this approach for firms like CrowdStrike is obvious, the customers are in control of deployment and the legal liability at worst is shared and at best, arguably, sits mostly with the customer. This applies to all software that is currently pushed directly by the software vendor.

Conclusion

It is interesting to see the responses to this incident with some throwing their arms in the air and complaining about CrowdStrike. Instead of complaining take control of your ICT Risk. It’s your risk and you can’t outsource it!

How IT Security Locksmith can help

If you would like to understand more about Digital Operational Resilience and Cybersecurity why not register for our board level course today?

About IT Security Locksmith

IT Security Locksmith are a cybersecurity company that specialises in board level training and consultancy.

To find out more about our capabilities please click here.

Our services page showcases the types of services we offer.

Click here to contact us for a no obligation initial consultation.