The delicate software supply chain

The great CrowdStrike patch failure of ‘24 (which is still unfolding as I’m writing this) is unlikely to have been a cyber-attack, but we won’t know for sure until CrowdStrike finishes its PIR and publishes the results.

It’s an interesting look into how modern software is built, and at this stage, it might be a bit unfair to pin everything onto CrowdStrike since we don’t know all the details. Testing updates can be complex – especially when you control only your part of the puzzle – since all modern software is built on top of other software – nobody builds from scratch any more.

Because of this, the underlying software everybody uses needs to be updated because it contains vulnerabilities—it’s just that nobody has discovered them yet. When a new exploit is discovered and released into the wild, all the software running the old version needs to be upgraded; otherwise, your company could be targeted and hacked.

Patch Management Standards

We need to patch the software, but why are we accepting automatic updates? Because it’s a risk vs reward decision. The longer you wait to apply a patch, the more time you’re exposed to this vulnerability after discovery. Also, many companies must stay compliant with published cybersecurity standards in order to operate. Even if they don’t have to comply, compliance is often a good thing since it shows business partners and customers that they take security seriously.

When organisations write their internal security policies, they don’t write them from scratch – why would you do that when you can “leverage” the work of others who have come before you? You can take what is already there and tailor the standards to the level that suits the organisational risk appetite.

Which standard you choose to leverage depends on where you are in the world and the industry you’re in, but in Australia, some of the more popular standards are:

NIST 800-53 – US-based, but also very popular in the rest of the world.
ISO-27001/2 – The international standard.
Essential 8 – Published by the Australian Cyber Security Centre, which outlines the eight most critical security controls
ASD ISM – The Information Security Manual, published by the Australian Signals Directorate. This is essentially Australia’s official version of NIST 800-53.
APRA CPG234 – The Australia Prudential standard, published by APRA and following the guidance (and demonstrating as such), is required for Australian financial institutions to keep their banking license.

There are many others as well, but these are the more popular ones. All these standards are similar in most ways but have varying levels of detail. CPG234 is quite high-level, with a wide range of interpretations, while NIST800-53 is quite detailed and prescriptive.

What the Standards Say

According to the NIST 800-53, SI-2 (3) standard, it recommends the organisation employs automated mechanisms to apply flaw remediation to information system components, and:

Organization-defined time periods for updating security-relevant software and firmware may vary based on a variety of risk factors, including the security category of the system, the criticality of the update (i.e., severity of the vulnerability related to the discovered flaw), the organizational risk tolerance, the mission supported by the system, or the threat environment. Some types of flaw remediation may require more testing than other types. Organizations determine the type of testing needed for the specific type of flaw remediation activity under consideration and the types of changes that are to be configuration-managed. In some situations, organizations may determine that the testing of software or firmware updates is not necessary or practical, such as when implementing simple malicious code signature updates. In testing decisions, organizations consider whether security-relevant software or firmware updates are obtained from authorized sources with appropriate digital signatures.

CPG-234 says that patching should be “timely”.

ISO20001 does not prescribe a “should” or a “must” for automated patching, but it provides a framework and best practices for consistently and effectively managing information security risks, including those related to automated patching.

Patching is a full quarter of the “Essential 8”, taking two spots (Patching the OS and application patching) and, at a high-level, states that software should be scanned daily. Patches must be applied within 48 hours for internet-facing systems if a vulnerability is discovered and an exploit is available, but you’re allowed longer if there’s no current exploit.

If you want to be as compliant with every standard as possible, it’s just easier to err on the side of caution and make patching a “must” activity with a relatively short timeframe.

Large organisations will have many scanning tools that produce thousands or even tens of thousands of “vulnerabilities” that “must” be patched. Manual patching takes a lot of time and is frequently ignored in favour of more important things.

The end result of all this is that if there’s an automated update option offered by a software vendor, it’s almost always turned on—especially if it’s security software—because trying to scan, identify, research fix, test fix, change manage fix, and apply fix is always going to take way longer than 48 hours—which means you’re no longer compliant with policy.

Supply Chain Security

In an enterprise software agreement, if a company buys software from CrowdStrike and CrowdStrike offers an automated patching function, CrowdStrike is expected to test the software before deployment.

This responsibility would be written into the contract, and CrowdStrike might offer a SOC-2 report (the audit outcome from an external auditor) as evidence that they have done so. In some cases, the purchaser will also interview CrowdStrike and ask them questions about their business process or even ask for a screen share to validate certain security controls are in place. They might talk to the engineer who designed the testing process and see if their answers align with those from that company.

But at the end of the day, it comes down to trust. The purchaser has no real way to verify the vendor’s claims. Short of perhaps hiring an ex-engineer who is willing to talk and spill company secrets, they have to trust that CrowdStrike is doing the right thing. It is a hole in the third —and fourth-party vendor management security “control” that you can drive a truck through, really. And there’s no good way to solve this problem besides trusting your vendors.

Patch Testing

As I wrote in the opening paragraph, although CrowdStrike is responsible for testing the patches it releases, it might not be all CrowdStrike’s fault. I know nothing about their testing process. They also have to contend with other vendors and their release cycles. They don’t make the operating system—Microsoft does, and Microsoft has its own automated patches applied continuously.

It’s possible that CrowdStrike’s update was tested on a version of windows that wasn’t released yet. Then Microsoft releases their patch, CrowdStrike releases theirs soon after and they don’t work well together and suddenly everything goes south. I also don’t know what CrowdStrike’s release process is like. Did a junior developer release a new patch without following proper process? Has CrowdStrike outsourced testing overseas? Perhaps they have replaced key staff with AI to speed up the process and it’s backfired?

They probably have some reasonably robust release processes, but clearly, a control failed at CrowdStrike, otherwise we wouldn’t be where we are.

Supply Chain Attacks

As I wrote previously, this is unlikely to be a cyber attack. But given what I’ve outlined, if a threat actor wanted to distribute malware, this type of hack would be the perfect way to do it. And I’m not singling out CrowdStrike here: Any vendor that offers automatic patching, has a large footprint into many companies and deploys software that runs at an administrative level is a large, juicy target for these type of attacks.

Why spend all your time hacking a single bank when instead you can hack all the banks in the world at once? CrowdStrike isn’t the only vendor in this privileged position; We’ve seen this type of attack before with Solarwinds.

Now What?

I recommend a few different approaches and next steps that should be considered:

For Vendors

Automated Testing. Don’t rely too much on automation to do the heavy lifting (on the back end). Test automation to do more testing in a shorter amount of time with more coverage? That’s a great idea. Automation of testing (or the introduction of AI) to replace testing teams and cut overheads? That’s a bad idea. AI is useful, but it can’t yet think “outside the box” like a human could because it can’t think much beyond the pre-programmed scenarios. The testing process should be bulletproof for software that runs at such a privileged level.

Governance. Ensure that there’s only 1 way to deploy to production, enforce testing prior to deployment and lock it down so that it cannot be bypassed. This requires a lot of thinking about insider threats and building processes that make it impossible for any one person to release code all on their own. This does tend to make the release process more complicated, which often tends to annoy developers.. there’s a delicate balance that needs to be reached between usability and security but it’s possible to do.

Transparency. What might help give assurance is if the whole vendor QA and deployment process was transparent and could be verified by customers somehow before accepting a patch. But perhaps this is wishful thinking. I think there will be more calls for oversight of this process once everybody gets back up and running again.

For their Customers

Keep Auto Update On. Don’t fall back to manual processes. Once the vendor has developed and tested the patches properly, updates must remain automated and continuous. We can’t go back to the dark ages of “patch Tuesday” (as all exploits will simply be released on Wednesday) or patching only at night to reduce the impact (it’s always daytime somewhere in the world – which is why this happened in Australia at 3pm).

If you want to go back to a manual process, you need to consider the risk of not patching, along with the significant workload increase, versus the slight chance of the patch breaking something. In general, and in my opinion, it’s not worth turning off automatic patching. You may have a special use case for business-critical systems and can’t afford any downtime, and for these scenarios, you need to perform your own risk assessment.

Proper Planning. We should all refresh our memory on NIST SP 800-40r4 – the Guide to Enterprise Patch Management Planning. This will tell you that you should have defined risk response scenarios in the event of patch failure and contains a list of questions to ask vendors during the procurement process to evaluate their software lifecycle maturity.

For Everybody

Offshoring. Often companies chose to offshore to countries with a low cast base. That low cost base can also come with a higher attrition and turnover, and a lower skill level which means not all staff may understand what they are doing or why. A small miscommunication between teams due to language or failure of handover due to time differences can very easily cause a problem like this to happen.

Of course it can happen on-shore as well, but is less likely. Offshoring certainly has it’s place, especially for lower value or repetitive work. But offshoring business critical functions brings with it a much higher risk of problems occurring and increases the incident response time if something eventually does go wrong.

Conclusion

Don’t be too hard on them just yet. People make mistakes, and it’s very easy to pile on with all the detractors. For EDR, feature-wise, CrowdStrike is still the best product on the market, with a large user base which is why this failure had such a large impact.

Hopefully, this high-profile incident will be the catalyst that sparks a solution. CrowdStrike could take this as an opportunity to be a (dare I say it) thought leader and show the rest of us how to fix these issues across the industry, preventing this, or worse, from happening again.