Blog
Crowdstruck - Some things to think about
“Turning and turning in the widening gyre
The falcon cannot hear the falconer;
Things fall apart; the centre cannot hold;
Mere anarchy is loosed upon the world…”
                                                 - WB Yeats

The Crowdstrike Falcon incident has impacted millions of Windows based computers worldwide, and has had cascading impacts across a wide swathe of industries. It’s scary (but ironically not too surprising) that a single update in a single software asset can have such a catastrophic domino effect. This in turn leads us to think of how systems are maintained and updated in the real world, and how we might potentially mitigate some of these situations in systems we use and work with.

What’s Crowdstrike, and why were so many systems affected?

CrowdStrike
Of course, if there’s something momentous in the tech world, there’s a good XKCD about it…

What is Crowdstrike Falcon, in the first place? It belongs to a class of security tools called EDRs (Endpoint Detection and Response). In simplistic terms, you can think of EDRs as a modern, more sophisticated evolution of ye-olde Anti-Virus softwares. Gartner defines EDRs as “solutions that record and store endpoint-system-level behaviors, use various data analytics techniques to detect suspicious system behavior, provide contextual information, block malicious activity, and provide remediation suggestions to restore affected systems.”  These are installed on your systems (can be servers, end-user desktops/laptops etc) and they monitor various things your system is doing. For example it could look through:

  • Software that’s being run, and by which users, which processes etc
  • Users and user activity
  • Web activity, sites visited, scripts run and what’s being downloaded
  • Any installations or executables being run

All of this is captured by an agent running on your systems, and the agent exports this information to a data-lake or some central location where it’s data analytics and AI tooling scans the captured data for patterns of threats and potential issues. The results then can be fed back to the agent to block potentially harmful activity earlier in the cycle, and is a pro-active approach to security compared to Anti-Virus, which generally tends to be a more reactive approach. This works better against modern threat models that are currently seen.

In order to do this, most EDRs (and their evolution - XDRs) depend on having low level access to systems, often at the Operating System Kernel level. This allows them to detect sophisticated malware that might be bypassing Application level security scanning.

Crowdstrike is one of the top players in the EDR market, and is deployed in many enterprise scale organisations (and millions of individual computers) worldwide. The Crowdstrike Falcon Blue-Screen-Of-Death (BSOD) that affected Windows systems worldwide was in effect a failure that happened on a piece of software that had access to the low-level innards of the Operating System, causing the OS to crash, effectively crippling these systems.

So how did this all come about?

This is something where details will become clearer in the days to come. It was ostensibly due to a defective “Rapid Response Content” update on the Crowdstrike Falcon Sensor software agent that runs on Windows systems. (Source) They narrowed it down to defective validation tests that allowed content templates to be uploaded despite containing problematic content data.

But while the post-mortem plays out, and the chain of incidents that led to this are uncovered, this is not likely to be the last time something like this happens. And this has to do with how a lot of modern software systems are wired together.

The tangled web of software dependencies

Any modern software system has an increasingly complex interwoven Rube Goldberg machine of dependencies, with potential failures and cascading effects possible from seemingly innocuous changes. Most software stacks are composed from a myriad of internal (to the organisation that owns the product or platform or system) and external software libraries, tooling and systems. Each of those dependencies could have their own set of dependencies, and so on. It’s turtles all the way down. To get an idea of the complexity in a modern microservice based systems stack, we can look at the Netflix example below.

A Netflix service dependency graph. Red dots are Netflix libraries, black are OSS. light red/grey are transitive dependencies.
The Netflix API service dependency graph (Source)

Resilience, control and mitigation

As companies that deliver software products or services to users, there are 2 lenses with which we need to look at mitigation strategies for scenarios like the Crowdstrike incident:

  1. How do we avoid a situation where users of our product or platform or system are impacted the way users of Crowdstrike were impacted?
  2. How do we mitigate impact to our systems when a dependency like Crowdstrike has a failure?

Both these scenarios have a fair amount of overlap, and a failure of the nature described in point 2 will often lead to the failure described in point 1. We can look at a few options that help navigate the tangled undergrowth in the jungle that is dependency management.

SBOMs

Having visibility and control over a system’s “Software Bill of Materials” or SBOM is considered a vital part of both ensuring security as well as maintaining stability in a software system.

It’s also important to have predictable, deterministic control over the versioning and upgradation of the SBOM for a software system. This becomes much trickier in todays world of labyrinthine dependency maps and versioning across the SBOM. With microservices, you have a distributed graph of dependencies that need to be maintained and inspected for hundreds of services.

In these scenarios, there are some good tools available to view and document your SBOM. Some notable tools are Syft, OWASP’s CycloneDX, The Linux Foundation’s SPDX, and Tern. CycloneDX and SPDX are common industry standard formats for SBOMs. These are great for generating SBOMs that comply with regulatory standards as required.

Dependency Management and CICD Automation

Most modern software stacks and languages allow for locking down the dependency configuration of your services to specific versions, and to have a deterministic approach to changes, which are tracked in version control. Alongside this, tooling like Dependabot and Renovate can be directly integrated with your services’ CI tooling to scan every build of the application for potential vulnerabilities.  This becomes even more powerful with a good set of smoke and regression tests on services before roll-out. Maintaining your software assets with a deterministic “everything-as-code” approach has a lot of cascading advantages.

It’s also vital to maintain test code, test data and configuration with the same discipline as application code. This would help prevent instances like the case where the Crowdstrike test automation for the “Rapid Response Content” update returned false negatives. It’s vital to periodically review tests and ensure that watermelon tests (green on the outside, red on the inside) are fixed or weeded out of the test suite.

Managing cloud-based black-box changes

However, even in situations where dependencies are mapped and maintained in a structured, sane manner, there is still room for chaos. Dependencies that automatically update themselves “over the wire” without much control in the hands of the user make them a potentially uncontrollable point of failure. This is what happened in the Crowdstrike dependency update. The different Windows systems that have Crowdstrike installed had automatic over-the-wire updates and ended up crashing as a result.

In mission critical systems, it is usually preferred to have governance over when and how updates are applied, and where possible, automatic updates without administrative intervention or control should be avoided. This allows for staged updates of dependencies that can be tested thoroughly first before roll-out to production systems.

However when it comes to end user devices like laptops, or airport check-in / booking kiosks etc, this needs to be balanced against the complexity and cost/effort associated with rolling out these changes across a fleet of devices with different users with varying levels of technical know-how across an organisation. Organisations use Systems Management / Device Management tooling with varying levels of control, granularity (and intrusiveness) to manage their software and hardware assets. Ironically Crowdstrike Falcon itself is a product that could fall into a subset of this broad category of a centralised systems management suite. Where possible, it’s good to test updates with a few test systems/devices before rolling changes out to the organisation.

Building for resiliency from the ground up

While the measures mentioned above can help with managing dependencies and preventing failure to an extent, it is also important to ensure we plan for failure in our software systems, and look at focusing on resiliency rather than an absolute avoidance of failure (which is nearly impossible in current, rapidly evolving interconnected systems).

It goes without saying that some systems need a much higher investment on stability and failure avoidance (health care life-support systems, for example). So one needs to figure out what takes priority for their software system.

A simplified matrix that looks at categories of software platforms, and their Risk Appetites vs how frequently they tend to push feature updates.

The example matrix in the image above looks at illustrating a high-level view for example systems based on category. This is not scientific or based on real data, but just an illustrative chart to give the reader a sense of where types of systems fall among those axes. It might make sense to look at your organisation’s software suite and dive into more granular layers and figure out where individual components of your product or platform fall under. That would be a starting point for further classification.

In actual practice, there are various levels of categorisation and risk classification that’s applied to software systems as a part of IT services management, and not all change is equal. So for any change in a software asset, it’s also important to be aware of:

  • Importance and Urgency of change
  • Risk associated with the change
    • Impact of failure
    • Probability of failure

It’s important to recognise that software systems can fail despite best efforts and intentions. It is important to build systems that make it quick for us to detect when something fails, and why. Then, we should be able to recover quickly… So design your software systems to optimise for resilience, with a focus on MTTD and MTTR.

Some pragmatic, relatively achievable measures can go a long way in making systems resilient:

  • Architecting our systems in a way that components can be loosely coupled, and can gracefully handle failure in their dependencies, and for it to be feasible to fall back on a “Plan B” in case of failure where needed
  • Having observability and alerting that
    • Enables us to identify potential failures coming up in order to prevent it
    • Enables us to quickly detect failures and anomalies
    • Enables us to quickly identify causes, and potential knock-on effects on other systems
  • Building systems that can be set up in a repeatable, deterministic way, using practices like Infrastructure-as-Code and Configuration-as-Code
  • Having a comprehensive Continuous Delivery mechanism that ensures an automated, deterministic roll-out and roll-back capability, with release strategies like Rainbow / Blue-Green deployments and progressive roll-outs. Having automated testing to ensure functionality, stability, security, and performance.
  • Making smaller, incremental changes where feasible that make it easy to constrain the scope and impact of change, as well as the effort to roll-out and roll-back changes

Wrapping things up

Designing systems that avoid failure, and bounce back on their feet quickly when any failure occurs is not a trivial problem to solve. It’s important to stay on top of the various components of the organisation’s software platform, and their various dependencies. It’s also important to roll out changes in a controlled, deterministic manner with the right amount of visibility. While not necessarily simple, some of the above mentioned practices and approaches go a long way in ensuring the stability of software systems when rolling out changes. It also makes a huge difference if these practices are driven by automation, and are ingrained in the organisational culture, and naturally occur among teams practically like habits or conditioned reflexes.

crowdstrike; architecture; change management; cicd
Crowdstruck - Some things to think about
24-07-2024
“Turning and turning in the widening gyre
The falcon cannot hear the falconer;
Things fall apart; the centre cannot hold;
Mere anarchy is loosed upon the world…”
                                                 - WB Yeats

The Crowdstrike Falcon incident has impacted millions of Windows based computers worldwide, and has had cascading impacts across a wide swathe of industries. It’s scary (but ironically not too surprising) that a single update in a single software asset can have such a catastrophic domino effect. This in turn leads us to think of how systems are maintained and updated in the real world, and how we might potentially mitigate some of these situations in systems we use and work with.

What’s Crowdstrike, and why were so many systems affected?

CrowdStrike
Of course, if there’s something momentous in the tech world, there’s a good XKCD about it…

What is Crowdstrike Falcon, in the first place? It belongs to a class of security tools called EDRs (Endpoint Detection and Response). In simplistic terms, you can think of EDRs as a modern, more sophisticated evolution of ye-olde Anti-Virus softwares. Gartner defines EDRs as “solutions that record and store endpoint-system-level behaviors, use various data analytics techniques to detect suspicious system behavior, provide contextual information, block malicious activity, and provide remediation suggestions to restore affected systems.”  These are installed on your systems (can be servers, end-user desktops/laptops etc) and they monitor various things your system is doing. For example it could look through:

  • Software that’s being run, and by which users, which processes etc
  • Users and user activity
  • Web activity, sites visited, scripts run and what’s being downloaded
  • Any installations or executables being run

All of this is captured by an agent running on your systems, and the agent exports this information to a data-lake or some central location where it’s data analytics and AI tooling scans the captured data for patterns of threats and potential issues. The results then can be fed back to the agent to block potentially harmful activity earlier in the cycle, and is a pro-active approach to security compared to Anti-Virus, which generally tends to be a more reactive approach. This works better against modern threat models that are currently seen.

In order to do this, most EDRs (and their evolution - XDRs) depend on having low level access to systems, often at the Operating System Kernel level. This allows them to detect sophisticated malware that might be bypassing Application level security scanning.

Crowdstrike is one of the top players in the EDR market, and is deployed in many enterprise scale organisations (and millions of individual computers) worldwide. The Crowdstrike Falcon Blue-Screen-Of-Death (BSOD) that affected Windows systems worldwide was in effect a failure that happened on a piece of software that had access to the low-level innards of the Operating System, causing the OS to crash, effectively crippling these systems.

So how did this all come about?

This is something where details will become clearer in the days to come. It was ostensibly due to a defective “Rapid Response Content” update on the Crowdstrike Falcon Sensor software agent that runs on Windows systems. (Source) They narrowed it down to defective validation tests that allowed content templates to be uploaded despite containing problematic content data.

But while the post-mortem plays out, and the chain of incidents that led to this are uncovered, this is not likely to be the last time something like this happens. And this has to do with how a lot of modern software systems are wired together.

The tangled web of software dependencies

Any modern software system has an increasingly complex interwoven Rube Goldberg machine of dependencies, with potential failures and cascading effects possible from seemingly innocuous changes. Most software stacks are composed from a myriad of internal (to the organisation that owns the product or platform or system) and external software libraries, tooling and systems. Each of those dependencies could have their own set of dependencies, and so on. It’s turtles all the way down. To get an idea of the complexity in a modern microservice based systems stack, we can look at the Netflix example below.

A Netflix service dependency graph. Red dots are Netflix libraries, black are OSS. light red/grey are transitive dependencies.
The Netflix API service dependency graph (Source)

Resilience, control and mitigation

As companies that deliver software products or services to users, there are 2 lenses with which we need to look at mitigation strategies for scenarios like the Crowdstrike incident:

  1. How do we avoid a situation where users of our product or platform or system are impacted the way users of Crowdstrike were impacted?
  2. How do we mitigate impact to our systems when a dependency like Crowdstrike has a failure?

Both these scenarios have a fair amount of overlap, and a failure of the nature described in point 2 will often lead to the failure described in point 1. We can look at a few options that help navigate the tangled undergrowth in the jungle that is dependency management.

SBOMs

Having visibility and control over a system’s “Software Bill of Materials” or SBOM is considered a vital part of both ensuring security as well as maintaining stability in a software system.

It’s also important to have predictable, deterministic control over the versioning and upgradation of the SBOM for a software system. This becomes much trickier in todays world of labyrinthine dependency maps and versioning across the SBOM. With microservices, you have a distributed graph of dependencies that need to be maintained and inspected for hundreds of services.

In these scenarios, there are some good tools available to view and document your SBOM. Some notable tools are Syft, OWASP’s CycloneDX, The Linux Foundation’s SPDX, and Tern. CycloneDX and SPDX are common industry standard formats for SBOMs. These are great for generating SBOMs that comply with regulatory standards as required.

Dependency Management and CICD Automation

Most modern software stacks and languages allow for locking down the dependency configuration of your services to specific versions, and to have a deterministic approach to changes, which are tracked in version control. Alongside this, tooling like Dependabot and Renovate can be directly integrated with your services’ CI tooling to scan every build of the application for potential vulnerabilities.  This becomes even more powerful with a good set of smoke and regression tests on services before roll-out. Maintaining your software assets with a deterministic “everything-as-code” approach has a lot of cascading advantages.

It’s also vital to maintain test code, test data and configuration with the same discipline as application code. This would help prevent instances like the case where the Crowdstrike test automation for the “Rapid Response Content” update returned false negatives. It’s vital to periodically review tests and ensure that watermelon tests (green on the outside, red on the inside) are fixed or weeded out of the test suite.

Managing cloud-based black-box changes

However, even in situations where dependencies are mapped and maintained in a structured, sane manner, there is still room for chaos. Dependencies that automatically update themselves “over the wire” without much control in the hands of the user make them a potentially uncontrollable point of failure. This is what happened in the Crowdstrike dependency update. The different Windows systems that have Crowdstrike installed had automatic over-the-wire updates and ended up crashing as a result.

In mission critical systems, it is usually preferred to have governance over when and how updates are applied, and where possible, automatic updates without administrative intervention or control should be avoided. This allows for staged updates of dependencies that can be tested thoroughly first before roll-out to production systems.

However when it comes to end user devices like laptops, or airport check-in / booking kiosks etc, this needs to be balanced against the complexity and cost/effort associated with rolling out these changes across a fleet of devices with different users with varying levels of technical know-how across an organisation. Organisations use Systems Management / Device Management tooling with varying levels of control, granularity (and intrusiveness) to manage their software and hardware assets. Ironically Crowdstrike Falcon itself is a product that could fall into a subset of this broad category of a centralised systems management suite. Where possible, it’s good to test updates with a few test systems/devices before rolling changes out to the organisation.

Building for resiliency from the ground up

While the measures mentioned above can help with managing dependencies and preventing failure to an extent, it is also important to ensure we plan for failure in our software systems, and look at focusing on resiliency rather than an absolute avoidance of failure (which is nearly impossible in current, rapidly evolving interconnected systems).

It goes without saying that some systems need a much higher investment on stability and failure avoidance (health care life-support systems, for example). So one needs to figure out what takes priority for their software system.

A simplified matrix that looks at categories of software platforms, and their Risk Appetites vs how frequently they tend to push feature updates.

The example matrix in the image above looks at illustrating a high-level view for example systems based on category. This is not scientific or based on real data, but just an illustrative chart to give the reader a sense of where types of systems fall among those axes. It might make sense to look at your organisation’s software suite and dive into more granular layers and figure out where individual components of your product or platform fall under. That would be a starting point for further classification.

In actual practice, there are various levels of categorisation and risk classification that’s applied to software systems as a part of IT services management, and not all change is equal. So for any change in a software asset, it’s also important to be aware of:

  • Importance and Urgency of change
  • Risk associated with the change
    • Impact of failure
    • Probability of failure

It’s important to recognise that software systems can fail despite best efforts and intentions. It is important to build systems that make it quick for us to detect when something fails, and why. Then, we should be able to recover quickly… So design your software systems to optimise for resilience, with a focus on MTTD and MTTR.

Some pragmatic, relatively achievable measures can go a long way in making systems resilient:

  • Architecting our systems in a way that components can be loosely coupled, and can gracefully handle failure in their dependencies, and for it to be feasible to fall back on a “Plan B” in case of failure where needed
  • Having observability and alerting that
    • Enables us to identify potential failures coming up in order to prevent it
    • Enables us to quickly detect failures and anomalies
    • Enables us to quickly identify causes, and potential knock-on effects on other systems
  • Building systems that can be set up in a repeatable, deterministic way, using practices like Infrastructure-as-Code and Configuration-as-Code
  • Having a comprehensive Continuous Delivery mechanism that ensures an automated, deterministic roll-out and roll-back capability, with release strategies like Rainbow / Blue-Green deployments and progressive roll-outs. Having automated testing to ensure functionality, stability, security, and performance.
  • Making smaller, incremental changes where feasible that make it easy to constrain the scope and impact of change, as well as the effort to roll-out and roll-back changes

Wrapping things up

Designing systems that avoid failure, and bounce back on their feet quickly when any failure occurs is not a trivial problem to solve. It’s important to stay on top of the various components of the organisation’s software platform, and their various dependencies. It’s also important to roll out changes in a controlled, deterministic manner with the right amount of visibility. While not necessarily simple, some of the above mentioned practices and approaches go a long way in ensuring the stability of software systems when rolling out changes. It also makes a huge difference if these practices are driven by automation, and are ingrained in the organisational culture, and naturally occur among teams practically like habits or conditioned reflexes.

Subscribe To Our Newsletter

Do get in touch with us to understand more about how we can help your organization in building meaningful and in-demand products
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Blog

Crowdstruck - Some things to think about

Written by:  

AP

July 24, 2024

8 min read

Crowdstruck - Some things to think about
“Turning and turning in the widening gyre
The falcon cannot hear the falconer;
Things fall apart; the centre cannot hold;
Mere anarchy is loosed upon the world…”
                                                 - WB Yeats

The Crowdstrike Falcon incident has impacted millions of Windows based computers worldwide, and has had cascading impacts across a wide swathe of industries. It’s scary (but ironically not too surprising) that a single update in a single software asset can have such a catastrophic domino effect. This in turn leads us to think of how systems are maintained and updated in the real world, and how we might potentially mitigate some of these situations in systems we use and work with.

What’s Crowdstrike, and why were so many systems affected?

CrowdStrike
Of course, if there’s something momentous in the tech world, there’s a good XKCD about it…

What is Crowdstrike Falcon, in the first place? It belongs to a class of security tools called EDRs (Endpoint Detection and Response). In simplistic terms, you can think of EDRs as a modern, more sophisticated evolution of ye-olde Anti-Virus softwares. Gartner defines EDRs as “solutions that record and store endpoint-system-level behaviors, use various data analytics techniques to detect suspicious system behavior, provide contextual information, block malicious activity, and provide remediation suggestions to restore affected systems.”  These are installed on your systems (can be servers, end-user desktops/laptops etc) and they monitor various things your system is doing. For example it could look through:

  • Software that’s being run, and by which users, which processes etc
  • Users and user activity
  • Web activity, sites visited, scripts run and what’s being downloaded
  • Any installations or executables being run

All of this is captured by an agent running on your systems, and the agent exports this information to a data-lake or some central location where it’s data analytics and AI tooling scans the captured data for patterns of threats and potential issues. The results then can be fed back to the agent to block potentially harmful activity earlier in the cycle, and is a pro-active approach to security compared to Anti-Virus, which generally tends to be a more reactive approach. This works better against modern threat models that are currently seen.

In order to do this, most EDRs (and their evolution - XDRs) depend on having low level access to systems, often at the Operating System Kernel level. This allows them to detect sophisticated malware that might be bypassing Application level security scanning.

Crowdstrike is one of the top players in the EDR market, and is deployed in many enterprise scale organisations (and millions of individual computers) worldwide. The Crowdstrike Falcon Blue-Screen-Of-Death (BSOD) that affected Windows systems worldwide was in effect a failure that happened on a piece of software that had access to the low-level innards of the Operating System, causing the OS to crash, effectively crippling these systems.

So how did this all come about?

This is something where details will become clearer in the days to come. It was ostensibly due to a defective “Rapid Response Content” update on the Crowdstrike Falcon Sensor software agent that runs on Windows systems. (Source) They narrowed it down to defective validation tests that allowed content templates to be uploaded despite containing problematic content data.

But while the post-mortem plays out, and the chain of incidents that led to this are uncovered, this is not likely to be the last time something like this happens. And this has to do with how a lot of modern software systems are wired together.

The tangled web of software dependencies

Any modern software system has an increasingly complex interwoven Rube Goldberg machine of dependencies, with potential failures and cascading effects possible from seemingly innocuous changes. Most software stacks are composed from a myriad of internal (to the organisation that owns the product or platform or system) and external software libraries, tooling and systems. Each of those dependencies could have their own set of dependencies, and so on. It’s turtles all the way down. To get an idea of the complexity in a modern microservice based systems stack, we can look at the Netflix example below.

A Netflix service dependency graph. Red dots are Netflix libraries, black are OSS. light red/grey are transitive dependencies.
The Netflix API service dependency graph (Source)

Resilience, control and mitigation

As companies that deliver software products or services to users, there are 2 lenses with which we need to look at mitigation strategies for scenarios like the Crowdstrike incident:

  1. How do we avoid a situation where users of our product or platform or system are impacted the way users of Crowdstrike were impacted?
  2. How do we mitigate impact to our systems when a dependency like Crowdstrike has a failure?

Both these scenarios have a fair amount of overlap, and a failure of the nature described in point 2 will often lead to the failure described in point 1. We can look at a few options that help navigate the tangled undergrowth in the jungle that is dependency management.

SBOMs

Having visibility and control over a system’s “Software Bill of Materials” or SBOM is considered a vital part of both ensuring security as well as maintaining stability in a software system.

It’s also important to have predictable, deterministic control over the versioning and upgradation of the SBOM for a software system. This becomes much trickier in todays world of labyrinthine dependency maps and versioning across the SBOM. With microservices, you have a distributed graph of dependencies that need to be maintained and inspected for hundreds of services.

In these scenarios, there are some good tools available to view and document your SBOM. Some notable tools are Syft, OWASP’s CycloneDX, The Linux Foundation’s SPDX, and Tern. CycloneDX and SPDX are common industry standard formats for SBOMs. These are great for generating SBOMs that comply with regulatory standards as required.

Dependency Management and CICD Automation

Most modern software stacks and languages allow for locking down the dependency configuration of your services to specific versions, and to have a deterministic approach to changes, which are tracked in version control. Alongside this, tooling like Dependabot and Renovate can be directly integrated with your services’ CI tooling to scan every build of the application for potential vulnerabilities.  This becomes even more powerful with a good set of smoke and regression tests on services before roll-out. Maintaining your software assets with a deterministic “everything-as-code” approach has a lot of cascading advantages.

It’s also vital to maintain test code, test data and configuration with the same discipline as application code. This would help prevent instances like the case where the Crowdstrike test automation for the “Rapid Response Content” update returned false negatives. It’s vital to periodically review tests and ensure that watermelon tests (green on the outside, red on the inside) are fixed or weeded out of the test suite.

Managing cloud-based black-box changes

However, even in situations where dependencies are mapped and maintained in a structured, sane manner, there is still room for chaos. Dependencies that automatically update themselves “over the wire” without much control in the hands of the user make them a potentially uncontrollable point of failure. This is what happened in the Crowdstrike dependency update. The different Windows systems that have Crowdstrike installed had automatic over-the-wire updates and ended up crashing as a result.

In mission critical systems, it is usually preferred to have governance over when and how updates are applied, and where possible, automatic updates without administrative intervention or control should be avoided. This allows for staged updates of dependencies that can be tested thoroughly first before roll-out to production systems.

However when it comes to end user devices like laptops, or airport check-in / booking kiosks etc, this needs to be balanced against the complexity and cost/effort associated with rolling out these changes across a fleet of devices with different users with varying levels of technical know-how across an organisation. Organisations use Systems Management / Device Management tooling with varying levels of control, granularity (and intrusiveness) to manage their software and hardware assets. Ironically Crowdstrike Falcon itself is a product that could fall into a subset of this broad category of a centralised systems management suite. Where possible, it’s good to test updates with a few test systems/devices before rolling changes out to the organisation.

Building for resiliency from the ground up

While the measures mentioned above can help with managing dependencies and preventing failure to an extent, it is also important to ensure we plan for failure in our software systems, and look at focusing on resiliency rather than an absolute avoidance of failure (which is nearly impossible in current, rapidly evolving interconnected systems).

It goes without saying that some systems need a much higher investment on stability and failure avoidance (health care life-support systems, for example). So one needs to figure out what takes priority for their software system.

A simplified matrix that looks at categories of software platforms, and their Risk Appetites vs how frequently they tend to push feature updates.

The example matrix in the image above looks at illustrating a high-level view for example systems based on category. This is not scientific or based on real data, but just an illustrative chart to give the reader a sense of where types of systems fall among those axes. It might make sense to look at your organisation’s software suite and dive into more granular layers and figure out where individual components of your product or platform fall under. That would be a starting point for further classification.

In actual practice, there are various levels of categorisation and risk classification that’s applied to software systems as a part of IT services management, and not all change is equal. So for any change in a software asset, it’s also important to be aware of:

  • Importance and Urgency of change
  • Risk associated with the change
    • Impact of failure
    • Probability of failure

It’s important to recognise that software systems can fail despite best efforts and intentions. It is important to build systems that make it quick for us to detect when something fails, and why. Then, we should be able to recover quickly… So design your software systems to optimise for resilience, with a focus on MTTD and MTTR.

Some pragmatic, relatively achievable measures can go a long way in making systems resilient:

  • Architecting our systems in a way that components can be loosely coupled, and can gracefully handle failure in their dependencies, and for it to be feasible to fall back on a “Plan B” in case of failure where needed
  • Having observability and alerting that
    • Enables us to identify potential failures coming up in order to prevent it
    • Enables us to quickly detect failures and anomalies
    • Enables us to quickly identify causes, and potential knock-on effects on other systems
  • Building systems that can be set up in a repeatable, deterministic way, using practices like Infrastructure-as-Code and Configuration-as-Code
  • Having a comprehensive Continuous Delivery mechanism that ensures an automated, deterministic roll-out and roll-back capability, with release strategies like Rainbow / Blue-Green deployments and progressive roll-outs. Having automated testing to ensure functionality, stability, security, and performance.
  • Making smaller, incremental changes where feasible that make it easy to constrain the scope and impact of change, as well as the effort to roll-out and roll-back changes

Wrapping things up

Designing systems that avoid failure, and bounce back on their feet quickly when any failure occurs is not a trivial problem to solve. It’s important to stay on top of the various components of the organisation’s software platform, and their various dependencies. It’s also important to roll out changes in a controlled, deterministic manner with the right amount of visibility. While not necessarily simple, some of the above mentioned practices and approaches go a long way in ensuring the stability of software systems when rolling out changes. It also makes a huge difference if these practices are driven by automation, and are ingrained in the organisational culture, and naturally occur among teams practically like habits or conditioned reflexes.

About Greyamp

Greyamp is a boutique Management Consulting firm that works with large enterprises to help them on their Digital Transformation journeys, going across the organisation, covering process, people, culture, and technology. Subscribe here to get our latest digital transformation insights.