One of the best articles on incident response I’ve read

Mar 13

The How NOT to f-up your security incident response article in The Register this week really summarized lots of the common mistakes I’ve seen in a couple of decades of incident response. Jessica Lyons interviews some other industry veterans and I found myself shouting “yes, yes!” at my screen.

The article nails it and mirrors my experience being dropped in as an Incident Handler to organisations that hadn't taken the right preparedness steps ahead of an incident and end up doing incident reaction, not incident response. Some of the points raised in the article really stood out to me:

Confirmation bias is a prevalent problem in both SOC Analysts and Digital Forensics & Incident Response. That's why I always thought my teams to use the principle of negative hypothesis - form a hypotheses but hunt for evidence that negates your preposition rather than confirms it. Doing this often uncovers evidence that extends scope and makes you come up with a new hypothesis.
Organisations that set the expectation to senior management that recovery time after a destructive cybersecurity incident will be the same as for a business continuity/disaster recovery event often are forced skip essential investigatory, remediation and testing before going back into production in order to achieve what they’ve promised. As a result suffer repeated outages due to reinfection. Sometimes stopping for a hours or days can save weeks or months of repeated recoveries.
In the article Mandiant's CTO Charles Carmakal calls out "not properly scoping out the investigation and being too narrowly focused". I often see this with organisations that use "automagical" solutions to simply recover backup images that "don't have Indicators of Compromise in them" not realising the IoC is the signpost not the route cause, or limiting the scope to systems that have been encrypted, which is often the very last stage of the attack and ignoring points of initial entry and persistence mechanisms scattered around other systems.
The other observation from Carmakal is that the "people who are directly involved in the incident response aren't the same one doing the hands-on remediation activities…Then you have a lost-in-translation situation, where, if the guidance was verbal and not written on paper, you're very likely to miss some of the important nuances". This is often the case with Security Operations doing the investigation and IT Operations doing the remediation. This is exactly why when I helped create the Cohesity Clean Room solution this shared responsibility model, often with iterations back-and-forward when remediations aren’t achievable is taken into consideration to ensure systems are secure before recovery back into production.
Jake Williams of Hunter Strategy makes a good point about recover-and-clean volume-level backups vs rebuild and recover data: “you're taking a huge risk, a personal risk and a career risk, by not rebuilding…People try to clean malware off of systems rather than rebuilding systems. But you just can't ever deem a system clean once a threat actor has been on it.". Investigation and remediation slows down the ultimate recovery of systems back into production, but as I mentioned above premature recovery will result in reinfection. So how do organisations achieve the very aggressive recovery targets the board and regulators are demanding, yet ensure that systems are recovered to a secure state. The answer is to have an ability to rebuild systems to a secure state by maintaining install images, configurations and the means to rebuild - which can be anything from simple scripts, through Ansible playbooks, to Terraform configurations - on an immutable vaulted store that can be mounted in minutes. At Cohesity we use the Digital Jump Bag for this functionality.
CrowdStrike’s VP of Global Digital Forensics & Incident Response nailed my last takeaway perfectly: “practice makes perfect”. You don’t want the first time your SOC team, Incident Responders and IT Operations to run through an end-to-end incident to be the first time your organisation suffers an incident. This is why I created the Cohesity Clean Room solution to not touch production systems until the very last step. Organisations are free to test the people, process and technology aspects of their end-to-end incident and crisis response plans without any impact on production systems. This not only builds muscle memory, it also allows for the continual improvement of processes and identification of opportunities where automation could add the greatest improvements in effectiveness and efficiency. Taking a continual improvement approach allows organisations to start building their cyber resiliency right now, with pragmatic steps that fit with the current level of maturity, culture and resource constraints and then gradually improve the level of resilience within the organisation.