I’m giving a presentation on the trends on cyber security operations at this year’s SIGS Conference in Zurich Switzerland. To register head here.
The Geneva SOC Forum was an interesting event, with some excellent presentations. The first was by Carine Allaz on her experiences of establishing and running a Security Operations Centre for a private bank in Switzerland. A lot of the lessons she learnt are common to many of the organisations we’ve seen that have undertaken. She mentioned that setting up a SOC had given her her first grey hair, I pointed out getting involved in building over 90 for our customer had made mine go grey, then completely fall out…
The second speaker was Jonathan Sinclair. His presentation focused on demonstrating business value and the development of meaningful use case – again spot-on with my experiences and something that businesses continually get wrong, limiting their return on investment in security operations.
This was followed by the SOC Jeopardy session we facilitated. In the session we asked the 70+ attendees, from SOCs all over the Geneva and Lucerne area, a selection of 11 questions around business-alignment, technology, people or process. These questions were a subset from the over 250 used in our Practice’s Security Operations Maturity Assessment, used to construct build or improvement roadmaps for our customers. |
The next step was to compare the room’s results with that of the average we see across from the hundreds of assessments we’ve conducted across the globe; then to discuss the impact that the different maturity levels would have on the effectiveness and efficiency of their SOCs; as well as discuss the constraints and challenges they may have in achieving the more mature levels.
On-the-whole, the level of maturity seen in the organisations in Switzerland was at least as good, if not higher, than the average across the globe. Some really insightful questions came from the audience and the two other speakers were exceptional. It is a shame that non-commercial events like this, that bring together SOC managers and operational staff with their peers to discuss best practices, do not exist in most other countries.
I’ll be speaking at the 12th SOC Forum in Geneva, Switzerland in December and offering insights into key findings from our annual State of Security Operations report.
If you are a CISO for a Swiss company or work in a SOC based in Switzerland you can join the Security Operation Center Professional Special Interest Group and attend.
More details on the exact date to follow.
I know, I know, it’s hard to get the staff for a SOC and once you’ve got them you don’t want to loose them but this can be one of your biggest mistakes, and lost opportunities.
Bit of a spoiler here for the remaining two SOC mistakes: I’ll cover the acquisition of talent in a later post, here I’m going to mainly talk about career progression and retention.
Retention of staff within a SOC can be one of the most challenging aspects of security operations management – constant turnover means that critical institutional knowledge that takes time to build is walking out of the door; offers from competitors can continually drive your staffing costs up; and attempting to keep someone whose heart is no longer in the job within your SOC can be incredibly corrosive on the morale of other staff members.
The first question is: why do they want to leave? You should ensure that there is a way of obtaining free-text feedback of the reasons for leaving and there should be a formal exit interview, not only with HR but also with someone from the SOC management team. Understanding why staff are leaving is the first step in attempting to stem the tide.
Often salary and benefits isn’t the only reason people want to move on: many SOCs we’ve assessed that don’t leverage well-built use cases that optimise automation, instead relying on the analyst to perform repetitive manual tasks over-and-over again can be really demotivating, as can building use case content that offers little opportunity for tuning resulting in the analyst continually seeing the same false positives again-and-again and is worried about implementing suppression in-case they miss something (see SOC Mistake #7: On Use Cases, You Model Your Defences, Not Your Attackers). Similarly, if an Analyst feels that their job has little impact on the organisation because the SOC Manager is focused on technology, not how their function supports the business and can’t illustrate the value of what their staff are doing (see SOC Mistake #8: You don’t speak the language of business, you speak the language of security). I’ve seen staff leave SOCs for better money elsewhere and then return their original organisation because the new company has the problems I’ve just highlighted, but their previous employer has not.
The other major reason people move on is because they’ve grown out of their current position. This can be challenging because as you move up the SOC staffing ladder there are less-and-less positions available. If you have a SOC that is proving value to the business and has nailed the foundations of security operations – good log management and leveraging SIEM to detect the known-knowns, you’re ready to move onto more advanced concepts such as using Hunt Teams to drive analytics to find the unknown-unknowns. You’re ready to embrace not just tactical Indicators of Compromise as Threat Intelligence, but to build a team focused around adversary characterisation through the collection, analysis and dissemination of strategic, operational and tactical threat intelligence. Measurable closed-loop feedback mechanisms and workflows can now be created between your different teams so the Cyber Threat Intelligence function can inform the Hunt Teams and Content Authors of new adversaries and techniques; the Hunt Teams can drive automation of their discoveries into SIEM Use Cases to maximise efficiency; and new artefacts discovered by the SOC Analysts can be feed into the Hunt Team for historical analysis…
All of these teams need staff – Cyber Threat Intelligence; Content Authors; SIEM and Log Management Engineering; Hunt Team Data Scientists, Analysts and Analytics Platform Engineers; and the owners of the event sources. You can create a formalised career progression plans along a number of tracks – technical, analytical and managerial and then focus on developing measurable progress against these plans. Supporting job rotation not only helps build bridges between the different teams for the roles they are doing now, it also allows a junior member of staff to plan where their next role might sit within the organisation and then follow the appropriate career progression path.
The career progression should not be limited to roles within the SOC. I know many SOC Managers would like to keep their staff, but if they’re going to leave for a role outside of security operations, it would be best to keep them within the organisation. I’ve seen Analyst leave to become developers, or go into IT architecture or operations positions. How much easier would it be to build bridges between information security and the operational parts of the business if they had internal evangelists for the security function now working within those teams who intimately understand the threats the organisation faces and how attacks can, and do, manifest themselves?
So if you follow some of the guidance in my earlier posts and offer structured career progression paths, you may loose less staff but you have to accept people do move on. So how do you deal with the challenge of staffing your SOC, you grow not buy and this will be the topic of my next post.
Every year at RSA, InfoSec Europe, Black Hat and the plethora of other conferences, the halls are full of vendors pitching the latest security products that are the Silver Bullet to all the ills you’re currently suffering. All vendors make concerted efforts to part you from your finite infosec budgets by knowing your pain points and offering a seemingly one-stop-solution to at least one of your pain points – some vendors are even promising to eliminate all of them, cyber snake oil for the 21st century.
Slick demos, based on highly selective inputs and well crafted pitches, all try and convince your organisation to part with it’s cash. I know, I was a part of this well-oiled machine for over a decade. It’s not like the infosec vendor community is bad, will all have a belief that we were making the World better – and to a certain degree we were – but the reality is that technology, on it’s own is not going to solve anything. It’s an old maxim, but you need people, process and technology in order to be successful.
Last year saw the rise of Machine Learning and Analytics as being positioned as delivering a panacea of end-to-end automation, by trawling through your log files in an automated way creating actionable alerts without human intervention. Those of us who’ve been using analytics in infosec for over a decade acknowledge that analytics can improve the automation of a cyber security operations centre, but it can’t automate everything. Besides, we’ve seen these kind of claims before: in 2001, Security Information & Event Management (SIEM) products came to the market and were sold in the same way: aggregate all your logs into this one platform and magic will happen. Fifteen years on and the reality is that most organisations aren’t even doing SIEM right today.
As my ex-colleague Anton Chuvakin, now a senior analyst for Gartner in the SIEM points out, thinking organisations who have failed to implement log management and SIEM well – and lord knows I can tell you most haven’t – can suddenly make something as complicated as big data analytics successful is naive.
The reality is that many organisation’s don’t have a shortage of security products, or even people. Some of the reasons that they are unable to deliver quality value-for-money services because they lack:
- Having a risk-based approach that identifies what risks, and the relative impacts of each, that the business needs to address to bring risk down to an acceptable level.The concept of information, or cyber, “security” should really be replaced with that of “risk”. In this day-and-age, risk is a part of conducting business and C-level executives are experts in making decisions based on the balance of risk. The challenge is that many in information “security” live within a little technical bubble where the grey area of risk acceptance doesn’t gel well with their binary view of either something is “secure” or it isn’t. Many who’ve floated to the top in infosec started as firewall or Intrusion Detection System administrators, where either something is permitted to pass or not, or is an attack, or isn’t. The concept of accepting certain information risk, because of cost of mitigation exceeds the value mitigated, isn’t something that gels well with someone who’s thinking is conditioned that way.
C-level executives don’t live in such a discrete world and are used to seeing quantitive analysis, or at least qualitative SWAG estimate, of the risk and the various options for dealing with the risk, and that includes acceptance. I’ve seen many people in the information security industry bemoan that they’ve “failed” when they the business does not act as they recommended when they have presented a plan that includes a inherent risk,; the costs and plan for treating the risk with a control; and the remaining residual risk after the control has mitigated the risk. As long as they’ve done a good job in preparation, they’ve done their job.
At the same time the CISO is pitching for funds for the infosec budget, the Sales Manager will be presenting plans for expansion into a new territory; the CIO plans to move to a cloud service provider or Facilities to upgrade the catering area – you’re just another Subject Matter Expert presenting plans for an Executive Board to weigh against other activities competing for the limited budgets and attention a business can allocate. In our narrow silo of information security, it’s hard to sometime see the broader picture.
I remember one particular instance when I was making a pitch for a £500,000 investment that would mitigate around £1M in risk – a good investment you might think. I thought I’d done a very good job until my request was turned down. In speaking to the CFO later, it turns out that the same £500,000 could have funded 5 additional sales executives who were, on average, bringing in £200,000 in revenue each. With our high customer retention rate would result in many millions of operational cash flow from subscription revenue being available to the business without the need to dig into our roadway from investors. I’d also produced quite a lot of empirical data – imagine if a CISO had come to the table without any firm of quantitative, or even quantitative data, and had made that pitch versus a sales manager who can demonstrate the per capita average value of each salesperson?
Information security needs to articulate risk in the same manner, using the same language and the same metrics, as the other aspects of operational risk – including fraud, safety, disasters, market risk and competitive risk. We also need to demonstrate the value of the investments in terms of enabling new opportunities and sources of revenue – we need to position ourselves as a profit centre, rather than a cost centre.
Senior executives will make decisions based on a balance of risk in each of these areas, if they can’t comprehend the value proposition you’re presenting – or understand the technical terms you’re using – someone else’s budget is going to benefit from your inability show the impact of what you’re proposing.
- Having properly integrated solutions, based on vendor’s products but with the appropriate skills and processes: people, process AND technology. Deploying a product isn’t the end of a journey, it’s the start of it. In fact, the preparations should start long before the courier turns up at the door with the product under their arm. Planning the integration of the technology with incumbent platforms; ensuring the right skills exist within the business to support the technology; the right processes are in place to maintain the technology; and performance metrics are collected to prove it’s business value. The vendor has little or no context of your true business operations – they may support hundred, thousands or even millions of customers in many different verticals.The vendor may give you a normalised blueprint of what good looks like, but it’s the customer’s infosec team that are ultimately responsible for turning that into an architecture, integration plan and operational model that will product the organisation not just at implementation time, but also on an ongoing basis as adversaries adapt and new vulnerabilities are found.
All too often vendor’s products are deployed and you can go back and check on how often the architecture or content has been updated, or even reviewed, and the last time may actually have been when the professional services team from that vendor first deployed the solution.
- Having a framework for integrating the different types risk each vendor’s product mitigates into a end-to-end framework. Those of us who have lived in the worlds of ISO/IEC 27002, NIST, COBIT or ISF are used to an integrated framework of controls that complement each other and integrate with one-another. If a preventative here fails to stop an attack, it can be detected by a detective control here and responded to using a responsive control there.Even if an organisation is utilising aa control framework, often the integration and feedback loops between the different controls are missing. For instance, if your Security Operations Centre analysts are detecting attacks at an internal reconnaissance stage, are they feedback back the control failures that allowed the exploit in in the first place?
Integrated frameworks aren’t just about a jigsaw of related controls, it’s having the right processes and swimlanes around the controls to make them work properly in that integrated fashion.
- Having an integrated architecture to support the framework. A framework is more of a conceptual concept, the architecture is how the different vendor’s products interface together. The output of one vendor’s product may well be in the input to the next; does this output need to be transformed in order for the consumer to use it?; how do we secure the controls themselves?; how do we collect metrics on performance and health?; how do we update any signatures on the platform? Instead of adjusting an existing architecture piecemeal every time there is an opportunity to deploy a new blinky box, build an extensible architecture and a process for the evaluation and integration of any new technological components. Having standards around health monitoring, metrics, etc allows you to evaluate the conformity of the vendor’s product to your standard architecture before money changes hands.
- Having a way to measure the vendor’s performance. Deploying a vendor’s product and “hoping for the best” doesn’t demonstrate value to the business. Understanding the value of the risk mitigation that the product brought should have been a part of your value proposition when making the case for the investment (see 1 above) – how do you measure and demonstrate this to the business? CISOs tend to have more credibility when they can say “Out inherent risk of X is £Y, if we invest in Z costing £A, the residual risk will be £B.” Who can then come back to the business and say: “You invested £A in Z and we expected the risk to be £B, we actually managed to achieve £B – £100K”. You can’t always do this level of granular assessment on controls, but most people don’t even try – but it needs you to build a metrics program based around risk.
No-one in infosec should see a demo from a vendor without subjecting what they’re seeing to some form of critical analysis. How would be manage this product? How is it updated, but in terms of the platform itself but also it’s detection capability? How does it integrate with other products we use? Does it need another UI to manage? Who is going to maintain the platform? How are alerts from the platform managed? What is the process for tuning the platform’s logic? What skills does this platform need to manage it? Does the platform introduce any form of privacy considerations?
In the meantime, we’ll all continue to enjoy lunch paid for by vendors, take the swag off the stalls, smile at the marketing drones and buy the machine that goes bing.
Building an effective and efficient Security Operations Centre can take a matter of years. Yes, you can build a foundational level of capability in several months (and it’s what companies used to pay me to do in my previous role), but it takes time for processes to be tuned and become muscle memory for the analysts;; to tweak the detection logic to filter out false positives; to optimise the triage and intrusion analysis playbooks al while minimising the risk of a real attack sneaking in as a false negative; – these activities all take time.
The transition from foundational baseline, implemented during the project to stand-up the initial capability, to the BAU capability is achieved through the implementation of repeatable and measurable processes, along with the metrics that provide the required telemetry for the SOC manager to make operational improvements: for instance “get analyst X who demonstrates competence to train analyst Y, who shows a weakness”; “retire use case A because the operational overhead of dealing with both real events and false positives exceeds the value of the mitigated risk to the business”; and “reinforce processes to analyst b who aren’t following the event’s playbooks but aren’t demonstrating improvements in effective or efficiency”. The reality is that most security operation centres aren’t even putting in the processes, let-ot-loan measurement to make them better – they’re deploying technology, putting people in front of the technologies UI and ending there. I’ll discuss metrics more in a later Mistake in this series.
Even with the right processes and measurement, it can means that it there is a period of 2 – 3 years for a SOC to reach it’s optimum maturity level (depending on what the business has defined that as) – during this time there will be new threat actors, new vulnerabilities, new systems implemented within the organisation and new defences. All of these changes of the goalposts need to be taken into account – putting a stake in the ground today on day one of the SOC build and saying we’re going to build this capability, using these event sources with these use cases over a 3 year period is a honourable endeveavour, but by time you finish in 2019, you’ll have a SOC fit for 2016. This is one of the reasons that when I was at HP we encouraged customers to see their requirements as fluid and likely to change overtime, whether this was due to discovering their baseline level of maturity was well off the mark after we started work; that the nature of their infrastructure or business changed or whether budgets got cut – the reality is the end point is a moving target and the best way to get there is small incremental steps.
I used to work for someone who’d come from an Audit background. She couldn’t see the requirement for many of our documents to be living breathing working documents: for instance our Attack Vectors and associated Indicators of Compromise and Contextual Enrichments events – this would change on a daily basis based on the availability of new events sources, new vulnerabilities and to compensate for adaptive behaviour by threat actors creating new attack vectors. Revision history, role-based access restrictions and change control are import – be she wanted locked down documentation, reviewed and changed on – at most – a quarterly basis, I wanted a Wiki (which allowed approval of changes, rollback and complete revision history) allowing constant improvement as a part of BAU.
With this in mind, consideration needs to be taken into account of not just the deployment of a SIEM and log management solution, along with recruiting and training some analysts, but to the ongoing operation of the SOC and the continual investment in time, effort and money needed to keep the SOC on target.
This is a mistake we see a lot in Security Operations Centres that have SIEM Use Cases that have been built using a bottom-up approach. I discussed this in my post SOC Mistake #7: On Use Cases, You Model Your Defences, Not Your Attackers, where SIEM Use Cases are are arrived upon by looking at what event sources are easy to obtain or are available, rather than what is needed to maximise the efficiency and effectiveness of a SIEM’s capability to detect attacks against your key line-of-business infrastructure.
Often in Security Operations Centres that have been built this way, rules cannot be built in a granular enough way to provide the analyst with enough context to determine whether something is a false positive or a real attack without significant digging around. To deal with event volume the Security Operations Centre has to invest significant cost in hiring additional Level 1 Analysts to perform event triage. Another problem with bottom-up rules is that they can be extremely tricky to tune, usually have simple correlations relying on two or three different log sources and simple logic – often tuning them for one scenario may detune the detection capability of another.
In contrast a top-down approach should provide multiple opportunities to detect the attack along the attack chain, so if one component of the staged rule is causing mis-fires into the SIEM it is possible to tune anywhere along the staged rules comprising the attack chain. With this approach you can start to tune out false positives (alarms where there is no real event) without introducing excessive false negatives (missing a real event). The business impact and threat assessment you undertook as a part of your Use Case Workshop should drive what the tolerable level of false negatives is: you compare the operational cost of the additional staffing to handle the false positives you have to keep in because you can’t tune them out without introducing false negatives, against the likelihood and impact of the event if you miss it – of course you can’t make these kinds of judgements if you haven’t taken a top-down approach. I’ll talk more in a later post about collecting and analysing metrics that tell you that you should stop tuning and just can a Use Case and start again with a different approach to it.
Going back to the “Big Picture”, in these bottom-up Security Operations Centres you often see SIEM events in the triage console that resemble the raw events from the event sources, i.e. they don’t resemble ‘actionable intelligence’. In fact in one multinational company that paid a significant amount for their SIEM infrastructure we saw the SIEM platform only receiving events from a single device type, from the same manufacturer – they might have well just dunked the Analyst in front of the management console of the device and forgot SIEM altogether. What constitutes ‘actionable intelligence’ will differ depending on which SIEM-vendors marketing glossies your reading, but to me it is enough information for a Level 1 Analyst to conduct initial triage without having to use a large number of investigatory tools to be able to triage false positives, determine the likely impact of the event on the organisation and determine the level of skill and possible motivations of the attacker.
A Use Case built using the top-down approach will provide this information. The process of building these kinds of Use Cases involve the modelling of vulnerabilities, threats and controls in the people, processes, applications, data, networks, compute and storage for each line-of-business, armed with the information about where in the attack chain the attack has been detected and all of the event information up to the point of detection (or beyond if the rule also triggered a higher-level of proactive monitoring, such as full packet capture or logging of keystrokes or even redirected to attacker to a tarpit to gather further information on their intent, tools, techniques and procedures). This information allows the analyst conducting the triage, at a glance, to make an initial determinations around impact, capability and scope of the attack.
The SIEM platform, ideally, should provide integrated tools for further analysis such as the retrieval and visualisation of related historical logs to look for anomalies, correlations, affinity groups and context; as well as the ability to lookup sources IPs, packet captures or executables against threat intelligence sources- and beyond to query the configuration management or identity management servers to understand the use and recent configuration changes to machines, as well as the rights of users, involved. In fact in HP ArcSight this data can be automatically brought in to enrich the event before it is even opened by the Level 1 Analyst to make them more operationally efficient.
So what is the “Big Picture”, well the answer to that is understanding the Who? What? When? How?, and most difficult, Why? of the attack. Faced with a huge deluge of rule fires that require significant effort to investigate and of which a large proportion end up being false positives which you “never seem to be able to tune out” when something that looks like a real attack is found, the Analyst often will run around with their hair on fire. Often they’ll escalate without answering these basic questions and when C-level exec has been got out of bed they’ll ask relevant questions that often the SOC can’t answer – Who? What? When? How? and Why?
Before an incident is declared the function of a Security Operations Centre in large organisation is to answer those questions – to prepare, to detect and to investigate. They should be able to prioritise the incidents to be dealt with by understanding the capability of the adversary, the impact of the incident and the scope of systems involved. This is the information they should be passing to the incident responders to allow them to contain, eradicate, finally IT operations works with the SOC to eliminate the vulnerabilities/apply additional controls and then recover (and increasing logging to detect if the machine is attacked again). During the whole process the Security Operations Centre should be working iteratively with the incident response team and IT operations.
Bad Use Cases that provide no context of the attack, bad integration of intrusion detection tools, lack of knowledge of context of systems and users, coupled with a lack of analytical skills in Analysts results in the focusing on the individual events, not the scope and impact of a potential incident or breach.
One story we frequently tell is of a SOC we knew of that where they hadn’t reached out to IT Operations department to win their buy-in for obtaining logs. Due to the adversarial relationship with IT Ops and the infosec department the infosec team relied on the logs that they could obtain easily, i.e. the ones from the systems they had ownership of – namely intrusion detection, firewall and anti-virus. Now everyone who works in information security reading this blog knows just how effective these technologies are in 2015 so nothing was triggering the correlations on the SIEM platform (the customer had also just deployed the default content from the vendor, not tuning it to their available resources). Funnily enough the SIEM didn’t detect a large breach very public breach that the customer suffered.
Questions were asked by the CEO about why he wasn’t notified and then why the SIEM product they’d spent so much money on had “failed”. At least as a result of the incident the information security team got carte blanche access to whatever logs they wanted – great right? Well no. The small SOC then on-boarded every single log source they could lay their hands on using a bottom-up approach. The result was chaos – masses of events bleeding into the console providing no answers to the contextual questions and in an overcompensation for not notifying the CEO of the original incident, the SOC team call him out-of-hours over a dozen times in one month over incidents that they had panicked over as they hadn’t been able to truly understanding what was happening.
Use Cases – these are simply the most misunderstood subject around both security operations and Security Information & Event Management (SIEM).
SIEM is one of the most mis-sold and mis-brought items of information security technology. The most common type of Request For Information we get from customers lists the different types of event sources and the Events Per Second (EPS) for each event source. The customer often believes that a SIEM is magical, that it understands everything about their business and their threats. All they need to do is merely pump their raw events into and actionable security intelligence pops out of the other end.
This means that procurement of SIEM is largely down to whether the SIEM supports the event source and whether the architecture supports the event volume. This technical bias when purchasing a SIEM means that Request For Information rarely looks at the operational processes in actually using a SIEM, and therefore the effectiveness and efficiency of the SIEM.
This approach ends up aligning with the defensive infrastructure of the organisation, rather than trying to detect the behaviour of an attackers. The SIEM will typically provide you details of a correlated event or anomaly which is the starting point, not that end point of intrusion analysis – if you SOC staff are simply escalating alerts to the CERT, they are simply acting as ‘click monkeys’ and not providing any value to the intrusion detection/investigation/response capability. The more accurate the modelling of the method of attack and additional context the SIEM has around the user, assets and know attacks and attacker behaviour: the more the technology is doing the heavy lifting.
Ideally what you are looking to achieve is something along the lines of:
- Level 1: elimination of false positives; prioritisation based on the criticality or sensitivity of the assets involved in the case; initial investigation of the incident to answer the When? Who? How? What? Why? questions before escalation (if required) to the next level for more specialist investigation.
- Level 2: more detailed investigation of the incident, which may involve escalation to the Incident Response Team (IRT) and then an interactive process of the Level 2 Analyst providing details of compromised systems and suspicious users to the Incident Response Team who will perform more detailed forensic analysis, that may result in one-or-more additional Indicators-of-Compromise (IoC) that the Incident Response Team will pass back to the Level 2 Analyst to conduct searches using the log management platform and SIEM to identify additional systems or users involved in the incident that can then be passed back to the Incident Response Team. The aim is to allow the SOC to specialise in technical skills required in log analysis and SIEM, and the Incident Response Team in network and host forensics.
When you consider the whole point of a SIEM Use Case is to drive the detection of the attack through correlation or anomaly detection rules in the SIEM platform, to detect incidents through post event analytics and to drive the intrusion analysis process to answer the core questions:
- When? When did this attack start? Is it a stand alone action, or is it a part of a wide campaign or a stage in a broader attack? If it still ongoing? Is it better to escalate immediately and kick off the containment and eradication, or to observe with your finger over the trigger to gain more information around intent and capabilities? With actors that would be typically classified as Advanced Persistent Threats (APTs), rather like the boy with his finger in the dyke, responding to the symptom and closing off one vulnerability will simply mean they’ll try again and come back through another avenue. Gaining a better understanding of the capabilities, tools and ultimate intent of the attacker will help you ensure that your preventative controls will be optimised to prevent future attacks from that threat actor.
- Who? Attack attribution is always challenging, just because you can geo-locate an attacker’s IP adjust doesn’t mean that’s the source of the attack – they can be pivoting through other compromised hosts and networks, or they could be utilising some form of anonymising network, such as The Onion Router (TOR). Was this an external party, an insider or someone from your supply chain? If an internal party is involved, it is sometime necessary to increase the priority of the incident, due to the internal knowledge of assets and provisioned access to internal resources exponentially increasing the potential expediency of attack execution and likelihood of success. Threat Intelligence can be used to profile the Tools, Techniques & Procedures (TTPs) that specific groups use and this can be compared to what you are seeing – again, the decision to continue to observe, or immediately escalate can be driven by the potential impact on the business of continuing to allow the threat actor access to your resources, compared with the benefit of gaining greater accuracy in attribution and confidence in understanding their capabilities.
- How? Intrusion analysis should allow you to build a timeline of the attack. For instance, which users (if any) were involved? Which systems were compromised? What data was accessed? Again, this timeline can be compared with Threat Intelligence to identify a potential modus operandi that can be used to attribute the attack to a particular group, which will give additional insight into intent and motivation.
- What? What assets have been compromised? What data has been accessed?
- Why? This is ultimately the most difficult question to answer. What was the ultimate intent of the attacker? As I’ve mentioned previously, if you can utilise known TTPs to attribute the attack to a specific group, you can look at what the outcome and intent of other attacks by that group was and use this to inform why they would be attacking your organisation. Otherwise you may have to guess based on what you’re seeing in the attack, such as the nature of the assets they are accessing. The decision to continue to observe rather than immediately contain/eradicate should be taken based on the potential risk, but the use of tools such as honeypots and tarpits can allow your analysts to observe behaviour without risk to live production systems and data.
Even in those organisations that do drive their Use Cases from attempting to model an attack, they often have a tendency to gravitate towards using logs such as firewall, anti-virus and Intrusion Detection/Prevention Systems (IDS/IPS). Why is this? Well that because these logs are typically from devices that are owned and configured by the Information Security Engineering team, rather than IT Operations. In many organisations the relationship between IT Operations, whose main focus is on continuity and performance of services, and Information Security, whose focus on the security of data, can be challenging.
Logs from firewall, anti-virus and IDS/IPS systems are becoming increasing less relevant in modern attacks. Attackers today will attempt to make their initial exploitation attempts, command & control and exfiltration look as much like normal day-to-day traffic as possible – the traditional ACCEPT/DENY rules of a firewall based on IP address and port of source and destination just doesn’t cut-the-mustard anymore. Even ‘next generation’ firewalls cannot provide the level of visibility in encrypted packets over Secure Sockets Layer (SSL) for instance. Just like next generation firewalls, IDS/IPS cannot look for signatures of known attacks on networks where the traffic is encrypted by technologies such as Secure Sockets Layer (SSL) without the break-out of the encryption. Finally, traditional anti-virus solutions have been notorious for not stopping modern attacks. It’s fairly simple to look at the vendor’s names listed technical skills of current and ex-employees of your organisation on LinkedIn. Then it’s just a case of finding an exploit that the vendor’s product can’t find using a resource like VirusTotal and you’ve got your in.
This is by far one of the most common failings of Security Operations. I’ve reviewed the maturity of several large global Security Operations Centres and they appear to be doing a reasonable job of the prediction, detection and investigation of information security incidents – but none of this is visible to the rest of the organisation who funds their operational budgets.
It is common to find someone who has started life as an operational information security person, maybe originally a firewall or Intrusion Detection System administrator, who’s career ultimately takes them to SOC Manager. Their life has been steeped in the operational reports produced by technical controls such as firewalls, Intrusion Detection Systems and anti-virus solutions. These reports a meaningful to him, although I’d argue about the contextual value you can gain out of an individual control’s report, but count based metrics such as ’37 unauthorised access attempts across the business’ or ‘300,000 block spam emails’ are pretty meaningless to senior management.
I’m reminded of Monty Python’s Spanish Inquisition Scene set in Jarrow in 1911 where Graham Chapman enters and says to the mill owner: “One on’t cross beams gone owt askew on treddle” to which the mill owner, both unaccustomed to the regional dialect and technical jargon says “Pardon?“.
Graham Chapman’s character is looking to the mill owner for support and direction, but he’s presenting the problem in the operational language he understands. If he’d said “A vital piece of the manufacturing equipment in our rail sleeper production has become mis-aligned halting production” instead of “I didn’t expect some kind of Spanish Inquisition” the problem would have got sorted and Cardinals Ximinez, Fang and Biggles would have never appeared.
It is the job of the Security Operations Centre, just like the rest of the information security function, to present meaningful decision support management information around information risk to the management and it’s the management’s responsibility to make decisions on risk based on it. If we never provide information in a language or format they can utilise, we’re always going to be seen as these strange people who live in the basement and occasionally come into the boardroom and start speaking something that sounds like Klingon to the C-level execs.
The other issue is that information security risk isn’t the only risk that companies need to consider, even if we do try to treat ourselves like a special little snowflake. An organisation’s risk function has to balance – liquidity risk; currency risk; supply chain risk; asset risk; competition risk; pricing risk; and capital availability to name but a few. We’re often so wrapped up in our own little worlds that are so important to want we do and we vent when decisions don’t go “our-way” forgetting that the C-level suite are running a company whose main business probably isn’t information security.
A classic example I can illustrate happened to myself. I came into a budget meeting armed with a risk assessment and a budgeted control suite – risk to the business was around 800K, cost of controls were about 200K and residual risk would have been around 200K. Job done the average quantitative risk wonk would say (it was a much more in-depth risk analysis than I am demonstrating here for the sake of brevity), but the issue was that if that 200K was invested in three more sales people they’d bring in much more than 800K in revenue to the organisation, which at the time was exhibiting now’t wrong with a Yorkshire accenta 98% customer retention rate, and future growth, and operational costs, was funded out of customer subscriptions. When you took into the balance of information security risk vs. opportunity risk, my project was a bad call.
So the presentation of risk in a language the business understands and that allows a normalised comparison with other forms of risk, if you operate an Enterprise Risk Management framework, is one of the key success criteria for good security operations.
So what does good management information look like? Well, financial metrics are a good start.
Everyone in the C-suite understand pounds, shillings and pence (excuse my pre-metric example, dollars and cents to my US friends) – it’s a good place to start. Creating financial metrics has long been a difficult proposition, but there are several ways to do it. Myself, I tend to map my SIEM event categories onto the VERIS framework. This then lets me use the average costs and time-to-resolve metrics from the Verizon Data Breach Investigations Report, which I still consider to be one of the best yardsticks of what is going on in the wider world, to show my organisations performance against the average.
The other is that it most have context, providing count based metrics for the whole organisation doesn’t impart any information about what line-of-business assets are involved and what the potential bottom-line is to the business. “37 unauthorised access attempts across Acme Corp” says one thing, “34 on cardholder processing systems”, “2 on bank transfer systems”, “1 on the customer relationship management system”, all of which are buried deep inside your infrastructure behind several layers of now’t wrong with a Yorkshire accentnow’t wrong with a Yorkshire accentcontrols, says quite another. I’m going to talk more about this in a later blog posting, so I am going to part this here for a while.
Another aspect of the context is granularity, and this normally requires input from the analysts and incident responders and some form of established taxonomy for the more granular categorisation of incidents. For instances saying you’ve blocked “34 malware infections” says one thing, saying “24 malware infections were stopped at the host level and we’re detected by the Intrusion Detection System”, or out of the other 10, “5 exhibited DNS behaviour showing they attempted to connect over port 443 to external systems” and that the other “5 encrypted the harddisks of systems in our payroll department just before payday”.
It’s not just about the granularity, it’s also about the curation. Helping the exec’s understand what the impact is to the business is?; giving advice on what they could do about it? who is the likely perpetrator is, based on the tools, techniques and procedures they are using, or at least provide an indicator of what their capability is?; understanding when this started, is it a part of a campaign or a single attack? Is it still ongoing?; how did this occur? What vulnerabilities did the attacker exploit? How could this be prevented from happening again?; the most difficult, and often most important question: why did this attacker attack us? What were they after?
Having management information that allows information security risk to have a seat at the boardroom table with the rest of the functions that handle risk is a starting point. now’t wrong with a Yorkshire accent Providing context to the C-level execs enabling them to make informed risk decisions helps move security operations from a reactive function, to one that is proactive. When this is coupled with the topics I’m going to talk about in my next couple of postings: providing line-of-business metrics and using threat intelligence, we’re moving from a Jarrow-accent to a Received Pronunciation one – although there’s now’t wrong with a Yorkshire accent as my fiancée is from Hull.
Security Information and Event Management (SIEM) platforms are all about turning the mass of raw events that occur in your organisation’s infrastructure into intelligence that can be assessed by analysts and incident responders to identify and react to information security incidents.
SIEMs, despite what vendors will tell you, are not magic. It will take you months to tune your ruleset to eliminate the bulk of false positives and you’re probably working against a moving target of an increasing number of event sources as well as continually having to adjust the rules to detect the new threats you’re facing.
To ensure the maximum use of your highly-skilled trained analysts, it is common to tier your analysts into at least two layers.
The initial layer that are solely responsible (at least to start with) for the triage of incoming events. That is the identification of false positives, ensuring the appropriate prioritisation and escalation.
In an effective SOC, however, these level 1 analysts are not simply “click-monkeys”, as well as triaging false positives they should be doing some form of initial assessment so they can evaluate the potential impact and scope of the incident. They should also be performing some form of adversary characterisation by evaluating where in the attack chain the event was detected (further down the chain, such as at the command or control or lateral movement stage, may imply that they have conducted significant reconnaissance and have crafted a specific exploit to be undetectable to your host or network Intrusion Detection/Protection System – this implies a motivated and fairly skilled adversary) and they should also be, from their initial investigation, ascertaining the potential impact to the business.
Often the SIEM will have some form of prioritisation algorithm based on a number of factors, but only a human analysts can take all of the context into consideration (Skill level of attacker? Does the attacker exhibit known behaviour in their Tools, Techniques and Procedures (TTP) that can assist with attribution? What is the apparent intent of the attacker (disruption, theft, espionage)? Is this a one-off event or part of a sustained campaign? Does the attack demonstrate investment of a lot of time or funds (use of zero days, for instance)? What systems are effected and what line-of-business do they support?
Only events assessed as what the level 1 analyst deals real events are escalated to the next level of more skilled analysts to conduct a deeper level of investigation. You can create specialisations at the Level 2, or above, layers to allow workflows to be created that direct events of a certain category to specific analysts, or groups of analysts. Some organisations have as many as three or four tiers of analysts, gradually becoming more skills and specialised as you move up the chain.
Any false positives discovered by the analysts can be routed to content authors who can further tune the SIEM rules to try and prevent the false positive from occurring in the future.
The focus should be on making this process as efficient and repeatable as possible, while allowing the collection of metrics to support continual improvement. For instance, in HP ArcSight, we create ActiveLists for a ‘triage channel’ and the ‘content needs tuning’. As we’re largely automating this workflow we can collect metrics on key operational Key Performance Indicators such as time-to-triage, time-to-investigate, number of false positives per use case category, number of events escalated per analyst, number of incorrectly categorised false positives per analyst. These metrics, when combined together, can help you achieve the right balance of efficiency and effectiveness.
In my practice we’ve evaluated hundreds of Security Operations Centres were all of the analysts are highly trained and all operate at a single tier. They all randomly pick the events they wish to work on off the console and do their typical ‘deep dive’ investigation. This causes several problems:
- It’s hard to maintain but a broad-spectrum of investigatory skills needed for triage of all event types and a deep-level of specialisation to do a full investigation;
- The analyst may prefer to investigate specific categories of events, meaning that some event types may remain in the triage channel for extended periods of time;
- Having your highly-skilled analysts conduct the initial triage of false-positives is a bad use of their time; and
- Often Security Operations Centres find it really difficult to produce meaningful metrics on the overall performance of the capability, or individual analysts.
Implementing at least a two-tier system of triage/prioritisation and investigation can dramatically increase the performance of your Security Operations Centres.