Dynamic playbook modelling. Investigation and response practices, quality metrics

05.08.2024

Security Vision

In the previous article we talked about the methodology for assessing the quality of the investigation and response process. In a nutshell: we discussed how to form a system of metrics based on the statistics of SOC analysts' actions, which will allow analysing security operation processes and procedures. The application of the metrics system will allow, on the one hand, to optimise the actions of analysts by eliminating unnecessary actions (perhaps, making them optional), to automate repetitive actions, to develop workarounds for actions that take too long, to replenish playbooks with actions that are absent in them, but in reality are always performed. On the other hand, the metrics system will improve the quality of detection by collecting statistics on the conditions under which the correlation rule fouls time after time. And finally, a metrics system can help analyse the utilisation of protection tools, as well as the cost-effectiveness of subscriptions to analytics services and feeds. How? Let's look at some real-world examples.

Optimising response plans by analysing incident action statistics

Response plan analysis should start by checking whether it achieves its goal of fully analysing the incident landscape. Why start with this? Because all security management strives first and foremost to maximise the completeness and relevance of the context: the cleaner and more correct the data, the more accurate the decision will be. So to build a good investigation process, we need to understand:

- How well we analyse all the artefacts collected in the incident.

- Whether all the objects associated with the security event have been evaluated and defused.

- For each of the associated hosts, accounts and processes, have a verdict: malicious or secure.

- If the verdict is malicious, neutralise or remove the compromise if possible.

The next step is to analyse the playbook in more detail. Typically, the evaluation of the effectiveness of the investigation process is fairly top-level based on criticality and SLAs. Duty shift supervisors look at the number of flozes, check the playbook execution times based on the criticality of the incident. This doesn't take into account the typing of the incident or the time to complete milestones.

But let's think about it: in some places, minimal containment speed is important in repelling attacks (such as in an investigation of a VPO); in others, the enrichment process needs to be detailed to gather more information before making a final decision (e.g., policy violations and locking down internal accounts requires care not to hastily lock down a legitimate and perhaps resentful manager). For more detailed metrics and assessments, we want to break down the analytics into investigation phases and estimate the time to complete each phase for different types of incidents.

рис 1.jpg

Figure 1: Distributions of timing metrics across phases

If we decompose the analysis of playbooks to the collection of statistics on the execution time of phases followed by the analysis of final actions, we can optimise here:

- Repeating from time to time simple actions to output to an automaton (and an important advantage: the ability to safely automate actions because we know the typical conditions of their execution);

- Actions that do not produce results should be excluded from the response plan (e.g. actions that do not enrich the context, do not produce new objects, do not confirm maliciousness in a TP incident);

- Actions that took too long to complete should be replaced by other actions or moved to other phases (allowing for longer execution times).

If we have retrospective analytics on recurring missteps and bottle necks, we can start to improve the process.

But the story of playbook analysis doesn't end there. We looked at how well we analysed all the artefacts. We've looked at how effectively the actions are aligned to achieve the outcome. Now let's look at: are all the playbook actions being performed by the SOS analyst. It is super-useful to analyse what actions the playbook analyst has performed compared to the ones that were proposed.

рис 2 1.png

The easiest way to explain these metrics is through combinatorics and intersection of sets. Imagine that you have two sets: the first one is - these are the recommended actions, the second – это выполненные действия, пересечение множеств - recommended and implemented.

From these sets we can make two metrics. The first is the accuracy of the playbook, or . For example, the playbook suggests 10 actions in its composition, of which 3 were actually performed, so . That is, the ratio is not high enough. If time after time the response plan falls short in accuracy and the analyst fails to execute the suggested steps, they should be removed or made optional.

The next metric that analyses playbook execution is completeness or . For example, 3 of the 10 suggested steps were completed plus 2 more of their own, then . If, time after time in this type of incident, the analyst takes two additional actions in addition to the agreed response plan, the playbook should be expanded with them.

The success of the actions themselves is also an important criterion, because a wasted action will only waste valuable time and possibly money (in the case of paid subscriptions) without having any effect. A successful action is one that:

- gave more data on existing objects,

- established the status of the object,

- made the object uncompromised,

- brought in new objects, expanding the attack landscape.

Unsuccessful actions are accordingly desirable to remove, as we have mentioned in previous metrics for response plan analytics.

Let us now put the metrics into practice by analysing them through the lens of the VPO Infection playbook.

рис 3.jpg

Figure 2: An example of analysing the ‘VPO Infection’ playbook

So, let's collect all the true-positive incidents of this type and analyse the statistics to see if SOC analysts implement the recommendations from the current playbook. In the containment phase, we see that the actions ‘Check in sandbox’ and ‘Enrichment in analytics services’ are either not executed or are executed for a very long time. In the context of first response to a GDPR, such time costs are unacceptable (especially if we try to analyse first and then perform containment). Therefore, a reasonable conclusion is that these actions should either be eliminated or moved to the next phase of analysis.

Further, we see in retrospective data that quite often in this phase, SOC analysts perform an additional operation called ‘Mass Check’. The systemic repetition of an action that is not present in this step in this type of incident indicates that the completeness of the recorded playbook is insufficient and a security officer with a lower grade and level of competence may skip this action. So, the playbook needs to be expanded.

Let's discuss another application of the technique of statistical analysis of the execution of response processes. In the current world, it is very important to consider the high volatility of attack methods and techniques, because the same virus can be highly modified to bypass defences and avoid detection. Most often hackers use familiar tools and utilities, but they change the methods of propagation, entrenchment, and the processes generated: if yesterday a VPO or a grouping used the reverse tunnelling technique, today the WSO toolkit becomes an indicator of presence. Let's look at the picture on the retro-search box: Threat hunting, behavioural indicators and neighbourhood analysis can help us to track this rapidly changing landscape. For example, we can look at the history of requests for incidents like Bitrix compromise and see that most often it is paired with IDS alerts (security events) of attempts to exploit a Wordpress vulnerability, so this retrospective analysis stage can not only be added to the playbook, but also automated since the history is typical.

Optimisation of correlation rules by handling false positive conditions

Another important block of information can be obtained by analysing purely false positive, or false positive incidents. If you analyse all incidents with the same correlation rule, you can find the conditions under which the rule fouls. These conditions can be either added to the correlation rule or excluded, thus adjusting the base of solving rules and reducing the number of false positives.

рис 3.2.jpg

Figure 3: Parameters of events at which the correlation fouls up

The second option for analysing the result: instead of adding events to exceptions through refinement of regularisations, we can determine the proximity of processed events by searching for semantically close to incidents defined as FPs. If a triggered event is close to some FP, then most likely it is also an FP. In this way we can solve the problem of infinitely inflating lists of exceptions to correlation rules.

Evaluating the effectiveness of applied defences, enrichment tools and analytics

Another set of conclusions and solutions can be obtained by analysing correctly positive incidents. In these, we know for sure that objects have been compromised and the incident was created reasonably. The sign of true positives can help us clean up playbooks of unnecessary actions that ultimately failed to bring us a result: the object is malicious, but the action (and the tool used) failed to confirm this many times in this type of incident (no matter what the reason).

Let's perform a similar analysis using the example of a bruteforce playbook: multiple failed login attempts followed by a successful one.

рис 6.png

Figure 4: Performance of the ‘Bruteforce, or successful password mining’ playbook tools

The correlation rule brought us the attacker's external host objects, the victim's internal host, and the compromised account - let's run them through the enrichment tools to get more information on the maliciousness verdict. Following the playbook from the example, we go to Threat intelligence, SIEM and analytics services such as Virus Total. If the incident is TP, the external attacking host is obviously owned by the attackers, but the specific data source does not (and all previous times also did not) provide information about the maliciousness or dangerousness of the artefact, then most likely it is simply irrelevant in this particular case.

рис 7.png

Figure 5. Enrichment source scoring

This approach can be used to filter feeds, select subscriptions to analytical services, optimise the connection of internal enrichment sources and calculate the utilisation of defence tools. Statistics is the goddess of data, it gives us conclusions on the cost-effective use of a tool in a specific type of incident, at a specific stage of investigation and under specific conditions: when we need an antivirus engine, when a crawler, and when an analytics service is useful to get a verdict on a compromise indicator.

Conclusion

With such a systematic approach, we have enhanced information to capture and optimise the investigation and response process based on retrospective analysis of right and wrong actions. We can in an automated way:

- preserve the expertise of cool specialists by capturing them in playbooks,

- highlight and rework bottle-neck bottlenecks,

- optimise overly long plan branches,

- eliminate redundant tools and analytics,

- optimise playbooks on a regular basis in the PDCA cycle,

- deliver real-time integral assessments for decision making.

That is, the analytics system, by applying its core advantage over humans (which is the easy processing of large amounts of data and the search for patterns) can help both retain valuable skills and offer steps to reduce costas and improve the efficiency of the investigation and incident response process.

SOC TIP