SSDL: ML for code and behaviour testing of opensource solutions

30.05.2024

| Listen on Google Podcasts | Listen on Mave | Listen on Yandex Music |

Security Vision

Introduction

In the last article we have looked at the possibilities of checking open source code in various ways, vendor-based and standalone. In general, frequently checking updates for suspicious or even malicious activity is a painstaking, responsible, and time-consuming task if done manually.

There are various ways to automate this process, but with methods of disguising malware as legitimate software evolving by the day, the process of detecting malware should also be evolving at a faster pace.

Artificial Intelligence

In the early days of the antivirus industry, the detection of malware on computers relied on heuristic functions that identified specific malicious files by:

- code fragments

- hashes of code fragments or the entire file

- file properties

- and combinations of these functions.

The main goal was to create a reliable fingerprint - a combination of features - of a malicious file that could be quickly verified.

Machine learning algorithms stand out for their ability to adapt to an ever-changing environment. They don't just follow known patterns, but actively adapt to the new capabilities of hackers.

File information is collected at two different stages:

- Before execution, that is, before the file's code would have been executed in any way. At this stage, data about the file's format, source code or code description, binary representation data, and other such information may be recorded.

- After execution, the data collected about the file are logs of system events and calls that occur during its execution (within an isolated sandbox environment).

For both types of data, machine learning can be applied. Let's look at the classical approaches to each.

Classification based on static file properties

In the context of open source malware detection, one powerful machine learning tool is the Random Forest algorithm. This method is particularly effective when dealing with static file properties such as code fragments, hashes, metadata, and executable structure.

Random Forest works by creating multiple decision trees based on different subsets and attributes of the input data. Each decision tree in the algorithm analyses a specific set of file characteristics such as lines of code, import/export characters, binary patterns and other static attributes. These trees work independently of each other to provide their own predictions about whether a file is malicious or not.

The advantage of this approach is its ability to adapt to new types of malware.

Behavioural analysis of a file using recurrent networks

Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) networks, are ideal for analysing file behaviour, especially when it comes to the sequences of system calls made by a file during its execution.

RNNs are able to process temporal sequence data by storing information about previous states in their memory. This allows them to analyse the behaviour of a file not only at the current moment of a sandbox simulation, but also cumulatively take into account its previous activity. This approach is important for detecting complex patterns of behaviour that are often inherent in malware.

LSTM, a type of RNN, is particularly effective at processing long sequences of data. It is able to ‘remember’ important information over both long and short time intervals, and ignore irrelevant data. This makes LSTMs particularly suitable for analysing complex and long-lasting behavioural patterns, such as a series of branched system calls, changes to the system registry, operations on user policies or on files that may indicate malicious activity.

LSTM networks, an advanced form of RNNs, are based on structures called gates that allow a model to allocate to information in its memory. There are three main types of gates: forget gate, input gate and output gate. The forget gate decides what information from the previous state should be ‘forgotten’ or discarded, the input gate determines what new data should be added to the cell state, and the output gate controls what information from the current cell state should be used in the network output.

By applying LSTM to the analysis of virus system call sequences, the model is able to effectively learn complex and variable patterns of behaviour that may not be obvious in static analysis. For example, a virus may perform a series of routine file operations to hide its malicious activity, such as copying or modifying system files. LSTM is able to detect such behavioural anomalies by analysing the sequence and context of these operations.

In addition, LSTM effectively handles other dynamic characteristics of programs, such as changes to system registries or the creation and deletion of processes, which are often a sign of malicious activity. This is particularly important for detecting advanced malware that can mask its actions or change its behaviour over time.

Deep learning vs. non-proliferating attacks

Typically, machine learning faces challenges when malicious and benign samples are represented in large numbers in the training set. But some attacks are so rare that we only have one example of malware to train. This is typical of high-profile targeted attacks. In this case, a very specific model architecture based on deep learning is used.This approach is called exemplar network (ExNet).

The idea here is that the model is trained to create compact representations of the input features. These are then used to simultaneously train multiple classifiers for each exemplar, which are algorithms that detect specific types of malware. Deep learning combines these multiple steps (extracting object features, compact feature representations, and creating a local model or models for each sample) into a single neural network pipeline that extracts distinguishing features for different types of malware.

This model can efficiently generalise knowledge about individual malware samples and a large collection of clean samples. It can then detect new modifications of the corresponding malware.

рис 1.png

Figure 1: An example of how the exemplar network (ExNet) algorithm works

Results

In the ongoing battle against malware, one of the key strategies is to create reliable fingerprints for fast file inspection. This is where powerful machine learning algorithms come into play, capable of not only following known patterns but also adapting to new and devious hacker tactics.

By applying various methods of checking suspicious code, it is possible to significantly improve and speed up the software update process, which, in turn, leads to the acceleration of the related process - patch management.

At two stages of information gathering - before and after file execution - machine learning reveals its full potential. In the first stage, the Random Forest algorithm does a great job with the static properties of the file, identifying code snippets, hashes, and the structure of the executable. By using multiple decision trees, each analysing a unique set of characteristics, this method provides adaptability to new threats.

The second stage - post-execution - becomes the arena for recurrent neural networks (RNNs) and their derivatives, such as LSTMs. These networks analyse sequences of system calls, remembering previous states and providing a comprehensive view of a file's behaviour. This method is ideal for identifying complex malware patterns, highlighting the importance of analysing the dynamics of actions.

Not everyone will adapt the automation of finding suspicious patterns so much just to use free software, but having this option is a good thing. And then it's up to each individual to decide for themselves.