AI Cybersecurity. P 2. Transformers, LLM, AI

12.02.2026

Ruslan Rakhmetov, Security Vision

In the previous article, we described the main methods of machine learning and neural network architecture, including transformers – this architecture is used for natural language processing in popular AI chatbots, including ChatGPT. Today, AI assistants process not only text, but also images and speech, and AI agents can independently interact with digital services and perform a series of sequential actions to achieve a user-defined task. Naturally, attackers immediately began to search for vulnerabilities in popular AI tools and employ new cyberattack methods: for example, the AI agent Deep ChatGPT 's research can be used to surreptitiously access a user's email, and AI browsers are susceptible to vulnerabilities that allow arbitrary actions on webpages to be performed on the user's behalf, including clicking phishing links and paying for goods in fake online stores. By tracing the history of AI development from the first language models to multimodal agents, this article will help us understand the specific AI features that attackers exploit.

The theoretical foundations of AI were laid centuries ago: in 1763, a fundamental work in probability theory was published (Bayes' theorem, reformulated by Laplace in 1812). In 1805, Legendre described the method of least squares, which is now used in regression analysis to approximate and predict the influence of a number of independent variables on a dependent variable. In 1906, Russian scientist Andrey Andreevich Markov proved that the law of large numbers applies to dependent events and proposed a concept in which the probability of the next event depends only on the current event – the theory of "Markov chains" is now actively used in forecasting and various recommendation algorithms, text processing, and language models. In 1943, American researchers proposed a mathematical model of an artificial neural network, and in 1950, British scientist Alan Turing formulated the concepts of machine learning and artificial intelligence. In 1957, American Frank Rosenblatt created the perceptron, the world's first single-layer neural network that emulated human perception. In 1960, it was implemented as the Mark I neural computer, which, after training, could distinguish simple geometric shapes. The first program capable of emulating conversation with a user (the prototype of modern chatbots) was named ELIZA. It was developed in 1966 at the Massachusetts Institute of Technology, and in 1979, a prototype of an autonomous car was created at Stanford University – Stanford Cart, which could move independently and avoid obstacles in laboratory conditions.

The surge of interest in artificial learning in the 2010s was driven, firstly, by a significant increase in hardware performance, secondly, by the growth of digital datasets on which ML models could be trained, and thirdly, by a number of significant events. For example, in 2009, the project ImageNet, which collected millions of classified images – it was these freely available datasets that allowed researchers to improve computer vision and deep learning. In 2012, the ImageNet image recognition competition was won by AlexNet, a convolutional neural network with 8 layers and 60 million parameters. It was created by Alexander Krizhevsky and Ilya Sutskever, a post-Soviet researcher under the supervision of their supervisor Geoffrey Hinton at the University of Toronto, who first used two Nvidia graphics cards to train the neural network. The GeForce GTX 580, which set the long-term trend toward GPUs (although TPUs – tensor processing units – are now increasingly being used). In 2014, DeepFace was created – a deep learning facial recognition system containing nine layers and over 120 million parameters, trained on four million photos of users of a popular American social network. FindFace, a facial recognition technology developed by the Russian company NtechLab in 2016, made it possible to search for VK users based on their photos. In 2017, Google developed a modern version of the transformer neural network architecture, which reduced training time compared to recurrent neural networks. It was this architecture that OpenAI used to create large GPT language models. The release of GPT-3 in 2020 was a landmark event, as the 175 billion-parameter model gained the ability to handle tasks it hadn't previously trained on, while GPT-3.5 optimized its interpretation of user prompts – it was this model, powered by the publicly available ChatGPT chatbot web interface, that allowed humans to join the world of AI in November 2022.

The following components and elements are used in the neural network architecture of transformers:

1) Incoming information (text, images, audio) is broken down into tokens: a tokenizer converts a sentence or image into a set of integers. In the case of text, one token (number) can correspond to a word, part of a word (a sequence of letters or symbols), a phrase, a punctuation mark, or the beginning or end of a sentence. Depending on the neural network implementation, for English, one token corresponds to four characters on average, but for Russian, this ratio is slightly higher, around five to six characters per token. This is due to the fact that Russian uses longer words, suffixes, declensions, and conjugations, which allows for semantic differentiation of certain words and phrases. Other types of data are tokenized similarly: images are divided into microsegments (e.g., 8x8 or 16x16 pixels), and sounds are divided into frames (e.g., 20-40 milliseconds long), then processed and grouped into tokens.

2) The resulting tokens are converted into their vector representation (embeddings). For example, consider the Word2Vec vectorization method, developed Google in 2013: each word from a given set is assigned a vector in an n-dimensional space (the dimension of the space depends on the complexity of the data, but can reach a thousand or more). To assess the coherence of words (i.e., their semantic similarity), the proximity of their vectors is compared using their cosine similarity, i.e., the cosine of the angle between them. For example, if the words "original" and "original" frequently appear close to each other in the dataset, their embeddings (vector representations) will be similar; thus, a transformer processing a huge array of texts will know that these words are synonyms. However, this also requires taking into account adjacent words and understanding the context – for example, the word "month" could refer to a calendar month or the Moon, Earth's natural satellite. To solve this problem, transformers use an attention mechanism, which modifies a word's embedding by forming a linear combination of the embeddings of adjacent words with certain weighting factors. A further development of this concept was the Multi-Head Attention mechanism, in which a transformer analyzes different parts of a single sentence in parallel to find relationships between words and their context and adjusts weighting factors, is the idea contained in a scientific paper published by Google researchers in 2017, “Attention Is All You Need", which first proposed a neural network architecture for transformers.

3) Positional Encoding takes into account the position of words in a sentence, working with embeddings. The encoder consists of several layers, processes the input phrase, and applies a Multi-Head Attention mechanism, and then processes the result in the built-in feedforward neural network (FNN) and uses normalization. The encoder 's output is then fed to the decoder, which also consists of several layers and applies attention mechanisms, an FNN, and normalization. It then passes the result to the output layer, which generates the final output as a probability distribution of the resulting tokens using a multivariable logistic function (Softmax). For example, if the transformer were tasked with translating a sentence from English to Russian, the encoder would convert the tokens of the original English phrase into embeddings, while the decoder and output layer would generate a set of words in Russian and arrange them in the correct order to obtain the final translated phrase.

A neural network-transformer self-trained on a large volume of unlabeled text is called a Large Language Model (LLM) – examples of practical implementations include well-known chatbots such as ChatGPT, Google Gemini, and Microsoft Copilot. If a transformer is trained and can work not only with text but also with images, audio, and video, it is called multimodal (modality is the type of data); the first popular multimodal transformer was GPT-4o (where o stands for omni, i.e., comprehensive). A distinctive feature of LLMs is that they have a certain knowledge cutoff – for example, the latest version of ChatGPT-5, released in August 2025, has knowledge limited to mid-2024. Although ChatGPT can process more recent information from various sources upon user request, this temporary context will be forgotten immediately after the end of the current conversation with the chatbot. Full fine-tuning of LLM can be a lengthy and expensive task – for example, in the case of GPT-3.5, it would have been necessary to prepare computing resources to work with 175 billion parameters, while in the case of GPT-5, we are talking about trillions of parameters.

Therefore, various simplified methods are used to adapt the model to specific tasks and modify its behavior:

1) Prompt engineering – query prompt optimization helps achieve specific tasks without reconfiguring the LLM. A prompt can contain a detailed description of the task, source data, context, model operation mode, and the user's expected response representation. For example, a prompt for accurately translating a technical text from English to Russian might look like this: "Work in technical translator mode. Maintain the translation accuracy and style of the source text. Use precise mathematical and physical definitions. Output the result in plain text without markup. Translate the following English text into Russian:...." In the example above, we gave the model instructions without an example of the expected response – this technique is called Zero-Shot. Prompting, which we combined with Role techniques Prompting (specified the mode in which the model should operate) and Instruction Prompting (providing precise and detailed instructions). To improve the accuracy of the model, the prompt can be accompanied by an example of how the LLM should solve the given problem – this approach is called Few-Shot. Prompting. Prompt with a chain of reasoning (Chain of Thought) involves decomposing the task into several steps (the user "explains" to the model what the sequence of reasoning should be). The prompt splitting technique (Chunking) will help to overcome the limitations of the model's context window (the amount of text processed in one query), and the query chaining technique (Prompt Chaining) will allow the model to solve more complex problems given in several sequential prompts, while allowing the user to inspect intermediate results.

2) In-Context learning Learning involves the user giving the model a series of clear instructions and providing several examples of expected behavior/responses, and the model generalizes from the examples and adjusts its behavior accordingly.

3) Method of efficient fine-tuning of parameters (Parameter-Efficient Fine-Tuning (PEFT)) allows you to change only a small number of parameters of the original pre-trained model while preserving its structure to reduce the complexity of retraining. When using the PEFT method, most of the model's parameters and layers are frozen, while a few variable parameters (adapters) are added to the outer layers. Additionally, a memory optimization technique using gradient control points (gradient) can be used. By using the PERT method, it's possible to train a model to perform additional tasks – for example, to use specific vocabulary when interacting with users (for example, a banking chatbot should communicate with clients in a formal tone) or to recognize new types of objects in images (for example, dogs as well as cats). The PERT method protects the model from catastrophic forgetting, when the model "forgets" the knowledge acquired during the initial training after retraining, as well as from overfitting, when the model correctly works only with data from the training set but is unable to predict or draw conclusions based on new data.

Among the algorithms for the practical implementation of PERT are:

· adapters – small neural network adapter modules are added between the layers of the original neural network, and then, during further training of the model, only their parameters are changed;

· LoRA (Low-Rank Adaptation) in which the weights of the original model do not change, but low-rank matrices are added, containing parameters that are changed during the retraining process;

· QLoRA (Quantized Low-Rank Adaptation), in which the weights of the original model are converted into more compact representations, which allows for a significant gain in resource intensity during model adaptation, sacrificing a small decrease in its accuracy.

4) Retrieval-Augmented generation method Generation (RAG) allows LLM to access an external source (knowledge base) to update information. The developer includes relevant and regularly updated information (e.g., data on new product models, new cyberattack techniques, or recent scientific articles) in the knowledge base, and the model searches for the user's requested information using vector (semantic) search, utilizing the cosine similarity assessment of embeddings. Using verified knowledge bases not only increases user confidence in model responses but also ensures data security, since knowledge bases are not part of the training dataset and are stored in their original form, and the model's access to them can be quickly revoked. Some AI assistants (e.g., Perplexity and Yandex Neuro) can use publicly available internet resources as an external source, rather than internal knowledge bases that have been prepared and cleaned of inconsistent data.

5) Distillation of knowledge (Knowledge Distillation) helps compress models to reduce resource consumption while maintaining performance, allowing them to run on devices with low computing power, including smartphones. Using the distillation method, a large pre-trained teacher model transfers "knowledge" to a smaller student model. For example, the distilled DistilBERT model is 40% smaller and 60% faster than the original BERT model, while maintaining 97% of its performance. The distillation method has been known for over 10 years and even caused a scandal in the AI industry: in early 2025, Microsoft and OpenAI announced their suspicions against the Chinese startup DeepSeek, which may have unauthorizedly used the distillation method of OpenAI 's GPT-4 model to optimize its R1 model.

6) Hyperparameter optimization involves experimentally selecting hyperparameters to ensure the most efficient model performance. Model hyperparameters include characteristics such as learning rate, number of layers, number of neurons (nodes), activation function used, number of epochs (the number of complete passes of the entire dataset by the neural network), and batch size (the parts into which the dataset is divided).

7) Using optimizers to speed up training of neural networks, including gradient descent (classical, stochastic, batch), impulse (including Nesterov impulse momentum), AdaGrad (adaptive gradient), RMSProp (root mean square propagation), ADAM (adaptive momentum estimation). Furthermore, some LLM developers suggest using their own optimizers – for example, the creators of the Chinese Open Source models Kimi-K2 created their own optimizer MuonClip for improved learning stability.

After training the neural network and adapting the LLM, one can move on to inference – the practical use of the model for text generation, and in the case of multimodal LLM – for image and speech recognition, anomaly detection, etc. When the model begins to make accurate conclusions about new data, independently analyze information and make decisions, create new content, i.e. reproduce human actions upon their request-prompt, one can talk about the full-fledged operation of generative artificial intelligence (Generative AI, abbreviated gen AI). If an AI system can perform actions only in a strictly defined area (for example, perceive voice commands or recognize objects), then such a system is called "weak (or narrow) AI" (Narrow AI). If an AI system can understand, self-learn, and apply knowledge in a wide variety of areas, surpassing the capabilities of people, then such a system is called "strong (or general) AI" (Artificial General Intelligence (abbreviated AGI).

In addition to generative AI systems, there are also:

· Predictive AI, which allows one to predict the future consequences of certain current events based on historical data and is used, for example, for data-oriented business analytics;

· Discriminative AI, which is trained to discriminate between data or classify objects in images;

· Reasoning AI, which independently builds a long internal chain of reasoning, checks itself, recognizes its mistakes, goes back a step if it makes an incorrect decision, and can also display the entire decision-making chain to the user;

· Agentic AI is a set of AI agents – autonomous AI - based programs that use the current context, conversational memory, execution history, obtained results and tools (API integration with external programs and systems, interaction with other AI agents, internet access) to independently perform tasks set by the user, for example, making purchases in an online store, planning a travel route and purchasing tickets, writing and installing a program based on a user's text description (the so-called vibe In turn, multimodal AI agents can interact with various types of data – not only text, but also images, videos, and speech, including information obtained from the device's built-in sensors (cameras, microphones).