Ruslan Rakhmetov, Security Vision
In early 2024, a financial officer at the Hong Kong branch of the international engineering company Arup transferred $25.6 million to fraudsters, and this was not the result of hacking, phishing emails, or a virus. The employee transferred the money because he received a direct instruction from the company's chief financial officer from London, whom he saw with his own eyes on a group video call with other senior executives, but there was one problem.… All the participants in this call, except for the victim herself, were deepfakes created by avatars, which the attackers used to attack.
This incident is a clear demonstration that Deepfake has ceased to be entertainment or science fiction and has become a powerful tool for fraud using artificial intelligence.
In today's article, we will analyze these high-tech fakes of video, audio, and photo content created with the help of artificial intelligence, which are becoming more and more indistinguishable from reality every day, as well as ways to protect themselves from intruders. And in order to effectively fight the enemy, you need to understand how it works: the term Deepfake itself is a hybrid of two English words: deep learning (deep learning, artificial intelligence) and fake (fake). It became widely used in 2017, when a user on the Reddit platform began publishing synthetic videos, which attracted public attention to this technology.
The creation of deepfakes is based on complex neural network models, which can be divided into 2 groups:
- Autoencoders who work as an artist and sculptor in the world of technology. To begin with, the neural network (encoder) "studies" thousands of photos and videos of one person (for example, Tom Cruise) in order to "compress" all his unique facial features (eye shape, nose shape, facial expressions) into a compact set of key characteristics. Then, another neural network (decoder) learns the opposite (to take this compressed set of characteristics and recreate a full-fledged face from it). When both models are trained, the scammers take the victim's video, pass it through an encoder to get the motion structure, and then "superimpose" Tom Cruise's characteristics on this structure using his decoder. This technology is at the heart of the acclaimed FaceSwap photo replacement app.
- Generative Adversarial Networks (GAN) work more complexly, but they are also based on the use of two models that can be compared to an artist and a critic who compete with each other: the "Generator" model (artist) takes random "noise" and tries to generate an image, for example, a human face (at first, he does it terribly), at the same time, the "Discriminator" (critic) studies both the real faces from the database (training sample) and the inept crafts of the "Generator". The generator draws a face and shows Criticism, who "responds" that it is a fake because (for example, the ears are blurred, there are 6 fingers on the hand in the photo, the eyes look in different directions, like Pennywise in the movie "It", etc.). The generator makes edits and draws a new face, showing the result all a more picky critic until he is satisfied with everything. This process is repeated millions of times. It is this "competitive" nature of GAN that is the reason why deepfakes are getting better at an exponential rate: each new deepfake "learns" from the mistakes that we use to recognize them (for example, unnatural blinking or artifacts on the edges).
To begin with, deepfakes were used for photographs, now neural networks are already able to generate videos, but the threat also extends to audio materials: speech synthesis (Text-to-Speech, TTS) and voice cloning technologies allow you to create frighteningly realistic audio clips.
Fraudsters no longer need hours of studio recordings to create a convincing voice deepfake. Modern AI systems only need a few seconds of your voice, after which the model can make your "voice" say anything.
On a more widespread level, deepfakes from scammers are used to attack ordinary people. The scheme is usually aimed at elderly relatives and parents, playing on emotions and using the following formula:
1) A person (for example, a mother) receives a call;
2) She hears her child's cloned voice, which sounds scared or crying;
3) The voice says a typical phrase: "Mom, I was in an accident/ the police detained me, I urgently need money for a lawyer, but I can't speak, so I hand him the phone...";
4) An "intermediary" (already a real fraudster) enters the conversation, who exerts strong emotional pressure, urgently demands to transfer money and, most importantly, asks "not to tell anyone" in order not to give the victim time to check.
Fraudsters often intentionally add noise or interference to the audio stream in order to hide minor speech synthesis artifacts and attribute voice defects to "poor communication."
Exactly the same scheme was used in the incident we described at the beginning of the article: preparation (phishing) --> overcoming doubts (video call) --> execution (social engineering) --> result. Let's look at it in detail.:
1) First, an employee in Hong Kong received a classic phishing email, allegedly from a financial director (CFO) from the London headquarters. The letter stated
the need for an urgent and confidential transaction;
2) The employee, as expected, became suspicious. However, the scammers were ready for this. He immediately received an invitation to a video conference to discuss the details.;
3) When the employee connected to the call, he saw on the screen not only the CFO, but also several other senior executives of the company whom he knew by sight.2 They all looked and spoke absolutely normally, discussing the details of the transaction and confirming its urgency;
4) Fully convinced by the visual "proof" and the authority of the leadership group, the employee followed their instructions. He made 15 separate transfers totaling HK$200 million (about $25.6 million) to five different accounts controlled by the scammers.
To recognize deepfake, it's worth focusing on individual factors:
- Intonation and tempo
- Emotions
- Background noise
The voice sounds robotic, monotonous, without lively human intonations. Phrases can be pronounced unnaturally fast or with incorrect accents. There is a clear discrepancy between the voice intonation of the transmitted message: the voice may be "flat" where panic or joy is expected. The presence of strange sound artifacts, clicks, metallic echoes, or unnatural noises may also signal forgery.
When watching a suspicious video or during a video call, pay attention to the following inconsistencies:
- Eyes and blinking
- The edges of the face and hair
- Skin
- Lighting and reflections
- Lips and facial expressions
For a long time, AI models could not correctly simulate the random nature of blinking (if a person blinks too often, too rarely, or does not blink at all, this is already suspicious). Also, pay attention to the "dead", "empty" or "frozen" gaze; the eyes may not follow the object or may not reflect the emotions conveyed by the voice. Look for blurred edges of the face, especially at the junction with hair, ears, or neck (in these places, the AI "glues" a fake face, and the seam may be visible in the form of artifacts, pixelation, or unnatural blurring). The skin may look strange (too smooth, as if with a strong "wax" filter, or, conversely, too wrinkled or with artifacts that do not correspond to age or light). There is also often poor synchronization of lip movement with sound, and one of the most difficult elements for AI is lighting (shadows on a person's face may not match the light sources in the room, and if a person wears glasses, the glare in them may not change or look unnatural when he moves his head).
In a future article, we will analyze technologies for analyzing deepfakes separately, as well as analyze corporate security systems, but for now we will offer three simple steps that you can implement in everyday life right now:
Come up with a safe word
Negotiate with your loved ones (especially parents and children) about a secret code word or a question that only you know: if a "relative" calls you with a cloned voice and an urgent, emotional request for money, your first question should be: "Name our word." The scammer doesn't know him, so this is the easiest and most reliable method.
Call me back yourself
If you receive a suspicious call or even a video call (it doesn't matter if it's from your "boss" or "son"), immediately hang up the phone and then dial this person yourself using the number from your personal notebook. If it was a scammer, they won't be able to intercept your outgoing call to a real number.
Use critical thinking
Trust your intuition: if something "feels" wrong, or the request is too urgent, emotional, and includes a non–disclosure requirement, it's almost certainly fraud. Attackers always put pressure on urgency and panic so that you don't have time to think – take a break and don't make impulsive decisions.
The main conclusion is that you can't trust by default what we see and hear online, even if it's the face and voice of your boss or someone close to you. But on a human level, victory depends on us: implement strict protocols to protect against deepfakes, train employees and loved ones, share this text with them, and never hesitate to check the information. The golden rule of protection against deepfakes of the new digital age is simple: "Doubt, check, call back."