Masking data to protect personal data

26.01.2026

Security Vision

In the modern digital economy, data is often called the new "oil", and if you think about it, it's not just fuel for the economy, but also a flammable substance: raw personal data (PD) is a toxic and explosive asset. On the one hand, for development, testing, and analytics teams, raw data is a vital fuel in their development environments (these specialists require realistic customer profiles, transaction histories, and complex relationships to ensure software quality). On the other hand, for information security specialists, this same data represents an area of maximum risk, requiring isolation in the "fortress" of the production environment. For a long time, the industry lived in a compromise paradigm where security was sacrificed for speed, and the practice of copying real production databases ( Production ) into development and testing environments ( Test / Dev ) became an unspoken standard. What emerged was a "shadow landscape" of data (akin to shadow IT), uncontrolled copies of sensitive information scattered across developer servers, analyst laptops, and contractor cloud workstations.

In this article, we'll explore the topic of data masking, a technology that allows us to find not just a compromise, but to achieve a win - win situation . In the Russian Federation, data masking has also been introduced. turnover fines for data breaches and entered into force New regulations from Roskomnadzor . All of this has increased the topic's relevance and importance for business.

The traditional cybersecurity model often resembles a medieval castle: high walls and a deep moat surrounding the "treasury" (the production database). However, the development and testing environments ( Dev / Test ) in this analogy often act as a poorly guarded side gate for the servants (and this is supported by statistics, as up to 70% of the risk of sensitive information leakage is (This is concentrated specifically in non-production systems .) For every protected production database, there are, on average, 8 to 10 copies of it in test, development, analytical, and sandbox environments. These copies often contain the same real personal data (full names, passports, transactions), but are under significantly less control and protection.

Supply chain attacks are also becoming a growing percentage for the same reason:

- In January 2024 Midnight group Blizzard compromised Microsoft's corporate communications . The entry point was an outdated test tenant; the sandbox account wasn't protected by multi-factor authentication ( MFA ), as the system was considered non-critical. The hackers used brute-force passwords to gain a foothold in the test environment, and then, using excess privileges, moved laterally into the main corporate network.

- Uber has made the same mistake twice: in 2016 Hackers found AWS access keys in a private GitHub repository used by engineers (this allowed them to download data from 57 million users), and in 2022 – The attack was repeated by compromising the credentials of a contractor who had access to internal development tools.

- In Russia, there is a surge in leaks through “forgotten” databases, a number of major leaks (retail, logistics, medicine The outbreak (which occurred in 2024-2025 ) occurred because developers uploaded database dumps to temporarily rented cloud servers for testing, forgetting to close ports or set passwords. Search engines ( Shodan , Censys ) indexed these exposed Elasticsearch or MongoDB instances within hours.

As a result, we see that data used in development (keys, dumps) is often stored carelessly, becoming easy prey, and an unprotected test environment is a springboard for an attack on the core business. Data masking – this is the creation of a version of the database that looks and behaves like the real thing, but does not contain real secrets .

This is a high-quality fake, or a stunt double in a movie. Imagine filming a dangerous stunt in a blockbuster with a Hollywood A-lister (real data), whose face is known to everyone and whose health is insured for millions. Risking the star is unacceptable, and the film includes a scene involving an explosion or a fall from a roof (test environment). In this case, the disguise would be the use of a professional stunt double: they wear the same suit, have the same height and build (maintaining the format). In a wide shot (in the app), the viewer won't notice the substitution. But if an accident occurs on set (a leak), the stunt double will be injured, while the "star" (the real client) will remain safe.

Now let's figure out how it works, highlighting two approaches:

1) Static masking ( Static Data Masking ( SDM ), or "Golden Copy for Developers", is the process of creating an irreversibly anonymized copy of a database before it enters the test environment. First, a clone of the production database ( Snapshot ) is created in an isolated zone ( Staging ), then a masking script is run, overwriting sensitive data, and the original in the clone is destroyed. The "clean" copy is then transferred to developers, and this approach ensures maximum security since it physically contains no real data. This static approach is often used in development, functional testing, AI / ML training , and for transferring data to outsourcers.

2) Dynamic masking ( Dynamic Data Masking , DDM ) or "Perception Filter for Operators" works differently: the data in the database remains real and unchanged, and masking occurs "on the fly" at the time of the request ( SQL Query ), as in traffic analysis using correlation rules in SIEM systems, searching for anomalies in UEBA and threat analysis in TIP: There's a proxy gateway or DBMS mechanism between the user and the data: if the Administrator requests data, they see everything, the call center operator sees **** instead of the card number, and technical support and BI reports will only see data relevant to the current task. This approach protects against insider curiosity, but not against direct hacking of the database file.

Imagine how intelligence agencies declassify an archival document: before handing it over to historians (developers), an officer makes a photocopy, blacks out the real names of the agents with a black marker, and writes fictitious ones over them. The historians receive a document they can work with, but it's impossible to identify the real agents, even if the document were stolen. That's how disguise works.

It can also be overridden using different methods :

a) Replacement The meaning is changed from a pre-prepared dictionary to another. This requires extensive dictionaries (thousands of names and cities) to avoid unnatural repetitions. It works like an actor's or writer's pseudonym: the person is the same, but the credits say a different name.

b) Mixing or permutation Real values within the column. The system takes the "Salary" column of all employees and randomly swaps the values. This preserves statistical reliability (sum, mean, distribution), which is critical for financial analytics, but it's unsafe for small samples: if a department only has a Director and an Intern, swapping their salaries can easily reveal the truth. It's like shuffling a deck of cards: the cards are the same, but now they're in the hands of different players.

v) Reset or Truncation , which replaces data with NULL or a mask. This allows for quick masking of fields that aren't critical for tests, or when some information needs to be hidden (e.g., PAN cards).

g) Deterministic masking is the most complex and important method for integration testing. In a microservice architecture, customer data is spread across different databases ( CRM, Billing, Logistics), and if it is masked randomly, the connections ( Foreign Keys ) will break, so the masking system uses an algorithm that, given the same input, always produces the same output, using secret "salt" .

In conclusion of the review, we will give an example of work postgresql_anonymizer: let's do it Masking the client table for developers .

1) Connecting the extension:

CREATE EXTENSION IF NOT EXISTS anon CASCADE;

SELECT anon . init ();

2) Declaring a masking strategy (declarative approach), replacing surnames with random values from a dictionary (Replacement Method):

SECURITY LABEL FOR anon ON COLUMN clients.lastname

IS 'MASKED WITH FUNCTION anon.fake_last_name()';

3) We generate a pseudo- email based on the ID , this ensures that user _ id = 5 always receives the same email :

SECURITY LABEL FOR anon ON COLUMN clients.email

IS 'MASKED WITH FUNCTION anon.pseudo_email(clients.id)';

4) We completely remove passport data (Method of Changing the Composition)

SECURITY LABEL FOR anon ON COLUMN clients.passport_num

IS 'MASKED WITH VALUE NULL';

5) For phone numbers, we leave the region code and hide the rest (Partial masking)

SECURITY LABEL FOR anon ON COLUMN clients.phone

IS 'MASKED WITH FUNCTION anon.partial(phone, 2, ''*******'', 2)';

6) The process of depersonalization ( In - Place Anonymization ) is launched on a copy of the database:

SELECT anon . anonymize_database ( ) ;

As a result, if before we had the line: Ivanov / ivanov @ mail .ru / 4500 / 123456, +79031234567, then it will be something like this: Smith / user _84 @ example .com / NULL / +7*******67.

Data masking technology is the only mature answer to the challenge of finding a balance between security and rapid development. By moving from copying to anonymization, companies create a "digital sandbox" where developers can safely build castles, break down walls, and experiment without risking the collapse of their real business.

In the era of digital transparency and strict regulation (Federal Law No. 152, Order No. 140), the use of raw data in testing is becoming unacceptable: test environments filled with real personal data have become toxic and explosive assets, capable of sinking a business under the weight of turnover fines. But implementing static, deterministic masking in the CI / CD pipeline is more than just legal compliance. It's a strategic advantage that allows you to be faster, more flexible, and more secure than your competitors.