How to manage a flock of sheep with one dog, or current approaches to configuring network equipment

30.05.2022

Ruslan Rakhmetov, Security Vision

In this article we would like to discuss the configuration of a large fleet of network equipment. Namely, how we can form and monitor the correctness of the configuration file of network equipment. This is not an easy task, but we want to make progress in this matter. First, let's ask ourselves: what is a large network infrastructure? And what kind of workloads are we talking about?

It is almost impossible to find information on large network infrastructure. In practice, no large network owner or data centre owner will lay out the architecture of their network infrastructure for information security reasons. But that doesn't mean we can't make an assumption. Let's study a brief description of the infrastructure of the Odnoklassniki website(https://habr.com/ru/company/odnoklassniki/profile/ ).

Odnoklassniki is the largest entertainment network in the Runet and one of the most visited sites in the world. The site is visited by tens of millions of users every day, who view feeds and videos, listen to music, correspond with relatives and friends. This user activity puts a tremendous strain on our systems.

Under the bonnet, Odnoklassniki has more than 10,000 servers, a powerful fault-tolerant infrastructure spread across three data centres.

In another article we will learn a bit more about the infrastructure itself: https://habr.com/ru/company/odnoklassniki/blog/115881/.

The network is divided into internal and external networks. The networks are separated physically. Different interfaces of servers are connected to different switches and work in different networks. On the external network WEB servers communicate with the world. On the internal network all servers communicate with each other.

The topology of the internal network is star. Servers are connected to L2 switches (access switches). These switches are connected by at least two gigabit links to a stack of routers. Each link goes to a different switch in the stack.

Aggregation switches are connected by 10Gb links to the root routers, which provide both connectivity between data centres and connectivity to the outside world.

The switches and routers used are from Cisco. For communication with the outside world we have direct connections with several major telecom operators.

Based on this information, we can assume that Odnoklassniki data centres have a standard network architecture. The diagram is shown below:

Now we can calculate the approximate amount of network equipment. We know that Odnoklassniki has 10,000 servers. Assuming that there are 40 servers in each rack (and there is enough power in the rack to keep those 40 servers running), then we have 500 racks. Each rack has 2 TOR (top of the rack) switches with 48 ports each. Based on this we conclude that they have 1000 TOR switches.

Next we have the aggregation layer - 1000 TOR switches with two ports looking into 2 aggregation switches. Let's assume there are 40 ports each in an aggregation switch, hence: 2000/40 = 50 switches. That makes 50 aggregation switches.

These 50 aggregation switches are connected to a stack of routers, which are probably 2 in each data centre. Total 6 routers.

In total we get 1000+50+6 = 1056 pieces of network equipment, and this is without taking into account IPMI interfaces. That is, we can assume that Classmates have about 1000-1500 pieces of network equipment.

Good, we have an understanding of the size of the infrastructure. Now let's try to figure out how to ensure correctness (in accordance with the logic that was laid down by the business requirements) and uninterrupted operation of the network infrastructure. We need to somehow ensure that the configuration file of the network equipment is set up correctly.

Automation is key to this task, but there are many pitfalls. We may encounter the following difficulties in accomplishing this task:

The configuration file is unstructured (A router/switch configuration file can consist of several thousand lines of text and this text is not machine readable, we need to explain the configuration logic to the programme).
No unified logic for grouping settings (Network equipment settings can be divided into services that implement the global architecture of the entire network, but since the text is unstructured and not all relationships are grouped, it is difficult to do manually).
There is no unified approach to configuring equipment (Network equipment has been configured for years, added piecemeal, each network engineer has his own way of understanding how to configure equipment).

How do we approach solving these problems?

Consider the first problem - The configuration file is unstructured. An example is shown below:

As we can see, it is not clear from the text what is a number, what is text, what a particular text refers to. Before starting the analysis, it will be necessary to mark up the configuration file for the program so that the program understands and structures the data. The result of the first iteration will be a structured text that the programme can work with.

The next problem is that there is no common logic for grouping settings. We don't need to compare the settings of all network equipment at once, because there is a lot of text, and we don't need the whole configuration file at once. Usually we need some service. Suppose we want to see how the VPN is configured between offices. In this case, we will not be able to take, for example, lines 5 to 10 and know exactly what it is about the VPN. Firstly, it will be a set of lines in different parts of the configuration file. Secondly, this set of lines is not predetermined and can ‘move’ with the increase/decrease of the config file size. This task requires network hardware expertise and close interaction with the developer. Based on the results of this iteration, we will be able to isolate the part of the text that applies to a specific VPN service. It is important to realise that even with a unified approach to VPN configuration, the settings on different devices will not be exactly the same. This is something to consider as well.

Another problem is the lack of a unified approach to hardware configuration. This makes it difficult to choose a reference configuration approach. In this case, we can analyse the settings of all network devices involved in the operation of, for example, the same VPN service. From the results we realise that 40% are configured in the first way, 20% in the second way, 15% in the third way and 5% in the fourth way. Based on the results, we can select the first method as the reference method and extend it to the remaining fleet of network equipment.

As a result, we have a reference configuration file that we can keep up to date and monitor compliance across the entire network infrastructure. This approach will allow us to automate the control of correct configuration of the network equipment configuration file.