Soar: Simulation for Observability, reliAbility, and secuRity
Presented by: Yan Zhai
Originally aired on July 5, 2022 @ 12:00 PM - 12:30 PM EDT
Tune in for a discussion on building a test bed for hundreds of Cloudflare services.
English
Engineering
Testing
Transcript (Beta)
Hello everyone. My name is Yan. I'm a system engineer at Cloudflare. Today I'm going to talk about SOAR, Simulation for Observability, Reliability, and Security.
This is a newly built testbed at Cloudflare and we feel it's interesting to share the experience and story with it, how we build it, and how we use it.
I will take questions at the end of the presentation and if I do not have time to answer all of them, they will be emailed to me and I'll answer them afterwards.
So feel free to send any of your questions and we will answer them afterwards.
So first of all let's look at the testing at Cloudflare.
As we know that Cloudflare is operating a massive edge network.
As of the fourth quarter of 2020, we already hosted more than 25 million Internet properties and we expand across 200 more cities and 100 more countries.
And we serve quite a bit of requests every day and intercept cyber attacks, a lot of them.
This basically means that there's no small error at our scale.
Anything that you did wrong, it could accidentally be magnified to a very large scale.
That's undesired. So we need to be very careful at testing.
Regarding the normal release of our products, so when we perform tests during these release cycles, usually the teams will first do their development stuff and test the things in their development environment.
And then they all use the CI pipeline to build the new release and test around some unit tests and integration tests.
And this then goes into a lab environment. This environment is our production grade environment that is dedicated to each of the teams.
So the team has a dedicated server to build their own further testing facilities for ensuring performance, ensuring reliability, and so on.
After the lab environment, we will take a step to deploy the new products into dogfooding datacenters.
These datacenters only serve Cloudflare internal employees. Our web browsing and stuff, they all go into those datacenters.
The dogfooding phase makes sure that Cloudflare will test all of their products before they are exposed to the customer.
And after the dogfooding, we move the products to a few canary datacenters.
These canary datacenters have a small number of servers, but with live customer traffic.
So there we observe if nothing bad happens, we'll start a slow process to release the new version into the rest of the world by tiers.
There are several tiers.
Lower tiers have less server and less impact. And tier one is the most critical datacenters that we have.
So this is a gradual process. Along this release cycle, one of the problems that we encounter here is dev and CI pipelines, you're fine when you run unit tests, integrity tests, but we need a better way to organize our lab.
In the past, it's dedicated to the team, each of the teams.
So each team exclusively owns the resources there. They can build what they like.
But it turns out this is not very efficient as the team grows. Cloudflare is growing fast.
So a number of teams definitely outpace the lab environment.
And we need to find a way to make it more efficient. And if we do not do a proper lab testing, then there will be problem because dogfooding and canary tests, when not, they are just serving live traffic.
So there is no guarantee that they can cover as much as of the code pass.
That can be a potential problem. So we do not want to have any issue when we have decided to release to the rest of the world.
And recently, well, not recently, but we have more and more products that start to have datacenter specific configurations.
A certain customer can only be served on certain datacenters.
And that means in lab, many of our previous tests may not cover this use case.
That's what motivates us to think of. We actually need to design a new testbed.
And this testbed needs to be scalable. And it should be as close as to our production environment.
And it should not have any customer impact.
Under such background, we designed Solr. It's a simulation platform designed to run complex tests and performance tests and security tests, all of them that is hard to perform on the CI environment.
So basically, Solr simulates mini Cloudflare.
This two designs are we build a dedicated datacenter to host a bunch of Cloudflare servers.
All the servers have the same specification as what we have in production.
And what makes this possible is that Cloudflare edge architecture is actually pretty homogeneous.
That most of our servers run the same set of software and configurations.
Basically, that means once you have this server provisioned and sync to up-to-date, they are ready to serve the traffic if they come from the network stack.
Because on the server, we do not need to distinguish whether it comes from the Internet or it comes from our simulation environment.
So that's why we divide the servers into several categories.
The eyeballs simulates the browser, the cell phones. They send traffic to the product server simulating our edge network.
And there will also be mocked servers running the customer origins, which is basically all kinds of websites.
And this datacenter is pretty much dedicated.
And we intentionally do not have any of the customer traffic going through.
So if it actually fails at some point, there are any errors, there will be no impact on Internet.
That's what we desire. And instead of dedicate the environment to each of the team, we actually build a service to share the environment.
So engineering team are using this on their own time slot.
When they need to submit jobs and have these jobs run on specific slot, instead of just check out machine and do not use it 90% of the time, which is a waste of resources.
The SOAR design is pretty straightforward. It's not something new, but something necessary at Clover.
So we have a coordinator, which is a Kubernetes service.
This coordinator is engineering facing. Engineers can config simulations, either it's a submitted one -time simulation for testing a hot change, or it could be repetitive simulations that is used for monitoring purpose.
Like we scenario, we make sure this feature is never failing, something like that.
And the coordinator will assign all of this simulations to a per server agent.
These agents are the actual actor to run the simulation task.
This task can be as simple as a hello world, or it can be a very complex benchmarks and setups.
When it runs this task, you will collect the metrics and report to our metric store using premises and a lot of manager.
So it provides a way for engineering team to program like how they want to receive the result of the result of these simulations.
Through the process of designing the system, we figured there are a few capabilities that people should be aware when they design a similar system.
First one is network simulation.
So you want to build your testbed so that your service process just process all the requests, no matter they are coming from Internet or it come from fake simulation environment.
That's important. You should not distinguish that.
And the second point is about logs and metrics. You definitely want to keep a long-term log storage and metric storage.
One thing very important to remember is you want to attack the metrics specific for each of the tasks that you run.
When you have a lot of tasks running, test tasks running a certain environment, the metrics, your metric store can sometimes cause problem for you.
For example, when you configure premises for monitoring the time series metrics, you might encounter inaccuracy of this metrics because premises usually has a scraping interval.
Between the two scraping point, if you run multiple tasks, you'll have very hard time later to correlate.
Like when I have this resource consumption, how do I attribute each of them to individual task?
So you need to have a way to either tag or find a data of each of your task run for the metrics.
And scheduling in this environment can be relaxed.
We do not need a very complex scheduling engine actually, because we can assume the scheduler is cooperative.
So jobs will actually finish in time and they will not intentionally do bad things.
We make this work because we constantly review what's running in the environment and audit what's actually wrong.
And you also need to educate your engineer about this.
But even the worst case, because this data center does not have any customer traffic.
So if anything that unintentionally break the environment, we still can save them without causing any customer impact.
Version management is another topic that is very, very challenging.
So when you build a test pack, you naturally have to integrate this with all stage of your development pipeline.
So the most asked from our engineering team about how to use this is, how do I control the version?
Can I run multiple version at the same time? How do I compare them when I want to do some tasks?
So you definitely need to have a tight integration of the version management.
Similarly, the configure management is something you are very man is you want to test things in an up-to -date scenario.
You do not want to accidentally have an outdated environment while the production has been far ahead in configuration.
To achieve this, there are many configuration management system that you can use like SaltStack, Puppet, Chef, Ansible, and so on.
So that you just sync your production environment with your testing environment.
Make sure they are always the same. And lastly, you definitely lock down your system.
So engineers cannot log into the servers. One of the problem we had a lot in the past regarding our lab environment is people can just freely log into the servers there to test their things.
So there are some unrecoverable changes that they made, which actually suddenly break others' tests.
That happens very often. So you want to lock down your system to avoid any kind of human problems.
The actual implementation of SOAR of this ballast is for network simulation.
We place this SOAR environment in same VLAN so that we can use the Linux routing system and hostname system to simulate IP layer and higher layer connectivity.
We also have the flexibility to simulate certain failure. Like we can mark LinkedIn down so that we can check in this situation, can we use another host for load balancing or so on.
For logs and metrics, we store the log in S3 object, internal S3 storage.
We complete two retention on these logs. And right now we feel like it's useful already for us to analysis the events.
And metrics, we use a premises metric mainly to monitor the current running tasks.
And we build a customized data exporter to correlate when a task run, the engineers can specify some metrics to calculate the data before and after a task run.
So it accurately report what's happening during the process.
It's a pretty simple exporter that is pretty much 300 or 200, 400 lines of code.
Another important feature that we have is we generate flame graph during task run.
This is important for analysis after you have run the things.
Our key challenging in testing the performance is sometimes you are hard to reproduce the exact context.
So by generating the flame graph and related statistic on the fly, you have some place to begin to investigate the problem in the past without having to reproduce it first.
That's a very critical thing.
The scheduling of the testing is priority -based.
It's a simple scheduler that we build by ourself. It's also a very small thing, only 300 lines of code.
And we feel like it's already sufficient for us to run something if we want to run some urgent fixes or if we want to test something earlier.
And the users specify a job budget when they configure their test jobs so that these things, we guide them and help them how long they should configure this so it will not be killed and they will be tuned properly.
Version management are adapting to our current internal software.
We have a release manager to control the version.
So we made integration with that. And we're also integrating this into our CI pipeline so that engineers can build a fresh packet and deploy this Hulk version into our testing environment.
For config management, we didn't do much of the change to implement SOAR because Cloudflare already used SoftStack to manage our edge configurations.
We just reuse that and nothing special, just keep it high-stating frequently so the configs are up-to-date.
SREs will make sure these high-states are successful, complete. If there are any problems, we are investigating what caused the high-state failure and make sure this environment is actually up-to-date with the production.
Unlocking now system is even simpler that we do not hand out the SSH access.
And as I said, SREs maintain the data center.
So it's the same level of standards as our production data centers that help us to perform those tests.
And that's the things I am sharing about the SOAR design and implementation.
Now let's see some use cases that we had.
Engineers pretty much build a monitor of a test scenario on SOAR.
There are several categories of things that they care about. As the title suggests, it's observability.
In this case, they want to use end-to-end benchmarks and synthetic workloads to test the service in a black box manner, but also watch if any metrics is able to help them define whether a certain test succeeded or not.
This is an iterative process. When they onboard new tests, they check if it's already emerging enough.
If not, they are iteratively adding more metrics and run them again to see what actually can help them to assert the quality of the software.
And once it's emerging enough, they will just deploy the tests and leave it around there and monitor as a daily operation practice.
And for the reliability, this test environment serves to test complex acceptance tests and also with some events that are hard to test in the normal test scenario.
For example, one thing that we really care about is when you add the server workload to a red line, what behavior the software has.
So that will give us the confidence that things still operate correctly when we have a high or even extreme utilizations.
Or say, at least we can notify our servers are not taken down because of a high utilization.
And for security, it's an interesting category that we are still exploring.
We want to build more penetration tests and tests on ourselves.
Cloudflare is a DDoS provider, so we care a lot about the efficiency of our mitigation softwares.
And in this environment, we actually can examine the efficiency of individual servers for mitigation.
As I mentioned before, Cloudflare has a homogeneous edge architecture.
So that means once you have the result on one of the servers, you can usually speculate how this works out on a larger scale.
So what we have is we are among existing known attack traffic on our own system and test how well we drop that.
And as a concrete example, I want to share with you some simulation that we build around our product.
It's called Magic Transit. So Magic Transit is an IP layer product.
Customers bring their IP address to us and Cloudflare starts to advertise this IP address on BGP so that when the customer's customer, the eyeballs, are trying to send traffic to our customer, they will, this traffic will be absorbed to Cloudflare data centers.
This data center will perform all sorts of processing like mitigating the DDoS attack, dropping packets based on customer config firewalls.
And once this processing finishes, we will encapsulate this traffic and deliver over GRE tunnel established with customer.
And these are clean traffic.
So customer receive them and process in their normal business. This product was launched already when it's launched.
We do GRE tunnels from all over the world.
So it turns out we can actually do any cost GRE to our customer. It works fine because every data center is the same.
But then we, as the product having new features, we start to have like a PNI connection with our customer.
On our website, it may be labeled as CNI, but it's the same thing.
So it's a physical network interconnect with customer.
For example, in the diagram here, we may have a customer data center in Paris, and they will peer only with the data center with Cloudflare at Paris as well.
That means when the eyeballs are sending traffic to this Paris data center, they will pretty much only reach the, our Paris data center will become the choke point.
It's a different data center from other data centers.
Recall our previous release pipeline. This actually introduced some difficulties in testing this kind of configuration because our dog and canary test phase will not capture this, will not cover any of the aspect of our software in this data center because that's not in Paris.
So in this case, we actually break down the scenario and start with two parts.
I'm only showing you the second part, which is the PNI phase.
So we established the PNI server with the PNI specific configurations and use simulated eyeball to send traffic and then encapsulate the GRE and deliver to the simulated engines.
In this case, the PNI servers basically having the same level of configuration as our PNI data center so that we can, the engineers can actually have a place to check if their PNI configuration amputation actually have correctness and performance proper.
And on the test case, we are feeding TCP and UDP flows for performance related bandwidth and latency check.
And we also check for MTUs. So MTU is important because when you are encapsulating packets, you encounter MTU issue frequently.
The MTU discovery on the Internet is broken.
So we're running that. And we're also simultaneously feeding the ACK floods and other SYN floods through the server we have.
So make sure the PNI servers also mitigates this attacks while serving the normal traffics and so on.
So that's pretty much I want to share today. As a recap, we built this testbed as an internal simulation platform for Cloudflare.
And on this platform, we can build dedicated test scenarios to improve different aspects of our software to deliver our best to our customer.
There are more details in the, there is a blog post about this test.
It could be found on Cloudflare blog website. So if you have interest, please do check out that.
And if you have any question, feel free to leave your comments.
I will answer them. And thank you for watching. Have a good day. Everybody should have access to a credit history that they can use to improve their situation.
Hi guys.
I'm Tiffany Fong. I'm head of growth marketing here at Kiva. Hi, I'm Anthony Brutus and I am a senior engineer on the Kiva protocol team.
Great. Tiffany, what is Kiva and how does it work?
And how does it help people who are unbanked? Micro lending was developed to give unbanked people across the world access to capital to help better their lives.
They have very limited or no access to traditional financial banking services.
And this is particularly the case in developing countries.
Kiva.org is a crowdfunding platform that allows people like you and me to lend as little as $25 to these entrepreneurs and small businesses around the world.
So anyone can lend money to people who are unbanked. How many people is that?
So there are 1.7 billion people considered unbanked by the financial system.
Anthony, what is Kiva protocol and how does it work? Kiva protocol is a mechanism for providing credit history to people who are unbanked or underbanked in the developing world.
What Kiva protocol does is it enables a consistent identifier within a financial system so that the credit bureau can develop and produce complete credit reports for the citizens of that country.
That sounds pretty cutting edge.
You're creating, you're allowing individuals who never before had the ability to access credit to develop a credit history.
Yes. A lot of our security models in the West are reliant on this idea that everybody has their own personal device.
That doesn't work in developing countries. In these environments, even if you're at a bank, you might not have a reliable Internet connection.
The devices in the bank are typically shared by multiple people. They're probably even used for personal use.
And also on top of that, the devices themselves are probably on the cheaper side.
So all of this put together means that we're working with the bare minimum of resources in terms of technology, in terms of a reliable Internet.
What is Kiva's solution to these challenges? We want to intervene at every possible network hop that we can to make sure that the performance and reliability of our application is as in control as it possibly can be.
Now, it's not going to be in total control because we have that last hop on the network.
But with Cloudflare, we're able to really optimize the network hops that are between our services and the local ISPs in the countries that we're serving.
What do you hope to achieve with Kiva? Ultimately, I think our collective goal is to allow anyone in the world to have access to the capital they need to improve their lives and to achieve their dreams.
If people are in poverty and we give them a way to improve their communities, the lives of the people around them, to become more mobile and contribute to making their world a better place, I think that's definitely a good thing.