*APAC Heritage Month* Datacenter Automation at Cloudflare
Presented by: Jet Mariscal, Zena Soans
Originally aired on April 9, 2022 @ 5:30 AM - 6:00 AM EDT
A discussion on how Cloudflare introduced data center automation to scale our systems and increase operational efficiency.
English
APAC Heritage Month
Interview
Transcript (Beta)
Hello and a very very warm welcome to all of our viewers. Thanks for joining us today.
We will be deep diving on the topic of data center automation. I'm your host for the segment Zena.
I manage the data center engineering team within the infrastructures organization and I'm based in Singapore.
And our guest today is Jet who is a systems engineer from the infrastructure engineering team.
As a matter of fact, Jet and I were both ex -SREs and it ties very well to this topic because when Jet joined Cloudflare, which was about a year and a half ago, we were posed with this huge issue on how to fasten our provisioning process.
And when I say provisioning, that basically involves our expansions of our existing data centers, decommissioning them, the older generations and spinning up newer ones.
To give all of our viewers a little more context, Cloudflare at present has about 200 plus points of presence across the globe.
As you all can see, Jet has presented this beautiful map.
You can also view it by going to www.Cloudflare.com .network.
It enlists all of our POPs and this is just to show how widespread our footprint is.
And until automation had kicked in, our SREs, our systems reliability engineers, would provision or commission all of these servers manually by running lengthy SREs, aka our standard operating procedures, which was quite time -consuming.
It was extremely tedious, prone to errors due to all of the manual copy-pasting.
All of these servers that I'm actually talking about would originally be installed by the DCE team, the data center engineering team.
That's my team.
And they would be the ones to plan and install all of these servers and the SREs would provision them manually.
So mainly there are two distinct teams here that are involved in this process.
One is the DCE team, the data center engineering team, and this other is the SRE team, the systems reliability engineering team.
Now with all of the scaling that has been going on, there came a point where we could not just keep up to the rate at which installations were being done, just because of the need for increased capacity.
This, by the way, is also a testament to which Cloudflare, in which Cloudflare's network has grown by leaps and bounds over the past years.
And as we all know that the larger the network, better the performance, better the security that we can deliver to our customers.
But in this bargain, what happened is the SREs would become the bottleneck for a provisioning pipeline.
And our pipeline would be queued up for days and days, thereby delaying our capacity turns and not allowing us to meet our targeted deadlines.
So this was just the time when actually, I believe it was around the time that Jet joined Cloudflare.
And I clearly remember having this discussion with him over a very casual lunch break in our Singapore office.
And that's when he actually jumped in with his expertise in automation and in airflow, which not only helped us save time, it made the process error-proof.
But also it gave data center engineering team, which is my team, to take up the responsibility of provisioning, thereby freeing up a lot of cycles for the SREs.
So yeah, so let's just hear a little more from Jet on this topic and on how he was able to make our lives a little more easier.
Jet, what is provisioning as a service, or PRAS, as we call it?
Can you please deep dive into the goal of this tool?
What is it exactly? Yeah, sure. Thank you, Zeena. So provision as a service, or PRAS as we call it internally, is the platform where provisioning operations such as expansions and decommissions are executed in an automated fashion.
The self-service nature of PRAS and its ease of use empowers non-SRE teams to perform expansions and decommissions on their own without SRE involvement.
This allows us to quickly add capacity to a data center, increasing the performance and security that immediately benefits all Cloudflare customers.
We built this tool primarily to eliminate the toil, as you mentioned earlier, involved in any of those operations in the data center.
So by eliminating the toil, we're able to conduct these various provisioning activities safely and in a highly efficient manner.
We're really talking about 90% reduction in labor-intensive work.
So PRAS is written in Python, and we leverage on Apache Airflow to transform the procedures into an automated workflow and orchestrate its execution.
The team has been really harnessing the power of this tool is obviously the data center engineering team, which is your team.
And actually, that's why I like it even more, because it just benefits my team more than any.
So yay for that, and thank you, Jet. And I know I touched a little on the section earlier, but if you could just let us know what were the precise requirements for which we came up with, and how did we end up using Airflow?
Were there any other contenders that you were thinking about? How did we end up using Airflow?
Sure. So traditionally, SREs are the ones who conducted this data, these operations, as you mentioned in your intro.
So for years, we have relied on our standard operating procedures to execute these operations.
That means carefully reading through each step and executing each step manually by hand.
Over the years, this approach has allowed us to successfully build and expand hundreds of data centers globally.
But we acknowledge that this has been a huge toil for SRE for quite a long time, for quite a long period of time.
And then with Cloudflare's aggressive data center expansions, we really needed a way to address this increasing demand and come up with a solution to automate expansions and the VCOMs for that matter.
So if we're both SRE-4s, or if you're an SRE at Cloudflare, you spend a certain percentage of your time in engineering work.
So you step aside from your tactical shifts, and you only focus on writing codes, automate stuff, for example, build various tools for operational efficiency, or work on projects that increase the reliability of the edge.
That was a result of this engineering work.
It was a project that I worked on in most of my engineering time. So as for the requirements, the obvious ones were to automate our SOP, and then to deliver it as a tool that is easy to use, and to do it in a short period of time.
So just to elaborate on each requirement, for automating the SOP, at Cloudflare, we use Saltstack, and we use it to automate a lot of stuff, to manage our fleet of bare metal servers, routers, switches.
And although much of the automation is really in Salt, we still have some steps that need to be executed by hand, mostly by SREs.
And with this requirement, executing the step should not involve any human involvement, and more importantly, must not use SSH.
So we've only been using SSH to log on to a remote host where our tools are installed, and we execute it from there.
But in introducing this automation, this needs to be totally eliminated already.
And then for delivering a tool that is easy to use, apart from being easy to use, the tool must be centrally accessible.
So most of our tools are CLI -based, which means that we actually execute them via the terminal shell.
But with these requirements, we were already considering using a web application-based, which is a web-based application for this one.
And lastly, it was very challenging because with only a short amount period of time.
So we were in a race against time because of the increasing demand.
And delaying it further would have increased the amount of toil in SRE as well, and would have forced us to put in more human resources to cope up with the demand.
So to satisfy these requirements, we actually have two options.
So the two options were, first, we can write our own software. So there's a lot of complexity involved in automating our SOP because we have to deal with lots of things, such as execution dependency, failure handling, even parallelism, orchestration, and all the other stuff that are defined in our RFC.
Just so our viewers and listeners would know, at Cloudflare, whenever we work on projects, we write an RFC to request feedback for a proposed solution for our team members.
So these complexities can only be addressed by using a real programming language.
And this would obviously be either Go, Python, or maybe even Rust, which are all Cloudflare-run favorites, by the way.
However, writing this software from scratch requires a very high level of effort, and would also be very challenging given the mentioned time constraint that we have.
So our next option would be to leverage an open source software.
One of the most crucial parts of this automation is really the orchestration part.
It really requires a flawless orchestration of the execution of each step in our SOP.
So it was pretty clear, at least to me, that those requirements will be satisfied by Apache Airflow.
I was very confident that Apache Airflow can even exceed these requirements, and that automation can really be delivered in a short period of time.
So with that, I spearheaded the project, started writing code for the SOP transformation, and later on came up with a full proof of concept code running on my laptop.
And I think in as little as two weeks, we were able to execute successfully against a small production center.
And if my memory serves me well, that was Lisbon.
I definitely think that was Lisbon. I think that's correct. And it's actually funny that you mentioned about running the n number of commands on the CLI.
It takes me back to the days when I was an SRE, and I was absolutely new to all of this stuff.
And I would have to do a bunch of provisionings. Obviously, we had capacity that was required across the globe, and my entire screen basically would fill up with terminals in order to run multiple tasks in parallel.
Oh, my God, what a pain that was.
And I would end up using most of my time for trivial things, just making sure a ton of time that whether have I enabled BGP with the right set of servers or not.
Or for example, did I enter a command correctly or not?
Or did I enter the name of the servers correctly or not? Or have I made any other clumsy mistake?
And all of which can be actually totally avoided basically with Airflow.
So going back to the topic of Airflow, are there any prerequisites to run Airflow Jet?
I know that our eventual goal is to go into this one-button provisioning world, where an engineer just hits a button, sits back, enjoys a nice cup of coffee.
But as of now, are there any human intervention required throughout the process?
And if so, then how does automation handle this criteria? So the great thing about our automation is that it can still work very well with humans alongside.
If a task requires human intervention, our automation is smart enough to indeed seek that attention of a human operator and asks for it.
So to cite an example, we have a task in our expansion procedures that requires an F3 to review the changes that are detected in our DNS zone records as a result from expansions and decommissions.
So automation will ask for this intervention through chat notification.
It then pauses the execution of the workflow while it waits for the SRV to review and deploy the change, and then proceed executing downstream tasks as soon as it gets the input back, confirming the completion of the deployment.
So our automation asks for intervention if it's needed, and it pauses the execution while waiting for a human intervention to actually intervene, and resumes the operations once that intervention has been made.
So does that actually mean that I don't have to keep staring at my terminal to see the progress and I can just...
Yeah, it's funny.
It's just taking you back to the time. But anyway, you were talking about expansions earlier.
Given the fact that expansions are super complex, they're critical, their complexity increases just by every condition that's required, and not that all expansions are the same, do we just use a single workflow or there are multiple workflows involved?
And if there are, then how does it go into a specific workflow based on a condition?
And just because of how different each expansion is, is this even customizable?
Can we tweak it here and there? Yeah, sure. So let me just share this quick slide.
Okay. So we have automation for each of our data center operations, provisioning that is.
That means that we have a corresponding DAG for expansions and decommissions, for example.
So in Airflow, a DAG is a collection of all the tasks you want to run, organized in a way that perfects your relationships and dependency.
So an engineer who has the right permissions can treat any of these DAGs, depending on which operation to be conducted to a data center.
So in addition to these DAGs, we have other external DAGs that are dynamically triggered by these data center DAGs in order to execute its own specific workflow with its own specific tasks inside of it.
And these independent DAGs are orchestrated all together, which allows us to solve really, really complex workflows.
This is why we have written our DAGs to be fully reusable. We really did not try to incorporate all the logic into a single DAG.
Instead, we created other separable DAGs that are fully reusable and can be triggered on demand, not only programmatically.
And our DAG manager and this helper DAG is an example of it. Right.
So you mean primarily right now, we handle two tasks. For example, if you're handling expansions, then that has a separate DAG.
And if you're handling decommissions of a service, that has a separate DAG.
That is correct. Yes. Okay. It differentiates the functions very well.
Perfect. Moving on. For example, say suppose if I've never used Airflow, how easy is it for me to adopt it?
And should I have some sort of a prior knowledge to use this technology?
Yeah. So as an end user operator, no prior knowledge of Apache Airflow is needed.
One of the customizations that we did to ensure that our automation is easy to use is to create custom user interfaces that are tailor fit for provisioning.
So we created a plugin that added a custom menu called provisioning.
And under it are sub menus for a specific operation that we want to do, which will then direct you to a web form that is rendered as custom views in Apache Airflow.
And any tooling for that matter to be successful really requires strong adoption and acceptance from end users.
Actually, Airflow made it really easy for us to create customized web interfaces on top of these processes that we have, which made it simple to use by more employees who didn't have to understand all the details.
I could recall Jose before doing the expansions on his own.
Jose doesn't have any prior knowledge to Airflow, but we just made it really simple for our colleagues who doesn't have any knowledge from it prior.
Absolutely. Pretty much all of my team, they're excellent.
Yes, they are. Some of them have never even entered a CLI command before, but here they are running expansions after expansions, running decommissions after decommissionings, which is great.
It has just improved the pipeline so much that we have freed up so many cycles, even for us and even for the SRE team to do more such automation, to do more such good work and just make our lives easier, basically.
You just have listed all of the good stuff until now. How much time saving are we actually speaking about here?
Give us some numbers, please.
That'll give us a better idea. Sure. I did an example for phase two expansion, for example.
It not only takes anywhere between 30 to 40 minutes, and that's without human interaction needed, really.
Compared to about previously, it takes an hour and a half, two hours, right?
You know how it was before. Also, you have to sit there in front of your terminal, so you're really glued in there and unable to do anything else.
Unlike today, you simply click a button, you let it run, you carry on doing some other of your exciting stuff, and it really optimizes a person's time.
There's just an enormous list of advantages in automating our data center operations.
I think to name a few, automation really gave us guaranteed consistency compared to manual action performed by humans, because the inevitable lack of consistency displayed by humans can lead to mistakes, oversize, even reliability problems.
This has now totally vanished. It's very easy to do multiple simultaneous expansions now, which have always been very challenging if you're actually doing it in two data centers, three data centers.
Then lastly, we no longer have the need to use SSH.
Automation has significantly contributed to the overall reduction of the SSH usage.
Very, very true. This is absolutely great. Right now, my team is making the utmost use of it in our day-to-day.
That being said, we're not always satisfied with what we have as humans, do we?
Talking about the future, Jet, what's next?
Right now, we are at about 200 plus data centers. At the rate at which Cloudflare is scaling, how prepared are we to tackle this growth?
I know for a fact that we're not going to slow anytime soon.
Please shed some light on scalability, or do you see a foreseeable maintenance nightmare?
Hopefully not, but do we see a dark period for us anytime in the future, Jet?
Yeah, sure.
I just presented a slide now. In the case for expansions, they're done in two phases.
Phase one, this is the phase where servers are being powered on, boots our custom Linux kernel, and begins provisioning process.
Then phase two, this is the phase in which newly provisioned servers are enabled in a cluster to receive production traffic.
Now, to reflect these phases in the automation workflow, we also have two separate DAGs, one for each phase, as mentioned earlier.
Right now, we have over 200 data centers.
If we were to write a pair of DAGs each, then we would end up writing about 400 files, right?
A viable option would be to parameterize a single DAG, but at first glance, this approach sounds reasonable, but one major challenge is it could be very difficult to track the progress of each DAG run.
Imagine if an engineer is conducting simultaneous multiple expansions.
Following the software principle design called DRY and inspired by the factory and method design pattern and programming, we've instead written both phase one and phase two DAGs in a way that allow them to dynamically create multiple different DAGs with exactly the same tasks and to fully reuse the exact same code.
As a result, we only maintain one code base. As we build new data centers somewhere there around the world, our automation can automatically support it, generate a DAG for that data center instantly without even writing a single line of code.
Now, speaking of scale, we're able to execute these tasks simultaneously at scale because our colleagues from the core SRE team did an amazing job setting up the Apache Airflow infrastructure.
We use a modern executor called Kubernetes and this executor creates a new worker pod for every task instance that needs to be executed.
And once completion of the task, that pod gets killed.
So this ensures maximum utilization of resources at our core data centers.
I see that we're thinking from all aspects that we can on how to save space and how to be more efficient, which is great.
And I'm absolutely all in for scalability and just that it supports the idea of Cloudflare just to expand our footprint everywhere and it just goes hand in hand.
Let's talk a little bit about SOPs, our very favorite standard operating procedures.
Given that the automation using Airflow is already in place, does this mean that SOPs are no longer needed?
Yeah.
So since automation is already in place, really, we no longer need it to go through this SOPs anymore, but we still keep it there.
We keep it in sync with the code of grass.
That way, if we need to, we can always still be able to do expansions and decommissioning by hand.
So for example, if we encounter serious technical issues preventing us from using automation, we still have our trusted SOP over the years that will be used to prevent provisioning and decommissioning activities from stopping completely.
And also, on a personal note, I believe SOPs still have value, especially if you're a new hire.
If you're a new SRE, it's actually a great training tool and you will actually learn a lot by doing it the traditional way, at least initially.
Very true. I think you get to know what's happening in the background when you run the SOPs manually for the first few times and then you use automation and you already know why it's happening.
So yeah. So Jet, what's next for PRAS or provisioning as a service?
Yeah. So PRAS will ultimately be delivered as an autonomous system for the data center, wherein it would be able to detect our bare metal servers being racked on site and powered on, and it will then automatically just provision it, expand the capacity of the data center without any human involvement at all.
I mean, that's the ultimate goal that we want to achieve.
Currently, we continue to improve this service, working on feature requests, for example, and we also continue to onboard existing manual procedures that we still have.
There's just some really exciting stuff going on right now at the moment, such as the initiative to automate also network provisioning and even RMA in that case.
Right. All of the good stuff. And just reiterating this fact that my team is taking the utmost advantage of it at this moment.
They've been able to use technology and certain concepts of even automation that probably would not have usually crossed their paths.
And I'm certainly, certainly looking up for all of the upcoming features and looking forward to the ultimate goal of reaching our one -button provisioning process.
A very, very thank you, Jet, for sharing the technical details about provisioning as a service.
I also, before I conclude the segment, I also wanted to take this opportunity to make a small hiring pitch while we still have some time on the clock.
My team is hiring for data center engineers as well as network deployment engineers in multiple locations.
So if any of our viewers or listeners are interested, please drop an email to zina at Cloudflare .com or visit our careers page for more information.
I can vouch for this fact that Cloudflare is absolutely a great place to work at.
I've been having the time of my life here since about three and a half years now, working and growing alongside such brilliant individuals and talented people in general and such nice people.
And I'm super thankful for it. Jet, any closing comments from you?
Yeah, thank you very much, Zina. I mean, thanks as well for this platform that we have, you know, for an opportunity to, you know, showcase some of the exciting stuff that we do here at Cloudflare.
And just like your team, our team is actually also hiring.
So we are always on the lookout for the best talents out there.
So we have open positions for both SRE and infrastructure engineering to be based, of course, here in Singapore.
So, yeah. Perfect.
All right. Concluding this segment, very, very goodbye to all of our viewers.
Thanks for watching. And please, please stay tuned to our Cloudflare TV. There are a lot of exciting stuff coming up.
And yeah, just have a good day, good night for everybody across the globe and just stay safe.
See you, everybody. Thank you for watching.
Bye.