Cloudflare Control Plane and Analytics outage
Presented by: Nitin Rao, João Tomé
Originally aired on July 23 @ 3:30 AM - 4:00 AM EDT
In this special episode, João Tomé is joined by Nitin Rao, Cloudflare’s Interim CPO, to discuss the Cloudflare Control Plane and Analytics outage. The incident lasted from November 2 until November 4, impacting several customers.
We explain how one of our core data center providers failed catastrophically, and steps were taken to ensure it never happens again. These measures, which we’re calling Code Orange, involve reducing dependence on our core data centers, requiring all Generally Available products and features to have a reliable disaster recovery plan that is tested, recurring audits, and much more.
The mentioned blog:
English
News
Transcript (Beta)
Hello everyone and welcome to This Week in Net. It's the November 16th, 2023 edition. This is a special episode focused on Cloudflare's control plane and analytics outage.
I'm João Tomé based in Lisbon, Portugal, and with me I have our Interim Chief Product Officer Nitin Rao.
Hello Nitin, how are you? Hello João, thanks so much for having me and I appreciate it.
You are based? So I'm based in Oakland, California, near Cloudflare's San Francisco office.
Exactly, so eight hours different from Lisbon, where I am today.
This is an episode focused on what happened in terms of our outage, it was a big outage, who was impacted and what we learned from this and what we're going to do to make sure it doesn't happen again.
For those who don't know what happened, even customers, what can we say from the get-go?
Yeah, so for starters, Cloudflare plays a very important role protecting a substantial portion of the Internet.
There's a very large number of Internet applications and corporate networks that rely on Cloudflare every day.
We take that responsibility seriously and anytime we have an incident or an outage, we have a culture of transparency.
It is important to us to be really transparent about exactly what happened, what we learned from the incident and sharing and being accountable for making things better.
Between November 2nd and November 4th, we had a significant outage to Cloudflare's control plane and analytics plane and so across our global network of hundreds of data centers, we continue to have traffic flowing, so hundreds of trillions of requests a month of traffic continuing to flow.
We continue to stop attacks against our customers but our control plane was impacted which meant that if you were a customer of Cloudflare trying to log in, make changes to your configuration for key services, you wouldn't have been able to do that.
In addition, for a set of our newer products that weren't yet general availability for those products, it was in some cases for services like our stream offering or parts of our Zero Trust offering, our customers were more acutely impacted and so I'm very sorry for the outage.
It should not have happened. It's completely unacceptable that for a period of time our control plane was unavailable and in some cases key analytics or logs data was lost for our customers.
That just cannot happen and we're committed to ensuring it doesn't happen again and ensuring we learn from the incident and so I'm truly embarrassed and sorry that the incident happened in the first place and in the spirit of Cloudflare's culture of transparency, it's important to share exactly what happened and act with urgency to remediate it and that's what the team's been really focused on.
Exactly and it didn't impact all the customers, right?
It impacted the customers, as you said, that use these specific products and so security network, our traffic wasn't impacted in general, right?
Yeah, so when we think of Cloudflare's network, there's the network that customers are especially familiar with.
Our global network, we internally sometimes refer to it as the edge.
These are hundreds of data centers all over the world in over 300 cities in over 100 countries.
That global network was very resilient through this control plane outage.
However, we have a separate set of data centers that we sometimes refer to as our core data centers, especially in the United States.
There are three data centers that act together as a high availability zone, so they act in concert, continually syncing data for consistency where a lot of our control plane sits.
The control plane is the interface with which you interact with Cloudflare, things like our dashboard, our key APIs for Cloudflare.
The analytics plane is where we do things like logs processing, and so those were impacted during the outage.
That was clearly disruptive and especially disruptive for a subset of customers and has given us the opportunity to identify exactly what we need to urgently shore up, and that's what the engineering team has been really focused on.
Exactly. You mentioned before, in terms of culture of Cloudflare, blogging about what happens even when there's an outage, and this was the blog post that we published on November 11th, the postmortem of what happened, you mentioned before, and it's pretty much detailed.
In the industry, it goes back to the Cloudflare culture in terms of transparency, as you said, but not only for customers, but even for people in the industry to understand what happened.
We usually disclose the specifics, right? Yeah.
Cloudflare has a large and global set of customers that really rely on the availability of our services.
We have a status page, and so if we have an incident, we share ongoing updates on that status page of the incident, but also the specific services that are impacted during an incident, but for larger outages like this one, we aim to get the details out to our customers as quickly as possible, and so this began Thursday morning U.S.
time, and late Friday night, our CEO, Matthew Prince, published a very detailed blog, which is the one that you're showing on the screen right now.
It acts as a postmortem describing exactly what happened, including it goes into great detail describing the failure at one of our significant data centers.
It also describes the impact to our services and how we, that I'm happy to go into more detail for, and begins to preview lessons in remediation.
Separate to this, we've reached out one-on-one with a number of our important customers and addressed this with a webinar that we did in the coming days, and so appreciate having another opportunity with this week as in that to speak to the incident.
Absolutely. There's a lot of details here, as we usually do in these types of blog posts.
You were mentioning before exactly the measures that we put together at the time.
You want to go more into detail on those? Sure. Maybe before doing that, just to not skip too many steps ahead, if it's okay, let me take a moment to just describe exactly what happened.
I know that some folks have already read the blog or are very familiar, but for folks who aren't familiar, let me take a moment to just quickly describe it so we're all on the same page.
As I mentioned, Cloudflare is a control plane with some key services like the dashboard and APIs to make changes to Cloudflare, and an analytics plane where we do things like logs processing.
A lot of that happens at three major data centers in the United States.
We sometimes refer to it as the U.S. core data centers because these are databases that are syncing data with each other and maintaining a level of consistency.
They can't be too far from each other. We are at the highest distance from each other that is possible while still being at a low enough latency.
They can consistently sync data with each other. This is sometimes referred to as a high availability zone where you have three distinct data centers and you're resilient to the failure of any of those data centers.
There were distinct data center partners. They have multiple power feeds and several fail-saves.
They have independent network connections and the like, and they're constantly syncing data.
However, what happened is we had a truly catastrophic failure at the first and largest of those three data centers, and it had significantly more severe impact than we'd expected.
Typically, just a power failure should not cause such significant impact, but this was truly catastrophic where the data center not only lost multiple of its feeds but lost several backup arrangements including a generator and battery that impacted Cloudflare as well as a number of other technology companies and telecom providers that rely on the infrastructure in that data center.
The second and third data center continued to run during this incident, and that's where you'd expect high availability to kick in.
You'd expect several services that are correctly configured to be highly available to continue to run on that second and third data center even as you're waiting for the first data center to come back.
Unfortunately, while that happened for some services, that did not happen for all services.
We realized that a subset of services either had hidden dependencies on our first data center that we should have known but did not fully appreciate and weren't exposed in the different types of chaos testing we'd done, or we found that especially a number of our newer products where we'd given the product teams a lot of autonomy to build quickly weren't configured to take advantage of our high availability cluster.
Just a side note on that, we have, and this is also a bit of the culture of Cloudflare, we have a lot of new products.
Birthday week was just around the corner, that's one of our innovation weeks.
We launched a lot of products and with those lot of products, some are beta, alpha, some are general availability already.
We'll speak to this a little bit later, but one of the things that's coming out of this outage is ensuring that any product that is GA has to be HA.
No GA without HA. We're very strictly enforcing across all of our products, ensuring that every one of those products is verifiably high available.
They take advantage of the architecture that exists for this purpose that's spread across data centers that has a verified disaster recovery plan.
It's important that any product we put in our customers' hands and designate as GA has to pass that test.
That became one of the main takeaways from this incident that we clearly do not want to happen again.
If it's general available, it will have all of the redundancy that makes the Internet hold again.
Yes, absolutely. No GA without HA. And while the incident happened on November 2nd, and it's been roughly two weeks, even in that period of time, we've made significant progress ensuring many more of those services have been moved to the high availability cluster.
And the team is acting both with urgency, prioritizing this above all other engineering efforts, ensuring that the control plane is highly available, but at the same time doing it in a very thoughtful way because we want to make sure that any changes we roll out help our customers but are not unintentionally disruptive to our customers.
So in a very thoughtful way, these changes are happening.
There's been substantial progress even in the last two weeks.
And we already have this architecture in place that's highly available, that's spread across data centers, that has a verified disaster recovery plan.
It's important that we enforce that across all of our products.
And so we keep reminding ourselves no GA without HA and appreciate all the work from across team members to make progress against this.
It was as a team, in terms of many teams from Kaufler, actually, this was also one of those projects that haul hands on deck in a sense, in the way that people are trying to create these procedures, some very immediate just right now, others for the long term.
And we call this code orange, right? This process in a sense. Yeah. So this is somewhat inspired by our understanding of what Google has done for matters of great importance for Google.
They have what they sometimes call a code yellow or a code red for matters of great criticality.
These could be stability issues or other issues that just need to take precedence above all other product development.
This felt similar to us coming out of the outage. We knew that we wanted to urgently ensure that the control plane is highly available.
And so we created code orange.
It's an engineering resilience initiative where we're taking all non-critical engineering resources and team members are prioritizing making the control plane HA above all other initiatives and appreciate the progress we're making against it.
I know that customers want us to quickly remediate these issues.
And so we're making progress against them and transparently sharing those updates with our customers so that we're held accountable to the changes we need to make.
Some of these things are not expected, but even more so, it's difficult to test disasters, even natural disasters.
There's a lot of mathematics, there's a lot of logistics there, but it's always difficult to calculate in some sense.
So in this way, how also we try to predict some of these things really?
Yeah, so the one consolation, if you will, from the outage is that it has been the most full and thorough test, not only of our high availability systems, but also of our disaster recovery systems.
So our first measure of control is high availability. So to the extent services like databases are replicated across and synced across the high availability cluster, they should stay resilient.
But in the event high availability doesn't work, as it didn't in the outage, we can trigger what's called disaster recovery.
So we have a facility in Europe specifically for this purpose and where we fail out to instead use the infrastructure in Europe.
And through this exercise, we were able to identify exactly what needs to be shored up for our disaster recovery systems and also identify just the entire timeline, what took contributed every single minute in the disaster recovery process.
And our goal is to make sure that in the event of a disaster, we're able to recover in closer to an hour rather than much longer, which it took in this case.
And so there are significant opportunities both to make more services highly available, but also to really bolster the disaster recovery systems.
And that's what the team's been working hard on.
In terms of medium term, of course, there's a lot of process also going on.
What are those in terms of reducing dependence on our core data centers or recurring audits?
What are those really? Yeah. So Clariflare's strength really lies in distributed systems.
And we saw that in part in the resilience of the entire global network that stood up, that's highly available, that's incredibly diverse in its design.
That's where Clariflare's strengths lie. And so we recognize that there are some services that run at our US core data centers that arguably should not even run at those core data centers.
They should actually run across our global network.
And while that's harder to do for things like databases, that is certainly possible for a number of other services that they run at the core.
And so we're going through really the catalog of everything that runs at the core.
And we're saying, could it be core less? Could it just run across our entire global network instead?
And so there's a suite of services that's just moving off the core to instead run across our distributed network.
But then separately for the services that do stay at the core, we're doing the work to ensure that if the product is GA or even on the path to being GA, it has to be HA.
It has to have a verified disaster recovery plan.
We need to do a lot more rigorous chaos testing.
I can't underscore enough the importance of chaos testing. I think one of the things we reflected on coming out of the outreach was of those three data centers, I described how the first and largest facility went down, the second and the third stayed up.
The second and third had rigorous chaos testing where we had taken those down.
But for the first data center, we had done some chaos testing, but in hindsight, we would do a great deal more rigorous chaos testing.
Even if it in the moment while you are doing chaos testing caused some pain to customers, we think it's just outweighed by the benefit of that exercise.
And so we're going to be a great deal more rigorous with that as well.
And so a number of measures in the immediate term, in the near term, some in the longer term to ensure we're more resilient.
Separately, as you mentioned, we do rely to some degree on our data center partners.
And so we're going to audit every one of our data center partners for their backup systems and the like.
But ultimately, we can and should expect that data centers can fail.
And we want to make sure we can have systems that are resilient to any data center failure.
And so that's going to be the thrust of our effort.
So spending a little bit of time working with our data centers, ensuring they have best practices, or spending the majority of time really at the software layer, looking at all of our applications and saying, can this be resilient, even if a data center fails, because that's something we should always be prepared for.
You were mentioning before the chaos testing, which I found really interesting, because you must create some chaos, even in terms of reality, right, of impacting to do the proper testing, right?
Yeah, the image that comes to mind is of a chaos monkey that's in the data center.
People, the people who built those services don't know exactly where the chaos monkey is.
The chaos monkey is metaphorically pulling out different cables to see what disruption they can cause.
And it's important to do that level of testing to, for example, take down any and all of those data centers in a controlled manner, so that we know exactly what gets disrupted in the event of that failure.
And so we already do chaos testing, but we identified some places where we weren't doing enough.
And so that's going to be a big area of focus for us. I was curious, because you have a lot of experience in this area for many years, and companies evolve with things like these.
And I remember seeing blog posts from the past 10 years, actually, in terms of problems that emerged, and there were patches, there are new things done to avoid some of those mistakes.
But one of the interesting things is how complex the Internet is, and things work.
And in this complexity, sometimes you're always learning, right?
It's difficult to say, I know everything. So you're always learning in this case, right?
Yeah, so I think part of the learnings from our outage for us were technical, you know, what is the architecture we could have in place that's resilient.
But some of the collection of human beings working together to run these critical services on behalf of the Internet.
And so some of our learnings were actually about how do we enforce high availability standards?
How do we make sure that it's given high importance, and in some cases, higher importance than other engineering efforts?
And so we've been able to reflect on that.
One of the things I'm really grateful for is every engineering team stood up and said, we never want this outage to happen again.
How can I help? What do I need to do? And so I appreciate how moments like this, like inherently, any kind of change comes with a level of pain, but it also inspires people to step up and make the changes needed.
And so emerging from this incident, every engineering team is doing the work to ensure that their services are highly available and appreciate all the work that's going on.
And I know the customers expect this of us.
So this is very much the job and something we're spending time on.
And some of these are very technical, of course. In terms of the learning process of the team, different teams, different ideas, how does the idea flow is measured in terms of, hey, we should do this process.
We should make sure this process is working.
How does that process in terms of ideas and what to try to test and put into the ground, and let's say like that, happen in this situation?
So for starters, you need greater consistency across our process for something like high availability.
And because of, like you mentioned, because of the shared number of products we launch and the amount of innovation we do, we have typically indexed a little more on speed and giving product teams the autonomy to move quickly and innovate.
But that has also meant that different teams have taken a different path from their alpha to beta to GA.
And we realized that this is one of those areas where it actually helps to have more consistency to enforce a set of rules.
And so we're grabbing greater standardization across all of our teams to ensure no product is GA, is generally available, unless it's HA.
It's highly available. It is demonstrably replicated across the high availability cluster we have in place.
It has a verified disaster recovery plan. And team members across Cloudflare's engineering teams have been very supportive, very responsive to that, because we recognize that's job number one.
In many ways, reliability is the most important feature we sell to our customers.
And so it has to be the very first thing we pay attention to. And so we're making urgent progress based on the lessons we learned towards these remediation measures that we've spoken about.
Of course, there's excitement, and you mentioned in this, a lot of launching products.
And that's excitement for customers that some are waiting and asking for specific products and for the teams that want to put it out there.
But as you were saying, adding this process of making sure, even if in a very catastrophic way, that possibly it won't happen in the next 10 years, let's hope so, but it will still hold, because the process is there, right?
That's really important, now that the company is also growing, in a sense.
Yeah, absolutely.
I agree with what you said. In terms of, just to wrap it up also, in terms of Cloudflare, what do you take from this as lessons learned?
We mentioned already the specific steps we were taking, but in a more general way, in the history of Cloudflare, now 13 years, what do you take from it, even historically?
Yeah, so first, at the outset today, recognize there's some customers dialing in.
I'm very sorry for the outage.
It should not have happened. We want to make sure that when something like this happens, we use it to really galvanize the engineering team towards these urgent remediation efforts.
We use it as an opportunity to become even more resilient across all of our services.
And so I'm optimistic that our services for customers, our teams, will emerge stronger from it and be transparent along the way.
And we're already seeing significant progress towards services becoming more highly available.
That's a good way to also put it in terms of, we will be stronger.
Cloudflare will be stronger after this and more resilient, as you were saying.
Last but not least, you have a lot of experience in this area, and there's a lot of talk these days in a more geopolitical situation, but the Internet plays a role there.
There's submarine cable, there's a lot of related things there.
Redundancy in the Internet is really important for not only companies specifically, like in this case, but we mentioned submarine cables.
There's a lot of the need of having that redundancy makes a difference, right?
Yes, absolutely.
It really takes a village to... This outage was largely within Cloudflare's control.
We should have had the design that was verifiably highly available and resilient to these failures.
More generally, yeah, it really takes a village to help run the and make sure the Internet is available and fast and secure.
That depends on a number of moving parts. That depends on things like submarine cable systems, as you described.
Cloudflare is one of a number of companies that does its part to make sure the Internet is as available as possible.
Exactly.
This was great, Nitin. Any final words for those who are listening? I'll just reiterate for all the team members who had to work on this, for our customers who are impacted, for the team members of customers who are impacted, or the customers of customers who are impacted, I'm very sorry about the RHB.
It was completely unacceptable, but it gives us an opportunity to make systems more resilient.
The team's acting with urgency to ensure we follow through on the remediations that we shared in our blog.
I appreciate the opportunity to speak briefly to this on this week's Net.
Thank you so much, Nitin. Also, thank you so much to all our team members that worked so hard in this past two weeks, also to make sure this doesn't happen again.
Thank you. And that's a wrap.