What Does an SRE Do at Cloudflare?

Presented by: Michael Wolf, Tom Godawski, Daan Van Gorkum

Originally aired on June 10, 2020 @ 9:30 PM - 10:00 PM EDT

Join for a high level discussion of the role Systems Reliability Engineers play at Cloudflare. This includes how we handle oncall and alert ownership, the software engineering we do to make internal systems reliable and easier to use, and the ways we dogfood Cloudflare's products to make our lives easier.

English

Reliability

SREs

Dogfooding

Transcript (Beta)

What's up guys? Hey, thank you so much for coming and watching this Cloudflare TV presentation. You know, I'd like to give a quick shout out to Calvin for his talk on bot management, but this presentation is going to talk about what SREs do at Cloudflare. I'm going to start out with kind of a quick presentation from my perspective, and then we'll go and talk to two SREs and get what their perspectives are and their experiences have been at Cloudflare. So let's go ahead and get started. If I click on the right page. Yeah, so this is me. I'm Michael Wolf. I'm an SRE on the core team at Cloudflare. I started here about a year ago, but I've been doing, you know, SRE sort of things for like five years, maybe. I started as an intern, which is a crazy intern job to have in the beginning. And based here in San Francisco, a fun fact is that when my wife and I got married late last year, we tattooed our rings on each other. So that's been pretty fun. Despite, you know, all of the quarantine stuff, I haven't had the need really to cut off my finger yet. We'll see how it goes. This is the classic systems reliability engineer photo. If you don't know what a systems reliability engineer is, you might have seen an individual like this on the street, you know, crouched down in consternation staring at a laptop in like, you know, very inopportune spot, you know, maybe on the bus or something. In this case, it was a fire drill. Basically, you know, SREs are kind of known as people who are like putting out fires for perfectionists. You know, that's kind of a decent idea, but we'll go into more of a traditional definition. So if you look at like the Cloudflare jobs description page, this is what you'll see. These vague explanations, you know, working on complex and distributed systems and, you know, all the complexity that entails, making sure that things are up and running and kind of gluing all of these products together when they're running in production scalably. As well, you know, contributing to open source and, you know, taking solutions that, you know, either from open source software, building them up and deploying them and basically serving as, you know, operations, but more. And the sort of person that's drawn to this role is, you know, someone who's never complacent. So, who sees a problem or, you know, has to do a manual task is like, I could be better, you know, let's make this like more automated or, you know, let's make this process smoother. I'm tired of dealing with this like horrible code base. Furthermore, any SRE who's been an SRE for a while is going to look at anything very skeptically or have been woken up in the middle of the night or have had their Friday ruined by something like this, you know, like, oh, you know, production server database went down. That's terrible. So, you kind of have this skepticism that you wouldn't see in, you know, say like a front end engineer, for better or worse. And finally, like many engineers or many people in the world, just a general curiosity about technology and things that aren't even related. I'd like to point out myself as a very curious person, but most SREs will have a variety of hobbies and some random interests you probably wouldn't necessarily expect. But really, a better definition of what an SRE does is, as Colin in the UK mentioned, is they do the needful, which is, you know, basically whatever needs to be done to keep production running 24-7, you know, 365, we're there to do. So, we have like the standard operations of supporting production, as well as, you know, being there to make the deployment pipeline more resilient, making the engineering integration into production more seamless by enabling developers, and then furthermore, just kind of constantly making the whole system more resilient, faster and more efficient. So, I'll go into each of these like kind of individually as just a brief dive. So, supporting production is probably one that individuals who interact with SREs will be most familiar with. This is making sure that the applications that are running at the edge, that are interacting, that customers interact with, that they're up and they're running, they're stable. And when you have an incredibly complex ecosystem, this is no small task. And so, SREs, as well as the engineers that actually build the software, will be setting up these systems to monitor and alert on problems that happen. For instance, this is our beautiful Grafana monitoring dashboard. Log into Grafana, you see this dashboard, and you know Grafana is healthy. That's a joke example, but we actually do do some pretty intense observability related stuff to make sure that our systems are running. And it's not following just a couple key metrics. It's really diving in, you know, like, things can go wrong at many layers of the stack. There's really complex software that we do at Cloudflare, you know, any cloud service providers, really, really a whole lot to be keeping track of. You know, that can be like request latency or CPU or memory or, you know, even more, you know, application specific metrics like, you know, how many requests you're getting in or what the requests are even looking like if you're being, you know, attacked by certain, you know, bot actors or whatever. So, taking all that into effect is part of, you know, monitoring and making sure that production is healthy. Obviously, your metrics aren't always looking great. It can sometimes look very mean, and you can have problems. At least in the last year that I've been here, Cloudflare has had two major, you know, well-publicized incidents. And it's in these times that the SREs really shine as the individuals who go out, troubleshoot the issues, and do their best to remediate the issues quickly, but, you know, safely. And this is the part that SREs don't like is necessarily dealing with a horrible incident. When it does happen, this is when the true value of the patience and the coolness under pressure comes in handy. So, the one tool that I really want to talk about here is PagerDuty. So, for those who aren't familiar, PagerDuty is sort of like the hello, yes, this is dog meme, except the dog is the, this is fine dog and the dog's on fire. You're getting called at a certain, you know, in the middle of the night, and, you know, everything's broken, you have to go wake up and fix this. So, at many places, PagerDuty is incredibly painful to deal with. And not that the software isn't great, it's probably that it's too good and, you know, it always pages you when there is an incident. So, this was a screenshot that I took from a PagerDuty developer survey, and they're discussing, like, how, like, how often are you on call, basically. And one of the answers was, like, oh, I'm on call for everything all the time. And that is not a great feeling to be, like, the person that if things are sour, you're, you're, you're welcome, you have to deal with it. And that's really how you drive someone to, like, quit engineering forever. And I had a similar experience, it wasn't quite this bad, where, you know, it had gotten to the point where my phone, like, sound of my phone buzzing was causing me, like, really horrible anxiety, and I would feel my phone vibrating when I wasn't even, you know, like, I wasn't being called, there wasn't a call, and anytime that someone called me legitimately, I was, like, freaked out, and I'd, like, pick up the phone shaking. And this is, like, a blog post that I wrote, just hyping up my blog a little bit, but about kind of that experience. And so, like, the first thing that really fixed this problem for me was, like, I got a new phone. But the second one was that I went to Mataflare. And so, during the interview process for the SRE team, we're talking about, you know, what, like, how does the on-call rotation work, you know, I was a little bit nervous about how it would work. Obviously, that's kind of an important part in understanding the work-life balance of a place. And they mentioned all of the sun rotation, in which you're only on call from nine to five during business hours, and then, you know, like, nine to five on the weekend, like, when you're awake, and it doesn't disrupt your sleep, and then you hand over to an incredibly competent engineer in a different time zone. Yeah, it's really incredible to be able to work with these teams. And not only just on on-call, but on project work and everything. So, Mataflare is incredibly lucky to have this many engineers that are able to support production together. It's awesome. I would not go back. Kind of discussing how production feels as an engineer. It's obviously very high pressure, and there's a lot of stress that comes along with supporting production, but it is an incredibly important role. And it's something that, you know, that you're definitely contributing to the success of the company by keeping things up. And you know that the value that you bring when you help resolve an incident like that. Furthermore, it's like, it's not like it's a competitive thing. You're working on this in a team, and you build these incredibly close bonds with the engineers that, you know, even, you know, a job or two away from, you know, teams, you know, teams that I'd worked with, you know, three or four years ago, I still interact with, and I know them very well. So, in addition, just the ways that things break are fascinating to me as a, as an engineer and a curious person. And so just like learning how the Internet works also a great part of this part of the job. Next thing I want to talk about interfacing with developers, you know, this is sort of like the DevOps concept that's existed for a little while, but making it so that engineers can sort of own their code as from development all the way into production and providing the tools and support they need to make it happen. So, this can be, you know, building out the monitoring solutions or building out the platform tools to interconnect services, or just like providing advice to teams as they transition from kind of like a young product into something that's fully fledged and out there. And obviously this comes out of the fact that, you know, an operations team simply cannot understand all the services that we have. There's, you know, thousands and thousands of services you potentially have. But, you know, being able to get a single team to own their chunk of the ecosystem makes it makes it so that the entire system can function as a whole. So here's like a quick example of kind of an engineer S3 relationship. The engineer comes up, you know, it's like, I got this like amazing product, it's going to just like blow our customers minds, it's going to be absolutely, you know, unicorns and rainbows and stuff. And, you know, the S3 is like, I think I've seen something like this. And it didn't go so well, you know, you chugged up all kinds of, you know, all kinds of memory and it just like destroyed our whole product. And so the cynicism really comes into effect, you know, the S3 definitely remembers the last time that, you know, a major product went through and sees where things could go wrong. But obviously, like we need to get these products out, like there are awesome features, we just need to make sure that they go out to production safely and that if there are problems, we can roll back quickly. So one of the tools that we built at Cloudflare, thank you Edge team, shout out to Bill and Jared, is the Release Manager. So Release Manager is a tool that's developed to effectively allow engineers to say, like, I'm upgrading this package and start rolling out slowly to each of the, like, canary colos and then slowly just out to the entire Edge. This allows us to track the resource utilization and basically monitor that everything's going all right. This tool has been incredibly helpful, at least like as well as like monitoring the health of deployments and again, reducing the footprint that S3s need to be in when dealing with like, you know, releases. There's a ton of stuff that we do around helping developers out and a lot of our time is spent trying to improve this process because that helps us faster as a company. As well as there's a lot of things, you know, where it's minimizing the amount of work that we have to do so. Working with developers is, can be a little bit complicated because, again, there are a lot of teams, just as there are a lot of services and so understanding who owns what, keeping that down is challenging, and it's always fluctuating as well. As well as like, you know, when you, when you build this new like internal tool and you're like, I want everyone to use this tool, it'll be great, we'll get everyone on the same system so they come to us with like the same process and make things super streamlined, but that's like really hard. Every time that every team has their own schedules and they set things up so like we can't like mandate these tools so just trying to get adoption going, it can be a bit slow, but obviously like working with developers is a ton of fun. We're all engineers, we can all nerd out on things. And when you're making a tool that's helping someone else, they're going to be grateful for you. You know, just building that cross team collaboration and, you know, you can hop in a Zoom channel and play Jackbox or something. It's a lot of fun. Finally, I want to talk about the last thing which is maximizing efficiency. This is a screenshot from Factorio for those who are unfamiliar, there will be more, but when you're maximizing efficiency, you're effectively taking what you have and cleaning it up, making it go faster, making it easier to read, reducing the amount of time that you've just spent maintaining it. So it's cleaning up tech debt, removing things you don't need, and just like generally making things faster. This is where you can get your like dopamine hits of like, I made that like so much faster. So one of the tools that we set up here was Jaeger. So Jaeger is a distributed tracing system, which allows us to look at, you know, the amount of time that each request spends as it goes through the pipeline and identify which services are taking the longest and what, you know, specific function calls are taking the longest. This allows us to like substantially reduce the amount of time that we're spending in these requests. And, you know, even if it's saving like, you know, milliseconds, that makes a difference when you're at the scale that Cloudflare is at. And the improvements that we make through efficiency don't necessarily have to be specifically like the technical like request time or anything like that. It can be actually just reducing the amount of time that humans need to be interacting with things. This is an example of some of the Kubernetes automation that we've done to when we need to take a node out of rotation and drain it and clear out the pods and then bring it back up. It's awesome. You know, it's, it works really well. And it's some, you know, one less thing that the on-call engineer doesn't need to do. Thank you to the service automation team for this. It's awesome. Maximizing efficiency is difficult for me because I'm not the best engineer, but it can be very rewarding. You know, when you find something and you're like, things were going like super slow and you sped that up and you can like show a graph of like how much you improved it. You're like, yeah, that feels really great. You know, when you're doing a presentation in front of a bunch of people who are also nerds, you're like, yeah, that's awesome. You can really see like you've truly like accomplished something tangible. It's unquestionable. So that was kind of like my main overview of how, what SREs do at Cloudflare. This is kind of a lighter section where I talk a bit about what we've done since quarantine. So one of the great things I love about Cloudflare and why I joined was I was told that there are a lot of really good chat rooms. And there are, there's a ton of really good like completely off topic chat rooms, obsolete media formats, like I may have made that one so I'm a little biased, but it's a great place to interact with your coworkers and really get to know them better. And in this time when you're spending, you know, half your day like at home with people online, it's like this is, this is the time like that you're able to socialize a little bit and you know, having like Zoom calls, you know, this is a fantastic screenshot that my manager took. I just look amazing here. Yeah, just having Zoom calls or the week after quarantine started, we had a like a retrospective meeting where we were just like talking about what's going well, what's going wrong, you know, especially around in light of the current events and it's just like a lot of fun to get out this, you know, frustration or whatever and we have fun on core, we have a lot of fun. But yeah, so that was my main presentation. I just want to kind of walk you through quickly an introduction of these two guys. So at Cloudflare, there are two types of SREs. They're all the same SRE, but we deal with different parts of our infrastructure. So there's the edge SREs and there's the core SREs. Obviously, we have a lot of points of presence around the world where we have servers. The edge SREs are going to be dealing with, you know, is incredibly widely geographically distributed servers where all our customers traffic goes through and it goes through like our our network, whereas the core team is working on like a handful of data centers that's holding was like analytics data, the API and similar you know data that needs to be in like kind of a single spot for now. Another quick distinction about these two guys is that Tom has just joined pretty recently. And when a new hire joins, this is kind of what they see. This is an image from Kellen's Factorio server. This is what like, you know, an experienced SRE sees, you know, it's a big mess sometimes. Without further ado, I would like to introduce Tom Godowski, who is on the core SRE team and Dan Van Gorkum. I'll let you read their fun facts real quick, but I'm going to pop off share so that we can see their faces. Oh, right. That's All right. Take it away. Let's see. I have Tom up first. So, my question was, what stood out to you about Author when you joined? What stood out to me? Um, I'd say the biggest thing that stood out to me is when I first joined, it was the ability to talk to anybody I wanted at any time, no matter their seniority nor title or anything. I mean, I hopped in the chat room. I had questions and I would just ask, no matter if there were 10 engineers or 500 people in a chat room and I felt comfortable doing so. So that's probably the number one thing that stood out for me. And so, since you've been here, what sort of projects are you looking forward to working on? Or, you've been here long enough, what are you working on right now? So I'll tell you about my favorite current project is deprecating old operating systems and bringing in new ones to our fleet. There's something about it. I enjoy doing it. And yeah, that's my current favorite thing I'm working on. And that's, it's not easy to do, but it is really, really important. Thank you for that. On to Dan, I would like to ask, what has kept you at Cloudflare for so long? So I joined Cloudflare back in 2017 and SRE was a slightly different beast, almost, I would like to say. But the interesting thing is that every day you come in, it feels like a new chapter. Like it's a book you want to keep reading. You just want to see what's happening next. And I think that explains it very well. Some days are very quiet, right? There are days that nothing's happening. Everything is working as it should. No major alerts are firing. No one is getting paged. No one has to wake up in the middle of the night. But of course, there are these days that everything just seems to fall wrong. It just doesn't want to work. And that is what keeps it interesting, right? You are able to work on these incredible systems and things that are like, yeah, you learn something new every day. And I think that's amazing. What are you working on like a day-to-day basis? So the SRE has a couple of different areas. We touch the technical stuff like on-call and like dealing with the most difficult customer issues. Those are the things that we do on a day-to-day basis. And on the side, we have projects. And I'm mainly working currently on like hardware validation to really make sure that the hardware we run in production is untouched by anyone, right? We are over 200 locations these days. And there is just a lot of things that could go wrong if we don't monitor these stuff correctly. And of course, we have existing systems that really take care to make sure that nothing is being put in production with wrong settings or anything. But still, it's important that we need to keep checking ourselves as well. So in the years you've been here, have you run into any particularly interesting stories or maybe horror stories? I think, well, maybe not horror stories, but I think everyone remembers the WEF thing, right? That happened to the WEF. You know, that hurts almost, right? That was not inside our shift here in Singapore, but you see your phone light up, right? And just everyone is being paged. And we're starting to look into this. And, you know, it's the power of the group of all those people that we are still quickly able to find what the issue is and continue. So these things, those keep me up at night, right? I don't want those things to happen. I don't like those things. That's just terrible. I remember I was walking into the office. I was wondering why everyone was so upset. Yeah, definitely. So since the quarantine has kind of begun, obviously the way that we've worked has substantially shifted. But Tom, you actually never kind of stepped foot inside the Austin office. You joined and your first day was about to be in the office, right? How has onboarding been completely remote? So my first day in the office, I actually worked from my closet because movers were moving things around me and that was the only place they weren't going to touch. But as far as onboarding goes, I would say I've been looking a lot at prior commits and actions taken in the recent past by other SREs and other departments. And I really forced myself to just have absolutely no fear to just hop on calls and talk to absolutely everybody. But the support I've gotten from every single team and every single person I've talked to has been just stellar. So yeah, it's been great. Dan, how have things changed since quarantine? How has the team been adjusting to working from home? I think it depends a little bit on the person, to be honest. I do miss the social interactions we normally have in the office. It's very easy to just sit down, have a chat about any subject. And we cannot really do that now, like Zoom meetings and Google Meet and other things. It's a little bit more formal, it feels, right? It's not as casual as like one-to-one social interactions. But in general, I think we're doing quite well. I think SREs are more or less used to be able to work from everywhere, right? As you showed the picture earlier, you're sitting on the ground like during a fire drill. That's part of our life, right? We need to make sure that we have a 4G connection always ready if something happens. So I think if it comes to technical, we are able to support everything without any problems. But sometimes project work is slightly more difficult because people don't know if someone's busy, right? If you don't get a response on the chat, it might be because someone is busy or something else is going on. So in general, I think we adapted quite well, but it will always be a struggle, I think. One more question, Dan. Obviously, you're working out of the Singapore office. How have you felt like the interactions with other offices have gone? Do you feel like you're able to connect with the offices better now after quarantine? That's a good question. I think in the last few years, we grew a lot in that field already. I mean, inter -office communication, definitely in different time zones, right? That is always a challenge. But it's getting a lot better. I'm not sure if that's because of the quarantine, but yeah, it is getting better for sure. It's a lot easier. Maybe people are working more hours. I'm not too sure. But yeah, it looks like people are more available. So I have a question. It's a little bit off topic here, but for someone who is interested in being an SRE and going through the trials and tribulations and joys that are being an SRE, do you have resources that you would recommend someone get started in this? Tom first. My initial answer was my best resource is you, Michael. But in all seriousness, how I got into the industry was just doing it because of necessity. I play a lot of video games, and I always wondered, what does it take to run a server? So I looked everything up. As far as specific resources besides you, I can't think of any. Dan? It's getting easier. And I think that's great. You can be up and running very quickly these days, including all your SSH keys and passwords and other things that you need to do your day-to-day job. So yeah, just read a lot online, I guess, a lot of the things you're interested in. If you're interested in something, go for it, right? Don't try to hold back and definitely express it as well. If you're working inside Cloudflare, there's a lot of opportunity for self-growth, right? They really want you to do the next big thing almost. And yeah, that's the advice. Go for it, man. Just go for it. I would like to call out, especially the Cloudflare blog. It's an incredible resource for people who are outside of Cloudflare, and something that's brought in a lot of talent. And seriously, if you're interested in Cloudflare at all, we're definitely hiring. We're always looking for more people to go in. You don't have to be an SRE. It's okay, I understand. But just being at this company is incredibly fascinating. Absolutely. Yeah, the blog is great. There's so much information on there. It's amazing. I think we're just at about time, but I want to thank you guys. Thank you, Tom and Dan, for hopping on here and helping me fill out 30 minutes of time. So I did have to just talk really rapidly at the camera. Thank you so much for tuning in and have a wonderful evening or morning, wherever you are. Thank you. Take care.