Story Time
Join John Graham-Cumming, Cloudflare CTO, as he interviews a Cloudflare Engineer or a Solutions Engineer and they discuss a "war story" of a problem that needed to be solved and how they did it.
This week's guest: Simon Moore, Principal Supportability Engineer at Cloudflare.
Transcript (Beta)
Simon, welcome. This is Story Time, which is a silly show where I talk to people from Cloudflare usually about weird stuff that's happened in the past, problems they've solved, how they ended up at Cloudflare, all manner of things and well you and I have worked together for a hell of a long time now.
Indeed, yeah. I think this will be seven years coming up in August.
Yeah, I'm trying to People don't know Cloudflare history.
I was the first person, not in San Francisco. So when I joined Cloudflare, there were 24 people and I was at home alone.
Funnily enough, actually the desk in front of me is the really rubbish desk that was my Cloudflare desk and I'm using it now for Cloudflare TV.
And then finally we got an office and I'm trying to remember, were you the second person we hired in London?
No, so I think I was number five or four.
So there were, at that time there was Andrew, another engineer, there was James who was the first SRE in London and then there was Sam and Marty who were the first support people.
I'm always struck about that is that of that original crew of people we hired in London, only one person left in seven years basically.
We're all still hanging in there. Yeah, it's amazing and actually the two people who joined around the same time and after, Marek and Sigis still here too.
So you have moved as well just like that. And I think one of those folks is actually thinking of moving to Lisbon as well.
So we may all get, we might get the whole crew back together again.
We'll get the band back together. We'll walk down the river in slow motion.
Down the river in slow mo, yeah. It's like the right stuff kind of thing like that.
Well, I thought it'd be good to get you on this because of that long time at Cloudflare.
I mean, you and I were there when things were really small and things weren't quite as stable, as efficient or as organized as they are now.
We've had a few sort of crazy things happen over time. One of them is that you ended up getting on aircraft and upgrading servers, right?
Yeah.
So yeah, you're right. Like in the early days, we ran very, very lean. So we had a team of like five or maybe seven of us in London at that time.
But around the world, we were aiming to do a bunch of upgrades to our data centers.
And we had a few things that I think we wanted to do.
In some places we needed to add new servers.
In some places we needed to add RAM to existing servers and TPMs as well. And also some routers and switches and all sorts of things.
But basically we had to do this very, very quickly because Cloudflare was growing extremely quickly and so was our traffic.
And we had to do it in basically every location on our network, I think over a relatively short period of time.
At that time, we didn't really make a massive habit of using smart hands, remote hands.
And I think for this type of work, we decided that it would be better if we did it ourselves.
It would save us some money.
And also we had more people in more locations. But we now have a dedicated team for infrastructure, right?
I think we had a network engineer or two and a small SRE team, and that was it.
So at some point, Sri, who was the first SRE at Cloudflare, emailed me or messaged me.
I think we were using Skype back then and just said, hey, do you want to go to all of these data centers?
And there was London, Paris, Amsterdam, Prague.
There was maybe about seven or eight of them.
And I was like, yeah, of course. Because in my head, I was just like, these are all my favorite cities.
Why would I want to go? Why would I not go?
All I have to do is shove some RAM in some machines. Everything will be fine. So you guys, yeah.
I was like, you guys are going to pay to buy me to all of these cities?
Of course I'm going to do that. So yeah, I jumped straight in with both feet. And Marty, who was working with us at the time, was also very excited and muscled his way in.
But he demanded to be included and also went to some data centers. So yeah, the job was to install some RAM.
It was to install some TPMs, some servers, some routers, the pluribus, I think we were using.
Yeah, right. So it's worth talking about some of the gear, right?
So when I joined Cloudflare, we actually had a bunch of servers that HP had essentially given to Cloudflare.
And they were installed in places around the world.
I think HP was hoping that Cloudflare would grow enormously, would buy loads of HP kits, and it would be a really great strategy.
We did grow enormously. We didn't buy HP kit. And we ended up getting our own servers made by ODMs in Taiwan.
But there was this intermediate period where instead of doing what we do today, which is we're upgrading by sending you servers and running different generations of hardware, we actually had to go in and literally shove RAM and I think disks in machines because we were just growing outgoing RAM, all that kind of stuff.
And so this was what led to this, you know, European vacation thing where you go around and you go plug stuff in.
And the other thing is we were really, really interested in was making sure that the software that was running on those machines was our software.
And to do that, where you mentioned these TPMs, right, these Trusted Platform Modules, which give you a cryptographic way of verifying your boot process or the software you're running, and we hadn't installed those.
And so somebody literally, a little square, right, somebody had to go and power off the machine and plug them in.
So where did you go on your, you know, your jolly trip around Europe?
So I went to, and I've made notes, I went to Stockholm, Prague, Vienna, and Warsaw.
And we also had a little warm up trip to London.
So those of us who were prepared, so Marty and myself, but also Sam and James, and I think Sigis and Marek, we all had the opportunity to go to the London data center because no one had seen a Cloudflare data center, or most of us hadn't been really been to a data center at all.
So we all jumped at the chance to go. But what people don't tell you about data center upgrades, or it all sounds very glamorous, traveling to all these places, what people, when you're running a live network, you can't take London down in the middle of the London day.
So what you have to do is go at sort of two in the morning, when traffic has kind of tailed off from the evening, so that you can safely take the location offline without impacting performance and capacity across the whole network.
So invariably, what you do is you show up at these locations in the dark, they're normally in sort of strange parts of cities as well, they're not always quite where you think they may be.
Although in London, it was actually in Canary Wharf.
So it's kind of, you know, a very popular space, but at 2am, pretty, pretty quiet.
So you're quite sleep deprived. And then you have to do a bunch of very intricate work with screwdrivers, unracking servers, carefully placing RAM modules inside, putting everything back, remembering where all the cables went.
So it's actually really mentally taxing work.
It doesn't sound like it, but in the middle of the night when you're sleep deprived, it's really difficult.
Yeah. And also, the one thing you haven't described for anyone who's not been in a data centre is the noise.
Everything has got fans in it. There's air conditioning. It's actually really not a very nice place to be.
And most people who work in data centres are not actually in the data centre bit, they're like in a little office with a big window.
Because the noise is tremendous. Yeah, yeah, exactly. You've got all these fans running constantly.
And it just kind of drives you a bit crazy.
Right. And the air is very strange, because depending on where you are, you'll have like hot aisles and cold aisles.
And depending on where you are, it's going to be very warm or very, very cold, and also very, very noisy.
So it's quite difficult.
And I remember someone on the SRE team, Josh saying, explaining like, you are basically operating at half your mental capacity when you're in a data centre.
Because of the noise, because of the sleep deprivation, because you're kind of heightened and stressed, your normal sort of logical way of thinking through things doesn't operate quite the same.
So you have to be really careful. And actually, what you normally need is when you're doing anything really important, you're normally on the phone to someone who is not in that environment who can guide you and help you.
And that was something that really stuck with me because it was, yeah, you're doing it, apart from the London trip, it was done all on my own.
I'm on the phone to someone else pretty much who was making sure that I wasn't making mistakes along the way.
I found that, you know, having like a checklist, having things that are really basic, just so that you can keep yourself sane in there is helpful.
The other thing that's, that's really can be difficult in some data centres is that many data centres have a restriction where you can't take photographs.
And because you don't, they don't want you photographing other people's equipment, so they have a blanket, you can't photograph anything.
And the first thing you often want to do is take a picture of the rack, just as a sort of Ed memoir in terms of like, oh, where were these things plugged in?
Did I actually get it absolutely perfectly right?
So it's a very stressful environment.
I seem to remember that Stockholm didn't go so well, right? Yeah, that's true.
That's true. So Stockholm was a was another classic example. I remember going to Stockholm, it was snowing, which felt very appropriate.
And I had trouble getting in, that was always the hard part as well.
Like, even though you'd kind of the right people at Caltech had contacted the data centre, normally you have to book an appointment, right?
They have to know that you're coming. You have to be authorised, you have to provide identification.
So you have to do all of that stuff.
But invariably, you show up and they don't know. They don't know. And then you have to get on the phone to people.
And it must have taken like a good 90 minutes to even get into the place.
So I stood outside in the snow and cold for about 90 minutes.
I got in plenty of time, same deal, right? You don't start work until sort of, I don't know, one, two in the morning.
So you don't actually start touching anything.
So you start kind of preparing, you start unboxing the new kit and getting everything ready.
So I'd sort of done all of that. And then you kind of need to take a look at the rack.
And a rack of servers, it's just the big column. If you've never seen it, it's just a big column of shelves.
And the servers kind of just slot into it.
And normally they have a door and the door might be locked. And the door is kind of a mesh grill.
So you can't normally really see what's going on inside.
All you can see is a bunch of blinking lights from the machines behind. But in order to kind of assess what you were saying, take a photo, that's totally true.
That's really what you want to do. You want to see how everything is laid out.
And also when you're communicating with people who are coordinating it, you want to be able to send it to them so they can see what you're seeing wouldn't help you.
But you often can't do that. So the first thing before the maintenance start, you kind of want to just take a look at the rack and see what's going on.
So that seems like a safe thing to do.
It ordinarily would be. So I opened the door. About maybe a couple of minutes later, I'm just looking at things.
I'm literally not touching things.
A couple of minutes later, I get a phone call. And it's Sri. And Sri is probably one of the calmest people I've ever met.
He's got such a soothing, relaxed voice.
And Sri says to me, Hey, Simon, did you do something? And I was like, at that point, all the blood like drained from my body.
And I was like, oh no.
I literally opened the rack door. He's like, Stockholm is offline. And I was like, oh God.
And eventually what it turned out that just opening the door, that the router hadn't, the power cord in the router was not seated correctly.
It hadn't been pushed all the way in.
And the router also was screwed in on a small set of rails, a half set of rails, which meant it was kind of balancing.
It was kind of hanging down with the weight of the router.
And just opening the door caused it to wobble slightly.
And when it wobbles, the power fell out. And when the power falls out, the whole communication to all of the servers just completely dies.
And the whole of the Stockholm location just disappears off the map. And at that point, Sri calls me and says, did you do something?
And then we have to kind of explain.
It took a long time. If a router dies unexpectedly, I don't really understand why because I'm not a network engineer, but it takes a long time for it to come back online and to get everything back into a safe state.
And I'm sure things are much more optimal now in how we configure things.
But back then it was kind of a hard manual task for the SRE and network team to kind of get everything back and check everything over in order to be able to put everything back online.
So yeah, that was a chilling moment.
I'll never, ever forget that. I mean, the good thing is because we use BGP Anycast, when that happened, I mean, Stockholm would have disappeared, but other data centers would have immediately picked up the load.
I guess the bad thing was this was a long time ago and we had hardly any data centers.
So it would have been, you know, suddenly those Swedish, those middle of the night Swedish web surfers were probably being routed to Prague or something and the performance wasn't so great for them.
Yeah. Yeah. That's, you know, early on when you put these things together, you remember that, you know, that famous picture of Google's first server rack, where it's like cardboard between the layers of servers, I think early on in companies to do a lot of stuff really fast.
And then you have to go back and make it a bit more professional. And thank goodness, we have this infrastructure team that manages what, 200 cities now where it's, it's a little bit better than send Simon to Stockholm.
Yeah. Yeah.
I would. Yeah. If you were watching this as a Cloudflare customer, you need to know that they don't let me in data centers anymore.
Like absolutely not. And yeah, that was very, it was a very interesting and a massive, massive learning experience.
And the Stockholm one in particular was particularly challenging.
Yeah. The other thing to mention now, actually, as you mentioned about the whole photographic is that, and this is an example that I, that is true.
And I would literally have a piece of paper and I would draw a rack and write things down for myself on paper.
I believe now actually our infrastructure team, we have actually, there's a piece of software that essentially inventories all of your racks.
So, you know, exactly where every server is, you know, all of the labeling and it's just, it's just all around slick operation.
But back then it was intensely manual.
Yeah, it was, it really was all that sort of stuff. And there was a lot of flying machines to countries to get them there.
And if you remember in London, we used to have that big pile of servers.
There was all those servers that would just turn up in the London office and be like, yeah, don't worry about those, send those to Lisbon tomorrow.
Someone will come and get them. And, you know, if you haven't seen the size of servers we have in the box, then they're absolutely huge.
There's like this gigantic box.
It's nothing like a PC box. It's huge box. And occasionally Marty would get one out and test it in the office.
And if you remember in that tiny office, the sound of that, just one of these things in full blast was, was terrifying.
Yeah. I wanted to ask you about something else. In 2014, we had a very well-known service provider.
I won't name them because they probably don't want us to talk about it, but they had a vulnerability they were very worried about.
And they knew that there were a whole load of their users who use Cloudflare.
And they came to us and they said, could you put something in your WAF that would block this particular attack just while we have time to patch and get everybody to upgrade?
And we're like, sure, no problem. And they said, well, here's what you're going to do.
You're going to match this URL and you have to match this list of IP addresses.
And they gave us a bunch of deciders. And I think it was actually ended up being about 200 individual IP addresses.
And we were like, yeah, we'll do that.
And it turned out we didn't have a system doing that. Right. And so we had to hack it.
And I remember I was in Poland and you were on a bus in London and we got on the phone and discussed this.
What, what was, what was the solution?
So, yeah, you're right. Like at that time, I think your job title was program.
It was, it was. Yeah. Your job title was programmer and you had written the first version of Cloudflare's WAF?
The second one, the second one. So the second version of Cloudflare's WAF, just full disclosure for everybody who wants to know about, if you think the server thing was a little bit scary, the very first version of Cloudflare's WAF was you got NGINX, which everyone was talking to.
Then you got another NGINX and that NGINX is talking to Apache.
And Apache was running mod security, like the classic mod security thing.
And the reason there was an NGINX in the middle was that Apache kept crashing and it was a way of protecting our main service.
And one of the first things that Matthew, when I joined Cloudflare, he gave me his list of things to do.
And one of them was write a new WAF.
So I ended up writing a new WAF, which is still in production. I think the team is slowly now migrating off of it.
So we had this thing. So yeah, so I, you, I was kind of the WAF guy and you had got a reputation for being the other WAF person who would write rules for customers, right?
Yeah, that's true. So I had my head in my hands as you explained the technology of the first WAF there, because I thought you would, the letters that were going to come out of your mouth, I thought were PHP and they didn't.
So fine, we'll move on. Listen, do you want to hear some PHP?
All of Cloudflare was written in PHP at the beginning. I do remember that. I think just as I joined, I remember the migration from PHP to Lua.
Yeah. That was a crazy time, but yeah.
Okay. So, but I mean, Apache, ModSecurity, it was, it was, it sort of worked, but you're talking about when Cloudflare had like 5,000 customers or something instead of 27 million domains.
So it was a different world.
For sure. And yeah, so the WAF that you wrote still used the ModSecurity syntax, which meant we were able to migrate the existing functionality that we had into this newer, faster, more flexible thing.
And it meant that customers also could come to us and say, Hey, I run ModSecurity on my origin, or my previous provider had these rules.
Can you create them for us in Cloudflare? And yeah, we would take those in the support team, John, I think you trained us and showed us how to do it and how to write tests for it.
So it was something we would do relatively regularly.
But most of the time we were writing rules just for a specific customer.
So kind of the scope was much more narrow. But yeah, when you called me about this one, we were talking about writing something that ultimately we turned on for every single customer on our network.
I think we get that rule away for free.
The WAF traditionally is something you get on our pro plan starts at $20 a month, but in specific cases where there's a vulnerability that has a much wider impact, we have decided to deploy those rules to protect everyone.
Yeah, that's a very good point, actually.
We, for a very, very long time, ran a set of rules for every customer of Cloudflare because we felt like the threats that we were dealing with was so widespread that it was better just for the Internet just to say, I know you should be paying for this, but we're going to protect.
We didn't tell anybody. We just protected people just to make the Internet better.
And there were actually some threat protections that were literally hard-coded.
So when Shellshock came out, if you remember the Shellshock vulnerability, I actually did literally hard-coded protection for that.
Not even in the WAF, it's like literally an if statement because that was such a scary, scary vulnerability.
So yeah, we were doing this.
And then this particular one, as you say, if you remember, I made that little thing where you could create a custom rule and it would set up a test harness for you and you would write positive and negative tests.
And then this one came along and this one was block this URL pattern if the real IP header has one of these 200 email addresses in it, right?
Sorry, 200 IP addresses. Yeah. Yeah. So it was like a control endpoint for a piece of software.
And that endpoint was only designed to be used behind some authentication and by a trusted set of servers.
And I think a bug was found in the authentication layer, which meant that if someone could exploit it, then they could take control of this application.
So it was a big vulnerability.
And the team who wrote this software, I think wrote to someone at Cloudflare and was like, you guys have, you know, at that point we had a lot of customers, we have even more now, but if we could potentially protect all of our mutual customers for them.
So yeah, what they had was a list of, I don't know, did you say 20 was about 20?
No, I think it was, I think the actual number of IP addresses was 200, but we had about 20 ciders, right?
So something slash 23 or whatever.
So. Yeah. So we have these 20, like what are called cider ranges, I guess.
And basically that's a, it's a shorthand way of expressing a range of addresses from starting from a number and ending at a higher number.
But the logic can get quite complicated and it's really confusing.
I've always had problems with cider ranges.
So I was very nervous for two reasons. One, I don't really understand cider ranges.
I understand them a bit better now, having been here for seven years.
Two, the WAF did not have the ability to analyze IP addresses, but what it did have was the ability to look at request headers and test the values of request headers.
So Cloudflare being a proxy would always pass around the IP address of the real visitor, which meant that theoretically we could just write a rule that says, look at this header and the header is called something like X real IP or remote address.
Yeah, something like that. And test that value against it. So that's problem one.
We can now, we can actually write functionality to test the IP address by using request headers.
But as a string, right? Not as a number. So it's like literally string matching.
Yes. So expressing ranges is a challenge, I suppose.
But what you can do, you can write regular expressions to pass strings and you can express a regular expression to match a cider range.
But it gets really, really complicated because there are lots of different variations.
And so for example, a cider range might be what we call a slash 24.
And if you know an IP address, an IPv4 address, it has four segments or octets.
And a slash 24 basically means the last octet is the range that you're trying to match anything from zero to 255 in the last octet.
So you can kind of express that in a regular, I could figure out how to express that in a regular expression, because I'm just going to match the first three parts and I'm going to ignore the next three.
But then you can have like a slash 28 or you can have all these other numbers, which are kind of they don't match the octets perfectly.
So I sort of Googled around, I did what any enterprising coder would do.
I Googled it and someone has written a tool online that you can plug a cider range in and it'll give you a regular expression back.
It was really designed for one cider range, but we had like 20. So I just did that like 20 times.
And I just piped them all together into one giant regular expression.
And then I wrote, went away and wrote 40 tests, like one for every single, I was so paranoid about this and so worried about the scale at which we were planning to run this and also the speed at which we needed to do it.
And I wrote the most exhaustive tests I could come up with. And yeah, your WEFT tool had this ability to, you would write your rule and then you would also have tests.
And the tests basically were just text files, which had a HTTP request in it and a HTTP request.
And then I remember, well, there were two directories, like a 200 directory and a 403 directory.
The idea was write test, write stuff that's meant to get through, write stuff that's meant to get blocked and then we can run it to test the rules work.
Yeah. So what you're always really worried about with a WEFT or any kind of security thing is, is what's called false positives and false negatives.
And what I'm really worried about, yeah, is blocking things that we shouldn't be blocking and not blocking the things that we should.
So your tests really allow you to assert that that's not the case.
So I write like 20 tests that should be allowed through, and they're all matching those IP addresses that are trusted.
And then 20 that don't, and then run those.
I've got it right in front of me, actually. You wrote 16 tests that were meant to 403 and 15 that were meant to 200.
So yeah, you read a lot of tests.
I was just so terrified because obviously, like I say, I can understand regular expressions mostly, but when it came to this stuff, I was just plugging this into a tool and who knows whether that tool is accurate or trustworthy.
So that's where testing really, really helps so that you can at least prove that even if you don't understand what you've written, or more importantly, if you don't understand what someone else has written in your team, you can kind of just put a test and go, well, they pass or they fail.
So it works or it doesn't.
Yeah. So I think we got there. And that regular expression is enormous. And it lived for quite a while.
I don't think it's still in existence anymore, because eventually we did implement the ability to properly match IPs and also to do CIDR logic.
So we were able to stop using request headers and also stop using regular expressions for CIDR matching.
So that functionality is now there. The nice thing is I pulled this up in our code system, and you wrote a really nice comment describing what you're doing, how you wrote the test, everything.
It's as clear as day.
You can get in there and you can say, okay, yeah, this is exactly what's going on.
And myself and Lee Holloway, technical founder, we both reviewed it.
It looks like you probably wrote it in the time it took me to fly from London to Warsaw.
I remember calling you from Heathrow. I was almost boarding the plane.
I was like, can you do this crazy thing? And you were like, yeah, I think I probably can.
Yeah, I think I wrote it at the end of the day. I worked in London.
And then I got on the 141 bus back home to Harrogate where I lived at the time.
And I was still doing it. We were still sending emails and we're still fixing tests and stuff.
So I do remember sitting on the top of the 141 bus going around Old Street roundabout, tapping away in the terminal, trying to get this done.
So I'll always remember that moment.
And I'll always remember the 141 bus for that reason.
I think I'm going to have to get this RegEx engraved into an acrylic gift for you, like a sort of souvenir of Cloudflare.
Because it's quite large looking at it.
I mean, I wish I could publish it, but it's got a lot of IP addresses in it.
But it's quite something. But luckily, we have better functionality these days.
Yes. And I think you were saying before, we were chatting before, this functionality, the ability to kind of match a whole bunch of IPs is something customers ask for a lot.
And we do have ways of doing it. But managing sort of large lists of IPs, independent from your rules, is not something we've ever had.
But it's actually very, very close to being launched.
So finally, this would be very, very easy for us to do today.
And I wouldn't need to be on the 141 bus with my laptop open.
Well, now you can get on a 28 tram here in Lisbon and go up and down the hills and write a regular expression.
Oh, God, yeah. I've been on the 28 tram once, and I didn't even want to be on there for tourism.
So I don't think I would enjoy being on there writing.
Do it now. There's nobody on it. I mean, it's the perfect time to go on it.
Well, listen, Simon, thank you for talking to me about the horrible history of that.
And, you know, if anyone else is watching this, who is starting a company, don't feel bad about some of the crazy stuff you're going to have to do in the beginning.
And some of it's not going to feel like you're doing a great job.
It's easy to judge companies from the outside and think, wow, they must be doing everything perfectly.
No, we did lots of things that were expedient and needed to get done.
And then over time, we replaced them and made things better. And so, you know, that's really the story of building any company.
So, Simon, thank you for being here seven years.
Hopefully we'll be here many years more and I'll see you again on the show sometime.
Thanks, John. Yeah.