To Really Mess Up Takes a Computer
Presented by: Watson Ladd, Peter Wu, Tom Strickx
Originally aired on March 6, 2021 @ 4:00 AM - 5:00 AM EST
Experienced sysadmins and coders talk about their worst mistakes: rm on the wrong tmux, dropping prod db, expiring certs, and the operational improvements/lessons that resulted. Doesn't have to be Cloudflare related!
English
Troubleshooting
Interviews
Transcript (Beta)
Welcome to To Really Mess Up Takes a Computer, where Cloudflare engineers talk about some of their worst mistakes.
I'm Watson, your host, talking live from New Jersey.
Today's guests are Tom Strickx and Peter Wu. Tom is a network software engineer on the network engineering team at Cloudflare.
He mixes daily operations and incidents response with network automation development.
Furthermore, he's the vice chair of the non-executive board of INEX, an Internet exchange in Ireland.
Peter is a colleague of mine on the research team.
He's a systems engineer who likes to use and improve open-source software and has been contributing to KDE and Wireshark, to name just a few of his many projects he contributes to.
So Peter, I hear you had a rather unfortunate space in a command some time ago.
Yeah, so let me share a command I've been executing and ask you, so what do you think that this command is doing?
You are in some directory because some commands require permissions and you start running chmod O minus R.
You run the command with star and dot star because you wanted to make things more secure by removing permissions.
The intention here is to also include hidden files because that by default is not included with a star.
Yeah, it's gonna go through all the files that start with dot as well as the ones that the shell doesn't, as well as all the ones the shell lists.
There's nothing wrong with that command. That's what I thought as well, but then I was surprised that I started getting like errors from my web server, forbidden errors.
I couldn't run sudo anymore and some other like strange stuff. So it turned out that, so by default this doesn't include hidden files like dot files, files named starting with a dot.
So that's why I added this second part which unfortunately also includes the file dot and dot dot.
The first thing dot means it will make the current working directory inaccessible.
The second part, dot dot, means the parent directory, which combined with being in root home folder, which also in rather unexpected behavior.
So after doing that, luckily I had backup so I could recover.
At first I was like, ah, my server got hacked. Oh, what did I do?
Maybe I can restore from backup and hopefully the hacker doesn't get in. It turned out it was just like my mistake.
But as a like real scientist, you remember that XCCD with the like lightning bolt striking someone when it touched a pole?
That's basically what we, so a normal user would probably be like, ah, let's not try to do this again.
Yeah, instead install a restore from backup and then press the, run the same command again and had the exact same problem.
So at what point did you figure, okay, you know what, I'm going to put echo there and see what it does.
I don't recall when I ran echo, but then I should indeed, well probably I didn't.
Then I figured out like, huh, interesting. Why do I have the root in there?
In this particular case, the solution was rather simple. So instead of adding something to the effect of trying to achieve what I was hoping to achieve, I tried to modify the shell instead to make sure that the star also includes hidden files.
But this is like, usually you don't learn this thing. You only like learn this when you make the mistake and then you will remember forever.
That's a fascism, isn't it?
Yeah. So this particular case, it wasn't destructive, right?
I mean, all the data was still there. I could like easily recover from it because I still had the root shell and access to restoring backups.
Somewhat less fortunate are some Linux users who used Bumblebee in the past.
Wait a sec. So, but all you did was change the permissions of root to remove world readability, world writeability.
Could you have just commoditized it right back? I could probably have, but I didn't have like access to all permissions.
So it was easier just to restore from backup.
I believe when you modify permission like perhaps it also removes the set user ID bit.
So things like sudo and su might not work or maybe it does, but in any case, normal user wouldn't be able to execute those.
Yeah. So if you'd done this to sudo, bad day, bad day. Yeah. Luckily it was just my personal server with hosting a site for a few of my friends.
It wasn't entirely destructive except for maybe some log files, but the data was still there.
The data was still there in that case.
So mistakes.
So those who don't know Bumblebee is a project that tries to improve the state of NVIDIA graphics on Linux.
So some laptops have a dedicated NVIDIA graphics card and a normal Intel, less powerful Intel graphics card, which is supposed to save power when you don't need the more powerful NVIDIA card.
And in the past, Bumblebee was like shell script that's tried to make the life of those users happier by allowing use of this NVIDIA graphics card.
So yeah, it's a big shell script.
And of course, because it wasn't like entirely standardized and so on, it has an installation script.
And yeah, you know, as it's an open source project, everyone can start contributing to it.
And sometimes I don't know whether it was like night or so, but there was like a small cosmetic change.
Small cosmetic change. Famous password, right?
So yeah, version. And yeah, small cosmetic change. Okay. Doesn't look very cosmetic to me.
Okay. And then we have this interesting case. Oh, no.
So yeah, this is the installation script, which is trying to remove some drivers, which were not support, which were installed before.
Okay, back up a second.
When I use the machine, I have this thing called a package manager that installs things and keeps track of all the files for me.
And they get unhappy when you change things in their directories without telling them about it.
That's correct.
But so in order to make this like little graphics stuff to work on there, like a Linux, you have to do some like spam and non-spammer stuff, like combine drivers in a special way, modify config files and so on.
And this was basically what this script was doing.
Yeah, exactly.
Oops. So if you haven't noticed, there's a space between slash user and the part after.
So this slash libs as a video current directory probably doesn't exist.
That's fine. So what's more concerning is removal of the slash user directory, which spawned a huge thread of, well, this was fixed.
It's exactly as described.
This is, I think, maybe one of the biggest page on my GitHub in terms of comments.
People started adding so many comments to it.
It's a whole meme thread. If you like need to spend a wasted day, this is a good place to go.
That's awful.
That reminds me of that. I don't know if you guys ever played EVE Online.
It's this big multi MMORPG and I think it was 2007-ish. They also had a similar issue with their installer on Windows.
And instead of deleting a subdirectory of I think it was system 32, they deleted the entirety of system 32.
Windows has this very unfortunate side effect of not being able to operate without system 32.
And one of the things at that time, what the installer needed to do, it needed to reboot the computer before it actually got everything going.
So deleting system 32 by itself won't break your actively running system.
But what it will do is it will break your actively running system after you've rebooted it because everything that's in system 32 now suddenly is no longer in system 32 because the entire directory is gone.
That also took quite a bit of doing and a lot of like emergency posts on their forum and like everything else just kind of shouting into the void kind of going like, please, please, please don't run this.
Please don't do this. That was that was a fairly painful situation as well. Oh, we're talking about Windows installers.
One time my brother and I were installing Rome Total War.
And for some reason, we ejected the disk drive in the middle of the install.
So it didn't work anymore. And we tried again to reinstall it.
And their installer refused to run if it thought you'd already installed it. So we had to go manually delete all the files, all the registry keys that it had created.
And we couldn't tell if we were done except by rerunning it. But the installer when it reran, wouldn't abort immediately if it thought you already installed it would put some of those back first.
So it was a long afternoon of alright, did we get them all?
Nope. All right, all the ones we found before and let's look for the new one.
We're very lucky to have a bootable machine by the end.
So you basically end up with like a notepad of just constantly adding registry keys.
It's like, okay, I need to delete this. You rerun it again. Okay, to delete all the games.
Like, okay, let's find the files. Jesus, that's, that's, that's impossible.
Oh, it it was a long afternoon. Guess you got to keep busy, right?
Well, you know, we were kids, we didn't have anything better to do. Yeah, exactly.
I mean, it's not like you need to get an education or anything, right? It's like, just, you know, backups, backups.
Oh my, in terms of useless things, I remember as a kid, like, you know, you know, you get, if you normally press a keyboard, you get like ABC and so on.
But if you press alt and a certain number, you get a special key.
So what did I do as a kid? I had like, like, a word open and it was like, literally pressing alt one, alt two, and so on.
I don't know where I stopped, but I had a couple of pages with nice characters.
And then I hear if you press the ones above the number row, it gets even more interesting on windows.
I remember the days of playing multiplayer games. And then there's always going to be that one person that goes like, if you want to have a wall hack, press alt F4 or anything else.
And then there's, it's always fun to just, just wait until you see like, at least one person goes like, insert username and then disconnected.
It's like, oh, it's yeah. Good times. People still try and spam IRC with 80 plus plus.
I mean, and then the worst thing is, occasionally you still see people disconnected.
It's like, this, this thing has been around since what, the late nineties?
And how is this still being detected as a virus attempt?
Like, why is your CPE still thinking that this thing is not supposed to be here?
That's, that's impossible. But I mean, to be fair, IRC is still around, which in and of itself is an achievement.
That's not the, so the 80 plus plus, it's a modem command.
Because the haze modem is in band. So there are character sequences, which if they go through, TCP just takes the data and puts it, it's not doing anything.
If it recognizes that, it will disconnect. I thought there was, there was also that, that one that was being recognized, I think, by, by certain CPEs as being malware.
And that would then also force a disconnect or I might be missing, like, mixing up certain things.
There's some strings that are used to test that it's actually working.
And if your intrusion detection system is aggressive, and when it detects it, decides to cut the connection or something.
I mean, you know, there's, there's, there's things, network engineers do weird things to the network.
I can, I can fully attest to that fact. Oh, say more.
I mean, so, like you, you mentioned, like, I'm a mix of daily operations and as well as automation to give some context.
At Cloudflare, we, we use SaltStack for our automation.
The way that works is, is using a minion master architecture.
So you have every individual device that you have basically has a minion running.
And that minion then executes all the commands that you want to do.
SaltStack is also used for, mostly used actually for, for server management, but we also use it for network equipment management.
The one thing with network equipment, unfortunately, is we're kind of stuck in the early 2000s when it comes to network equipment, which means that we can't always run our own software on top of the hardware that we buy.
So usually you're kind of stuck with either the vendor operating system and they kind of limit things to what you can run on it.
There is some exceptions there.
There's some vendors that allow you to even just run a plain Linux with some enhancements in the user space or allow you to run your own compiled programs on top of it, but that's not always the case.
So instead of running the minion directly on the hardware, what we do for our network equipment is we run the minion on a server that is within the same location as the network equipment.
And then that minion makes a connection through a bunch of different protocols to that network equipment and then manages it through that, right?
So there's, there's certain things where it will either connect through SSH with Netconf or it will connect through HTTP.
So there's, there's some network equipment that actually runs a full-fledged REST API on HTTP, which in and of itself, I'm sure certain people will have quite a lot of problems with because you're running an HTTP server on network equipment, which is usually not what you want to do, but yeah, fair enough.
Could be worse, like Telnet. Yeah, well, you know, let's not talk about Telnet.
But so, you know, so that's, you know, the context.
So enter late 2017, I just joined Cloudflare in September of 2017.
You know, all new to the company, just, you know, getting my bearings, figuring things out, suddenly working at a company that is working with like over 150 data centers, massive global scale, a massive amount of network equipment as well.
It's all being managed, automation here, automation there.
So, you know, a bit overwhelmed, but, you know, getting there. And then I got very lucky that Mircea Ullinich, who's no longer with Cloudflare, unfortunately, but Mircea was there.
And together with Jerome as well, they were very good at mentoring me and leading me into this like automation journey, which was, which was in and of itself amazing.
So what happened is Mircea decided to push me, it's like, okay, let's give you an exciting project.
Let's get you to turn up salt stack automation for our network switches.
So in our architecture, in our edge location architecture, there's basically two things that you have, you have the router, which will be present in, you know, Internet exchanges, transit providers, stuff like that will do the routing.
And then you have a top of rack switch, which will just do layer two forwarding of packets.
So we had automation for all our edge routers. That was all done.
But because the scale of the top of racks, there were a lot more top of racks than there are routers.
There were some things to take into consideration, right?
It's like at a certain point you run into, you will always run into scaling issues.
There's no such thing as the perfect solution, which will always work for any, any scale.
So we had to do some investigation, figure out, okay, will this work with the current setup we have?
Will we need a different setup?
You know, what's the scale of this? So when I joined, that was one of the first projects I started working on is okay, like, you know, let's roll out automation on our top of rack switches.
So like I touched on, like our top of racks, fortunately, well, I guess, fortunately, come with an HTTP rest API.
So set up the minion, set up the minion to connect to that HTTP rest API.
And that works, you know, so suddenly, yay, we have automation for our top of racks.
That's amazing. That's really cool.
And that went really well. And then so we did some testing in a couple of smaller locations to make sure they're like, okay, is everything working?
It's like, are we sure?
Like, okay, this works. So we left that running in some smaller locations for I think, three to six hours.
Everything fine in those locations, nothing that we saw of like, okay, is this really working?
Or is there anything going on?
Are we seeing issues? Everything was, you know, perfect. So we decided, okay, yeah, so this clearly works.
And then everyone's kind of excitement and eagerness to get like, oh, yeah, we can finally like start automating our top of racks as well.
Said, okay, yeah, let's just roll out for the rest of world.
So do a ROW release. Did that, rolled it out to all our top of racks across the globe.
At that time, I think we had roughly 150 locations where we're now up to, I think, 250.
So, you know, this dates me. But so, you know, rolled it out. I think plenty, plenty of top of racks.
And this was roughly at the end of my workday.
So that's roughly I think that was around like 6 p.m. British time. We were done, right?
So everything worked. Ran a bunch of test commands. Everything seemed to be happy.
We were getting the output we were expecting. Great. So I decided, you know, go to the pub, have a pint.
And then go home, have a lovely evening in and then go to bed.
With the, you know, that happy, warm, fuzzy feeling inside of, yeah, I finished my first major project.
This makes me very, very happy. And then you wake up and the first thing I saw on my phone was a chat message from one of my coworkers in Singapore saying like, yeah, I can't log in on this top of rack switch.
And I then tried to log in on a couple of other top of rack switches and I can't log in to those either.
Do you know what's going on? Like, right. Okay. And then you get that sinking feeling in your stomach, right?
It's like you realize like, oh, no, this is not gonna end well, is it?
So log on and start trying to figure things out.
And then you start realizing, okay, there's quite a few top of rack switches where we can no longer log in.
So we first try to log in using in band, right?
So there's two ways of logging into the management plane of a managed switch, which is in band and out of band.
Out of band is through, which is also still surprising that that's a thing, is an RS232 console connection, right?
So that's out of band.
Gives you the benefit of if in band is broken or if your network connectivity is broken, you can still access the switch so you can figure out what's going on.
So, okay. In band, clearly broken. We couldn't SSH into the switches anymore.
So we try it out with band. And, okay. We get the login prompt. Try logging in.
And switch doesn't let us log in. And the way it complains about it is too many open files.
Now, if you know kind of how Linux works is there's a thing called file descriptors, right?
So any file system and any operating system kind of has a limit to the amount of open file descriptors one can have.
So that was the error we were getting is the Linux control plane operating system for those switches was complaining about the fact that, yeah, we have too many open files.
Which is, you know, quite strange because that's not supposed to happen for all clarity.
So, okay.
Obviously because we couldn't log in anymore, we were kind of stuck in debugging and trying to understand what was going on.
Clearly what we did fairly quickly realize was, yeah, clearly this is related to, you know, this new salt stack deployment that we did.
So we immediately shut down anything we had running. And then started trying to assess, okay, how many top of racks switches are broken.
So we build a PSSH script which is parallel SSH that basically just iterated through all the top of racks and tried to log in and then do a show version command.
Show version lists you, you know, the operating system it's running, the hardware it's running, all of that.
So tried to figure out, okay, how many things did I break? Got the assessment.
I think we broke roughly 60 to 65 or 70% of our active fleet.
Fortunately, to add some context, fortunately it was only the control plane. So in network equipment, you have two things, right?
You have the control plane and you have the data plane.
The data plane is the thing that pushes bits, right?
That just decides, like, I need to go from A to B, blah, this goes out of that port.
That's your data plane. Control plane is what tells your data plane how to switch those packets and, you know, some management stuff like that.
So the only thing that was broken was the control plane.
So what that meant is, fortunately, it didn't break our data centers because that would have been a lot worse.
I think I would have been paged up at, like, 3M at that point.
So fortunately that didn't happen.
I did get my sleep. But what that does mean is if you can't access the control plane, that means that if you need to make changes to a port or if you, you know, need to assess what's going on, you can't because, you know, control plane.
So all of that was happening.
It's just awful. So one of the first things that we did is, you know, shut it all off and then, all right, let's see if we can replicate, right?
Because if you can't log in, you can't access the log files. And if you can't access the log files, you can't know what's happening, which is unfortunate.
Good thing, though, is that we did have a syslog forwarding setup. So what syslog forwarding does is instead of just keeping your log files on the device, it will forward it to a server where you can then store it off system, which has the major benefit if your system crashes, you still have all the information, right?
So that's useful.
That's cool. So we knew in the back of our heads, okay, so we have, like, the syslog backups if we need to.
And let me just basically go through every single line and figure out what's going on.
But first thing we tried to do is, okay, let's see if we can replicate this.
Because if you can replicate it, it's usually a lot easier to just kind of figure out what's going on.
It's a lot easier to understand what your problem scenario is instead of learning it from log files.
Log files can have so many different meanings, so many different causes that it's usually a bit harder to, you know, work backwards from there.
So set up top rack, set up salt stack again.
And we couldn't replicate. So no idea what was going on.
But we did this on a top of rack that wasn't broken yet. Then realized, okay, so maybe we should try to reproduce it on a top of rack that we broke.
Unfortunately, the only way that you can do that is you need to reboot the box, right?
Because you're out of file descriptors, so you need to kill whatever process it is that's keeping all those file descriptors open.
But you can't log in.
How do you reboot a box where you can't log into? That's a hard one, right? So, yeah.
Big red button. Yeah, the big red button in our case was a system called remote hands.
What that basically means is data centers offer a service where a human will do a thing that you want them to do.
In our case, that was can you please unplug the two cables connecting that top of rack to the power and that rebooted the top of racks?
Because, you know, there's power supplies where you can SSH into and then switch off the outlets.
We were not excellent at documenting at that point in time where our outlets were connected.
So, that would have meant that if we were to go down that route, Russian roulette with power supplies is not exactly, like, the kind of thing where you consider happy days, right?
So, we decided to go with the a bit more expensive but safer route where you just ask remote hands, yeah, can you please unplug those two cables and then plug them back in?
That reboots the top of rack.
So, rebooted top of rack. Unfortunately, what that also meant in the case of that specific switch is it wiped your entire file system.
It keeps configuration.
So, it's still, like, it reboots in the configuration you want it to, but log messages, dump files, core dumps, anything like that, anything that can be considered, you know, temporary, it's actually gone.
Because there was no permanent storage.
So, everything was basically volatile memory. So, the moment you reboot, you have that on Linux configurations, you can set it up for, like, your TMP folder as well, where TMPFS just wipes everything when you reboot.
That was basically the system they used.
Everything volatile memory, which meant, you know, any useful information besides log files that might have been on that device, yeah, definitely gone.
So, we had to start from scratch. So, rebooted one of the devices, set up salt.
And then we started seeing this very interesting thing.
At the same time, we were, you know, somebody else, I think it was Jerome, was going through the log files and trying to figure out what the hell is going on here, because this doesn't make sense.
This is not a good thing. So, what Jerome then suddenly realized is all the devices that Tom broke, because I definitely broke them, that Tom broke, they all are running the same version of the operating system.
They're all the exact same version. Like, okay, that's a good pointer.
So, that's something to start up. Furthermore, they were all the same hardware release as well.
So, it's all the same exact same type of top of rack with the exact same operating system.
All right. That's good. That's something to go on.
And then he started seeing something else, which was Nginx core dump. As I might have mentioned, those top of racks, they run an HTTP API, a REST API.
You need to have something that serves HTTP.
The vendor decided that, you know, what's a great web server?
Oh, Nginx is a great web server. So, they packaged Nginx for their operating system and put Nginx on top of their operating system.
Awesome. What they did decide to do is compile in their API as a module into Nginx, which meant that if your module fails, Nginx fails.
Failure in this case being core dumps.
What we then saw was, okay, Nginx core dumps. Interesting. That's not supposed to happen.
Pretty sure core dumps aren't supposed to be happening, right? I think.
So, okay. That happened once. Interesting. Five minutes later, exact same log file.
Like Nginx core dump. But I didn't restart Nginx. Nobody restarted Nginx.
Why is Nginx still core dumping? When Nginx core dumps, it restarts. You want to have instant recovery as soon as possible.
If it's a one off kind of thing, that's perfectly fine.
What happens when you core dump is you store the actual core dump on the file system.
You can see where I'm going with this, right? You store the actual core dump on the and they closed it, all right?
But they filled up the file system.
Because Nginx ran as root. So, you can, you know, for a file system, you can reserve 5% of the file system for like root stuff so you don't get into this.
Unfortunately, Nginx ran as root, which meant that, yeah, Nginx don't care.
Nginx will fill up your file system with core dumps.
So, it filled up the file system. Which meant that when you try to log in, because it's Linux, you know, it needs to check a bunch of files.
File system full. You can't even create like the lock file or whatever.
You can't lock your password file or all the things that Linux needs to do before logging in.
Which meant, yeah, you try to log in, yeah, no. Because Nginx was just every five minutes in this interesting restart loop of restart, everything happy, happy, happy, and then something happens because we at that point didn't know what triggered the core dump.
Just happy, happy, happy, five minutes later, core dump.
And, okay, yeah, that's something. So, we knew what was causing it.
We knew salt stack was doing something that was causing Nginx to receive a command and core dump.
So, we then went through the core dump and tried to understand, okay, what exactly failed?
And then we realized that it was libXML. Because a lot of APIs in the networking world still, unfortunately, are written to return XML.
As I said, early 2000s, right? We're stuck in the past with a lot of things.
And using XML is one of those things. This is that Yang stuff that I keep having to see people sending documents for.
It's like, okay. Open config Yang, yeah.
That's a totally different story. But, yeah. So, libXML segfaulted.
You don't want your libraries to segfault. What happened is when that segfaulted, that took down their API module and that API module then took down Nginx and then that core dump.
So, all right. Perfect. So, we got there. Then we saw the segfault reason was libXML was trying to decode a byte ordinal.
So, a literal byte stream of where the bytes were 255, 255, 255, and I was trying to interpret that as ASCII.
So, you may know that's not valid ASCII. So, libXML says, yeah, no, I'm not doing this.
Nice try, but no. So, segfault. Fair enough. So, trying to figure out, okay, where the hell does that byte stream come from?
Because, you know, something clearly something of the operating system is returning that.
At the same time, we also understood what was causing this every five minutes.
What SaltStack does is basically it tries to maintain a database of facts about the device.
Facts such as what's the operating system? What version of the operating system am I running?
How many interfaces do I have? And a bunch of other things, right?
And it tries to update that database every five minutes. So, we got there, right?
So, we figured out, okay, it's this function that's being ran that is then being executed on the HTTP API and that's crashing it.
Awesome. So, we had I think that is roughly six or seven commands on the switch that are being ran.
So, we had six commands that we then figured out, okay, we need to run these and then do a pipe this to their XML exporter to see which one breaks.
And there's a specific command on those switches that shows you which transceivers you have.
A transceiver being an optical module, for example, that you plug into the switch and then allows you to pass bits through it, right?
That can be over copper, that can be over fiber.
There's a bunch of different transceivers that do a bunch of different things.
So, that was the command it was running. And what that outputs is, you know, the vendor of the optic, the version, if there's a hardware version, a software version.
All of this is embedded in the firmware of that transceiver.
Usually in an SFP kind of module thing. Again, out of scope. But so, gets all that information.
You can these firmwares are programmed by the vendor usually, right?
There's only a very limited subset of transceiver vendors that allow you to customize the firmware of your transceiver.
Usually they just come prepackaged by the vendor and then just, you know, you plug them in and that's it.
You don't touch the firmware.
Because touching firmware is usually not the thing you want to do.
But what we saw is, we got a bad batch of transceivers where it said that the version, instead of being ASCII, was just returning byte ordinal 0x255, 0x255, 0x255.
And apparently, what the operating system did, instead of doing what it does, for example, with the command line, is it does some sanity checking.
It sanitizes the output and instead of outputting, like, instead of crashing or whatever, it just outputted question marks.
So, we were wondering, okay, this is weird.
This is not supposed to be happening. And then we did a display XML and then it crashed.
Okay. That's good. We now know what the exact command is.
And then we saw, like, oh, oh, no. The firmware has 255 as bytes and the operating system for the API, it doesn't go through the CLI.
It has direct access. So, it doesn't do any of the sanitization or the sanity checks.
It just tried to directly output the firmware output of the version of that transceiver into libXML and libXML says, no, I'm not doing this.
Take down the entire system every five minutes because you're trying to check what's the version of this transceiver.
And the transceiver tells you exactly what version of the transceiver it is.
And we had that bad batch spread across the globe.
We had it in so many locations. But that also explained why it didn't take down our entire fleet because, you know, those transceivers weren't in every single tupper rack.
So, we kind of got lucky in a way because it could have been a whole lot worse.
But that entire process of trying to figure out what the hell did we do and how did that go wrong took, I think, together with vendor support, took two and a half months.
Because, you know, you're trying to understand exactly, like, how does this all interconnect with each other?
And yeah, you own the hardware and it's all your thing. But the operating system is not, like, for example, with Linux, you own the entire operating system, right?
Like, you run it. You can get into every single nitty gritty little thing.
It's not the case for a lot of network operating systems, unfortunately. So, you know, you have to interact with vendor support, try to understand, okay, how does this work?
How do things interconnect with each other? So, that took a lot of doing and then finally got that working.
So, we understood, okay, so, clearly, don't run salt stack for these devices.
But the thing is, like, yeah, but we need to.
Because we're at a scale where I don't want to run parallel SSH scripts for every single change I'm going to need to make to all of our Tupper racks.
So, the next best thing is, okay, upgrade the operating system.
So, we upgraded the operating system on a switch we knew was broken, upgraded it, and then ran salt stack for it for 12 hours.
And it didn't break. And we didn't see core dumps. So, yay.
So, that's kind of how we knew, right? But the thing that bit us the most, I think, at that time was how long it took for the control plane to actually fail.
Because, there's quite a bit of disk space that you can use on those devices.
Which meant that me going away at 6 p.m.
meant that it only actually started failing, I think, 10 hours later.
Because it took that long for those core dumps to just fill up the file system.
And that's just one of those things that you can't predict.
Like, that's one of those things where you can go usually the expectation that people will have is if you push something new to production, it will either work or it won't.
There's no middle ground, right? If it fails, you should expect to see failures in some way or form.
Or you should see some indication of something's going wrong.
And we didn't have that. Because, yeah, salt stack was working perfectly.
We were getting all the information we needed. We didn't get any reports or any log files at that time where we knew, okay, something's clearly broken or something's clearly not working well.
But, yeah, after about, like, yeah, 10 hours, suddenly everything just kind of all at the same time, roughly, just kind of went, nope, not doing this.
And that is scary. That is really, really scary. Especially because I was only at a club at that time, I think.
This was December. So, you know, two months, right?
Very, very fresh. Very green behind the ears. And then you have to admit to the fact that, oh, I broke things.
I broke things badly. But fortunately, I got a lot of support from all my coworkers that were also very invested in figuring out what was going on.
And we have this really cool thing within the company, which is a no blame culture, which is incredibly, incredibly important.
Even when you're on any kind of incident, even if it's a single person's co-change, for example, that causes an issue, it's not their fault.
Because you should have processes or systems in place that then catch the fact that you probably shouldn't do this, right?
The failure isn't the person. The failure is the process or anything around it.
And that was great. Because that was exactly now what happened is, you know, we did a postmortem, tried to understand, okay, what happened?
How did it happen? And how do we make sure that if, you know, we upgrade to a different release and they have a regression?
Because, you know, again, not your operating system.
You can't audit the entire thing. You kind of just have to upgrade and then pray that it doesn't break again.
So, if we were to upgrade operating systems again and they have a regression for that specific kind of bug, how do we make sure that we catch it early?
And how do we make sure that it doesn't break the entire fleet again?
So, you know, that was, I think, that's the key takeaway is, like, don't be afraid, though, to fail.
Don't be afraid to hit the pavement running and then definitely hit the pavement with your face.
Because otherwise, you're not going to do the exciting things, right?
Otherwise, like, if after this failure, we decided, like, oh, yeah, let's not do this.
Let's definitely not use automation for our Tupper Racks, we'd be in a way worse state.
Because now we have, like I said, like, we have over 250 locations with a lot more Tupper Racks.
So, like, have fun managing that. Like, that's just that's your day job now.
Like, managing Tupper Racks. And that's not exactly what you want to do as a network engineer.
It's like, you want to do the exciting, cool things.
So, that's that's the that's super important. Like, don't be afraid to fail. Because like Peter said, as well, it's like, you learn interesting things.
Like, you start learning about, like, you know, how Bash does glob expansion.
I learned about, you know, this network vendor embedding Nginx on their operating system for reasons.
And, you know, like, what a chain of events can cause, right? And there's so many interesting, really cool things that you can learn from failing a lot more than you can learn from being successful.
Because being successful is cool, right?
That gives you this happy, heavy, fuzzy, warm feeling, right? But and you can learn from it.
And, you know, there's things that you can take away. But failure is definitely those things that will stick in the back of your head.
And you'll remember it like it was yesterday.
Because, you know, this was two and a half years ago.
And I can still pretty much recount you my entire progress through that entire thing.
Off the top of my head. And, you know, that was exciting. It was really cool.
And that just kind of set me up for the rest of my career within Cloudflare.
Just kind of doing automation, doing cool things and not being afraid to fail.
Because I had the support of my coworkers. And there was, like, yeah. Everyone just kind of went, eh.
Like, you run a production system. You touch production.
You're gonna make mistakes. That's a given. If you're not making mistakes, you're not working hard enough.
So, that was, yeah. Fun. So, when I was an intern, I was working on a project that involved the code that terminates connections on the edge.
And it's a biggish code base. At the time, there were sort of two different programs built from that code base.
And there was heavy use of Git submodules.
Now, submodules in Git are just sort of pointers to another repository.
And because I was adding a new feature, I needed a way to test this feature.
And so, I had a separate implementation of the feature that I had worked on earlier to make sure the specification worked and it was basically the same.
And so, I took that other repository, that other implementation, and put it in as a submodule.
And then one of the tests I wrote would run the two against each other and make sure it all worked.
So, I get that all working in my machine and it's set up and I push it on a branch and it was Friday in San Francisco, so I go home.
And then I come back to work Monday and learn that over the weekend, or sorry, not over the weekend because nobody's working on the weekend, but for that Monday, they have had a problem in the UK where most of the team is based.
The CI system has been unable to build anybody's branches.
And the reason for this is it was a personal repository and by default on the system that we use, those personal repositories are private.
And so, the CI system could not check out the submodule.
Now, the other compounding thing is that when the CI system can't check out a branch it knows must exist, it says, well, okay, it tries to check out all the submodules it can see.
It doesn't limit them by branch. So, because this one submodule on one branch was problematic, all the branches were failing the same way.
Oh, damn, that's rough. That is so rough. That reminds me of a similar issue we had with the CI where someone used an emoji and I think it was either a branch or a commit message.
Peter is laughing, Peter remembers.
So, yeah, they pushed that and apparently the CI system doesn't get emojis.
So, the CI system just failed. Because the best part was I think it was the poop emoji.
It was just, you know, it was such it was the best mix of circumstances one can possibly imagine for that kind of failure.
Because of all the things, like, of all the emojis, right, the emoji subset being so massive that it is, of all the emojis to use, it's the poop emoji.
And that just broke CI entirely.
It just, yeah. That's, yeah, actually interesting. So, usually, when you work on something, definitely you know already, like, okay, what, like this kind of work I do, Proc CDI follow, it works.
But when you have something new, they might not know that, oh, my branch came from them to ASCII or Unicode or whatever.
And that, like, usually, I would probably never use emojis in my branch name because they don't even occur to me.
But someone new might try to detect that. It wasn't even someone new.
It was a fairly experienced SRE, as far as I remember. It was just for that specific thing, it was just, you know, it made sense.
Because it was, you know, as this is coming from a network engineer that interacts with SRE on a daily basis, like, SRE has to deal with a lot of finicky tiny things.
And then from time to time, you have to make a commit after commit after commit because you keep running into, like, a bug that you're trying to fix.
Or it's like, you fix one case and then you run into the next case and then you run into the next case.
So, after a certain time, even, you know, the best people will break.
And instead of using constructive or, you know, very descriptive git commit messages, they'll just kind of resort to not this poop emoji again.
Something like that.
Because, you know, that makes sense. Because at a certain time, it just created, yeah, let's not do that.
Oh, yeah. It's like a pattern. Like, yeah. Indeed, until you break it, you won't notice it.
Yeah. So, just keep continuing doing that.
It only gets better. Yeah. I think we got a couple of questions, actually, from the audience, which is really cool.
Yeah. So, going back to the no-blame postmortem, does the panel have any opinions on how to constructively deal with accidents like those discussed today?
Or conversely, what types of reactions didn't help address the situation at hand?
So, I think for my case, anyway, the best constructive way of dealing with this was not trying to go, like, don't shoot the pianist, right?
Like, don't shoot the pianist.
Like, like, it happened. Blaming someone or going after someone because it happened isn't going to fix it.
The only way that you're going to fix it or understand the issue is digging in and fixing it.
That's the only way forward. So, that's the best way of being constructive about this.
Like, forget that, you know, things broke.
That's, you know, that's history. That's where we're now at a point where, you know, the fact is things are broken.
You don't have to, like, that's not worth considering.
So, that was definitely, I think, super important for us to just kind of deal with this.
And what definitely helped as well was we had someone, I think it was Jerome, just kind of he delegated tasks.
Like, he kind of figured out, okay, like, let's spread the load across as many people as we can and then just deal with this in your own kind of select piece of thing that you need to look at.
And then you had someone else that just kind of merged all of it.
So, it becomes a lot easier when you can just kind of go head down into a very specific issue that is part of the bigger issue at hand to deal with it and then fix it.
Yeah, I think in order to investigate such issues, being able to reproduce is very important.
Tom said, like, oh, we realize that some of these devices return this version and only devices with certain, like, image are effective.
Being able to reproduce the exact environment and conditions is very important to speed up investigating and fixing these issues.
I had some issues before, or still, like, ongoing, where there's clearly an issue, as in, like, not show it.
It's definitely there.
It's very significant, except it's very hard to, like, investigate because there's no clear way to reproduce it.
Yeah, I think that's always the pain, right?
You can have as much observability that you want, but if you can't reproduce the very specific failure mode that you're experiencing, it's going to be a lot harder to debug the issue because there's, especially for a Stack Cloud Cloudflare stack, there's so many moving parts and so many things interact with so many other things and just kind of all change into this massive thing.
So, if you're kind of stuck with, okay, we have to run this entire thing through the entire stack, you're going to be stuck waiting quite a while to, like, figure out what's going on.
So, if you can figure out precisely where in the stack it's going wrong and what you need to do in that stack to make it break, you're, yeah, you're miles ahead.
For your case, you said you, like, someone detected the problem, like, within a day, but it took, like, three months or so to fully resolve the issue.
How long did the investigation process actually take?
I mean, the investigation process, I think we got it done within, like, about two days, right?
We got very far very, very quickly.
Most of the rest of it was just working with the network vendor with their support team to just kind of get to, like, understand if there's a specific bug number because all the network operating systems have bug tracking system.
You get a bug number if it's a known bug and stuff like that.
Trying to more, like, understand if this is a specific bug so we can figure out if they know what the bug is, that also means that they'll know in which operating system release the bug is fixed.
If they don't know about the bug and you're the first one to report it, that usually means that it's very likely that that exact same issue is going to happen in the next release.
Because, if they don't know about it, how are they going to fix it, right?
So, that was most of the time that we spent with it. But in the meantime, obviously, so, we were working with the vendor and at the same time, because we weren't running the latest of the latest of the latest of the releases, we just kind of went, okay, let's figure out which operating system we can run.
Like, which version can we run that works? And the way we did that is just, okay, upgrade this switch and then throw salt stack at it and see if it breaks.
And if it doesn't break, you're fine. It's, you know, it's time consuming, but it was still a faster process to resolution than working with the vendor and getting, like, through the entire step of getting a bug number and then getting engineering to fix it and then getting a new release that explicitly fixed the bug.
Instead, what we kind of relied on was, yeah, let's just run a new operating system and maybe they tweet some code that somehow touches the thing that broke.
And they did, apparently.
I still don't think we actually got an official bug number for it.
I'd have to check. No, because, like, the yeah, the I think the end resolution of it was, like, oh, yeah, don't insert transceivers with a broken version.
Which, I mean, yes, you're technically correct, which is the best kind of correct.
But could you, like, not fail when it happens?
That would be nice. And, yeah, at a certain point, you just kind of you don't keep pushing for it, right?
Because you have an operating like you have a version that's working, fine.
It's not worth the engineering time.
It's not worth spending more time on than you should. So, we just kind of went with, like, yeah, that's fine.
And then moved on. And now we're running SaltStack on all our network devices.
Yay! Exactly. Make would win. Yeah, definitely.
I mean, if you're running at the scale that we are, it's not one of those things that you want to run manually.
That's not feasible. But Salt does enable you to mess up faster than humanly possible.
Yep, it definitely does. I mean, there's been plenty of cases where automation rollouts just kind of go, it's like, oh, no, no, no, stop it, stop it, stop it, make it stop.
So, that's always, yeah, that's always.
I have had to do that. Like, you know, after they merged my thing, scream, no, no, stop, stop, stop, you're setting it wrong, something's wrong, pull it back.
So...
Yeah, I've been doing things a bit slower and putting a bunch of safeguards up so that, you know, like, gradual releases and stuff like that, so that if it fails, it only fails in a small subsection, and you can stop it from failing everywhere else.
And you watch it and you can be in a loop and have the ability to react.
Yeah, exactly. Very helpful. Now, of course, you can always roll back.
You don't need to... Hopefully, you don't need to push it out. You can stop it, figure out what went wrong, come up with a different strategy to apply the change, which is what I did in that situation.
Came up with one where, okay, it's going to be a series of steps, and each one's controlled.
I'll make sure the first one's in no-op everywhere, and it's going to be the later ones that are going to actually do something, and they're going to be very contained.
Yep. I think, like, that's also a very important thing, that is, try to keep your changes as, like, uniquely identifiable as possible.
Like, try to touch on a single individual component per release, right?
Obviously, that's not always going to work, because sometimes changes in components need changes in other components, and, you know, that kind of trickles down.
But as much as possible, just try to keep it contained, because that makes it a lot easier to kind of figure out which component is failing.
Otherwise, like, if you have, like, you roll out this massive, like, you know, like, oh, I built a feature in two quarters.
I'm going to roll this up in one go now, because clearly nothing can possibly go wrong with this, right?
So steady releases is usually the way to go. Yeah.
For this release, it did make sense to, like, start a service comp, like, oh, does it even start without, like, failing?
Or before you add another, like, make another service dependent?
I think that's what we've done quite well. Yeah.
And also, we, I have, I've been lucky that when we've had deadlines to release stuff, I've gotten it working long before, so the inevitable fixes can happen a bit before race booking.
Well, thank you very much, Tom and Peter, for appearing on this week's edition of To Really Mess Up Takes a Computer, where Cloudflare engineers talk about some of their worst mistakes.
I'm Watson, your host, talking live from New Jersey.
Thank you once again. It's been a pleasure, Watson. It was super fun to relegate stories of breaking things and then hearing about how everyone else did it.
That's been really, really fun. Definitely. Thank you all. Thank you very much.