Cloudflare TV

To Really Mess Up Takes a Computer

Presented by Watson Ladd, Lucas Pardue, Michael Wolf
Originally aired on 

Experienced sysadmins and coders talk about their worst mistakes: rm on the wrong tmux, dropping prod db, expiring certs, and the operational improvements/lessons that resulted. Doesn't have to be Cloudflare related!

This week's guests: Lucas Pardue and Michael Wolf

English

Transcript (Beta)

Welcome to To Really Mess Up Takes a Computer, where Cloudflare engineers talk about some of their worst mistakes.

I'm Watson, your host, live from New Jersey. Today's guests are Lucas Pardue and Michael Wolf.

Lucas is an engineer in London working on the protocols team who are responsible for handling HTTPS connections that come into Cloudflare.

He's also co-chair of the IETF quick working group and a contributor to HTTP 2 and HTTP 3 related standards.

Michael Wolf is a systems reliability engineer on the core team based out of San Francisco.

He's been at Cloudflare for a year focused on observability and service discovery projects.

So Lucas has been advertising this segment on Twitter promising a rabbit warren of errors.

Let's dive in. Yeah, so I picked one story amongst like a whole load of stupid things I've done while working, mainly just due to clumsiness or accidental stuff.

But this one is kind of related to QUIC and H3, but it doesn't, you don't really need to know any technical details.

It's more like how chains of errors cause annoyances.

And before you know it, it's not a deep yak shave of going off and trying to fix a bunch of other people's problems.

But it's kind of something I found a bit funny when looking back on it all.

So without further ado.

So if you can see my screen at the moment, what's on here is an exit from the HTTP 3 spec.

You don't need to know what HTTP 3 is to care about this. The headline summary is there's things called streams and there's types and there's registered types.

So the document registers two, control stream and push stream, doesn't matter what they do, but they have a value, which is zero and one in this case.

And that value is sent as a 62 bit integer.

So it has a very large maximum possible value, but the document only defines two, zero and one.

So you've got this almost infinite space that's taken up right at the bottom end, and so you might ask, well, what was the point in creating this space?

The purpose is to allow extensibility of protocols.

So this is something that's really powerful and lets us define something that people can use today to fit a basic use case.

And then in the future as people come up with new ideas, there's an ability to extend and change stuff up and do new things by using new values or sometimes they call new code points or whatever.

But implementations look at what's defined today. And they might say, well, there's only zero and one.

So what I'll do is write a block of code that says, well, if I get anything that's not zero or one, I'll just ignore it or I'll send back an error.

And in practice, what that does is break the extension mechanisms that the protocols work hard to invent in the first place.

So this isn't a new problem.

It's age old, but it reared its head in some of the TLS work.

So transport low security and the extensions they have there and people in the middle clamping on to this notion that there's only so many values and we'll inspect them and we'll break connections because it must be a bad client or a bad server sending this value.

Why would they ever do this? So a concept of greasing came up, which is to say of all of the space that's available, let's reserve some special values that purposely have no meaning to make sure that all the components between a client and a server in this case can handle a value that has no meaning by ignoring it and not breaking the connection.

So that's kind of the premise of this excerpt is to say with these two values, but all of these other values evenly distributed throughout the entire 62-bit number space, some of them are reserved and they have no meaning.

And so there's an equation here, which is 0 1f times n plus 0 x 2 1.

Non-negative integer values of n and n goes from 0 1 2 3 up to some value of n that creates this maximum value 0 x 3 f f f f f etc.

So say like 18 months ago, shortly after I joined CloudFlower, I was starting to implement some of the HTTP 3 stuff.

And so you look at this code and you're like, oh, well, I'm going to write the code for those two values 0 and 1, but I'm going to follow the rules.

I'm going to be a good citizen. I'm not going to add in some extra special code to say, well, one to generate these values in the first place.

So you need a random number generator and two to validate them when I receive them.

On the generation side, I wanted to make sure that although I have a full allowed space of 62, well, 2 to the 62 minus 1, that I don't go beyond this maximum that's in the document, because it's there and there's no point.

So I can just write some basic test code, right?

So for those that are maybe not familiar, the language we write our HTTP 3 implementation of at CloudFlower is in Rust.

So this is just some Rust code, but you can imagine more or less like it's not that hard in any other language to see what this is doing.

I haven't got the full code here, but it's, you know, I want to define the upper bound variable so that when I use a random number generator, I can just say, well, if it's more than that, clamp it down or something like that.

So all I did was from the document I had open, copy the text and pasted it into my code editor and had a red squiggly line of death, which isn't, it's always a bit worrying.

It's kind of good for IDEs or any code editor that is doing real runtime analysis to do these things.

But at the time, I was also kind of new to Rust and learning stuff.

And so, you know, after like a month or two, you've got some basic confidence that if you just assign a variable, a value that it's going to work, you know, there's all kinds of hard Rust stuff that the compiler shouts at you, kind of the learning curve could be a bit steep, but it does generally give you some quite informative error messages.

But for something so simple, it was a bit worrying.

So the actual message that I saw was no valid digits found for number, which, you know, that's just weird.

Did I just do something really stupid, like forget the equal sign or like, what have I done here?

I couldn't quite figure it out.

So then you kind of start grepping the code base to say, well, let's just take another example.

There's like loads of hex values that we've got through the code.

They all seem to work. I just don't know what's going on here. So kind of tried a few things.

And then I was like, well, I don't need this right now. Context switch away to something else, come back to it a day later, and then go back to the specification to try and find what was the thing I was looking for.

So this is in a web browser, say, back to the section of the specification, which is in an HTML format.

So I was like, it was 0x3 something. Here's a fuzzy match of the string.

There's three matches throughout the document. So there's three different kinds of protocol elements that can be extended in this way.

But ultimately, they all have the same bound.

So I'm like, OK, refresh my understanding of the specification.

Now what I'm going to do is go back into the code and search for the value, 0x3 FFFF.

So I go back into my editor. And what I had open at the time was the quick specification.

So the HB3 specification, actually, in this case. And for those not familiar, the workflow here is that you write the specification in Markdown, a special kind of version of Markdown like everyone has.

And then you run it through a tool that can generate the IETF style of docs.

And it told me there were no results for that value, which are weird.

But this is my local version of the specifications that were checked out.

So maybe they got out of sync or something like that.

So kind of do the whole rebase, resync to mask, all of that stuff.

And still no dice. So from that point, I went and just manually scanned through the code to look for that section.

I'll search for a control stream. And this is the Markdown version of the table.

So we can see here that although my code editor is telling me there's no string matches for this value, it is actually in the code right there.

And so I'm just getting so confused. All I wanted to do was assign a variable, a value, and write some basic code.

And now I'm just like, I don't know what's going on.

Is it one of these weeks? And so the problem is caused by a few things.

And I don't know if based on the set of slides so far, if either of you two have any ideas what might be causing something here.

Would that zero happen to not be ASCII?

And so we would look for likes from somewhere else in Unicode.

You're very close. The zero is a zero though. Are there any other non-printing characters?

Yes. Well done, Michael. And in hindsight, you're like, well, of course.

If there was some error, then, you know.

But because sometimes I think, like I find if you're unfamiliar with things, you tend to assume it's the newest thing that you're not familiar with.

Or like you're doing something wrong.

Like I'm obviously writing hex digits in Rust in the wrong way and something weird is going on.

So my pro tip is always read the errors that are being presented.

There were more errors, but I was so focused on the first one.

Because generally, like in new languages, for me, it's like the first error is the obvious thing.

And then all the errors follow that are the chain of things.

So yeah, the next line said unknown start token after the 0x. It's like, what?

And then the bottom one is just the consequence of the mess up that happened here.

So you've got here unknown start token backslash U202D. So you put that into your favorite search engine.

And it tells you this is the Unicode character left to right override.

And I thought, I've never heard of that. But actually, you know, I am familiar with control characters messing stuff up, usually because of line ending conversions, you know, bash scripts and those kinds of things make files.

So my editor does have show control characters. But for some reason, this one doesn't show up as anything.

It's like a non-visible ASCII format character, the Unicode format character kind of thing.

So I don't know what this is.

You know, it's maybe obvious because it says left to right. It's related to text direction and stuff.

And I'm definitely no expert in Unicode or anything like that.

But it's like, OK, well, that's kind of unfortunate. I can maybe fix the error.

But I always want to know why, you know, yes, it affected me. But why?

Why is this here? Who put it in? Why didn't it appear elsewhere? So actually, what I found out was if I copied the value from the text or HTML rendering of the document, like the stuff that's on there, the ITF canonical source of things, it's stripped away.

It's only if you copy it from the markdown itself. And so there's not many people who would be doing that.

I was just very unlucky in the first instance to have that document open and decided to copy the value from there and not anywhere else, like three different places I could have copied it from.

So, you know, you want to do some background into this thing. And you search left to right override.

And you hit, you know, Krebs on security. It's like the number one hit with this right to left override email attack spamming thing.

And you're like, oh, no, right, I'm not going to go to that thing.

There's no security problem here.

That case is kind of interesting for weird ways you can convince Windows to do stuff.

But it's nothing related to what my problem was. Actually, I think because I'm not the author of the code in the markdown, that oops, this is caused by a bug in Windows calculator is my strongest suspicion.

So, yeah, there's a link here to a super user kind of forum question or tip, which is to say, if you're doing programmer calculations in Windows calculator, which is probably what the author did, I don't know.

I wouldn't want to speculate too much.

But you can imagine they've looked at the equation and they've just gone, oh, put some values in.

And they've copied them from the not from here section, which you might think you would do, copy them from somewhere else.

So, you've got two different text copies that behave differently in the same application, which is just horrible.

And so, you know, I wasn't aware of that. And there's a way to mitigate that.

But also, someone raised a bug on the Microsoft calculator GitHub, which I didn't even know was a thing.

So, I first encountered this issue in March 2019.

And it turns out someone opened a bug on calculator to kind of describe this issue.

And I wasn't aware of this at the time. I was just researching this thing today as I find it.

And then, even more funnily, this was fixed in April 2020 with a really cool description to say that, you know, calculation results or memory items or things that numbers should be displayed as left to right, whatever the culture.

And therefore, that's the reason why they did this override to force them to appear in a certain way.

But unfortunately, the way they did that was to embed the Unicode characters into the string that then got passed up to Windows rendering engines.

They've apparently fixed that in some specific Windows way, which is cool.

That change has landed.

But that doesn't help me. So, you know, I'm the only person who probably would have encountered this bug.

So, back in March, having figured out what the problem was and corrected the value in the code in Quiche itself and fixed it, I then made a patch to the HTTP specification to remove the non-visible character.

So, the patch itself looks stupid, right? So, I had to then go into the PR itself, explain explicitly the problem that I'd encountered.

So, I kind of made this problem for myself, but also fixed it for anyone who would do this weird workflow.

I can imagine the reviewer looks at that. It's like, why is there a diff here?

Yeah. Sneaking something in. Definitely sneaking something in. Yeah. So, that's it.

It's annoying. And it's such a small thing. But, you know, it kind of wastes cycles of stuff.

And ultimately, I think the moral of the story is maybe if you're copying, pasting values, be aware that the stuff is just weird.

Try and paste them into an editor or into some intermediary that's going to strip anything like that.

Source of errors, DNS, Unicode, time zones. It's like the nightmare fuel.

And you could even combine all three. I hope not. Well, you know, if somebody puts in an entry in zone info, that includes some Unicode characters.

Yeah.

So, I was editing some code once and I wanted to use, I had to make a symbol for microseconds.

And the thing is I was, I sort of like, well, Go has this wonderful facility for letting you type in Unicode characters as identifiers.

And I'm like, okay, I'm going to use that.

What I did not realize is that Unicode includes two subtly different, visually identical characters.

Greek small letter U and micro symbol.

They are visually identical in almost every font. And half the time I was typing one of them and half the time I was typing another, because in the middle of doing this, I reconfigured Emacs to have a shortcut for the Greek characters.

That shortcut was configured to insert the Greek character.

It was not configured to insert the micro symbol at the other keyboard shortcut inserted.

So, I was very confused for a while.

You're setting up a phishing site, definitely. What was the manifestation of the error, Watson?

Was it compiler error or? The compiler had an error.

And the compiler error is the very informative, no symbol micro S found. Did you mean visually identical symbol?

You're completely correct.

Yeah. I did mean that. Yep. Yeah. I mean, it's kind of annoying when you get compiler errors, but it's way better than getting runtime errors for these kinds of things where maybe for some reason, like, I don't know if I'd pasted that code in and rusted, swap the bytes around, then something different might have happened occasionally.

I don't know. I'm speculating here.

I think it's unlikely, but you know, with that Greece code, actually what happened was beyond the error bounds checking, the random number generator was also slightly skewed insofar as there was like a 99.9% chance you'd get the maximum value always.

So, it was great for what we're trying to do, interoperability testing, because people are like, what's the first value that you're not supposed to use?

0x21. Great. Okay. I'll try that one. I'll send that to people.

So, actually with having a bug in my implementation that wasn't wrong, but it wasn't syntactically wrong, but it was kind of semantically pushing the boundaries, help other people find bugs in their code.

And then after a while, I actually did like plotted the distribution of random functions and realized what I'd done wrong.

So, that was fun too. And now people ask me to do it the old way, which was not so helpful.

Now I've hard-coded the fix for your old version. Yeah. Oh, yeah.

TLS has had its share of things like that. So, when I started in grad school, this was way back in 2013, the sort of the big story that winter was the Juniper bug.

And what had happened is Juniper had implemented Dual EC, et cetera, et cetera.

And okay, that was all interesting, but they weren't using vSafe.

And we knew from the Snowden files that vSafe hadn't paid a whole bunch of money, or they hadn't paid to implement this as a default.

And there was a paper saying, well, actually, it doesn't work if you go through something special.

So, who knows if anybody used it? And there was a TLS extension that was out there in a draft that had some numbers reserved, but never really officially for employing this extension to make the exploitation easier.

And then the TLS 1.3 process started and there was all sorts of stuff.

And then after about, I think it was a two years in, so 2016, TLS 1.3 was done, except that browsers had to figure out whether or not they could deploy it.

Remember all those boxes out there do funny things and think that people, you know, all the extension problems.

And they were just having problem after problem. So, okay, they start by fixing up the handshake, make it look like older versions, that solves a whole bunch of the problems.

And finally, they're down to a really small number of problems, all involving people hitting print.

And so, the people at Google went out, bought a whole bunch of printers, started connecting them, started trying to print through them to the web screen until they found the one that broke the TLS 1.3.

And it turns out that one printer maker had used RSAPsafe and implanted the whole backdoor and TLS 1.3 had squatted on that code point.

And that is how we found out almost five years later what was going on. Nice.

These things are kind of annoying, but it is rewarding if you find the actual culprit.

It's not an unsolved mystery, you know, brute force attack the problem. Now, Michael, there's one other very common suspect.

Have you ever had the pleasure of working on an environment with shared NFS home directories?

Yes, you hinted at this before.

And, you know, I just started an NFS share at my house. So, I need to know what to avoid.

Well, the first thing anybody who does this is going to tell you is don't use NFS.

So, for those who don't know, NFS is Network File Share. It's an old thing.

It dates back to Sun. And the old versions, NFS 1.3 and earlier, all have a stateless server.

The server doesn't know anything about what you're doing to the files.

It doesn't know whether you have them open, whether you've closed them.

All it knows is you want a block from the file. Here's the block from the file.

Go away. Tell me when you want another one. And so, as a result, very few operations on NFS files have the semantics that you expect from a file system.

In particular, locking does not work very well.

And when I was a grad student, I was, I had implemented something in Python.

And it worked on my computer. I needed to do a big calculation, and it was way too slow.

And I'd optimized it as much as I could, but the calculation I had to do was just huge.

All right, what to do? Well, luckily, there's a supercomputer I could use.

So, I'll just go run this Python script on the supercomputer on a whole bunch of nodes.

Great. Except that the Python script I was using, it's actually in Sage, and there's a whole bunch of systems that, when they load modules in, they make use of lock, of locking in the home directory.

If you have a script, they assume the script directory your script is in is one where you can make a file where you can do these sort of locking things, and it will work.

And the supercomputer environment used NFS for home directories.

So, this meant that I could not run the script until I went through and carefully trimmed down all the inclusions, all the module imports to exclude the offending module.

And of course, the result was, don't fix, don't use NFS.

Oh, you didn't have like data destruction or anything?

It's the fun part. No, I mean, well, data destruction, I will tell you that one time I managed to knock out my dot profile completely.

And so, when I logged in, the path wasn't set, everything was bad.

And I managed to remember where, well, I didn't remember, I used man to figure out where the template for this was and copy the template over to fix it without needing the intervention of Redisys admin.

So, that was successful. I can't remember what I did to knock out that profile.

Oh, you know what it must have been? It must have been that I thought it only had my, it must have been that I had some configuration and thought that it was reading from somewhere else because I'm so used to Mac OS X where the initialization, the shell is different.

But no, no, no, no, no, no. I thought you were going to say you cadded something with a single care instead of a double care, which is classic, classic mistake.

So, my experience with file systems is a little bit different.

This isn't necessarily work related, but it does relate to how I initially got into operations.

So, I went to college, I was in my, I think I just finished my second quarter and I was about to start on the first chunk of like internship where you go out and work for a semester at a company somewhere.

And so, I up to that point had always been using Windows since you know what I grew up with.

I'm like, okay, you know, this is fine. But I'd been interested in other things.

I talked to this recruiter that came through from a big Silicon Valley company and, you know, I'd handed my resume.

I was like, I'll never get a get a hear back from these people.

And a little bit later, I got a call and this dude was like, oh, hey, I heard you're interested.

Do you want to do operations?

I'm like, sure. It's like, do you know Linux? And I'm like, I can. And so, I was given an offer on that promise.

So, I went home. I took like an old hard drive out from the closet and I put Ubuntu on it and I got started and I slowly over the course of a couple months, you know, got ready to learn how to use Linux and able to, you know, actually be a decent, you know, intern in operations.

I don't know how great I did.

It was only until a couple years later did I realize that that hard drive had all of my Dogecoin on it.

I had like 50,000 Dogecoin. It's gone forever.

But I mean, now I'm an engineer now. So, you know, rising like a Phoenix from the ashes of my Dogecoin.

How much would that be in today's kind of money? Like 40 bucks, I think.

Oh, okay. But just, yeah. But yeah, I'll kind of go to one of the other anecdotes that I lined up here.

So, you know, years down the line, after I finished school, I decided to move out to California to be with my now wife.

And one of my friends from school had been working at a video game company.

And I'm like, sweet. Like, I saw him, you know, working there part-time before and seemed to be having a lot of fun.

I was like, let's go for it.

Let's go work at this, you know, really small video game company. So, we start there.

I worked there for about a year and a half. The way that we did deployments or like updating video game was, you know, normally you'd have a downtime.

And during that time, you would push the patch out to all of your distribution platforms, to Apple, to Google, to Amazon, to, you know, whatever.

I think we had our own distribution by the end of it to like PC and Mac.

And you would force everyone off at the same time.

It'd all have to go and update their clients to the latest one and go.

And that was usually like, you know, a downtime because we had to like update the servers and everything.

And, you know, players don't like that.

They want to be playing constantly. And, you know, it was very easy to overrun those times.

And so, we're like, let's do blue-green deployments where we'd have, you know, the old version infrastructure spun up and then the new infrastructure spun up using Terraform and Ansible.

And it was just a beautiful system that would seamlessly go from one release to the other and you could like slowly push people over.

It was great. It was incredibly expensive. It like bankrupted the company probably, but like it was awesome.

The issue is there were still some manual steps in there.

And so, one release, I don't remember which one, it was like a pretty major release of this video game.

And obviously, it would give a very global community.

I think our large player base was in Southeast Asia actually.

So, we had different regions so that people were playing people that were close to them geographically.

So, we had this whole setup. We, you know, sat down with our coffee and donuts and pushed the button to like move up the new release.

We saw people trickling over.

It's awesome. You know, I think we did it sort of early in the morning.

So, it was mostly like Europeans and Americans were getting on. It's like, okay, sweet.

You know, I'd go back, sit down at my desk and hanging out seeing some people, you know, complaining about bugs here and there on Twitter.

I had a pretty active Twitter at that point.

People would directly message me with problems.

So, it's a great alerting system. And then I got a message from a Chinese player.

It was like, why have the servers been down for six hours? I was like, oh, oh no.

You know, our Chinese players are very passionate about the game.

And they just hadn't, it had been either messaging in Chinese. I didn't understand it or hadn't been on yet.

But we, I had black holed them for six hours because I forgot to set up a firewall rule to let Chinese traffic onto the servers.

I felt a little bit bad about that.

That was an honest mistake. A lot of things that I've caused production incidents have been pretty boring, honest mistakes like that.

But once I saw that, and then I looked on the Chinese fan groups, and I was like, oh no, this has been happening.

And I just didn't realize that what's going on.

But that is one thing about working in a really small industry or in video games, because you can be very close to your customers through Twitter.

And sometimes it's too close.

You'll get a lot of nice, nasty messages. But then if you resolve the problem, they're very nice to you.

So, Michael, how did you, well, how did you mitigate the issue?

And did it ever happen again?

Yes. Well, we eventually set up probes to make sure that an IP from China was able to connect to the servers.

We obviously had test accounts in each of these regions, but we were testing from our office, which was whitelisted because we were testing on it.

Setting up probes is definitely a major thing. And then eventually adding the firewall rule into Terraform to make sure it wasn't a manual step.

Manual steps are dangerous. Absolutely. Even if you have it in your spreadsheet, it's bound to get missed.

That's part of the fun of being in a smaller organization, where it's more about getting something out there with as little resources as possible that you know, you're going to make mistakes, definitely.

Yeah, I think in my experience, people, although they might be annoyed that it happened, like you say, they're very pleased that you could put it right quickly.

It's not like you said, oh, I don't know, we pushed, well, there's cases of games that have pushed out bad code and then turned around and said that they can't afford to fix it, right?

Because of certification costs and and basically the game is completely broken.

That's really bad.

And this is obviously a completely different case, but yeah, you can change stuff quite quickly and respond is great.

Having, I think the external probes is like, in my experience, a really good thing.

Although it's pretty hard, like, you know, load testing and external kind of things.

It's like, well, why didn't you just load test launch day ahead of time?

Because there's no possible way for a small kind of dev house to pay for that amount of traffic sensibly and not just, you know, high volume, bad traffic, but legitimate game systems and, you know, whole chains of errors that happen and arise because of interactions, databases or whatever, you know?

So, yeah. And no matter how much you might have like a runbook, it's very easy to be in a panic state and kind of miss a crucial step.

Pretty much any time that you're running a release out or something like that, like something's going to go wrong that you're going to know for next time, or, you know, there's been a code change that made one of the steps different.

And so it's never just like so easy you can kind of do with your eyes closed.

But, you know, if you go into it expecting that things are going to go horribly wrong, then it's usually better than that.

And that's kind of the general ops mentality that I've been aware of.

Assume everything's going to go horribly wrong and prepare for the worst. You know, you have these backup plans, you have all these graphs that you're monitoring things and, you know, especially watching Twitter very closely and make sure that people can get on, even if they're just saying, I can't connect to what's going on, and it's usually the router or something.

So for Cloudflare, like sometimes we have a global traffic level kind of feel for this kind of thing as well.

So it's an, I wouldn't call it a probe, but an internal way of measuring, like, did you have that in this case?

You said you, like you were having connections, but were you able to measure traffic?

Does it work the same in that kind of industry as ours?

Yeah, so it, like, I don't know if I could speak for the whole industry, but at least in that place, like we were definitely watching traffic from all the regions, but it was, you know, really, really late Asian time in Asia.

And oftentimes that isn't an issue, but we've also had, you know, problems with metrics around there.

And so I guess it was just like, whatever it is, we weren't paying enough attention to that one particular region's traffic.

And it was also, or the traffic from China would be incredibly variable, you know, if we were featured on one of the app stores there, our traffic was like 10x.

And, you know, it would fade away within a week.

And we're like, wait, where did everyone go?

That's great. Come back. You know, like, we love the hug of death. You know, if our servers are getting overloaded, because like, you know, players love us, that's awesome.

What feels bad is when your servers are doing fine, because people aren't playing your game anymore.

It's kind of like the weird dichotomy where you're trying to, especially if you're using a cloud provider, that's very expensive.

It's like you're trying to straddle that of like providing just good enough service that you're making a profit without like scaring away your customers from being too slow.

But I guess I do find it interesting, there's a balance in games in which you want people to be playing your multiplayer match, you want them to be like geographically close, so that they're able to connect to like a server that's between all of them.

And so that limits your pool, the people that are in that region.

Furthermore, you want people to be playing in a match that like is their style they like.

So if they're playing, you know, if you find like a tactics based game mode versus just like a shooter game mode, you want people to be able to decide between that.

So that also divides your player base. And then you also want them to be matched together based on their experience or their skill in the game.

So you don't have like, you know, scrub stopping. It's like, you know, actually steam rolling one team or the other.

It's that that cuts down your player base.

It's a really small group of people. And then people also don't want to play with other people that they've reported for being like, bad sports.

And as you know, in multiplayer games, there's a lot of people who are bad sports.

So trying to figure out a way to get that many people to get into a match in like a reasonable amount of time is an incredibly difficult problem.

When I was at the time, I actually did like a presentation at, I think it was GDC on that topic.

Yeah, it's, it is really difficult.

The second that you make it quicker to get into a match people complain about getting, you know, in really unfair matches.

And other people are like, I've been sitting in this queue with my team for like five hours.

Like, it's probably after I actually just got disconnected at that point.

It's a much different set of problems than you'd have to necessarily solve something at a place like Cloudflare, which is another set of difficult engineering problems.

Yeah, so what you mentioned using Terraform to make sure that you have all the firewall rules in that when you apply, everything works nicely.

At Cloudflare, we solve for the same purpose.

There, the one thing to watch out for with these systems, which I've run into a lot, is, so Salt in particular uses Jinja the template, and it's quite flexible.

It's quite nice. Until you remember, Jinja is based on Python, and Python has this unfortunate habit of letting you compare things of different type.

In particular, you can compare a string with an integer and get false.

One day, I wanted to turn something on, and I thought that the value I was using was an integer.

So I look up a list of integers. It was a string. Instead of turning it on in some places, I turned it off in all the places.

So that was unfortunate.

It's kind of the polar opposite of what you were trying to achieve, right?

Like the worst-case scenario of what you're trying to do. Yeah, but the nice thing is this didn't matter.

This was for a feature. It was okay that it was off until I could get around to turning it back on again.

So it wasn't an incredible, it wasn't a huge disaster.

Did you know that you turned it off, though?

I realized once the errors started coming in, where it was complaining about, okay, I was like, did I turn this on more?

And then I started asking, well, what does it say on this bar?

What's the configuration file? It's the other problem with salt, and it's sometimes, and this is, it's sometimes tricky to figure out what it actually did.

You can look at it and see what you thought it would do, but actually seeing the files on the system, you might have to ask someone to grab the file and look at it.

It's like, oh, oh, it's off there. I didn't turn it on.

And then, of course, we look at the code and the PRB, you know, a number of people put their eyes on it, and right this way, it was like, yep, that's what we did.

It's like, you know, the computer is sort of like a djinn, and you're being granted three wishes, but it's taking you in a very literal sense that you didn't mean, and you just, you know, that's what computers are.

Is that why it's called Ginger?

Because it's, it's actually... Nothing else. That joke is why we had this segment.

It wasn't planned. Maybe you gave the perfect, like, lead-in, and I delivered the unwitting punchline.

That's great. I've got another anecdote, which has no slides, but yes, I've got lots of these kinds of things.

So, this is kind of going back earlier in my career, where I was working on a thing.

I can't really talk too much about what the thing did, but just imagine there's some software that you have to build on your machine, and then the only way to deploy that to where it needs to run is by physically taking it to the machine and running it there.

There's no, like, servers or anything like that.

So, you know, there were USB keys, but they, at the time, they're kind of a bit small, a bit expensive.

So, you know, like, you could get a gig, maybe, but you'd have to get, like, a project approval to pay for a USB key like that.

So, actually, you know, there were other keys that were smaller, that were more available.

So, I was using something like a 256 megabyte USB key. It was pretty cheap and whatever.

So, I was early in my career, and I hadn't learned a lot of the kind of life lessons of developing software and the ways to avoid getting into deep problems.

Just some of the good practices of stuff, things that they can't easily teach you in school or college or whatever, they come from experience, and maybe they're things that you work out yourself based on how you like to program as well, and what your strengths are, what your weaknesses are.

And so, what I did was basically a big bang patch to this application.

So, I've been working for, like, a few days, and from an old version, had this thing that was, like, completely different.

And, you know, it's a case of load it onto the USB stick, take it, you know, a 10 minute walk to the place it needs to run, put it in, do a cycle of, there's, like, no unit or automated testing here.

So, it's a manual process of, like, 10, 15 minutes to get it to the point where the thing, the new thing should appear, or whatever the feature is, you can actually get to that point and say, huh, it didn't work.

Okay. Now, this isn't connected to a debugger or anything. There's no real way to get any logs or anything.

Either the feature works or it doesn't.

So, at that point, you have to kind of dismount the USB stick and take it back, walk 10 minutes back to the machine, try and think of ways, you know, almost like the old web programming style where you can alert at certain points and say, you know, oh, okay, I did get here.

Yeah, there's a console log, but in this case, was it easily visible?

Maybe you can add some logging, but whatever. But this whole torturous process of going there, trying something out, it doesn't work.

Okay. Like, these small tweaks and stuff.

So, you try and make the changes very explicit.

Like, yes, it did load. Yes, this is the version of the software that you're running kind of thing.

But it took me a long time to get to that process. So, you know, a whole day of walking back and forth.

I spent more time walking than programming or doing anything.

It just felt very, very annoying. To the point I just went, look, obviously I've made a whole hash of this patch.

It's, like, I just can't do anything.

I'm going to roll back to what I did, try and run that version, like the original version.

And even that, like, it didn't work as I expected.

So, I just thought, like, what the hell? Like, what is going on? I'm going to, like, do a thing.

So, rather than have to go through a 10-minute process of put the stick in and load the application, do all these steps, just do, as soon as the program launches, print, like, hello world effectively.

That should work.

Put this code in. No, it still doesn't appear. So, I honestly, I was just so scratching my head.

Someone else is, like, oh, look, you seem a bit frustrated. Like, somebody who's a bit more senior and a bit more mentoring.

So, they're, like, let me build it for you and we'll try it.

And, yeah, sure enough, it appeared first time.

So, it's just, like, have I been copying the wrong files? Like, has it been, like, this bad?

Well, like, no. So, we went through it. Again, these, like, boring, monotonous steps of trying stuff out.

And even to the point, my colleague pulled down the patch version and we tried it.

And guess what? The feature worked first time, right?

It did exactly what it should have. But, again, I wanted to find out, because it's no good relying on your colleague to build your own software for you, what was happening.

And it turns out that, basically, the USB stick was faulty. But faulty in a point that it would tell you on the file system, like, the build machine, that all the file labels and everything had updated and changed.

But when I went to the other machine, it would also, like, give the right dates and times.

Like, the actual files hadn't been modified. It was just so weird, like, the USB driver lying.

Practical joke, USB drive. Yeah, but it had worked. Like, and you hear these stories that, you know, over time, the quality of the devices degrades.

And, like, this can happen. And memory cells effectively turn from writable to just a read-only.

And that whatever the firmware or controller in the device itself doesn't know either.

It's like the commit had failed. And it was like one of those lessons where if you're getting such illogical problems, yeah, the first thing is don't do a big bang approach like that.

Like, it's never going to end up in a good style.

But also never trust the physical layer. Same in networking as well.

Like, if things aren't talking, just check the cable, check it's switched on.

Because sometimes the logical leap is to go so far into the deep problem that you're solving and assuming that you've done something wrong.

And a lot of times that is actually where the problem is.

Assume that it's an update to a piece of software that's broken something.

But yeah, I spent so much, like, I wore my shoes out on that thing.

You got a lot of steps in, yeah. Yeah. What was funny is, like, a year later, because I was on a graduate scheme, we were doing training courses.

And so they come to this thing. It was a networking training course to teach you, you know, things, how to crimp a cable, what the different layers of the network stack are, how things work on, say, Windows and Linux, these kinds of things.

And, you know, not just, oh, configure an IP address to HTTP, but how ARP works, what MAC addresses are, and all of these things.

And they taught it in a really good way.

And they always say, like, cables can go wrong. They can just break.

Like, do check them. Like, we're not just saying this. It happens. But basically, through all these steps, it kind of reaffirmed some of the lessons I'd learned.

But at the end of the course, in partners, they sent one of the partners outside for half an hour and basically got me and the other cohorts to sabotage the computer in a way, like, all the networking things that we learned to, you know, poison the ARP table and do these things.

And then when they brought them back in, challenged them to get the thing working.

And fair enough, my partner remembered everything apart from the, this is on Windows.

So, just the device manager, I think, just the device had been disabled physically.

Not physically, sorry, but, like, configurably.

And he was so annoyed, again, because he'd done all the complicated steps.

We didn't have a sheet to work through them. He genuinely remembered them from memory.

So, bad times. I had to tell him the answer in the end.

I think that was the most insulting part. He was that close. It's getting, you know, a minute away from the escape room and you have to ask for help.

Yeah. And in real life, it's great when someone could come in and just look at something, actually, like, here's the answer.

You know, you can move on and be happy. But it's that kind of, when you feel like you're being tested, especially, just an artificial environment.

And he kind of blamed me, even though all I did was follow the instructor.

I felt aggrieved as well. So, in the end, yeah, moral of the story, don't do a training course like that.

It was really useful. I still remember those kinds of things.

You sort of mentioned live debugging. I have a story that's not from myself, but from my friends who was working at a large Internet company and they had set up load balancers in front of their web servers, you know, as a standard web scale company would do.

And they were trying to troubleshoot something with the UI.

You know, something wasn't rendering right and staging.

And they were just like, what's going on here? And so, he went in, went, you know, on the live server, you know, super, you know, ninja style.

And it's just like, I'm going to just put in, like, I'm frustrated.

I'm just going to put an F word, right, on, like, the main page here.

And it's like, staging is fine. And he put it on, he put it on one server.

And he was, you know, hitting refresh, refresh, refresh, refresh to see if he could hit the, see if you could see it on the screen.

Like, wasn't it rendering? Still, still looking, you know, added it to another server.

And it's like, still not showing up here. Like, what's going on? And then they got a report that someone was saying F word on the production, production server, like, every, every couple times.

So, it was like, two out of the, two of the five load balancers on production now had the F word, just like in the, I think it was in the, the header of the web page.

So, live debugging, especially when you don't check what server you're on.

I think it's also very important to have consistent naming schemes for your things.

So, you know, exactly if you're in a production node or not.

It kind of echoes the point Tom was making on last week's episode around, you know, a bit like Kurt in, in the messages that you're using.

And it's very tempting, but as much as it might feel good while you're doing it, there's also sometimes side effects.

Yes, the issue is much worse if you, if you put F instead of just like, this is a test, like, this is a test is fine.

It's also, it's important to get plenty of sleep and step away from the computer, you know, maybe there's, there's, there's walks that you were taking between your computer and where you were putting the USB stick in where we're sort of clearing your head a little bit.

Not in this case, you're just thinking like, well, yeah, it reminded me like, to the point where I put more, like multiple copies on, you know, because, because the work takes so long, like, well, if it only takes me 10 minutes to spin out like a variant and build it and deploy, well, I'll just come up with like, I'll spend half an hour making more variants and then I'll go and try them all.

And, and they all fail again. It's yeah, stepping away out of that environment would have helped.

But I think it was kind of the assistance of a colleague just to give a little bit more insight into that issue itself.

But yeah, you know, like last week, the question is, you know, what, what kind of is good or bad in those situations?

In all of these things, I think people have always been like, that's, that's weird.

Doesn't sound right. Let, let me have a look as well. Sometimes just getting validation, that's also an error for them is, is, is enough to give you the motivation to keep digging deeper.

Like sometimes it's better just to nuke your entire directory and kind of start again.

Change device. I don't know. Remove all the variables possible.

But the worst would have been if you hadn't realized that was the USB stick.

Your colleague goes over, plugs it in, it works for them, you go over, it doesn't work for you.

I can see it taking a good time to realize that the USB device you're using is the problem.

Yeah. And, and the, the opposite problem is that then, again, I'm just reminded now, you start to question everything you've done up to that point.

So all the things you said, yeah, that, that worked fine.

Depending on the kind of thing you're developing and how you're testing that you've progressively gone along and yeah, yeah, I've introduced no bugs here.

You know, what version did I ever test? You don't know. So you have to go through a whole re-qualification of things.

So, you know, how could you mitigate that?

I don't know, like hash your software and, or embed versions into it.

It's like, it's a lot easier said than done to do that. And even then sometimes it's like, well, the thing that prints the version isn't the actual software.

It's something else.

You know, it's, it's lying as well. I've got no direct experience of that, but I, I do know of that happening.

Yeah. It's, it's, it's when you have those things, those visual cues that you use to confirm something that are wrong.

So in your case, the USB stick saying that it had the updated file when it didn't actually is, is, is those problems where you're just like, oh, you know, you, those, those rocks of truth have crumbled.

What I'm interested in is if anyone who'd worked with punch cards would have run into issues where the bug was in the punch cards themselves.

I'm sure like you, it was easy enough to make mistakes with them, but like an actual bug in, in the punch card that it would be like read incorrectly, just due to like some malfunction in the medium.

I think we have some engineers who use punch cards, don't we, Watson?

We should, you should get them on the next episodes.

Oh yeah. I will definitely reach out to some of the people with long careers.

Yeah. That'd be fun. Punch cards are moisture sensitive.

So they, and they are, I can't, I mean, all these media are mechanical media.

So if you have a batch of punch cards that, that it's moist and they're swelling just a little bit, you can have problems.

Network cables, by the way, are, so the optical ones have a minimum bend radius.

And if you happen to tighten it, it's not that it stops working unless you get really lucky.

It says sort of starts to not work.

And if you're very unlucky, when you take it out and plug it into the tester, you have just undone the bend that is the problem.

I've got something related to cables, not optical, but in the, in a place I worked, we're doing device devices that can measure like heat and light levels.

Effectively measure the data center environment and we're developing the software and hardware and firmware for these devices.

And so I had to test some tests I needed to do.

Actually, what I was working on was IPV6, but in order to test some feature related to IPV6, I had to go and plug in a lamp, a high energy lamp that could produce enough impedance or resistance.

I can't quite remember to trigger a sensor, to create a reading that could be read and then converted.

And this is this whole chain of stuff that had nothing to do with what I was actually working on.

So, because I didn't have to do it that often, I kind of forgot about it.

So I turned this lamp on and walked away and then came back in like the afternoon.

And my colleague said, is your Internet playing up today? Like, it's been really bad the whole day.

I'm like, no, it's absolutely fine over here. I've just been on a call.

It turns out his cable, his ethernet cable is right next to it. And it's basically like there was hardly anything of it left.

It was like down to the bare kind of copper stuff with melted plastic everywhere.

And it got progressively worse through the day as both things had kind of increased.

Yeah. I felt really bad then because he'd got really stressed through no fault of his own.

And I'd just forgotten. And basically, I never used that lamp ever again.

So thank you very much for appearing on the show today.