Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.
This week's guest: Andrew Galloni, Engineering Director at Cloudflare.
All right, welcome to Storytime and good morning from Lisbon in London where my guest Andrew Galloni is sitting.
I assume you're in London, I guess. Near enough, seven eggs.
Near enough, near enough, somewhere near London. All right, well, welcome to Storytime, which in every week what I try and do on this is talk to someone at Cloudflare about either history of Cloudflare, a particular debugging problem, something they worked on, and well, Andy and I have worked together for a very long time at Cloudflare.
How long is it now, Andy? Cloudflare, I joined in 2013 in January.
And then we'd worked together before that, so it's now so long I'm starting to forget how long it is.
2009. It's good to chat with you. I thought one of the things we might talk about, so you don't work in my team at all, you work in this other thing which is now called Emerging Technology and Incubation, ETI.
Tell us about what ETI does. ETI is about looking to the future of Cloudflare.
So we know what Cloudflare is good at and the idea is to have a separate team that can think about new and different areas.
So for example, I'm looking at Cloudflare for Teams, which is how to secure businesses.
And especially at the moment where everyone's in remote working, looking at those sort of areas.
Other things that may not necessarily work out, but aren't part of the core technology, because as businesses grow, as you know, it's actually very hard to make big moves into other areas.
So the idea is to have a team that can focus on emerging technologies and incubating ideas.
Yeah, people who don't know about this structure in Cloudflare, there are actually, in a way, three engineering leaders.
There's your boss, Dane Connect, who's the SVP who runs that group, the Emerging Technology and Innovation.
There's my team, which is actually a fairly small team, which has some engineers and the research group under Nick Sullivan, which does stuff with universities.
And then there's Usman Muzaffar, who is the SVP of engineering, who's doing product development.
So we actually have this deliberate structure with different people leading different teams.
And so you're on the bit that's doing not necessarily the next product, but the sort of next next.
Exactly. So even ideas like Cloudflare TV.
Yeah, exactly. But if we dial back, when Cloudflare got started in London, so I joined, I was the first person there and you were number two, right?
Correct. Yes. So now it's what, is it eight years? Seven and a bit, I'd say January 2013.
You basically left me on my own for a year is what happened. I worked for Cloudflare for about a year before I persuaded you to join.
That's right. Yeah. Because what happened was, as I said, we worked at a previous company.
I remember actually when I signed out a previous company was ETL 2.0.
It was. At the time. And what really fascinated me was you changed roles from being a CTO to becoming a programmer.
Yeah. And I remember emailing you or texting you in about November of that year saying, that sounds fascinating because I was an engineering director, you were CTO.
And I wasn't sure that's what I wanted to do. And you just made a huge jump to become a programmer.
That was fascinating. So I came and had a chat, had a coffee.
And, you know, as you can see, that's really worked out for both of us because it's worked out.
Right. What's funny about that coffee is I remember that really well because we had it in Paul near South Kensington station upstairs.
Now, I remember about that is you asking me whether I, whether I, you know, I thought Cloudflare was a good thing.
And I think I said something like, well, it seems like a good place to work.
I'm not sure it's going to work out, but, you know, it's not a bad spot.
And here we are. So it kind of worked out. So exactly. Yeah, that's right.
Yes. And then, yeah, I went over to San Fran in January of that year, next year, sorry, to meet the team, decided to join.
And then there was me and you working from home for us.
Yeah. Working from home for quite a while. And one of the things, if you remember, as you say, I went from running the engineering group with our previous company back to literally just writing code.
Right. I just wrote code at home.
And you did as well. Right. Because your background, you've done a lot of stuff that was to do with web technologies.
And you were like, yeah, I'm going to get back to doing that.
Yeah, we still use today.
Do you remember that? Because I think, didn't we both fly out? We ended up flying out together.
And then we got, Matthew met us at the airport or he called us and was like, there's a piece of malware which is breaking our challenge.
Exactly. Yeah. Yeah. That was my first day. Yes. It was your first day. Yeah.
Yeah. Yeah. I think you remember that very, very clearly. And so, okay, so you went back and you did that kind of stuff.
And now you've gone back to being engineering director again.
So we've come full circle from me too, right? Exactly. Yeah.
So it's what we've regressed to the mean, basically, of what we were doing before.
Pretty much. But give us a sense for why did you want to go back to sort of the engineering side of it rather than managing people at that time?
I think I just wanted to get back to coding and solving some interesting problems.
It's something I really enjoyed doing so much. And as I sort of progressed, the assumption was you had to be a manager, right?
Just the traditional model.
And I sort of did enjoy that and still do enjoy that. But as I said, unless you'd actually made that step yourself and gone and just said, I can be a programmer.
It was like, okay, maybe I should go back to it.
So that's literally the conversation I had with you.
And there's lots of interesting things to do at Cloudflare as well, which sort of helped.
I think that's right. There still is this huge amount of stuff we're doing, and there's always these different projects you can end up working on.
And then I know that for me, when I went back into it, I had managed all sorts of different teams.
And I just got to the point where I felt disconnected from the technology completely.
I didn't know what was going on, particularly at the previous company where we worked at, where there was that very strong group of people who knew each other and worked together.
I was just like, I don't know what any of this stuff is.
And then to actually go back and just write some code and mess it up and write tests and just felt great.
Yeah, no, absolutely.
And deploying something on the Cloudflare network was like, that was really cool.
Yeah, definitely. Yeah. And now it's, I mean, what's interesting is if you remember back then, I think we had like five locations around the world and deployment was pretty quick.
And now we've somehow managed to keep that deployment speed up in 200 locations.
So it really, it did, as you say, work out. What I think is interesting is when you talk about these career things, because people often get that like career progression, they need to be a manager and then a director and all this kind of stuff.
And reality is, first of all, that pyramid is impossible for everybody to be at the top because of the shape of it, right?
There just aren't enough places for people.
But also you don't have to go and be a manager.
You've really moved in different roles in Cloudflare. Just take us through that because you've done lots of different stuff.
And I think it's interesting to talk about how we think about things.
Yeah, absolutely. So yeah, I started off and it's almost I've come full circle and started off just looking purely at technical problems and small things like web performance is one of my sort of big passions.
So you've been looking at like WAN optimization and how to improve edge to origin performance.
So I sort of focused on edge to eyeball and still do actually.
But sort of during my time at Cloudflare, I think I've run and managed most teams from www, the UI team and the design team.
I ran the whole design team for a while.
Yeah. The whole of the edge as well. The whole of FL, Nginx. Yep. All the securities, firewall, WAF.
Yeah, pretty much. I've ended up running and managing every team, which is fascinating when you think about it, the opportunity to be able to do that.
I mean, that's the great opportunity of small companies, right?
Is you end up, you know, how did I end up writing a WAF for Cloudflare? I had no intention of doing that when I came to Cloudflare.
I was like, well, somebody's got to do it.
You're available and you probably have the right skills or maybe you don't, but you'll learn them.
Yeah. And yeah. Why was I in charge of rolling out Speedy and then HTTP2?
You know, figure it out, figure it out. Yeah. The other thing that's interesting is that the two of us being in London and then at the time when we joined the rest of the engineering team in San Francisco, what happened was the center of gravity of Cloudflare from an engineering perspective moved somewhere, I think Bermuda maybe, but you know, it moved towards the UK.
And I think a lot of people outside the company don't realize how many of the critical core products were and still are completely made in London, right?
That's certainly almost every interview I used to do. It was, we're not a satellite engineering team in London.
It was, we actually build and own these products.
It's not like we just keep them going in the night. Yes, exactly. Yeah, exactly.
I think that was, and that's sort of now we've grown. Now you now work for a team in Austin, right?
Yep, exactly. Yeah. I have a team just based in London working on some areas.
Yeah. Right. Right. So just go, if you, if you can remember now, I mean, you worked on, let's just think about it.
The dashboard and stuff. That's right. So yeah, I ran the, the migration to yeah, the 2.0 version, which I think is similar to, it's probably still, yeah.
Based on the one we still use now. And that was a huge migration from, Oh, did I lose you?
Yeah. Oh, I think I lost you for a moment there. You said it was a huge migration from.
Yeah. Just from some, you know, the original code base into, you know, something that could support a full APL.
Yeah. My Internet connection is unstable.
There we go. Yeah. Sorry. Yeah. I can't, it's going back a bit now and my memory isn't the best.
Yeah, I know me too. And there's things I worked on.
I was like, wait a minute. Did I? Yeah. Yeah. Cause I worked on DNS while you were, that's right.
A lot of things. And there were, there are now things where I look back and I'm like, wait, did I, I think I did.
How many other people worked on it?
And I think they were like two, you know, you suddenly realized how much a big part you were of something.
Absolutely. Yeah. And I think probably the most famous story about that was I grew a beard and I promised in the meeting not to shave until we actually shipped it.
It's probably the biggest thing. That was a terrible mistake, right?
I know. I never, that's my one bit of advice to everyone now is never say that.
Never say that. It would be quite large. Yeah, exactly. Cause it was such a huge migration, you know, absolutely every feature, every API you can imagine.
Yeah. Yeah. It was a big thing because of the way Cloudflare had like any company had sort of grown by accretion of stuff.
And there was this big PHP thing that was still sitting there with lots of weird, well, there was a weird thing where you had leaky abstractions and all sorts of stuff.
It was just the mess had grown basically.
Yeah. And you know, no coherent style IA or anything like that.
So you could add, add more, more pieces to it, which we still can do today.
So, yeah, yeah, exactly. And we still now use that migrated, that migrated platform.
And then is it after that you worked on FL? Yep. That's right.
And then I moved on to managing FL and yeah. And that was. For people that don't know what FL is, that is actually the server that if someone goes to a website or something that's on Cloudflare, that is the server that serves that request.
So it is, it has to work. Pretty much. Yeah. That's sort of the core thing.
So we have SSL sat in front, which is the TLS termination. And that's the, it receives a request.
It figures out any of the features that are turned on, does all the sort of lookups and then, you know, calls those features in order.
And then obviously it goes to cache and then cache will do the origin.
But yeah, it's, it's, we used to call it the brain.
If you remember, it's that every single request goes through this.
Yeah. Yeah. I think we also called it the brain because we didn't understand how it works.
Right. That was part of the. Yeah. For a while. I remember, I remember for a long time, there was a, because particularly the bit in FL that did all of the request processing, there was a while there where it was almost a running thing where people would say, in what order do our features execute?
Cause it was not clear. Like, does the WAF run before the. Then when, you know, when is the actual, all the overrides put in place.
So yeah. Yeah. You just think about the number of features Cloudflare has and the way in which they interact.
It's actually monstrously complicated, just that bit. And that was just when a request happens, like, you know, someone's web browser goes, get this image.
There's a whole massive amount of code that excuse to figure out how to do that.
Oh, exactly. And I think when I first started, that was when we were first porting it from PHP to Lua to run in it next.
Yeah. And that must've been happening when you joined was that PHP to Lua.
Exactly. Yeah. And you know, one engineer and, you know, there's always feature requests.
So you can imagine everything was just being added and you were starting to get, you know, action at a distance where you weren't sure when you change this, what would change there.
So, yeah, there was a big push to modularize FL and have, you know, clear callbacks so that anyone else could then, you know, the goal was anyone could write a feature and be able to put the module in.
So that took quite a while. Yeah. That took a while to get into that.
One of the other big things as well was just to sort our patch steps out for NGINX.
We used to hack everything together. Yeah. And remember it would take us six months to do a new release of NGINX because we'd have to do it all ourselves.
We weren't running the core tests, et cetera.
And we did a big change where we made it all very clear and it's just very easy then to update.
That was a big part of what I did.
Yeah. It's true. I'd forgotten all about that whole NGINX patching mess and how we tested it and everything.
That was, that was hard work. It goes away then.
Yeah. And then you don't remember. Yeah. Yeah. My brain has deliberately put that on the side, like, remember the horrors of that.
But then you, so you do all these things and I'm trying to think, when did you transition over to the ETI team?
I think it was around a year ago. So yeah, you sort of, sort of sit back and think, you know, what did I enjoy doing?
What, you know, what's, what's my passion?
And it's, it was really, you know, building these new technologies, looking at new things and driving those things forward.
And, you know, I was having a team that was able to do that.
It was great. So then I transferred over there, brought a small team with me and working on, you know, interesting new things.
And one of those is all the image optimization stuff, right?
That's been a whole wonderful world of speciality. No, absolutely. Yeah. We, we, we've taken sort of, you know, a lot of the core image optimization stuff and we've built this really interesting Rust server to do it all in.
Obviously everything's done in Rust now.
Our whole team only works in Rust. Is that right? The whole team only works in Rust?
Pretty much. Yeah. But you're the cool kids. Exactly.
Yeah. So we've got an imagery sizing proxy that does all the standard sort of imagery sizing stuff.
We've been looking at the new formats, but the really fascinating thing that we were working on was progressive images.
And how you can deliver progressive images where if, if you just encode a JPEG progressively rather than full, rather than loading each piece at a time, you'll get a couple of layers first.
And that's actually good enough for a visible quality of a web, of a, of a page.
So you can get a site loaded, you know, 50% faster if you use progressive images.
And the real trick we did, I think that's really, really exciting is that we've linked that in with HTTP2 prioritization.
So as the first frame set is loaded through and goes through us, we then drop the prioritization.
So if you've got multiple images being loaded at the same time, all the first views will be sent first and prioritized.
Yeah. That's a fascinating thing, right?
Because it's like, one of the things that's weird about the Internet is that it's grown massively because of these different layers, which are essentially independent of each other.
But then you get things like this, where it's like, well, if I know about the layer below me, I can, I can sort of send a hint to the layer below me and say, yeah, prioritize this bit in this case, and then get that sent.
And then you get what you get a slightly less high definition image very quickly.
And then you get the rest of the definition afterwards. Yep. So I think about it is how can we send the optimal bytes in the optimal order to our customers, to our eyeballs.
Right. The way I sort of phrase it now, and it's either compression, we make them as small as possible, but then like the ordering is critical to see a browser will only be able to render a site if it has certain information.
And if you send that in the right order, you get a much faster experience for everyone.
Yeah. Yeah. It's kind of amazing how like simple the web is and how much complexity there is in that simplicity, like an individual webpage, when you go to it, you don't even think about the masses of information that were first of all sent and then reconstructed in a way that makes the final, the final look that the user has.
Yep. And the amount of media and everything that wants to be sent now that there's so many different facets to think about is even just the battery life, right?
Because how much and what encoding you use and how much has to be encoded on the device.
So, you know, we even think about what's the best way to save battery for end users.
That's a good point. Just talk about that a little bit more.
My understanding, and you're more of an expert on this than me, is that certain image encodings, depending on the platform you're talking to, whether it's like a Microsoft thing or an Android thing or an Apple thing, actually lower battery life to display an image.
Is that right? Yeah. And the main thing is actually sending the right image down, right?
Because you send a huge image and then you've got to try and fit it into a small place.
You've got to decode the whole thing first.
So like actually one of the main things to be clear on is to send the right size image to the right device because then the browser doesn't have to do much with it or decode it or, you know, resize it.
Yeah. I think the other actually interesting thing, which literally working on this week is there's a new image format in town called Abiff.
And we're just getting that into testing now, which is really cool.
So what's great about Abiff? It's way smaller. That's the story, right?
But what's the trade-off? There's got to be some trade-off. It may take a little bit more CPU.
And that's kind of why we want to get it into testing to see how long it takes and why and see if we can improve that, you know, classic cloud thing and then also understand what it does for the client as well.
It'll be a much smaller bandwidth, but does the decoding take longer?
I don't know yet. That's what I'm going to do.
Yeah. That's one of the things that I found sort of so fascinating at the same time difficult about something like, you know, what Clever says is we'll make the web faster or we'll make your Internet property faster.
And then it's like, well, what does that actually mean when you've got different network types, different handset types?
I mean, different computer speeds. It's like, it's a really complicated equation to actually figure that out.
There's a balance between those things. And that's the thing we're trying to move towards, which is we want to serve the optimum image or asset depending on that network and device type each time.
That's like the goal I'm aiming for.
I've got lots of bits in play. They're not quite got there yet, but that's the goal.
It's like we see you're on this device with this network bandwidth at the moment, we'll send you this quality of image that we know will be perfect for you.
That's going to work well. Yeah. And you get sort of a little taste of that on the filmstrip view we have in the dashboard, right?
You can kind of start to see what Cloudflare does for you versus if you're not.
Yeah, exactly. That's one of the hard things sometimes for people to understand is what's the value of Cloudflare?
How can you show so that the web pay test, which, you know, hats off to Pat Meenan, as always, for providing this great tool.
How websites load, but for anyone who's not super technical, I think the best view is the filmstrip view, which will literally show how much of your website is shown to your customers at certain points.
So it's very clear to see, you know, several seconds, what the differences are turning on this feature, or if you use slightly smaller images, or if you're using Cloudflare.
Yeah. Yeah. It's really, it's really interesting to see that.
And you can actually see how, and then you can modify something and rerun it and see, okay, this is, this is what will make a difference.
Yeah. You worked on some of the compression stuff in the past as well.
Didn't we? Didn't we look at Gzip and Brotli and all these nonsense things?
Yeah. Yeah. Yes.
Gzip. Yeah. We're still, still using Brotli with this site. In fact, we're working on a feature now, which is the latest thing I've been landed with or working on is CDNJS actually.
And we're looking at the best compression methods for CDNJS and what level of Brotli we can, we can serve.
And, and we're actually going to be using workers to do that, which is quite interesting.
Well, that could have a, that could have a huge impact beyond properties that use to Cloudflare directly because that's incredibly widely used on websites in terms of a, you know, if you're embedding jQuery or React or whatever.
So we're seeing if we can use a higher level of Brotli because we've got time because we can then store it ourselves with the origin, right.
Not just doing it on the fly as we do for sites that don't support Brotli at the moment.
Ah, so we would keep it in cache pre -compressed.
Actually in KB. In KB. Wow. Using all the technologies.
Yeah. Yeah, exactly. You mentioned that your team only writes everything in Rust.
How did that come about? You probably know as well as I do.
Yes, but the people listening don't. Um, we were using a language.
It wasn't so memory safe, shall we say? Yeah. Nothing bad has ever happened with that.
Yeah. Whenever you do that. And in the postmortem for that happening, we, I remember distinctly you were sat at the table with us and our team and Ingvar mentioned that he thinks the best thing to do would be to go and use Rust.
And you were like, really? Do you think this is a safe thing to use?
Well, it's safe, but do you think this is a reasonable thing to use?
To which I said, didn't someone ask you the same thing to you about Golang many years before?
Yes. Yes. And you went, ah, okay then. Yeah. Yeah, exactly. I mean, I think that that's exactly what happened, right?
Which is that after Cloudbleed, which is the thing we're talking about, where a C-based program leaked memory since such a long time ago now, even though it was so hard at the time going through that, we really tried to move to memory safe stuff, particularly memory safe stuff that's on the edge where you're dealing with the great heterogeneity of the Internet throwing stuff at you.
Absolutely anything. Yes. And Rust has really nice properties and we're like, hmm, yeah, maybe we should write all this stuff in Rust.
And so I think my only hesitation, and then honestly, anybody would have said this about Go at the time when I started using Go because it was 0.98.
Was Rust stable enough? Has language, you know, has enough of the libraries and all this kind of stuff.
Again, performance, all those things.
Yeah, exactly. And yes, yes. It turns out. Yeah. Yeah. We've written quite a lot of, our core libraries now are in Rust.
So we wrote the matching engine while filter.
Yeah. That's in Rust. And the new proxy we're working on cloud teams, that's for Rust proxy.
Yep. Yeah. And it looks like, and the new proxies are an interesting thing because we've always used Nginx as a proxy component within our architecture, but it does look like, I mean, we've solely changed, I mean, that's kind of the core proxy where there's all stuff around it, but it's now feeling like it's time to actually replace Nginx as well, right?
Yeah, exactly. So every problem doesn't have to be solved with another Nginx.
Now we can add another Rust proxy.
Yeah. We have more control. Yeah. Yeah. All right. So it's been what, seven years?
What's the next seven going to do to the Internet? Wow.
That's a question. I have absolutely no idea, which is why you're not in the job of writing those annual things.
I have to write predicting the future. I'm not going to tap you for extra help at the end of this year, then you're not doing that.
So, so how's it been doing this?
I'm an engineer. I was an engineering director. I left, become an engineer and now back to being an engineering director.
I think you've found a niche where you're running a team at a senior level, but really involved in the technology still.
Yeah. I think I found a place where I can be very effective in terms of, because I've been around long enough, I know how everything works and I can actually get traction and see where there's synergies between different areas.
As you say, I'm using all these different things, but you know, I've really taken on how to use workers and really push things forward there for our features and know how the protocols work and be able to marry these things together, but being senior enough to be able to make these things happen.
Yeah. And work is interesting in this context, right?
Because this is now, I mean, something we provide our customers with, but you're using it to build internal stuff at Cloudflare, core functionality.
Well, yeah, I'm a greater believer in though, if, if we can't build our systems using it, why should our customers be able to use, be able to build those, right?
It just, you know, it doesn't make sense. So, you know, and I remember, you know, saying we need better observability and that's what we work with platforms providing and, you know, logging and debugging and all these things are coming, but it's sometimes better to be driven from inside as well.
And so, yeah, cooking your own dog food or something. I'm not quite sure what the right metaphor is here, but yeah, exactly.
But yeah, dog fooding, I think is very important.
Yeah. And the, and the Wrangler tool, I think has really grown because of that, because we're actually trying to do stuff.
If you try and get stuff done, you're like, oh wait, I can't do X.
And it's like, oh, well, we have a tool where we can add that feature.
So. Exactly. And it, and it's interesting, you know, because we say we've got to have things in CR, we've got to have tests, we've got to have it built.
And I'm, I was saying like literally, I need, if I'm using workers, I need to be able to go through the same pipeline because that's the way Cloudflare and, you know, similarly with our customers.
Yes. If you remember when we joined, we didn't really have a CI pipeline, did we?
We had Lee's laptop. Pretty much. Yeah. So yeah, not just Lee's laptop either, but yes.
Yeah. It was Lee's laptop and there was occasionally, I remember actually on the PHP for the website, not for our customers service part, but for the Cloudflare website, actually logging in and hand editing the PHP live on the website.
So that was eight years ago. Now that, that would, well, first of all, probably get fired, but second of all, I don't think I can do it.
So yeah, fortunately I don't have access to anything like that.
The other day somebody asked me something and I was like, all right, I can't, I can't do that.
You don't want me doing that otherwise I'll break something.
So I don't have access.
So, all right, well, we're almost out of time. Thank you so much for coming on my silly show story time and talking about Cloudflare stories.
I think it's, I think your point about, you know, you haven't been here for a long time is really an important one for companies, which is, you know, you get people with institutional knowledge who understand the technology, who understand how things operate, who can really help, you know, lubricate things and make stuff happen.
So I'm hoping you stay for another seven years.
Help keep things going. So, yeah. All right, Andrew, engineering director in the emerging technology and innovation group.
I don't know, what was your job title when you joined? Was it software engineer?
Yes. There you go. Front-end engineer or something. Yeah. Front-end engineer.
Yeah. Yeah. All right. Well, thank you very much. Thanks for coming on the show.
Have a good time in London and I'll get to see you in person one of these days.