Latest from Product and Engineering
Presented by: Usman Muzaffar, Tom Lianza
Originally aired on March 26, 2021 @ 8:30 AM - 9:00 AM EDT
Join Cloudflare's Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
Joining this week: Cloudflare Engineering Director Tom Lianza
English
Product
Transcript (Beta)
All right. Good afternoon from California, everybody. Welcome to the latest from product and engineering.
Although this week's show will only be the latest from engineering.
And in fact, a look back at engineering. Jen Taylor is out this week, but I'm very thrilled to welcome my colleague, Tom Lianza.
Tom, say hi. Hi. Yes.
Welcome to the very oldest from engineering. The very oldest from engineering.
Right. So Tom and I were just chatting about what we'd like to cover today.
And Tom has been at Cloudflare. Tom, I don't even, I don't, I'm not going to get it.
I'll get it wrong if I guess. Exactly how long? Five years and four days.
Just hit the fifth anniversary. So this, you know, Tom's been a key player in Cloudflare engineering on many different teams.
I was just trying to make a list of how many different projects and teams you've been a part of.
And I thought it might be fun just to sort of walk through that list with you and some of the great things that you you've been part of over the years.
When I first met you on the roof of Cloudflare's 101 Townsend office, I think we were having the Internet Summit.
The subject of the day was SSL certs and some great products that we were releasing around that time, I think.
So talk about that. Let's start there. That seems like it's a good enough place to start.
So when I joined, actually, I found out several of my peers have joined under similar circumstances.
It was an exciting time in SSL.
It was the one year anniversary of Universal SSL. So let's remind our audience what Universal SSL is, which was a big deal.
I was sort of like jaw dropped.
I was a Cloudflare customer at the time, not an employee. Yeah, I mean, it's part of why I became a Cloudflare customer when I was at the startup.
Dealing with SSL certificates once a year was just like a calendar reminder, a pain in the neck.
You have to go and figure out which one you're supposed to buy and people scared you that theirs was the best.
And, you know, like I had other things to do.
Hey, more money for my longer number is how I remember the market.
Yeah. Is this bank grade? Is that good? Is that good enough? How green is the green checkmark going to be if I get it from this?
So that's why I became a Cloudflare customer at the startups I worked with prior.
So I was excited to take part in it.
But it was the one year anniversary. So all the certs that we had from that first couple of weeks of its announcement were being renewed at that time.
They were new a few weeks before they expire. And just let me say, all those certs, we're talking like, let's get it.
Let's put an order of magnitude on this.
What does all the certs mean? I mean, back then, I'm not sure. I mean, so here's the thing.
And it's just the challenge of that product back then was none of us were sure how many.
We put multiple domains on one certificate. So you'd have these certs that you would share with other customers.
Right. In part for conservation of IP addresses.
And we didn't know, we didn't want to overwhelm our edge with lots of certificate material.
So the number of certs, even then, I think it was still 20-some-odd potential zones per cert times.
I think, I want to say tens of thousands, maybe six figures in that order of magnitude back then.
I don't know that we opted everyone into that at the time. We're not talking about hundreds or thousands.
We're talking about seven digits of a number of certs to manage.
And suddenly, and we were just talking, two seconds ago, we just said what a pain in the butt it was to renew the cert for the one or two services that you are responsible for.
And now, basically, the calendar reminder is going off for, let's call it a million certs, let's say, need to be renewed.
Right. And SSL is interesting.
I've since don't spend as much time with it. But what it means to have a certificate, I think, was at the time not clear.
Like everyone understood it meant encryption.
So as a browser, as a visitor, I'm now talking securely and I can enter private data and know that it's not being snooped on.
But there was a period there where people thought it also meant we won't give you a certificate as a business unless we know who you are or we trust you or we trust what you're doing.
And there's a whole other level of implication by having a certificate.
And so certificate authorities at the time took on some of that responsibility and said, you know, if something's suspicious about the certificate or the name or their domain name, they'd reject it.
And sometimes you wouldn't, you'd have no idea why.
And us just trying to help our customers get certificates, we had to then explain this to our customers.
Basically making applications to another authority on their behalf and are now in the business of having to explain to them why something was or wasn't working.
So we literally had to build a complete pipeline to manage an integration with a third party provider and then ship it to the edge.
Right. And with customer sharing, have these situations where like, okay, I need to attest that all 20 of these domains are completely valid.
There's no illicit activity happening.
And if one of them has a substring in the domain name that looks suspicious, then all of a sudden we've got a blockage and we have to escalate and all this stuff.
It was a tricky problem and one that was scale only increased because it was such a successful product.
And then, you know, this landscape changed dramatically in the coming years as SSL became the default in browsers.
We started signing on customers that themselves were SaaS providers that wanted to make, you know, the website be, you know, support.mycompany.com, you know, as opposed to a third party provider.
And then, of course, Let's Encrypt showed up. Right.
And so the whole landscape continues to change on us. Yeah, I think it's evolved in a really positive direction overall where the Internet expects everything to be encrypted, which is good for privacy.
And we've scaled our own systems.
We introduced dedicated certificates so you could opt to be on your own cert.
And then as we progressed, we decided it was dedicated, like everybody should really be on their own cert for the sake of it.
One of the other things I wanted to talk about is like, so this is all, when we try to describe the Cloudflare architecture at the highest level, we talk about the control plane where people set up and configure the service, including where we monitor and operate the service.
And then there's the edge, of course, which is actually processing the traffic and, you know, doing things like actually terminating those SSL connections.
So when we come back to, like, we think about what the core looked like, you know, when Cloudflare first started, it was just a simple database, a little bit of PHP logic, and we're off and running.
Like, that's what the code, you know, that's what the SaaS stack looked like in early 2010s.
And now it is an incredible piece of engineering, you know, with multiple teams that are contributing to a vast, orchestrated, cross-continental system.
What were some of the early breaking points and, like, how did we start to, you know, align on how the core should operate and how, you know, how do we get out of, you know, simple PHP calls into full-blown microservices, multiple databases, clusters, Kubernetes?
Yeah, I think we did it correctly. I think as a startup at that age, you build the simplest thing possible, you move fast, you solve your customers' problems, and as your organization stays small, you can have a monolith, you have one repository, and you shift monolithically and have a single build, single set of tests.
As we grew, the bottlenecks just became so clear. Like, first of all, writing code in PHP, just as an engineering culture, became less and less desirable.
Especially for a company that was so good at adopting new stuff, right?
It felt like we're being held back. So we have people writing code at the edge, using Nginx and Lua, and then the core, Go, was really taking off for a company that was really interested in it, and people were building services.
But in the end, if you wanted your service, like, your logic, to be accessible by API to Cloudflare.com or Dash to Cloudflare.com, at the end, actually, that day, it was all www.Cloudflare.com, I believe.
You had to write code in PHP, or get people who knew PHP to write that code, and probably you needed to get stuff into the same database, the monolithic database that everybody was using.
So we had a very poor separation of, I mean, from my perspective, I remember as I came on board, I was like, wow, these teams have a lot of dependencies between each other, and they're slowing each other down in really sort of, well, it doesn't have to be this way.
Like, we could easily separate this. Easily. Easily. We could imagine, it's easy to imagine in a better world, put it that way.
The solution didn't seem tricky to get to.
Right. Yeah, I think, you know, people use Conway's Law as a pejorative, but it's also to your advantage.
When you're small, you build things that are, you know, you know everything in your head, you code it all together, this calls this, and you can move really fast.
And then as you spread out and you have people in other continents who program in different languages, speak different languages, like the coordination becomes really painful unless you give people a way to separate and move independently and autonomously.
And that's where, not just the software, but also the topology.
Also, you know, one computer or a mirrored set of computers wasn't enough to run Cloudflare's control plane.
So, okay, I have a service. I don't need a whole computer for this one little service that turns, you know, load balancing off or on.
How do I run that thing?
Meanwhile, all the logs coming back from the edge needed compute to sort of churn through them and produce the graphs and analytics that our customers wanted.
And so there was a team, the data team, building what was effectively generalized compute for the purposes of all that big data stuff.
And those of us building control plane services were like, oh, that looks pretty good.
Yeah. You're writing little Go services to churn through the logs at the edge and generate, you know, chomp data.
Can we get some of that? And that's where we started using Marathon and Mesos at Cloudflare.
And that served us really well for a good number of years. For a long time, the control plane was hosted on Marathon and Mesos.
And in fact, the problem was one of rapid adoption, like the number of services that were piling onto this architecture.
And, you know, I remember engineering's coming to me going, this thing is breaking at the seams.
And part of me said this, you're absolutely right.
And part of me felt like that means we've hit on something correct, because it's usually hard to get people to align on a single solution.
And yet here we had the opposite problem.
It was just, you know, people pounding at the gate effectively to get in.
It's the same problem a product has always had. Like it's a success signal that things start falling over.
And I still, you know, the decision was before my time, but it seemed like still a very smart decision.
They evaluated the different schedulers, job schedulers at the time.
And Mesos was open source and simple.
And at the time, Kubernetes was very complicated. You just really wanted someone else to run that thing for you.
And it wasn't until many years later that the decision was made that the Kubernetes was just a rocket ship that we couldn't ignore.
Like people were coming and joining the company. And they're like, I knew Kubernetes where I just came from.
What are you all using here?
And so that was also a big shift. That was very palpable to me just in interviews, you know, if you just talk about, especially senior engineers.
So tell me a little bit about the architecture, you start describing it and you sort of get the cocked head look like you're not using Kubernetes.
And you know, it was built for this exactly the problem that you guys are trying to solve.
And I think there was a gutsy decision made by core SRE very early in the thing, because at the same time, we're setting up a secondary core data center.
And they said, the secondary will not have Mesos and Marathon in it.
And that was an interesting, you know, that raised a whole bunch of eyebrows in the other direction, because they're like, well, then how much of a secondary is it?
Because the primary has a heck of a lot of Marathon and Mesos in it.
Yeah, I think people still disagree about the decision, but I think it was the right decision to this day.
There was, we were busting at the seams in enough ways that the concept of replicating that was, I couldn't, I don't even know how we could stomach it.
Like, I know, you know, it might have been faster.
Yeah, we would have had, we were able to duplicate what we had faster.
But the number of challenges we were having, which is so great that duplicating those felt like a step backwards.
It was a step backward. And part of me was, I remember wrestling with this conversation, too, because I had teams of smart people coming to me saying, if the goal is replication, if the goal is redundancy, let's just DD all the drives, dump it, recreate it, and get past this and cross this off the list and just declare.
And, you know, the other half are saying, it is going to set us back months.
Don't make that call, please. Let's lean into the future.
And, and so I'm proud to say that I do think with the, with the power of hindsight, it feels like it was the right call.
So, but it took us a while to get there, Tom.
So talk about that. How did, what was, what did we have to do to get everybody to go from Marathon to Kubernetes?
I mean, the biggest thing I think is, Kubernetes is, not many, maybe I don't know the exact numbers, but there's not a lot of companies who are like, you know, we're going to run Kubernetes, we're going to take our bare metal servers, our inbuilt networking fabric, and we are going to run our own Kubernetes infrastructure, we're going to let teams self serve.
And here's the thing, like, even if you use a cloud provider, use, you know, one of the big cloud providers, we are one product.
Yeah, Cloudflare is a product.
We are not, teams don't go and create their own Kubernetes cluster, because they all have to work together.
Like we, we need to build a platform that's, you know, co resident, it lets these teams interact with each other.
So it's, it is extremely non trivial endeavor to do bare metal Kubernetes, shared across a lot of teams, but separated.
So teams have rules of network policies that they can express about how they interact and what they can and can't talk to.
And in a lot of ways, the challenge of, I think, getting to Kubernetes was, was solving the fundamental primitives that we were, we didn't have in Mesos.
And yet we knew what it should feel like when it's done, right?
It should feel just like the public cloud, it should feel just like using, using like an individual engineer shouldn't have to know about all that stuff.
And so that made the bar even higher.
And I think one of the other things that was critical was to recognize that, yeah, you're right, we're not writing Kubernetes, we're not selling Kubernetes, we're not making this available as an external product or anything like that.
But this is, this is engineering in and of itself. And this merits a team of some of our best engineers focused on this problem.
And so we created a team service automation team specifically to attack this.
And not long after that, you came to me, I remember very clearly and said, we need a database team to, you know, and, and so talk about that.
Like, so, you know, after all, we've been running Postgres and production at scale for, for, you know, close on it, you know, seven or eight years at that point.
What is it that, when is it that you feel like an engineering organization?
Like, yeah, we need a team for this, versus we can, we can, we don't need a team, we can get by with best practices and a lot of goodwill.
We're having that exact conversation now with Redis, right? Like, it's really hard to know when you've crossed over the, like, is, can this plausibly be the full-time job of three people?
Like, because a team, you know, a team isn't one person, can't be one person either.
As they say, like, you're just an abstraction of an individual if you're one of three people.
So, so with databases, as we proliferated these microservices and got them out of that monolithic database, we ended up just sort of recreating a tragedy of the commons with another shared DBMS, had lots of different databases on it, but all it took was one product to be successful.
And then its peers started getting sort of, sort of edged out of the database infrastructure.
And then, then it was all about, okay, how do we slice these databases up and put them in different places and give people the separation that they need?
And I think that's one piece. You need some stewardship over the overall topology of the databases that no one team could offer.
They, I mean, the team that used the database the hardest could complain that they weren't getting what they wanted and the team being squeezed out would complain, but like nobody felt.
And they're both right.
Yeah, they're both right. Yeah. So, so needing some, one entity to sort of be stewards of the overall platform.
The other thing was, was expertise.
So as your, as your product gets successful and you scale, things get slow.
So, so, and there, and databases are programmable. They're fully, they're very sophisticated if you're supposed to write sort of procedures, but even, you know, the SQL language.
So do, does every team hire a DBA? Does every team hire a SQL expert?
And the answer there is like, of course, you've definitely done something wrong there if you're unable to factor out that kind of common logic into, into a, into a shared team.
So I think, I think you're right. And I think it's very interesting.
I think part of the things we're so proud of at Cloudflare's internal engineering culture, which is we are collaborative and we have an incredible amount of empathy in some ways makes these kinds of problems for us harder.
Because everyone works so hard to be collaborative until it's like someone needs to, this needs to be someone's full-time job, wakes up in the morning, cares about this for the good of the entire company, the entire team and take it over.
And I think, I think, you know, the, the transformative thing for me that happened when we, when we started saying, okay, we have a database team is how many database clusters showed up?
Like all of a sudden things got split up, factored, multiple failovers, multiple nodes running in locks, running into things.
So it was, it was, it was, it was, it was clearly in a sort of pent up demand for, of tech debt really that just was, you know, the database team took a sledgehammer to once they were constituted and had that as their mission.
And, and, and sort of made it a cookie cutter thing.
Like we solved the problem the same way with each, with each team, the tools that are used, you know, the expertise and failure modes is applicable across everything.
You know, related to this is the problem of dev tools, which is something which also sort of lived under your, in your department for the, for the longest time continues to in fact, or through multiple leaders.
So what, what, you know, especially with such a heterogeneous environment, you know, it's so many different kinds of services, things that run at the edge, things that run on the core, Debian packages.
Today, I was even talking about, you know, external plugins that need to happen.
So many different things need to be built. How, what is the, what is the, what is the role of dev tools?
How do we think of what its responsibility is to the organization?
We like to empower teams to move as fast as possible, but I don't want, I don't want 47 teams standing up their own copies of this continuous integration.
So really that seems wasteful. So how do we think through this?
Yeah, I mean, actually the idea that everyone runs their own CI has been, I think some companies do that.
One of the reasons we, we've always pushed back on that is, is as autonomous as Cloudflare is, is an engineering organization.
There are certain, I don't know if choke points is the right word, control points.
Where if we need to exercise security scanning or other sort of broad compliance things, we really do need a way of doing a thing.
And that's controlled by a team who works with compliance and make sure that we do it that way.
And who can work with security to make sure we'll just plug in that scanner into the, into the software.
DevTools, just like the database team was started off as a person or a pair of people until eventually it became.
We felt passionate about that stuff.
Yeah. Yeah. And eventually it became a team. I think it's really interesting because it's the team that almost more than any other team is, is their own customer.
Like they, they live and breathe this stuff. They use the stuff for their stuff.
And when we started, we were a full Atlassian shop through and through.
And, and I think a lot of engineers, you know, everyone will complain about tools.
That's just like what a tool is almost for. But whereas I think there's a lot of confidence that some of the Atlassian tools were best of breed or thereabouts.
The CI tool at the time was not. So, so we did make a move to, to, to team city, to the JetBrains product for CI a couple of years back.
And it built a lot of integrations on top of that.
And a lot of engineering around it. It's not like we just turned up the, the CI tool and said, everyone go nuts.
So we sort of like, I love the idea that one of the mantras of the DevTools team is make the easy things easy and the hard things possible.
Right. So that you can, you can always fall back to a more, you know, getting the nitty gritty and, and, and, and customize it.
But even, even if I look at some of the, you know, part of this series is the latest from product engine.
One of the, one of the latest products, internal products coming out of the DevTools team is a new configuration tool.
Specifically to make it super easy to do bringing this conversation all the way back to the beginning of what we talked about.
There used to be a page which would say, Oh, this is how easy it is to set up a new service and marathon.
And that page got longer and longer and more and more steps.
And, you know, and then it shrunk back down when it was Kubernetes, which automated a lot of it for us.
And now it's still, it's still, you know, there's a lot of things to pay attention to.
And I think DevTools thinks through it as, okay, look, hello world should be a push button.
We should be able to get a simple service up and running.
And I was playing with myself and sure enough, like hello world literally was a push button.
So it was really, really cool to see that sort of focus and thinking through what the experience should be for a brand new engineer.
Cause Cloudflare keeps growing and we keep adding new people.
Yeah. I think it's like chess, like a minute to learn a lifetime to master it.
There are very common things. People are not all experts in make and Docker and building Docker images and where the packages go and et cetera.
And so making that easy is I think a real superpower. But eventually a lot of Cloudflare engineers end up just full sack every little bit of their software they own and want to optimize.
And so they have more advanced options as well.
And that's great. Like, I think, you know, we, we the, the, the model I keep using, it's a, it's convenient to talk, to be, have an engineering background, if only because it gives you a lot of extra metaphors to describe things is the interface implementation difference.
Like we'd see a lot of our jobs as engineers leadership to sort of define interfaces that we think are important.
But the implementation can be whatever is appropriate for that team, whatever technologies, their, their platform.
It makes sense for them to use. So, yeah, that's, that's another, I think, a story that we've, we've shown a lot of success with.
So Tom, another thing that you, you've worked on recently is a massive project to make sure that our services are available in Luxembourg.
And I don't want to do a blow, but we can probably dedicate a whole series of Cloudflare TV projects.
I want to ask you sort of the meta question there, which is, so here's something which involves every team, which has a control plane, a footprint of the control plane, which is literally every engineering team in the company.
And we're trying to align them along an ambitious goal, which is to say that we want to be able to prove that in the event of an issue in the primary data center, that everything automatically fails over and goes to the second one.
And some of this is underlying infrastructure that they need to count on things like that database team and that Kubernetes team and the core network.
But a lot of it is their own services and, and double checking all that.
And so how, how did we even approach this problem?
And how did you, you know, I sort of, we sort of nominated you, Tom, you're in charge of this.
How did you, how do you try to get 40 odd teams to line up to a, to a very ambitious target?
Yeah, so it's really hard. We, we did, as you say, for the things that were shared, for things that we had platforms for, like a database, we created patterns.
So we're going to, we're going to copy your database in both colos at all times.
We're going to give you a proxy. This is how you talk to the database.
It'll work in either place. Yeah. These are Kubernetes as well. It's in both places.
Some of the shared services, you know, in the end, we have a lot of services at Cloudflare, but there's only a handful that everybody really needs.
You know, there's the, there's the actual product. Tearing them to make sure that we understood which was the most critical ones as well.
Right. So the teams I cared about, I spent the most time with, are the teams that everybody depended on.
Yeah.
And make sure that they had an answer. Like if you're using our shared message bus, okay, that's got to be in both places.
Can you use it across the ocean?
Entitlements, anything to do with authentication, like these core, the API gateways itself, all of these things needed to have a story to work in both places before the long tail of teams could even follow along.
We need to paint them a picture of how they can move there.
And I would love to say, you know, click this button and now your service is in both places.
But the reality is a lot of services are all different.
They have different performance characteristics and some services can survive latencies and certain dependencies that others can't.
And so it's just a big, it's a very big project management exercise.
Once we've got the primitives and proxies and patterns in place, then it's a march down the teams and see what they can do.
Yeah. Someday we have to write this up in a blog because it's sort of the nested level, like the multidimensional project management matrix here of how, you know, normally you think of tickets rolling into epics and tickets rolling into epics and epics rolling into stories.
Like here, this was that same thing with a whole other layered dimension of, you know, across teams, across products, across tiers of importance and cross dependencies there.
And I think, you know, it's, we still have, as always, we'll always have work to do on resilience, but it's an incredible result of how much is now transparently running in multiple continents and a redundant control plane.
And it doesn't, it's so tricky time.
Like it doesn't end. Even as we're doing this project, new products are appearing.
Like the things that we announced on birthday week, some of those were in development.
I'm trying to get everyone in both places. And so they're like, we got a deadline to hit.
This is the last thing. Yeah, exactly.
And that's, you know, that's, that's how the, that's exactly those kinds of risks that the, the, the business takes.
You know, the last, we only have a couple of minutes left on one thing.
I remember you and I talked and you sent me a screenshot of your calendar and you were like, this is a problem.
And it was, it was literally just packed, like from morning to night, five, five and a half days across the week.
And, you know, the, the answer there we both know is, you know, grow.
And thankfully cloud, we work in a company that's doing very well. We're so proud of all of our, we've achieved.
It enables us to hire, keep hiring. But, you know, in two minutes, what are some of the key learnings of scaling and engineering team?
Just, just a quick two minutes. So, I mean, I think seconds. At least at Cloudflare and probably at any high growth company, we're all asked to do more than we can do.
And so, so prioritization is your, is your answer. A former colleague of mine used to say, who am I going to disappoint today?
Like you, there are hard choices you have to make.
And I think being ruthless, but polite about what you can do, what you can't do is your only way to, to stay sane.
And, and really is a lot of sort of emotions to making sure you don't beat yourself up over the fact that you can't do every last thing that the world expects of you at work.
And I would say that, that, that prioritization exercise also makes crystal clear where the next headcount goes.
Right. Because it becomes, it brings it into sharp focus. Like this is one, one, another group of people I wouldn't disappoint if I had three more people.
And, you know, this is what the kind of leader I need. And so it's that, that, that ruthless prioritization is also your, it's not just your day-to-day salvation.
It's your path to the future as well. And if you don't include hiring as part of that ruthless prioritization, I know it doesn't pay off the hours you're interviewing somebody.
But a few months from now, that's, that changes everything.
That's right. A couple of hires. All right, Tom, thanks so much for joining me on the latest from ProductVenture.
We got to geek out about engineering the whole time.
I always, I always love chatting with you about this stuff.
We got to do this again soon. Thanks. Thanks for having me. All right. Bye, everyone.
See you next week.