Latest from Product and Engineering
Presented by: Jen Taylor, Tom Lianza
Originally aired on October 11, 2021 @ 1:00 PM - 1:30 PM EDT
Join Cloudflare's Head of Product, Jen Taylor and Engineering Director, Tom Lianza for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
Hi, I'm Jen Taylor, Chief Product Officer at Cloudflare and I am pleased to invite Tom Lianza to join us for Latest from Product and Engines this week.
Tom? Yes, I'm Tom Lianza, Engineering Director at Cloudflare.
I work, I've been here about five years, mostly work on our core control plane.
So the brains of Cloudflare versus the edge that everybody knows and loves.
And I started in the product side, a lot of SSL in the early days, but now it's all, it's all core and API all the time.
Tom, what makes our core hard? That's a good question.
So I, one thing I like about the core is it's actually very easy to reason about.
It's not a unusual set of services, a set of patterns. They have microservices, they run in Docker, they run in Kubernetes, they run through an API gateway.
I think a lot of companies out there have a similar architecture, similar technologies that we do.
What's hard, I think, is that our overall company, you can tell in our blogs and talking to our engineers, our skill set needs to be super diverse, to be good at both these sort of traditional microservices architectures and distributed, worldwide distributed network, low latency network technologies.
And that's not just in software, it's in hardware, it's in infrastructure and ordering and support of the infrastructure and that the core, the control plane, has completely different requirements, I would say, than the edge.
And so it's not, I wouldn't say it's particularly hard generally, but for Cloudflare, we just need to cover so much ground that if we only had to do the edge, or only had to do the control plane, or only had to do a big data deployment, it might be pretty straightforward, but we have to do all of these different architectures to provide the service.
Yeah, to do them concurrently, kind of hyper-local and hyper-global at the same time.
It's just, it's every direction.
Right. So when you want us to build a feature in engineering, we actually almost have to build three different products to deliver a proper product to our customers for, if you include UI, API, analytics, and then the actual proxying behavior at the edge.
Wow. If I think about it, you know, I've been here a little over three years and I think about the fact that like, you know, I run the product team and we've tripled in size and we're constantly increasing the volume of product we're building.
What are some of the things that you guys do to kind of make it easier for teams to develop on the core and to kind of scale with the kind of the rate of innovation and the products we're building?
So it's, one of the things we do at Cloudflare all over engineering is use standard or open source tools.
So tools that people come to Cloudflare and they know Docker, or they know Kubernetes, and they know they know Go, they know the programming languages.
So, so we're not inventing, you know, everything from scratch.
And then, you know, as you see people ship products, there's some very common patterns.
And so we try and spare teams, the pain or the toil of recreating things we know they're going to need like audit logs.
So great. I think Rajesh was on one of these a while back. And like, there's a whole team devoted to, you're probably going to need audit logs.
So plug into this.
You, you might want to send an alert to customers. So you can plug into this and we create some internal products that every other customer facing product can sort of hook on to.
I think that's some of the, you know, there's miles to go on that front, but that's certainly part of the philosophy.
Yeah, the other thing we do, it's almost a challenge is, is we try to dog food, our own product.
So we increasingly throughout the years are now building tools for developers, not just sort of classic operations, but up into application development and we try and get our engineers to use the tools we build to build parts of our product.
And so the same challenges that we have with adoption with our customers wanting to put some stuff at the edge and maybe keep some stuff on prem for now.
Our engineers internally are going through the very same challenges and give very direct feedback to our, to their fellow peers on those platforms.
Yeah, that's one of the fun things about working in a company where we've got such a strong culture of sort of dog fooding our own things.
The fact that we build so much of our own stuff on our own stuff.
I mean, leveraging standards, as you said, but yeah. I can imagine, you know, one of the things I feel like I've seen in the time that I've been here in the work that we've done together as you sort of mentioned the sort of the centralization of some of these services.
And kind of effectively building a platform right that ends up making it kind of more almost not nicely plug and play, but just easier and more efficient to get things up and running.
Yeah, that's right. So I think we by using, you know, some some standard platforms like like Kubernetes the the and Postgres and, you know, open source databases and open source orchestration systems.
Prometheus for learning open source alerting or following up tools people are familiar with people can get ramped up pretty quickly.
And then, then the challenge is, okay, you a lot of companies know how to do that in one geographic location.
How do you or one availability zone. If you're using a cloud provider, maybe even to availability zones, but how do you do it globally.
How do you do it across the earth. Yeah, we can proxy traffic across the earth.
No problem. Yeah. But how do we manage state across the world. That is An unending challenge that we're making big strides and It's taking years.
Yeah. What are some of the things we're doing to sort of tackle that are kind of refine our thinking on that.
So, um, as some of our customers probably know we have a A lot of our control plane runs out of North America and we have a secondary control plane site in Europe and and like many Large infrastructure providers, we do we have we treat that we tested for failovers and we treated.
We have our data backup and replicated there at all times, but we've never run all of Cloudflare out of that site.
Sort of soup to nuts sign up login pay DNS, all the all core functionality and after considerable effort to to fully to get comfortable that we are fully replicating the systems and that secondary side is ready.
We're ready to prove, prove that to our customers and ourselves.
And that's what we're going to be doing starting next week on 99 National holiday.
That's right. Right. Birthday resilience.
The birthday resilience cloud cloud Cloudflare takes takes a trip to Europe.
As you've been preparing for this. What are some of the things that the, the team is has had to do or had to think about kind of differently than they have before.
So the, the, the big, the big thing that's really hard to get people to understand internally.
I think a lot of people understand this who build apps. Regular old web apps, which is my background is state.
So our edge has this beautiful property that it is it doesn't have state that it can't afford to lose It proxies traffic it it knows what to do because the core the control plane tells it what to do.
And we push that data out if for whatever reason, a machine crashes or dies, we can Bring it down.
It's fine. There's plenty of others, bring it back up, it'll get it state.
It's not the source of truth for that configuration, it receives it.
And that's true of all of the edge clothes and the same is true of the cash.
In a cash miss will go back to the origin and get the item will go through our go if we have to, whatever, whatever it takes, but we don't Need.
We're not worried about losing an image, you know, worldwide, we have a big infrastructure for that when a customer signs up.
And enters their email and hit save. We cannot lose that that is that is stored somewhere and it needs to be durable across all manner of disaster scenarios and that's that's the big difference and that and working around that.
I'm making that reliable in the event of a disk failure or rack failure or colo failure.
We've been working on all power failure, all sorts of failure modes that we don't lose critical data.
But in the event of a true disaster or sort of a natural disaster.
We have to, we have to make sure we handle that to as little data loss as possible.
So customer signs up, it's safe, they send their email address to us horrible earthquake occurs something or something really terrible.
Where did that email go how long, how much data would we have lost and if if if something horrible happens in an instant.
And so thinking about that involves For most people, even in availability zones.
If it's a real disaster multi availability zone can go can go sideways.
So that's where people start to my regions and replicating your data across regions and that's where That's what we're working on now.
That's what we've, that's what we've been doing to make sure that yes in the event of a natural disaster.
There's some data loss. There's an instant where you know part of the world is in trouble.
But we are not far behind and some other region around the world, ready to pick up and go from there.
Yeah, yeah, always, always on.
It is interesting to think about right the the architecture of the edge.
The notion of any cast is this sort of inherently durable sort of lag. No worry, I got it kind of, kind of thing.
But the notion of state and replication of the course is much more complicated.
Yeah, it's, it's, it's easy to take for granted. And in fact, disasters don't happen very often.
There we, you know, Sure, there are many, many people using cloud providers, you know, you know where they are in the United States.
And we're all very lucky that those parts of the country haven't had big disasters, but if they were to a lot of the Internet, you know, would be in trouble if people aren't doing disaster.
As you and the team kind of prepared for this work and sort of dug in and really made this investment and resiliency sort of where the things that you guys look to for inspiration.
Are there things that you guys learn to look to in terms of lessons that kind of helped inform our approach, you know, it's, what's sad.
And if we were to do it by the book, we, it's kind of like we skipped a few chapters and went straight to the hard part.
So instead of having, you know, a couple of regions nearby.
So we can have very fast replication and tolerate the loss of a single, you know, single facility we're jump we've jumped straight to worldwide go from North America to Europe, which is which is quite far.
So early on when we knew you know that we were wanting to jump straight to the heart straight to the end from the from the beginning.
A lot of the key things we needed to do were internal were process oriented were, you know, as long as people treat the secondary site as secondary.
It's going to continue to be a slog getting people to think about it.
Think about the disaster that may never come.
Think about, you know, obviously we're going to exercise this but not treat that as a as a chore.
But turn sort of flip the script and make it feel like an opportunity.
Like, here's, here's a complete facility that you could be using all the time right now.
You could you could spread the load use those CPUs have this, you know, our friends in the stream team be compressing videos and all those sorts of things.
We have computers around the world that you can use there. And so the main focus at the beginning of this year was let's let's be using them both primary and secondary at all times.
And so earlier this year. We had it. We adjusted our API so that if you are closer to Europe, you're hitting when you go to a paddock offer calm, you're hitting our European control plane.
And if you're a poster North America, you're, you're hitting our North American control plane using our load balancing product using dynamic steering a feature of our own load balancing product.
From there, the teams at Cloudflare who build the control plane can now they have the ability to start taking advantage.
Of both both facilities and maybe they're not ready.
And so they still stay send it from Europe back to North America and that's okay for now.
But our cash team, for example, just recently. Really leaned into this and is running in both places all the time.
If you want to really cash.
Which is, which is really cool and a perfect scenario for leveraging both primary and secondary at the same time.
Yeah, and a number of teams are getting closer and closer to being able to do that.
Well, and it's like if I think about like the purging of the cash and that being sort of kind of the quintessential use case.
It's also kind of so paramount to Cloudflare and what we do if you just think about the volume of data that we store in the cash.
And that the rate at which we're constantly sort of thinking about refreshing and handling that Yeah, I mean, if you look at traffic to API to Cloudflare.com there is no there's no end point that takes as much as many requests as cash purge that is certainly is the heavy hitter and having them be able to spread across close across the world is a really great step forward for us.
Yeah. What are some of the things we've learned along the way.
Oh boy. We, we learned, we know that you know this intuitively that Europe and North America are far away.
You know, you can see it on a globe and and you can put up a website in Europe and visit it from North America, like pretty fast good enough couple hundred milliseconds extra I barely noticed when you attempt to write an application like And it talks to a database and that database now moves across the world.
Everything explodes and like I know it's it sounds obvious, but a lot of people take for granted close accurate atomic read after write semantics beautiful full sequel.
semantics and starting to steer away from that, like, you know, if you want to be really reliable, we're gonna have to take some short, we're gonna have to like give up some of that has been has been big learning And and I wish I could say, I mean, it sounds so obvious, but it's not until people who have built software really see it and they're like, wow, I didn't even realize this one API call me 10 database connections, because it didn't matter.
It's just from super fast. Until they go across the ocean and it's a wake up call.
We got to really be really thoughtful about data access patterns.
And I think with with serverless with with workers with all the different evolving new paradigms for building applications.
Everyone's gonna everyone's cast to get more mindful of that the days of having a server with a database right next to you.
Are increasingly going away because the reliability and the uptime that people expect is is increasing towards Yeah, it really is.
And I think it's, you know, um, it's just kind of quintessential Koffler for me.
Right. But like we see this challenge. Of course, we go to the hardest part of the challenge.
First, we tackle that first You know, and and we we bump into as we're doing it constraints that that sort of push up on sort of the natural laws of physics and, you know, we believe that that's a false dichotomy.
So we continue to kind of innovate to to kind of push and push and push and push Yeah, so it's kind of very kind of quintessentially Koffler to me.
Is we started also like every startup is a monolith is one app one code base one repo and I would still advise any startup to do that as long as you can.
Don't Don't go breaking it up.
Yeah, until you really, really have to. And we did we were we did have to But those are those are good living for Got away with that.
Yeah. So you talk about breaking up the the monolith and some of the strategies we've taken with some of our own services, you know, I know we continue to make a lot of improvements, even just, you know, kind of with even our own API architecture.
Kind of, where are we at with that journey and how are we thinking about that.
Yeah, so there have been some recent changes there as well.
So over the year recent several years.
Most of the new clouds are features you see built have aren't part of the monolith aren't part of our original core monolith.
But it's still very much in use as things, things have broken off, but there's still that core thing.
I always, I think the the users, users like that'll be the last to go.
That'll be probably So new things, not in the monolith slowly pulling the monolith apart, but we did last last two weeks was moved the monolith off of it's been running on bare metal physical servers, a cluster of them for many years and it and the API moved into Cooper 90s as a only a couple weeks ago and without disruption.
The team did a really, really good job there so deployments and management.
And observability all increased dramatically.
We, we try to work towards fleet of cattle and not pets like most in most companies and we we have a lot of cattle, but this was this is one of the pets that's still in the process of converting Yeah, and one of the oldest pet science code basis eight years old.
Yeah, when you when you when you promote that pet to cattle and it's it's so it's it's so core and it's so significant.
What are some of the things that team had to do or had to kind of keep in mind with with that move in particular.
The, the, I think with anything With any change you make the biggest risk is making it all at once.
Like sometimes redundancy can be an illusion.
You're like, well, we have five of them so we can afford to lose one that's maybe that's true, but you can't afford If you're making a change to five of them at once.
They might as well be one thing. So what the API team did and actually is the same, same as what the cash team did and moving Active, active globally.
This is start with with 10% of traffickers they might start with 1% but gradually as requests come to a pet.com Send, you know, one in 100 to the new Kubernetes version and the other 99 to the bare metal.
As you grow confidence and you watch your errors and you watch your performance characteristics you increase that number until you're just not sending traffic anymore.
And that's That's how we do all roll out to the edge as well of new software to the edge.
It's a gradual monitor, monitor as it goes more global so that by the time you know it's really worldwide.
It's been, it's been well it's had plenty of eyes on it and it starts to collapse.
It starts by Yeah, the, the, the, the, the, the image I have in my mind again just the metaphor of the pets and the cattle is sort of Releasing an animal from like, you know, a rescue center in like, you know, into the bay or something and sort of monitoring its progress to make sure it will be healthy and thrive.
So it's kind of I think it's I wish I had that beautiful metaphor.
I really Keep thinking of People, you know, you take such care of a pet and you love it and cattle aren't treated so well.
They're not unique and are treated poorly. So our aspiration of cattle is almost Yeah.
You don't want to care for our servers. This is why you need product managers in the world.
Right. Yeah. To enhance the metaphors.
Nice. The API.
Is really come up come a long way.
And even though even if we can't. And I think a lot of companies can can can relate, even if we haven't been able to fully, you know, deconstruct the code deconstructing its deployment deconstructing its topology how it's scaling properties.
Is it is a nice step is a big step. For us. And so that's part and parcel of our move forward.
Flexibility scalability resilience, all of that kind of, you know, touching touching in there.
So, yeah. Yeah, and we're still shipping things which is exciting.
Yes. Yeah. What is new that what is your favorite thing that we just shipped recently I it's like asking me which one of my children is my favorite like it's hard for me to pick just one You have to answer.
Yeah, exactly. Like, yeah, it's like whoever didn't bug me for screen time three minutes ago.
Um, you know, it's a, I think, you know, one of the things that we have done over the course of the past couple weeks that I'm really excited about is actually Not necessarily something that is brand spanking new.
In fact, it is basically as old as the web itself.
But part of what I love about what we did is we took something that was complicated and only feasible in code and we made it possible and easy and and that's That's a huge part of why I love doing products like cloud players, because we have such a focus on taking complicated things and making them easy.
And that really comes down to the new feature here is URL rewriting like it is like it is not rocket science, right.
I mean, honestly, this is translating what a user puts in a browser into the right structure to match how you might store that your own systems.
Super basic, super basic feature, but like today.
Or historically, the way that we have supported this is we've been like, that's great.
You should write a worker to do that and And workers are great and workers are powerful and workers are easy to use and they're performing and you can use JavaScript and like it's great.
But it's not clicking buttons. And so what we did is we made this button clickable so we made it possible for people to create URL rewriting rule using effectively the same intuitive UI, we use for our firewall rules.
And again, taking complicated kind of heavy to heavy, you know, things you would have done the code and make it easy so Yeah, that's, that's what I'm pretty fired up on I when I interviewed at Cloudflare I almost didn't interview because I I like apps like some more consumer things that was my thing and infrastructure was just like, I just really want to take this for granted, but what What drew really drew me to Cloudflare was as I was building stuff in previous startups I used Cloudflare because they made that hard stuff easy on me and and I still think, you know, That I think we have a world class UI for really complicated things that we should be really proud of.
And yeah, you're right. Rewriting.
It's a perfect example. I feel like so many routers into my apps. Yeah, there's any number of ways to do those templating and all this stuff.
Yeah, that's cool.
Well, I think the other thing that we continue to enhance like the other thing that we continue to enhance a great deal is our analytics, right.
So, You know, one of the things I've talked about a lot as we've done these these Cloudflare TV sessions on latest from product and edge.
It's just the journey we have been on over the course of the past year really fundamentally to kind of Transform the way we present information to customers in the user interface so that they can better understand Kind of what is happening with their systems.
Right. I mean, so much to your point.
It's so much in infrastructure. It's like It's like if you just looked at what was happening above the surface.
You just saw the tip of the iceberg, you would sort of miss the the mass of what's happening and I often think of analytics is the thing that kind of shines the light on that and and really helps customers understand in a really scalable easy to use way like what's happening with their traffic what's happening with their security rules.
And so, you know, again, we started with the transformation and moving to the GraphQL platform, you know, moving ourselves to a place where we're really leveraging and building on top of reusable components.
So investments we make their kind of, you know, bear fruit across the entire UI and then this week.
You know, we took many of those same enhancements and took them to our network analytics, which is an area where Where we've just started kind of digging in to support our newer products of magic transit and spectrum and bringing in the ability to show the top end and filter and support TCP flags.
It's Again, it's just, we're kind of taking the goodness and continuing to spread it across the UI.
Yeah, it's, that's another thing.
The whole class of things that when I first started here. It was shocking to me.
I felt like Cloudflare was doing so much for people and we were telling them so little about what we were doing and top end.
There's the the construct of top end when I when I look at the sites that I personally use behind Cloudflare That's most that's a big chunk of it like Yeah.
Where's all the bandwidth coming from. What are the top end requesters, you know, top end things being blocked or top end threads like the way it's such a nice shorthand for consuming At least expanding the tip of the iceberg.
So you can see a little more what's going on. Well, and I think, yeah, to your point, kind of going from the like I feel like we've been on this journey from the like sort of like trust us.
We got it. Don't worry about it. Don't worry, don't worry about it.
We got it to being able to actually be like Actually, we have it.
We understand it and let us actually show you what's going on because there may be pieces here that you also want to drill into And I like the way that we built the top end because kind of back to what we're talking about with URL rewriting Like, I don't want to have to go in there and like write a bunch of filters and like part of what we've done with our components.
Is built on such that you can use point and click logic to effectively filter your analytics view to refine and drill into, you know, specific aspects of your, your top end or whatever.
And it's just the whole the whole part of that.
UI is is responsive and and really kind of and dynamic, which is great.
I'm impressed. Every time I look at these these analytics announcements.
I'm actually shocked that we're building all this stuff.
I feel like if At a smaller company. I'd be like, why don't we just give them the data and let them use their favorite tool, but our tools are really are really good and really Purpose built for the scenario that we're trying to help people understand me and I'm sure there's value in exporting it all in to real data mavens, but I think the tools are I love them.
Yeah, well, I do too. And it kind of comes back to what you're talking about at the beginning right of sort of the power of the core and the investments were making in the centralized platforms.
Right, and and using that as a way to to make these things that are that are easy for teams to pull off the shelf and build on top of Yes, it's funny like as simple as they may look and you drag this and top and that there's so much data that we have to Chomp through and slice up and aggregate and summarize for you to have that nice view and to get it quickly.
Yeah, it's, it's very easy to take for granted.
But it's, it's a whole lot of infrastructure required to deliver the solution.
Again, it's this complicated things and make it easy. It's just like it's it's one of my absolute favorite things.
So, yeah, that's, yeah, fantastic. Yeah.
Um, well, so, so Tom, you know, I know you have a big week ahead of you with your national holiday.
You're going to get a little bit of a respite here in the US, at least, and get hopefully a little bit of time on on the weekend in the long weekend with your family, but I just wanted to thank you for for joining us on latest from product and eng.
I really appreciate all the work that we're doing to support Resiliency and stability for our customers.
And I'm really happy to get a chance to talk about it.
And I look forward to celebrating with you on the 10th. Thank you. Yeah, I know, you know, we have, I think, is a good history of being open about our incidents, what we're doing to make things better.
And this is just a very direct You know, response like that.
We mean it when we have problems. We want to make it better. We invest in the infrastructure and it's not glamorous maybe and not visible.
Usually, but but it it's a big point of emphasis in the cloud for engineering to make this stuff.
Well, and like I'm excited to read the blog post. Right. I know that that will be a big part for us to be kind of sharing our learnings and our experience with with the community and Continuing to kind of iterate and grow from there.
So, absolutely. Fantastic. Have a great weekend. Good luck with everything.
Thank you. You too. Right.