Latest from Product and Engineering
Presented by: Jen Taylor, Usman Muzaffar, Dimitris Antonellis, Brian Batraski
Originally aired on August 5, 2022 @ 1:00 PM - 1:30 PM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Panel
Product
Engineering
Transcript (Beta)
All right. Hello, everyone. And welcome to another episode of the latest from Product and Engineering.
I'm Usman Muzaffar, Cloudflare Head of Engineering. Jen, nice to see you.
Nice to see you. I'm Jen Taylor, Chief Product Officer here with my partner in crime, Usman, and I call them the Beatles of Cloudflare.
Yes, we do call them the Beatles.
Actually, before we introduce John Paul and Ringo and George Kerr, why do you call them the Beatles, Jen?
Because they keep shipping the hits. Very consistently, every quarter, this team is shipping out great customer-facing features that move the needle in terms of adoption and use cases for the customer.
And I bet everybody who's listening is wondering what team we're talking about.
Who are we talking about?
Who are we talking about? Who are you guys and what do you do?
I will say their names and they can introduce themselves. Brian and Dimitris are Product Manager and Engineering Manager respectively for Load Balancing and the Assorted Products, Health Checks and the Waiting Room.
Guys, thank you so much for joining us this afternoon.
Brian, why don't you just introduce yourself, say hi, what you do and how long you've been at Cloudflare?
Yeah.
Hello, everybody. Coming in from sunny California. My name is Brian Batraski.
I've been at Cloudflare for almost two years now and run the Load Balancing, Health Checks and Waiting Room products and love it every day.
Nice. And I'm Dimitris.
I'm the Engineering Manager for the Load Balancing team. I work closely with Brian and yeah, I joined the team six years ago.
Six years? No, I was just thinking that.
Cloudflare was a small startup at that time. It's been really fun and yeah, happy to be here.
Okay. So, you know, one of the things I'm thinking about is you guys say you're the Load Balancing team, but then you're like, we do Load Balancing and Health Checks and Waiting Rooms.
Like, why do we put all these products in one family?
How do you guys, how can you guys scalably manage three significant products at the same time?
Why? Why are all these products together? It's a great question.
You know, at the end of the day, it comes down to our customers, right?
They have a number of challenges they go through every day and it takes a lot of time, effort, coordination and a lot of cost to kind of solve all these challenges in terms of making your application available, resilient, performant, having a pulse and understanding of what's going on in your app at any given time.
And so we want to help customers focus on their table stakes, on the customers, on the products and their services to really ship amazing stellar experiences to their end users.
Whereas we as Cloudflare will help abstract those problems and challenges away and allow them to lower that maintenance cost, lower that coordination and be able to get just focus on what's important to them.
Great. Is there a lot of shared technology here? There is, there is, especially in the load balancing and health checks.
A big component, the health checking mechanism and aggregation, it's quite common.
On the Waiting Room side, it's kind of not, we don't really have any good overlap.
And the reason is, we can discuss more later, but we utilize newer technologies like Cloudflare Workers, Durable Objects, and we got a lot of things for free basically.
So we don't have to implement all these things on ourselves.
One thing to add is, you know, one of the reasons why they all kind of go together is because they all, at the end of the day, are different sides of the same coin of how do we manage traffic and how do we view our traffic and protect traffic?
And so that's why it kind of falls into our lab to make sure that we handle all those in a nice way and an integrated way across our platform.
Yeah. So actually this, and this is a perfect segue into some of the things we wanted to talk about today.
So, you know, the picture we draw of Cloudflare when we're trying to introduce Cloudflare to new employees or people who are just hearing about it for the first time, is that of a classic reverse proxy.
There's something that sits in front of their origin. It receives those requests from around the world and, you know, caches them and accelerates them.
But ultimately, if it doesn't have the answer, the reverse proxy is a proxy of something and it's got to go back to origin.
And in all the sort of the simplified diagrams, there's always one, there's one origin box there.
And the trick is that, you know, the lines are going from the big orange cloud back to this singular origin.
But of course, big customers don't have one origin. They got a whole bunch of origins and they spread them out over the place.
And so part of when we load balancing is exactly this challenge of, okay, how do we know which origin to pick?
And sort of, Brian, tell us the story. Like what was the hello world here of like what we had to do?
And then like lead that into some of the more interesting things that was exactly what you just said, which is steering.
Like where does this get more interesting and how should Cloudflare's edge decide which of potentially multiple origins it should send the traffic to?
Yeah. Yeah. Great, great question.
So, you know, we don't live in a world anymore where you can have a fixed amount of servers in a singular location to be able to scale, to be able to scale and meet the demands of your customers as you grow your company to meet customers in different areas of the world.
And so what is really most important to help businesses set up for success is to be where their customers are.
Now, if you have customers coming from London, you have customers interacting with your app in Tokyo, be in London, be in Tokyo.
And so with that, you're going to, to your point, have multiple servers that you're locating across the world to meet it to very quickly and reliably meet those requests.
But, you know, sometimes that those origins can be overloaded if you have too much traffic coming from one particular point in the world.
And so instead of adding more CPU availability, adding more servers, which are very expensive, again, high maintenance cost coordination, adding more people to your teams, overhead, you know, you can, we want to empower folks to be able to leverage their existing infrastructures in a much more intelligent, intelligent way.
And so that's what load balancing is for, to intelligently steer your traffic away from unhealthy origins, using our active health checking system, and then be able to steer that traffic to origins and to servers that are available performance and can actually suffice the needs of those requests from your end users.
And so it is, in all in all, an availability and reliability solution that we provide customers.
Exactly. And let me just, let me pause there because you already said one of the magic words, which is health check.
So Dimitris, how does Cloud First Edge know what's healthy and what's not? I mean, like, are we literally taking the temperature of all these origins simultaneously and trying to figure out who can say, yeah, I can take traffic or no, I can't?
Yeah. So this is, as I said, this comes from the health checking mechanism that we have.
We are doing some really intelligent things there. We have all these data centers, but we don't want to send health checks from every single one of them.
So what we do, we pick a subset of them, we aggregate, we let the other data centers know of the health.
And yeah, we utilize that health information when the request comes, the eyeball request comes, and we pick the right server.
The other tricky part, again, there are a lot of optimizations that are happening, but imagine you have a health check that happens every minute and you run the health check and right after a second, the origin goes down.
We don't freak out the moment we see the request.
Don't freak out. Don't freak out. Good life advice, good system architecture advice.
Even if the health check agent told us the origin was healthy one second ago, we try our best, we go to the origin, we see the origin is down and we fail over to the next one.
Some of the minor things we are doing, but they improve a lot the availability of these services.
Now, one of the things that I think is great about what you talk about in terms of health checks is that it gives us signal to help us with some of our steering.
But one of the things that you have done a phenomenal job with is providing capabilities to customers to enable them to basically configure their load balance and configuration using a variety of different signals.
Can you talk to me a little bit about some of those mechanisms and how we decided to tackle some and what's driving the vision there?
Yeah, I think I can take that one. We have those health checks that are constantly pulsing those origins to see are they healthy, are they unhealthy?
And then we take that feedback and then determine where we steer that traffic so we don't have errors serve to end users and make sure they have a good and wonderful experience.
And before we've had geo-steering, so if you have requests coming from different segments of the world, you can map it to a particular data center near that location, make sure it's going to the right location.
And again, being available, performant, localized.
It's a dynamic steering, right? To pick the path of lowest latency that is the fastest, most performant path and pool to be able to send your traffic to.
But that wasn't enough. We wanted to continue to meet more intelligent methods to be able to suffice the needs of our customers.
And again, provide that really stellar experience.
And so we introduced recently a new steering capability called proximity steering.
Proximity steering is where you don't necessarily have requests coming from different locations in the world that want to map it to a particular data center.
And you may not want to take the path of lowest latency specifically.
You may want to send your request to literally the closest physical location as I'm a customer of my data centers from where those requests come into a Cloudflare POP.
And so we actually take in the latitude and longitude information of customer data centers to make sure that when a request comes into our Cloudflare POP, we say, okay, where is the closest physical location of our customer's data center?
And let's send that traffic over there. And that helps keep requests very, very performant and also help again with that localization to make sure requests are going to the right areas in the world.
That's amazing.
So literally in the control plane, we've... And actually we're not talking about the control plane, but the control plane for load balancing is no joke.
There's a lot of buttons and switches here to let customers have the flexibility to describe where are their origins?
How are their pools? How do you want to steer between them?
But you're telling me we literally let them enter latitude and longitude coordinates, like a push pin on a map.
And that's not just to draw pretty pictures.
Dimitris, your code is actually going to use that, figure out the geography of planet earth and where we should be sending those and how this is actually going to affect steering.
Do I have that right? Exactly. Yeah. So they give us the GPS location more or less.
And then we take a look on the client IP. We figured out where they're located or where the data center that they hit is located.
We do some basic math and yeah, we pick the closest location.
And I think Brian, the design team have done a really good job on the UI, where you can literally see a map and drop a pin with the location of the data center.
And we calculate under the hood the GPS location.
You know there's one chance to have a base on Mars by the end of the next decade.
So tell me you have left enough data width to have a planet field in there as well, because we're going to need to geosteer across the solar system before long here, guys.
That's awesome. It's such a cool thing. So that's proximity steering.
Talk to me another thing about like back to that control plane when we talk about making sure that customers know what's going on.
Maybe a big enough company, you got enough people messing with the pool configurations and the enablement, disablement.
If I'm one of a bunch of different engineers at a shop that's managing multiple origins and one of my colleagues as part of their routine maintenance for their system disabled or thing.
How am I supposed to know? What are some of the tools that we built there to let inside an organization be aware of this stuff?
Yeah, absolutely. So before we had notifications that customers could configure to send an array of different emails to notify about our health events.
We didn't have the concept yet of if there are mission critical data centers and pools that are turned off, that could be catastrophic to a business if it's done accidentally or at the wrong time or not properly.
And so we heard the calls from our customers and we want to build what is absolutely necessary.
And so we heard that they need capabilities and granular functionality to be able to say, hey, if my US East data center is turned off, I need to not only let a number of different email on teams know, but that's not enough.
We need to be able to allow custom scripts and to leverage webhooks to be coordinated so that we can feed that data into existing systems and not have to redo all the different workflows that customers use today.
And it's really expensive to have to reaffirm and reset processes that companies have been used to for months and years.
On top of that, when we work with engineering teams, pager duty is one of our lifelines.
And so being able to have a deep integration that can be very easily and quickly set up to where if that US East data center is turned off, that we get alerts sent to the proper teams across the company to make sure that the right folks are paying attention to it, can take quick action.
And again, make sure that customers are not inundated by human error.
So another feature you guys worked on recently is load shedding.
So this is the load balancing team working on steering and shedding.
So let's help orient the audience here.
What's the difference between load shedding and load balancing?
What is the subtlety that we're trying to address there? And how does that relate to steering?
Yeah, this is super exciting. I cannot wait to tell you about this.
So load balancing, we are intelligently providing capabilities to steer traffic away from unhealthy origins, from unhealthy data centers, and make sure that errors served to customers are mitigated and put down to zero, that we conserve the customer needs and make sure that they have everything that's possible and have what they need.
But as we see that they're growing populations, there's more traffic on the Internet than we've ever seen before.
There are times in which a particular data center may not be able to handle all the traffic in the area that it's situated.
For example, US East data centers in Ashburn, Virginia, right in the East Coast has twice the population than the West Coast, right over 100 million people.
And so US East is likely not going to be able to handle the amount of traffic that's going to come from that portion of the world versus US West.
And so there are times where we need to proactively shed or get rid of or redirect this traffic to the rest of our infrastructure to make sure that I don't get to an 80 or 90% allocation of CPU usage for that data center, which could be turned over and have dire and drastic consequences across all my other parts of the application.
And so we provide a very easy mechanism and fine grained tuning to be able to for a customer, hey, I see on my server logs that US East is starting to get to 80% allocation of CPU usage.
That's making me a little worried. I don't have a great way to send my traffic around, but I have a load shedding.
And so I can determine that all my new non-session based traffic should then be reduced by let's say 20%.
And that's a relative metric.
So as my traffic increases through the day and we get busier, that's going to continue to reduce that amount of traffic by 20% relatively as we continue to get more and send it to the rest of my infrastructure, consistent with the steering policies, steering mechanisms that we talked about before.
So then we can do this in an intelligent manner. On top of that, if we see non-up traffic is being shed, we also provide the controls for customers to be able to do this for their session based traffic.
So that at the first start, they don't have to sever any sessions.
They don't have to kick people out of their shopping carts, kicking them out of their Dr.
Form fills. But at the end of the day, at the worst case, it's better to kick a few people out and put them to a new origin than have the whole system go down.
Right. You know, it's interesting to me when I think about the problems you're solving with load shedding, and then sort of in some ways, kind of the parallels to some of the problems you're solving with waiting room, right?
It's the same idea of kind of surges of traffic and making sure you're creating a great user experience.
Can you talk a little bit about like what is waiting room and like what are the problems that waiting room solves?
Yeah, yeah, absolutely.
So waiting room is a new product that we brought to Cloudflare that handles and protects customer applications from being inundated by unexpected traffic spikes or massive traffic spikes for expected events that take place online.
And so before customers would be able to utilize products such as rate limiting, right, where they stop the amount of requests coming in to turn their orders over, the side of the coin that that was missing is the customer experience, right?
What are the expectation customers going to have?
The eyeball experience is what you mean, like our customer's customer, right?
Our eyeballs, exactly, exactly. Thank you.
Our eyeball experience. And so waiting room is a way where very easily in literally a matter of minutes, you can set up a waiting room in front of your application that can have a custom branded template that when too many folks are coming to your application, we then automatically turn waiting room on, set this custom brand template to the end user, to the eyeball, so that they don't actually even know that when they're trying to go to Usman.com, they actually gone to a waiting room that looks just like Usman.com as the header, the footer.
It's a popular website, by the way, Usman .com.
Billions of people. I got searches every night on that.
All over the world. All the viewers of Latest from Product and Edge every week.
It's their number one go -to. And so you'll be able to put your fonts, your colors, your header, your footer, different languages.
And we provide the HTML and the CSS in the hands of our customers so that they can get something up and running that has a seamless experience for the end users.
And on top of that, we provide an estimated wait time so that as an end user and eyeball trying to get to Usman.com, the wonderful product and services, I know, hey, I have a few minutes to wait.
I'm going to browse some other parts of Usman .com.
And then once it in space is available, I'm going to automatically load that page.
I'm in. Right. And so it's an application. Yeah. It's kind of like when I go to In-N-Out and it says your order number 53, like just sit tight, your burger's on its way, right?
Exactly. The way I think about it is like for load balancing, we get the request from an eyeball and then we pick an origin and we send it to that origin, right?
Yeah. For waiting room, the difference is the moment we get the request, if all the origins are busy, we send it back to the eyeball.
The response goes back and then we're like, come back later.
So that's kind of how I think about waiting room versus load balancing.
That's extraordinary. And then Dimitris, just back up to, you know, Brian also raced through load shedding.
How did we build load shedding?
Because load shedding is able to shed before the health checks turn red, right?
I mean, that's part of what's going on here is that we're picking up, we're dealing with a surge before the thermometer turns red and origins are like, I'm out, tag me out, you know, go to origin B.
Yeah. So we keep track of rate of requests.
We know the limits, the specified limits. And based on this information, we decide if we have to send it to that pool or send it to the next one.
And as Brian said, there's some granularity there. Customers can configure percentages on the sessions, new sessions versus the existing sessions.
So customers, like they don't set traffic from existing sessions.
Instead, they decide for new users to go to the next pool, which is really powerful.
It's amazing.
So all these products, there's a great homology here, right? So like they're in all cases, it's about proxying traffic and it's about the Cloudflare edge, the load bouncing edge with these sophisticated hints.
These are not simple hints.
After all, people are giving us portions of their traffic, their GPS coordinates, their fractions of where they want things to fail over, how they want proximity steering to happen, how they want session steering to happen.
And then Cloudflare's edge is the brain that's thinking through this going, okay, let's figure out where the right place to send this is and do that.
One of the things I'm very proud of with WaitingRoom is that it helped power FairShot.
Can you talk a little bit about FairShot, Brian, and why that was such an interesting project to have in 2020 and 2021?
Yeah, absolutely. So we were making WaitingRoom and unbeknownst to us, a pandemic was surely in sight.
What do you mean you didn't have visibility into that?
You guys don't have metrics on that? I thought you had health checks for the whole thing.
We're still working on it. It's in beta. Yeah. And so we had this looming pandemic that came up and it was really daunting and it was scary.
I think it was scary for everyone around the world, for our friends, our family.
And we saw almost overnight this massive shift in traffic that went from in -person brick and mortar to online because people weren't allowed to be outside anymore.
But one of the biggest things and biggest challenges we face, I think, through every country, everyone in the world was what's the light at the end of the tunnel?
How do we get the vaccine? And so municipalities, governments, private businesses, public businesses, anyone who was able to step up to the plate and help with the dissemination scheduling of the vaccine was massively unprepared by the amount and steep spikes of traffic that they were going to get hit in their infrastructure day in and day out.
And that led to massive frustration for people like you and me trying to go and book our appointment to get our vaccine.
And so we saw an opportunity where Klaffler could help, where we as one of our bread and butter is to be able to help the liability availability of applications.
And what can be more important than be able to, I don't want to say this is a stretch, to help save lives, to help our mothers and fathers, our aunts and uncles, to be able to get the necessary vaccines so that they're protected.
And so we put on a program called Project Fairshot earlier in February this year.
So we provided our CDN and our waiting room capabilities for free to any business, private or public municipality, hospital, government that are responsible in the aid of dissemination and scheduling of the COVID -19 vaccine.
And we've seen a massive, massive turnout for the folks that we've been able to help.
And so we have been able to implement waiting room across 10, over 10 countries in the world, helping to get amazing access and quick available access to their vaccine scheduling.
We've been able to get over 100 million vaccines scheduled through different booking systems of having waiting room in front of them and really lower the pain and frustration that people and what we call eyeballs day in and day out we're experiencing trying to get access to vaccine.
And we have, it's a very humbling experience, I'll tell you, to be able to be a part of that journey, to be a part of solving that problem.
And we've had amazing, we've been able to get case studies written back from our customers who have been on the front lines of being able to help those scheduling systems.
And folks such as Luma Health, where we help the Cook County of Illinois, right, one of the largest counties in the United States, help get their 5 million people population vaccinated.
Same happening in areas of Germany and Japan. And so it's been a massively humbling experience.
And we will continue to extend Project FairShot for another year to make sure that as COVID-19 is around, that we're here to help and to make sure that this is alleviated as much as possible.
So amazing. Really, really, really proud of all the work you guys have done here.
It's just, and what an easy thing to feel like you're, people talk a lot about being mission driven, but literally, like, you know, help make the Internet better so that the Internet can help.
I really liked our CEO, Matthew said, you know, like the heroes of the pandemic are the health workers, but sort of like the faithful sidekick is the Internet, you know, sort of being helping and being around to try to help where it can at least connect people and systems together.
And I think I'm so proud that Cloudflare was part of that.
So before we wrap, Dimitris, interesting tech on waiting room and the, you know, technology that powers FairShot in that it's partly built on workers.
So talk about that. Why, why did we pick workers? What did it allow us to do?
And how did that help accelerate development and employment? Yeah, that's a great question.
Honestly, we, when we started thinking about the waiting room, we sat down and we're like, we need to make sure the system will scale.
It's going to be highly available.
And the, we had a few options, right? We could do the right around code or, and then have to deal with the availability and the kind of scaling part, right?
Or take all these things for free and use Cloudflare Workers that are available on every data center, Cloudflare data center, they scale, they are highly available.
And then the development is like the velocity there on development is really good, right?
So it was kind of the only decision. Yeah. The only, the only right answer.
Yes. Right. We got all these things for free. And we utilized also the workers KV, Cloudflare workers KV and the durable objects, which are great technologies.
They scale great. And yeah, that's why we were able to develop waiting room in such a tight schedule.
You know, we, we talk about, you know, building a platform and building, you know, building our products and our capabilities on top of that same platform that we make available to our customers.
And it's just, it's so great when I see that, you know, in action with, with things like, with like waiting room and just the impact that that has.
And the fact that, you know, by using it ourselves, we talk a lot internally about the importance of, of dog fooding and, you know, us relying on our own platforms, you know, is, is just an extra layer of, of really kind of going deep with those technologies and, and, and pushing them to their limits.
Obviously, as, as Fairchild has done with, with, with, with waiting room built on workers.
Yeah. And it also works in the other direction, right?
It makes, it makes workers better, right? Because it forces workers to be able to, okay, so this is, there's a real application with your engineers, right?
Just virtually down the hall who are, we're now building on this whole thing and they've got demanding deadlines and real, you know, Cook County, for God's sake, that's an enormous number of people who need to, who need to go get registered.
And so that's, it's fantastic. I'm always amazed how fast the 30 minutes goes when we, when we get you guys on here and how, how many, how many different things we have to talk about, but we are amazingly out of time.
And I just want to thank you, Brian, and thank you, Dimitris for, for joining us and we'll see, see, we'll, we'll, we'll be back next week or in a week, Jen, whenever we're back next to talk with another great team and talk about what's, what's next on latest from Product Ninja.
But thanks everybody for watching.
Thanks for joining us. Thanks so much, everybody. Optimizely is the world's leading experimentation platform.
Our customers come to Optimizely, quite frankly, to grow their business.
They are able to test all of their assumptions and make more decisions based on insights and data.
We serve some of the largest enterprises in the world and those enterprises have quite high standards for the scalability and performance of the products that Optimizely is bringing into their organization.
We have a JavaScript snippet that goes on customers' websites that executes all the experiments that they have configured, all the changes that they have configured for any of the experiments.
That JavaScript takes time to download, to parse, and also to execute. And so customers have become increasingly performance conscious.
The reason we partnered with Cloudflare is to improve the performance aspects of some of our core experimentation products.
We needed a way to push this type of decision making and computation out to the edge and workers ultimately surfaced as the no -brainer tool of choice there.
Once we started using workers, it was really fast to get up to speed.
It was like, oh, I can just go into this playground and write JavaScript, which I totally know how to do, and then it just works.
So that was pretty cool.
Our customers will be able to run 10x, 100x the number of experiments. And from our perspective, that ultimately means they'll get more value out of it.
And the business impact for our bottom line and our top line will also start to mirror that as well.
Workers has allowed us to accelerate our product velocity around performance innovation, which I'm very excited about.
But that's just the beginning.
There's a lot that Cloudflare is doing from a technology perspective that we're really excited to partner on so that we can bring our innovation to market faster.