Latest from Product and Engineering
Presented by: Usman Muzaffar, Jen Taylor, Brian Batraski, Dimitris Antonellis
Originally aired on July 13, 2023 @ 6:00 PM - 6:30 PM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
All right. Hello everyone and welcome to the latest from product and engineering. Jen, nice to see you again.
Great to see you Usman. I could not be more excited about today because we've got one of my favorite products.
It really is one of your favorite products.
You're always saying it. It is. You know, to me, this team, I often refer to this team like the Beatles internally just because they ship a phenomenal amount of hits within the organization.
It's just like these guys are rock stars.
I'm just really happy. So let's start by introducing these rock stars.
Brian, can you introduce yourself and a little bit about what this mysterious phenomenal product is?
Yeah, absolutely. So hi everybody. Thanks so much for joining in today.
My name is Brian Batraski. I'm the product manager for the load balancing product.
And a load balancer is, you know, in the simplest forms is to take a bunch of requests that are coming into your structure and then balancing them across the different origins that you have in the back end.
But in the idea, it's very, very simple, but it can get really complicated very, very quickly.
And so we're going to go over some ways that we make that those challenges easier.
We make the solution as soon as possible for all the different customers we have.
Great. And Dimitris, if you could introduce yourself too. Yeah. Oh, sorry, Jen.
Yeah. Hello folks. I'm Dimitris Antonellis. I'm the engineering manager for the load balancing team here at Klaffer.
I joined Klaffer a little bit over five years ago.
At that time, I was a systems engineer. I helped build the first version of the product.
And then when there was traction, we decided to build a team and I took over as the engineer manager there.
Yeah. I work closely with Brian.
He provides me the product requirements and I try to find ways with the rest of the team to build all the wonderful features here.
Find lots of good ways.
Okay. So just stepping back, Brian, you've covered already a little bit, like what is a load balancer, right?
Why do we have a load balancer? Like why does Cloudflare feel the need to offer a load balancer to our customers?
And what do we bring to the load balancing equation that is different than what everybody else does?
Yeah. Great question. So in a business, there are different segments in your infrastructure that are going to be paramount, table stakes that are necessary to make your business be able to scale and meet the needs of your customers.
And you're going to need security.
You're going to make sure that everything you do is performance because the bar of what people expect on their growing day over day.
But on top of that, you need to make sure that your website and that your application is resilient and reliable.
I think we've all had the experience before where we go into a website and it says, oh, 404 not found or 502 not available.
And that's not a great experience that we have.
And so having a load balancer, making sure that your application is resilient, reliable, and always available, what we call a hundred percent maximum availability, that is again, table stakes for any business to make sure that they can meet the needs of their customers and provide a great experience for everybody.
And so there are a lot of people in the market that have a load balancer, but what we do is we have a cloud-based load balancer that not only can go to different levels of the network stack, whether you have layer seven for your application, HTTP, HTTPS requests, all the way going down to layer four, if you have TCP or UDP types of applications and UDP traffic.
And so because of our massive global Anycast network and all the massive amounts of POPs we have across the world, we have no single point of failure.
We can load balance across any of our POPs, anywhere in the world.
And again, we're in 100 milliseconds of any eyeball on the planet.
And so it creates an incredibly compelling and again, very resilient and reliable solution for all the different customers, whether you're a smaller mom POP shop or one of the biggest enterprise customers in the world.
On top of that, you can have the benefit of cloud-based load balancer is that you have much lower maintenance and overhead costs as you would potentially compared to a hardware load balancer, where you have to have people that be able to go check on it, physically available, know the specific languages or the rules for the particular hardware load balancer.
And so it really just allows businesses to focus more on their table stakes and allow us to take care of the items that we specialize in and what I believe to do so well.
Lastly, a cloud -based load balancer can talk to all the different other products on our platform.
And so that creates an incredibly, incredibly strong solution, not only across reliability, but again, like I said before, performance and security as well.
That's awesome.
What I liked about your explanation is that it lines very squarely with, when you say, what does Cloudflare do?
We help make a better Internet. So faster, safer, more reliable.
This is it. This is that more reliable piece, which is how do you make sure that there's always a way to get, the end user always has a way to get.
So Demetrius, I guess the next question is, so how did we build this? And how did we leverage the thing that we had, even in 2015, as you joined five years ago, 2016, five years ago, we had an anycast network.
So how did all this play into the architecture of the solution that you helped build from the very beginning?
So how does this work?
Yeah. Great question. I would say, again, this is a really complicated system, right?
We didn't build the whole thing from the day one.
So we kind of came up with different iterations. You would imagine the first MVP version was something really simple.
That was even before we had customers in the system, right?
We want to prove that we can build such a system at the cloud for scale.
And then we start improving different kinds of parts of the system. I remember, I still remember initially we had more like kind of centralized architecture where you would run your health checks from the different data centers.
You would aggregate information in a central location, and then you would push the aggregates to all of our data centers.
So you have the health information ready when you were resolving load balancers.
We tried, so that worked perfectly fine.
But then we didn't want to have these kinds of failure scenarios where the centralized system would fail, right?
So we started improving that architecture, moved to a more kind of cordless architecture where we don't rely on the central location for the aggregation of the data.
So to me, I think we keep improving out of a can.
We always try to be resilient and highly available, and we'll keep doing that.
So you mentioned the word health check, and that's a word that comes up a lot.
And part of you teaching me a lot about how a load balancer works is this word health check showing up over and over again.
So like in the simplest terms, what is a health check?
Like what is the actual data in a health check?
And what is it that what, where is the health check information going?
And what does the rest of the system do with it when it gets it? Yeah, that's a really good question.
I would say a health check is a way for us to know if the origin is considered healthy or unhealthy, right?
So the customers give us the details of their origin, the path, their origins, all the API endpoints that they're exposing, any port numbers, of course, origin IP, origin names.
And then they also give us more details about the expected response, right?
Expected response code, certain strings that we should see in the response.
And then based on this information, we run, the simplest form would be an HTTP health check, open HTTP connection, run HTTP health check.
We wait for the response.
We check against the configure information. If we see everything in the response, we declare the origin as healthy, right?
So the moment the origin is healthy, it's a candidate when we resolve the load balancer to include that origin IP in the kind of- And when you say resolve the load balancer, it sort of means like if it's going to, if it's going to osmansdonuts.com, in fact, inside Cloudflare, we can say there's multiple servers that can serve the storefront.
And so if two of them, if the one here in Sunnyvale and the one in San Francisco, the one in Sunnyvale is sick, but the one in San Francisco is saying, actually, I'm healthy, then our Cloudflare edge around the world knows, send it to San Francisco because you're not getting any donuts from Sunnyvale.
One of the things that Dimitris touched on, Brian, that I wanted to ask you about is, so one of the things we have to give our customers the ability to actually define what healthy means with increasing sophistication.
If I can respond to a ping that's good enough, that's not good enough.
If I respond to an HTTP, that's not good enough.
You got to actually see me say this, like today's specials is showing up on the front page.
Only then can you consider this a healthy. So talk a little bit about that.
How did we decide what controls to give them and why, and how are our customers using some of that?
Absolutely. I mean, so it really all starts with the customers, right?
We take a very customer-centric approach. And so I went out and spoke to our customers and asked them, what are you looking for?
What do you need?
What are the requirements that would make your lives easier and benefit your overall infrastructure and your application?
So we got it directly from them and then did a number of research, made sure that we fell ducks in a row and really validated the hypothesis that we had of what would benefit our customers all the way across the market from smaller businesses that are just starting out, again, to larger businesses that have existed for my whole life.
And so there are a number of things, again, like Dimitris said, they expect a response code, a particular path, or maybe headers that are specific to different parts of your application.
On top of that, there are thresholds that may need to be put into place.
For some businesses, maybe you can expect a particular origin to maybe go down once in a while, but it's actually okay for you.
But then you want to be notified if it goes down a few times, oh, then something has to be paged, someone has to look into it, and that's where we come into play to make sure we minimize that time to resolution.
But it is incredibly important to make sure that we give those tools and give that balance between simplicity and granularity so people can not only configure this easily and quickly, but also not be able to focus, again, on their table stakes, on the customers, on the experience they have day-to-day.
It's just a very powerful, if you think about it, it's a very powerful interface between our network and the customer's infrastructure, right?
And it's really the place where there is this kind of flexible dialogue, actually, between Cloudflare and that infrastructure to really kind of, you know, and configure it and optimize it in a way that it's meant to meet the customer's intentions and the customer's needs.
The thing that blows me away, though, Brian, is when I open up the dashboard and I'm looking around, I mean, if I look at the load balancing user interface, like, it is a work of art, right?
It is an incredibly sophisticated interface, but at the same time, it's still fairly easy to use.
Like, talk to me a little bit about that journey and kind of how you guys have thought about striking that right balance.
Absolutely. So, first, I want to give a shout out to our design team here at Cloudflare and our engineers.
They do a fantastic job of building that beautiful and elegant UI.
But to your point, you know, it is a challenge, right?
It is difficult to give the speed and velocity to make changes and drive that certainty that everything is going to be okay and drive that confidence, but also not give, you know, as we can imagine, the more granularity, the more options you give, the more buttons, the more selections that are available on the screen.
And so, we wanted to be able to give the most pertinent and important information available at a glance.
So, then whether you want to just make sure a snapshot everything is okay versus being able to go really nitty -gritty into a particular issue, we provided that balance between them both.
But again, we went back and we talked to those customers.
We talked to them who are just starting the business out and said, you know, what do you need?
What is really important to you?
And what are kind of gotchas that you experienced in the past that you kind of don't want to see happen again?
And so, that led to a number of confirmations being added in, but put it in such a way where you don't have to just click, click, click, click and say, okay, everything is good to go.
But on the other end, you know, we are human.
We're prone to error. And so, having the confidence and really comes down to this that when you make a change, that the right person is making the change and that they're certain this is what wants to happen.
I think we can all have an experience where we tried to update something or delete something and we're like, oh no, that shouldn't happen.
And so, we want to make sure that never happens to any of our customers.
So, we've gone through a number of iterations, making sure that our customers are happy with what they see and that they can, again, have that velocity, have that very stellar experience that when they come and look at and update their infrastructure or want to check out something or dive into an issue that could potentially become more of a fire, that not only are we providing that data and that insight, but that we are helping them as a partner decrease that time's resolution.
So, something that's an evolving challenge, but so far, I think we're doing a pretty good job.
That's great.
So, the title of our segment, of course, is the latest from Product and Eng.
So, talk a little bit about some of the stuff we've done recently. What's a waiting room?
Why is waiting room showing up in our external logs and what does that have to do with load balancing?
And so, what problem are we trying to solve? And then, Demetrius, if you could jump in like, how did we solve it?
Yeah, absolutely.
So, I think over since 2020 started and the onset of the global pandemic, we've seen, especially at Cloudflare, a massive shift from in-person brick and mortar traffic, switching very drastically to digital traffic.
And many of these different businesses, such as e -commerce would be an example, have over many years been able to very directly track and forecast how many folks they're going to have in their stores, how many transactions are going to take place.
And this is all going to be kind of turned upside down now that all of this has moved over into the Internet.
And so, it kind of caught many folks by surprise. And there's a few ways that this can be solved.
Either you can try and build more origins, more servers of the problem and try and solve this by adding a lot more money.
But unfortunately, not everyone has that capability to do so.
And so, another option that was available is to create this waiting room, right?
To this concept of having the amount of traffic that your origins and your infrastructure can handle overall, and then taking that extent and putting them in the waiting room that acts as an extension of your own brand, right?
It comes down really to creating that seamless customer experience.
And so, the last thing we would ever want is for customers to then see that 404 page unavailable, that 502 server not available or network error.
And that doesn't instill confidence, that doesn't drive belief that this application or this business is going to be able to meet my needs.
And so, not having to throw money at the problem, but being able to get a reasonable cost solution that can maintain that customer experience while solving your infrastructure inundation, it was a great fit.
And so, we talked to a lot of customers and we saw that this is a really good fit, especially with the oncoming pandemic.
And so far, we've seen a really great response and early response from people wanting to test this out.
How does it work, Dimitris? In my mind, network traffic is kind of like a beam of light.
You can't stop it and then just say continue.
So, what's actually going on at a network level? What is the eyeball seeing when you're basically metering?
In my mind, when Brian's talking, I'm imagining like metering lights on a highway, right?
You don't get to go on just yet, hold your horses, now you can come on.
How do we pull this off? Yeah. So, more or less, that's what's happening.
The beauty of waiting rooms is we built waiting rooms on top of Cloudflare Workers, which is pretty cool.
That's a testament like Cloudflare Workers can be used for all these fancy things out there on the Internet.
But under the hood, what's happening is you see a request that is destined to a waiting room.
We identify there's a waiting room that is associated with that host name, that path, and then we trigger the waiting room logic on the worker.
We do maintain state. We need information about number of requests that we have sent to the origin, number of requests we've seen, and we put them in the queue.
So, there is all this kind of bookkeeping logic under the hood. And then we issue cookies.
Once we see the cookie again, we have a way of going back and figuring out the right order if we should send the request to the origin or it should wait longer.
So, from the point of view of my web browser, it's just talking to the site that I typed into the browser.
It's an absolutely seamless thing. It's just that I'm hitting a worker first that's giving me the interstitial page.
And because that worker is displaying content that we let the customer configure, it can be totally branded.
I'm looking at the logo that I expect to see, and it's like, just hang on.
You're seventh in line. You're going to get tickets for the great concert it is that you want to see.
And then the backend, the worker, is keeping track of how much time it actually needs to stick around and then lets the connection go through.
Correct. Yeah, that works. That's amazing. It's fantastic work.
What are some of the use cases people are using it for, Brian? Yeah.
So, one of the use cases we see is largely for e-commerce businesses, any sort of sales or marketing campaigns that they're kicking off, concert sales.
We also see this for telehealth and a huge impulse from banking as well, especially, again, more people are going over the Internet and not going in person.
And so, we've seen a really great response.
But again, if you're selling something, if you have a webinar or anything where you believe that there's going to be more folks that are interested than what your current infrastructure can handle, it is an absolutely perfect fit for being able to use your waiting room.
You don't have to go and buy more origins.
You don't have to go hire more people. You can use our solution that integrates, again, different products in our platform, such as bot management, DDoS, WAF.
And again, get rid of that illegitimate traffic early in the pipeline.
And then once you have all that legitimate traffic and real folks, get them into that queue and make sure they have a really stellar experience as they filter into your shop.
It's really great expectations. Well, okay. So, the Beatles, they had Abbey Road and then they had Sgt.
Pepper's. And it's phenomenal to think that one band was able to produce such a breadth of music and really innovation.
And I go back to the fact that in my mind, you guys really are the Beatles of what we do here.
On one end, you guys have just delivered waiting rooms, but on the other end, you just did per host origin override, which is something that has been a top feature request for us for a long time.
Can one of you guys talk through what does that mean and why do customers want it so badly?
Yeah, I can start that off and then I'll pass it over to Demetrius.
So, when you or I, we go to responsedonorshop.com, right?
When we send that request over to the Internet, so we can't wait to get that donut.
When we send that request over the Internet, there's something called a host header, right?
So, when that request comes from my browser, the Internet, somebody needs to tell the Internet saying, oh, I want to go to Usman's donut shop.
And this origin is the one that's actually serving that donut shop, right?
There needs to be a match to make sure that the request is going to the right place.
And again, you don't get that error being served to you. And so, sometimes though, if you want to use third -party applications, such as Amazon S3 or Heroku or Google App Engine, that host header on the origin may not match up with what the browser is sending over.
And so, this has been, like you said, a long-term ask from many of our customers.
And they want to be able to leverage these third -party applications in the infrastructure because it gives them a lot of benefit, right?
And before they were having to do certain workarounds to be able to leverage it with our load balancer and infrastructure.
And we wanted to make that much easier, like we do everything here at Kafka, make it a very seamless and easy experience.
And so, we've built this out as a first-class feature where you can add a host header on a per-origin basis, because not every origin is going to be the same, providing that granularity.
But then, we want to add that simplicity of being able to do it quickly and easily.
And so, you'll be able to do UI or through an API, add it with one single field.
And then, what's very, very powerful, we go back to the idea of these help checks.
Not only does the request from the customer need to make sure that it can reach the correct origin, whether you're using these third-party applications or not, these help checks need to be able to get to the origin and say, hey, are you available?
And can you take this request? Are you healthy and not over-inundated?
And so, we will automatically take that host header and apply it to the help check that's attached to those pools that the origins are a part of, and make sure we handle that and abstract away from the customer any extra configuration that's necessary.
How do we do it, Dimitris? That sounds hard.
That sounds hard. Gaitan, that sounds hard. Where is that thing getting persistent?
That sounds hard. Honestly, the most difficult thing here was keeping the config simple.
And we have a lot of back and forth with Brian. When we have a new project, we always try to come up with the proper API.
We don't want to add too many things, but we want to make the API config as simple as possible.
And allow the customers to configure different settings there.
I would say to me, I was struggling a lot.
I want to make sure we don't complicate the API and the UI a lot. And the whole implementation was way easier than kind of designing the API.
The interface was the hard part.
Interesting. Yeah. And then once we did it, Dimitris, was it built in such a way that, like, did we have to solve it twice?
Or did Health Checks pick this up automatically because of the place where we implemented this facility?
Yeah. More or less, we got it for free. And again, we built the basic components out there.
It was a matter of extending, making sure we pass along all the host headers, making sure we send the Health Check with the correct host headers.
So, yeah. And I think we can truly support a multi-cloud environment. I think we were limited for some of the cloud providers because of the lack of this feature.
Now, I think you can have any kind of cloud behind Cloudflare, which is pretty powerful.
That's amazing. You were just talking about the importance of configuration.
And we talked a few moments ago about just sort of the robustness of the UI and the configuration that we have today for load balancing.
But I know the two of you are constantly looking to sort of push the boundaries.
What are some of the ways you guys are thinking about giving customers more control and more fidelity over their load balancing configuration?
You've released all these albums.
How are you going to help them create the greatest hits? I think we're actually right about to come out with one of our biggest creative and say it's going to be the platinum record.
And this is what we're calling LV Rules. And so today, we have a really great arsenal of features that customers are taking advantage of day in and day out.
But there may be one particular item, maybe load balancing by path, taking advantage of particular host header or different type of header.
But there may be one particular item or set of particular items that we may not have built out just yet, but makes a huge world of difference to your overall business.
And so instead of customers having to wait for us to build everything out as a first class product, which again, like we said before, we take the time and the energy and the research to make sure we do it correctly, not have to go back and rejigger things.
We came up with this concept of adding rules, similar to what Firewall has already, where you can put the power in the customer's hands and add that customization onto their origin selection or traffic security decisions that directly aligns and ties to the overall goals and needs of their business.
And so it has a very easy and consistent UI experience, similar to our WAF product.
And it allows us to not only provide this, to put this power into our customer's hands, but also speed up our development velocity and creates a channel for us, where if someone has a request saying, this will make a game changer to my business, we can add it into rules.
And then in a much faster velocity than having to then go back, design, and put this out into the UI API as a first class citizen.
So just to caveat this a little bit, LB Rules is not a second class citizen.
It is incredibly powerful.
And we are incredibly, incredibly excited to put this platinum hit, double, triple platinum into your hands to make sure that you really get the needs alignment to what your business really wants to solve with load balancing and overall reliability solutions.
Dimitris, what's an example of a kind of rule a customer will be able to write?
And where does that logic, how do they define that logic and how does it get executed?
Yeah, so you'll be surprised. We support all kinds of things.
One of my favorite ones is time-based load balancing, where customers can specify different steering policies for different times of the day, which apparently we have customers asking for that.
But in addition to that, again, we are really flexible.
Customers can specify conditions based on different headers that we see on the request, IP, client IPs, or any kind of path-based load balancing.
That's a really popular one. So with the LB Rules, you'll be able to specify different steering behavior for different paths of your host name, which is pretty powerful.
And some more details for the implementation here. Again, we utilized a lot of the work that the Firewall team has already put in place.
Rules are already a big deal at Cloudflare, so it's great that we stood on the shoulders of those giants to get to Brian's platinum record.
Why not? We sampled a little bit.
A little bit of the song is a hook from a real popular tune from two years ago.
Yeah, so we get a lot of things for free because we are using the same libraries there.
Whatever improvements we are making on that API, we'll get them for free, and we are really excited about this.
Yeah, and we'll keep adding more and more conditions.
That's the beauty of the LB Rules. We get the client customer request within a few days.
You can see it in production. It's going to be really simple to improve.
That's fantastic. Thank you so much. I can't believe the time just flew by, as it always does.
I'm looking at the clock, it's like, holy, that's it?
We're down to the last 90 seconds. Brian and Dimitris, thank you so much for joining.
It's always so much fun to talk to you guys. We'll have you back in a few months to give us the latest, to talk about whatever the next hit album is.
For now, everybody, thank you for watching Latest in Product and Edge.
We'll see you next week. Thanks. Thank you all. Thank you. Bye-bye. Bye.