Latest from Product and Engineering
Presented by: Usman Muzaffar, Aki Shugaeva, Alex Krivit
Originally aired on October 23, 2021 @ 2:00 PM - 2:30 PM EDT
Join Cloudflare's Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Transcript (Beta)
All right, hello everyone and welcome to another episode of the latest from product and engineering.
I'm Usman Muzaffar, I'm Cloudflare's head of engineering. Jen Taylor's out today but no matter I'm very pleased to welcome two of our other colleagues here Alex Krivit and Aki Shugaeva are joining me today.
Alex say hi and just tell everyone what team you're part of and how long you've been with Cloudflare.
Hey everybody my name is Alex Krivit and I'm the product manager for the cash team.
I've been a product manager to cash team for just a little under a year now but at Cloudflare for about three years.
And what were you doing before cash team product manager?
This is great you can't you can't not say this this is too interesting.
Before I was a product manager I worked on the legal team. I was a legal associate working on a number of things from privacy to compliance to trust and safety matters.
And that's that's the kind of interdepartment transition mobility we have at Cloudflare.
The lawyers can become cash product managers which is just fantastic we're thrilled to have you.
Aki how about you how long have you been with us and what have you been up what's your role?
I've been at Cloudflare for about a year and a half.
I'm the engineering manager for the cash team. We've definitely used Alex's legal expertise many times whether it's for hiring or if we have questions about you know content things like that.
That's great and so as as viewers can imagine the word cash showed up in both of our guests titles and roles this afternoon.
So that is the subject of the episode is cash. Aki can you start us off what is a cash where this is c-a-c-h-e not not not the money cash.
So what what do engineers mean when they say I have to cash something or there it's in the cash or it's not in the cash or cash eviction what are they talking about?
Yes in a very general sense you know it's the cash is just an intermediary data storage you know used for high speed retrieval and you know our cash is more it's kind of like distributed across the network and so what we do is we use it to put our content closer to you know eyeballs that are requesting the content well eyeballs as in you know you sitting at the computer right there and then you know requesting it from one of our customers origins that might be you know on the other side of the world.
Yeah so so if I so another way of saying it it's it's your own copy a cash is your own private copy of something something that you can you can consult rather than having to go back to the authoritative original source in some way all the the phone book on all of our iPhones and our mobile phones are a cache of phone numbers that we pulled in from somewhere else that we can consult and and and of course that immediately lets you to the the problem that we'll touch on a little bit later in the episode is what happens when your copy is incorrect and you need to like how are you supposed to know when to go back and get and update your copy or or erase it in some way because it's wrong because it's the whole point is that you don't have to keep going back and checking the original source that's that's something there but Aki touched on the key word there Alex which was distributed it's global our cash is not on one computer it's on thousands of computer and in fact that's the d and cdn what's cdn and why are companies so interested in them these days yeah so the cdn is then a network of all of these caches that are around the world and so as Aki had mentioned we run kind of these we store our caches in various data centers and these data centers are distributed really really close to where our customers website owners have a lot of people trying to access their content and things that are in cash and so our goal then would be to move the content from wherever the the customer has their origin to where to as close as possible to where their visitors are coming from and so then the whole point is why why would we do that though i mean ultimately what are we trying to actually do we're trying to bring the content closer to where the eyeballs are so that what like what is the what's in it for the customer what's in it for the eyeball it's all about performance and so if the the request from the the visitor to get that information doesn't have to go you know really really far away so bringing the content closer to to the visitor improves performance it decreases page load times for various types of assets that they're trying to look at on the Internet and so that's a good beneficial for both the customer and for the end user trying to access that so that's fantastic so that means those millions of websites that are behind Cloudflare they don't the people who are trying to access that content they don't actually have to go to the one server on wherever it happens to be on the other side of the planet to actually fetch that content you can just go to the nearest Cloudflare server and somehow magically the Cloudflare server already has a copy of that content there just like magic just like magic so uh so aki let's let's let's take the wraps off the magic a little bit how does how does Cloudflare know what it's allowed to can it just cache everything like what's allowed to be in cache and what's what's not so that's it's a very good question you know there's a there's a number of different controls we have for what can or can't be in cache you know one one of them is you know customers can set their own cache control directives you know just as part of the hdb headers that they send us um then we also allow people to configure it you know a number of different config options we have through our Cloudflare dashboard so when you actually set up a zone on Cloudflare and when you have like a web server let's keep it really simple and then when you have a web server and that web server is responding it can it can sort of put a little sticker on the content and say by the way you're allowed to cache this uh for so long or by the way don't ever cache this because uh it could be it could be wrong the second if you keep a copy you won't you won't know that uh i've just changed it so right and and one thing you know sometimes people aren't actually able to change these cache these stickers but uh you know for whatever reason so we allow people to change those you know put new stickers on through the Cloudflare and and we're using we're using a stickers analogy for for the non-technical folks but really we're talking about an hdb header here something that is actually on the request that uh the Cloudflare's uh web Cloudflare's cache proxies know what to do what to do with them um what other kinds of controls alex do we give our customers like what else is it that they want to be able to do and some of this is baked into the definition of the Internet some of these some of this has nothing to do with Cloudflare it's like built into how web servers talk to clients they also tell you know you should cache this you shouldn't cache this um but what other kind of controls has Cloudflare put on top of that that our customers care about yeah so there's uh it's a really good question there's a number of different kind of controls that Cloudflare makes really really easy for customers to um configure things um uh for their cache uh those types of things could be um we mentioned you know how long something should stay in cache before it's kind of assumed that it's um that it's wrong or that it's incorrect or that it's not fresh anymore and that's generally called um ttl which i think stands for time to live right yeah i'm time to live in the cache and so um we give a number of different um ways for customers to um replace uh headers and override headers to say how long um something should should have a ttl for how long it should live in cache um both on in our cache and then uh from in the in the browser cache as well so let's make this really concrete let's suppose i'm building a web application all right and and uh and like most web applications when a user logs in there's a little profile picture of me in the corner and you know and if i hover over that you know it's not going to say it's mon.jpg it's just going to say like user.jpg right so like how do we make sure that we didn't actually cache my picture so when you log in um we you know you you don't see you don't see my ugly mug in the top corner of your profile page you know like how do we give our customers the ability to know like wait a minute this it's got a generic sounding name like user.jpg but that's very different from logo.jpg like logo.jpg great we don't change our logo every every every time someone logs in but like the user photo might really be a different human being so how does it how does a customer even go about doing this i think what you refer to is it's called a cache key and so cache keys are um how we like take a number of different elements of those http headers and combine them together and uh when that combination happens we try and find a match in cache to see if there's anything that's still fresh there that still has a good ttl that we can return if there isn't a match in cache then we have to go back to the origin and get a fresh copy populate the caches and return that to to the visitor and so um that's that's how we do that the opposite of that is kind of the the scary part about running a cache it's called a cache collision and that's when it even sounds scary that when the usman uh jpeg in the corner would be you know an alex jpeg or something and so you'd be like oh no that's not serving the wrong content to the wrong eyeball is the ultimate like that that's the you cannot you will never want to in the name of performance ever make that mistake so that is a hard no uh and so we you'll always pay the performance hit if it means that's the only way to guarantee that you don't have a cache collision right and so that's uh like an enormous amount of aki's uh engineering team's testing is making sure that there's just no way uh that we can get the wrong the wrong kind of but here's the other interesting thing here though right so let's say okay so forget that forget the case of user of the usman and alex jpeg in the corner let's say i'm i'm uh i'm the it person for my company's website new ceo i'm putting up the bio page for my new ceo and i fat fingered the spelling of her last name and and and so now like thousands of data centers are you know thousands of computers around the world 200 data centers around the world have now cached a misspelled name of my new boss like what do i do like now what how do i tell cloud first cache forget it like whatever you've learned please unlearn it and and come back to me again because you've got the wrong spelling of the content like what what what do customers do for that for that problem right there you know we've built you know a very massively distributed purge system you know that takes in a request from our api and then distributes it out to our network um and so so purge is literally a button that our customers can press that says ignore what you've cached exactly right and it's not even just one button right alex it's like it's got multiple multiple different ways of us uh of us uh being letting letting them use this right so like we we let people purge by path purge by host purge by prefix so like talk a little bit about that where's all that coming from yeah so um it's it comes from kind of these various degrees of kind of blast radius for uh specific types of assets uh as a webmaster that you may want to purge you organize um your web page kind of in in a directory format or something else and so you may um want to only purge on one specific asset um you know um again if we're going to continue to beat the usman jpeg uh example jpeg get that thing off of the network for god's sake you can purge specifically by that url all data centers within usman jpeg we can ignore that um and so that's a really really surgical strike for a purge um we can also purge um by directory uh and so if we stored a bunch of um management jpegs or something together in that directory folder we call it a purge by prefix and so you can purge um you know all jpegs that you have in a management directory um by sending that request through and so it'll go through all data centers and purge all of the jpegs in that directory we also have um a purge by tag which is a really powerful and interesting feature that allows um for the webmaster to tag various assets and then send purge requests uh looking for those tags and getting to get rid of those um and we also have a purge by host name which allows uh you to purge uh various host name uh assets it's like going to the grocery store and looking for peanut butter like there's a many as many different flavors of pert like we've got every different way of you know shape and size of of of addressing this and it's important to note actually uh uh aki like if we if we if we purge too much or if a customer even accidentally purges too much it can actually cause problems for them because all of a sudden hundreds of cloud for data centers will come storming up back to the origin and be like hey we need uh we we we want content we we we don't have a copy of this the eyeball's looking for it we need it and you can wind up um you know really overwhelming the origin which is kind of the opposite of the whole point of of uh of our cache so part of what our cache did is a tiered cache to avoid that can talk a little bit about what what tiered cache is trying to solve sure so to cache is our system that we use where if we check one of our caches and it doesn't have it instead of going directly back to the origin we'll check one of our larger caches and see if it's in assets and cache there first uh you know to avoid going unnecessarily back to origin so that means we actually let our customers think through and give them some control over consider these like origins that are closer to you bigger to you and if if one of the farther out uh data centers doesn't have it you know consult one of the one of the tier one data centers for your origin that that gives them a lot of a lot of control so aki how does purge work like what are some of the challenges we have to solve um in trying to get you know 10 000 computers to suddenly forget that they have a that they that they have got a copy of an image like how does that actually work what did we have to build to solve that i think one of our biggest challenges that we've had building out this purchase system is handling when one of our data centers goes down right like what do we do with those purge requests where do we store them how do we make sure that that when it comes back online that the assets aren't invert yeah and it could be totally normal good reason for it to going down maybe it's just being offline for maintenance they're replacing a hard drive they're doing something completely mundane but it wakes back up and it doesn't have the right information anymore uh then you know i'm going to start if it asked to if it's asked to give the about page for the ceo it's going to have the wrong spelling again uh so how so how do we what do we do about that how does our purge system guarantee against that so our purge system we actually use different queues or we store you know whatever has um whichever color has gone offline we'll store those purge requests for being able to distribute them you know once the the data center comes back online so so the so our purge system is actually in cahoots with the state of the network so that it knows how and when to send that kind of request uh based on whether it was offline or or not exactly that's pretty great so another um another interesting possibility here you know we always say that Cloudflare's mission is to help build a better Internet and you know the three adjectives that come out of that sentence right away it's faster safer more reliable um so here's faster for sure right like we we help make it faster um there's all and you know there's an aspect of more reliable that is we keep we take a lot of burden off the origin so it's way easier to keep your origin up when it's not getting hammered all the time but there's also reliable in a sense like what happens if the origin was completely off can we still serve the entire web page uh and we we had a feature like from long before any of the three of us were at Cloudflare called almost online i like to talk a little bit about always online and what was what was awesome about it and what was not so awesome about it yeah no always online was a fantastic kind of original idea from the founders of Cloudflare and it is a little bit like an insurance policy for your website like where your we couldn't reach your origin um and so what do we do in like instead of returning an error page which kind of is not necessarily the best um for your customers or anything else you know they want to see content they don't want to just be like hey we're having problems right now um we would look at um the the specific uh content that we've spoken before about ttls um when something has uh no longer is a fresh ttl it's considered maybe stale or expired it doesn't necessarily leave cache immediately when that happens it kind of sticks around for a little bit of time until it eventually kind of falls out of cache but in the situation where we're not able to reach the origin um always online would first uh look to those stale and expired assets and serve those with a little banner warning people that's like hey this is you know the origin is having problems um this is a fail content it might be a little bit old or out of date but at least it's something it's better than not seeing anything at all right absolutely and so that was the that was the idea um and then kind of over time uh we also uh began doing some uh crawling on various websites and storing some static pages um in certain data centers uh that over time became a little bit uh not so um i would say it worked uh well for the most part but became less of a a great place um to store content for we needed a place where we could definitely rely on um the the crawled assets uh when we couldn't find something that was stale or expired and so um we started talking to the Internet archive and uh their crawler and storage system to and they've been building crawlers and and archive mirrors like those guys are really into this problem they've been they've been they've been invested in this in this racket for 20 years like how do you archive an ancient web page uh in a way that you can browse you know decades later absolutely yeah it's really their core competency they're really great at it and they've been a fantastic partner for us to work with and really building that that uh that project out and so it's been you know well well received it's been pretty well adopted and so we're really hoping to make um vast portions of the Internet more more reliable and resilient with that and so that's a it's a really cool project that we get to work on here at Cloudflare let's just double click on this for one second so how does it work now so what what what what do we serve what does the ia serve and how did how are these two connected and i know you a great blog post on this our visitors should go check out blog .Cloudflare.com but give us the two sentence summary here one more time yeah so a request comes in from a visitor um Cloudflare tries to reach the origin it's unable to reach the origin first thing it does is look in the cache to see if there's stale or expired content that we can serve if nothing is nothing's there we still can't reach the origin then we reach out to the Internet archive for the uh the specific page that is being requested and the Internet archive returns that to the eyeball which the user could have done anyway and so really we're just really streamlining what you might have done anyway if you couldn't get to it and uh connecting to really really great services so the eyeball has as best an experience they can have given the state of whatever the system is and the issue is yeah absolutely that's great okay i want to go back to perch for a second because there's one other thing i want to ask you about that one of the really interesting things about Cloudflare is that we have the ability to uh to tell our edge very quickly new information so when you when new websites sign up you know new they are all those 10 000 computers around the world can be informed in you know in one less than two seconds i think was the last stats i saw that a new website is signed up ditto for firewall rules dos rules like all this kind of stuff and so our purge system actually architecturally has two very big chunks of different systems under there's one which is actually a purge request and then there's another thing called flex purge and so can you just talk for a second about what is a what's the difference between a purge request and a flex purge and what were some of the why why do we have two almost radically different ways of solving you know two different flavors of purges that um that that uh that alex was talking about earlier sure so our purge by euro uh that purge system is actually uh that one when you get a request and it actually will go directly out to our edge if the color is online for that one um you know that one like alex said was a little bit more surgical you know you need to know the exact url um our flex purge options which are a lot more flexible because you can do host or tag uh or prefix those ones are actually distributed out to our edge using our just having a key value store that we use a lot of our other services use it as well so it gets it gets put out to uh the edge within a few seconds after getting it and so that one and then that acts more like a rule that then checks the request and says wait a minute this matches this pattern so i'm gonna i'm going to i'm going to not serve it exactly got it and so it's interesting it doesn't actually remove the content but it's a rule that makes sure that the content isn't served because it is matching almost like an 11th hour check against it uh right the the first kind of purge that i was mentioning the purge by url that one will remove the content from our cache whereas you know this one doesn't actually remove the content it just checks to see you know if this rule exists or not yeah right it's and i thought it was when this was first explained to me i thought wow that's really interesting they're completely different kinds of systems one that's almost like no actually remove the asset and then the downstream effect is it won't show up whereas the other one is let the normal process go but right as you're about to serve it go no wait a minute it matches a filter it matches something that the user wants us to not serve and that's how with a simple rule with uh we can we can effectively purge uh you know whatever we need to really complex stuff including uh those purge by tag that uh that alex was talking about so on the on the search of other really interesting new stuff we've done alex we've also done work on very now very is built into the hp spec so just talk for a second like what is very and why does that start to affect a per a cdn like why do we have to worry about very yeah no uh very is kind of an interesting use case it is a um response header i know we've talked a little bit about requests but response response headers are different types of hdp headers that come back uh to the request yeah very uh is a header that signify signifies um hey there was a a change in the content um in various capacities you can serve um variants of that which is i guess where the term very comes from yes uh why would you want to do that like let's let's make this concrete why why would it why would a website decide that then i'm based on something i might want to give you a different what what what were they trying to solve um so a lot of people access you know the same types of content the same types of um websites from different browsers from different devices from from different places um and so you might want to vary on a lot of different things depending on where your visitors are coming from what type what type of browsers they have what type of language that they're reading your content in and so you can vary kind of across um a lot of different things uh and um for example um if you're looking at a web page a blog with a lot of images uh on a browser like a google browser chrome versus a firefox browser those uh different browsers might have different image optimizations that have been um in place that have been put in place kind of uh in their engineering and so you want to as the webmaster serve the best image to the browser that is most optimized for the viewing experience of your your visitor and so you might want to vary based on those things and so frequently that sort of content negotiation happens um when uh the browser sends the request and they're like hey these are my preferences for image optimizations and so as uh somebody who's going to be returning a variant of an image you would look to see which is the preferred image uh for that browser and so if you get that if so if we didn't add first class support for this we might naively return whatever the first client had requested that we had cached and and uh and effectively undo the optimization that the uh that the uh that the client was actually looking for that the webmaster wanted us to have so what did we have to build to make sure what do we have to support so that this this works right so we we built um some uh support for uh looking at the accept headers normalizing those so that we could avoid um some of the the cache collision problems we talked about earlier make sure we're doing the the best thing for the the browser for the client um and returning uh the the right variant um and so that is something that that's relatively new for us and something that we're we're working on and so it's pretty exciting that we're going to be able to do uh you know some great optimizations for uh clients and and their uh the the content that they're trying to view yeah awesome yeah so it all comes down to really getting that cache just right so that you can so you can maximize reuse of the asset that you've cached as much as possible but still always serve the best possible asset without actually serving a copy of something that you you didn't and just the last topic i want to talk about aki some of the engineers on your team are some of some of the world's experts on tcp optimization and how they how you can think because the other thing a cache has to do is actually reach out to an origin and pull information can you just tell one of the one of the more gives an example of a kind of problem that they had to solve to try to really optimize that connection from when the cloud for edge has to talk back to an origin what were what were some of the challenges they were trying to optimize for that so one challenge that we have tried to solve is you know we don't want to be making opening a ton of connections to each one of our origins so we've done some things to actually optimize you know our connection reuse on when we're actually connecting to origin instead of overwhelming theirs there's and it's and it's conservative on use on our side as well exactly and that and that's just that's just when i encourage our viewers to uh to check out posts uh from the tcp and cache teams uh on cloud on cloud first blog and alex's posts about some of the more recent stuff uh there and i can't believe this but we are at time um and so i want to thank both of you aki alex it was so great to talk to you and uh we will definitely have you guys back on in a couple months to tell us and all the other stuff that you haven't told me about that you're working on uh and and bring some of the engineering teams to come with us so uh thank you both for spending this time and talking a little bit about cloud for cache and the cdn with us thank you so much thanks thank you great day everybody goodbye so hi we're Cloudflare we're building one of the world's largest global cloud networks to help make the Internet more secure faster and more reliable meet our customer wangnai an online food and lifestyle platform with over 13 million active users in thailand one is a lifestyle platform so we do food reviews uh cooking recipes travel reviews and we do food delivery with lineman and we do uh pos software that we launched last year wangnai uses the Cloudflare content delivery network to boost the performance and reliability of its website and mobile app the company understands that speed and availability are important drivers of its good reputation and ongoing growth three years ago we were expanding into new services like a chatbot we are generating images dynamically for the people who are curing the chatbot now when we generate image dynamically we need to cache it somewhere so it doesn't overload our server we turn into a local cdn provider they can give us caching service in thailand for a very cheap price but uh after using that service for about a year i found that the service is not so reliable turn into Cloudflare and for the one year that we have using Cloudflare i would say that uh they achieved the reliability goals that we expecting for with car fair we can catch everything all locally and the site would be much faster wangnai also uses Cloudflare to boost their platform security Cloudflare has blocked several significant ddos attacks against the platform and allows wangnai to easily extend protection across multiple sites and applications we also use web application firewalls for some other websites that allow us to run open source cms like wordpress and drupal in a secure fashion if you want to make your website available everywhere in the world and you want it to load very fast and you want it to be secure you can use Cloudflare with customers like wangnai and over 25 million other Internet properties that trust Cloudflare with their performance and security we're making the Internet fast secure and reliable for everyone Cloudflare helping build a better Internet