Dogfooding Workers at Cloudflare

Presented by: Rita Kozlov

Originally aired on June 9, 2020 @ 3:00 PM - 3:30 PM EDT

Best of: Cloudflare Connect 2019 - UK

Rita Kozlov, Product Manager for Cloudflare Workers, shares how Cloudflare leverages the Cloudflare Workers serverless development platform internally, for a broad range of use cases.

English

Cloudflare Connect

Transcript (Beta)

Music Hello, everyone. So today I'll be talking to you guys about dogfooding workers at Cloudflare. My name is Rita Kozlov. I'm the product manager for workers. So Cloudflare Workers, Stefan did a very nice intro for me. And if you guys were here for the earlier sessions with JGC, you might already be familiar with it. But Cloudflare Workers is Cloudflare's serverless platform offering. So we've taken our edge and our growing network of 180 data centers and we figured out for you guys how you can run code out of every single one of them. But this is a lot of new concepts for a lot of people, right? Serverless is a relatively new paradigm, and it's the concept of developers not having to think about the servers where applications are running, but instead of thinking about the code. And this is fairly new because just not too long ago, we were thinking about virtual machines and containers. Serverless at global scale is even newer, right? Some of you have wrapped your head around the idea of running code in a single location. Some more advanced use cases call for load balancing between two origins. But running code out of 180 data centers around the world is kind of tough to wrap your mind around. So what I'm very, very often asked, and I think that was a question that JGC was asked, John Gray Cumming was asked earlier on stage today as well, is how do I get started with Cloudflare workers? Whenever I think about this question, I think about Gall's Law, which is the idea that any complex system that works was invariably developed out of a very simple system that worked. People are very bad at designing complex systems out of the get-go. Getting a lot of parts of a complex machinery right is very, very difficult. So what you have to do is start simple, right? Start from components that work and then grow from there on out. So today I'm going to tell you guys the story of dogfooding at Cloudflare with Cloudflare Workers and how we've learned to adopt workers step-by-step at Cloudflare. For those of you who aren't familiar with the concept of dogfooding, dogfooding is the idea of consuming your own products. So rather than there being a version of the product that we use, or human food, and another product that our customers use, dogfood, we get to consume our own products and feel the pain of our customers and develop empathy for them. So just like we didn't send you guys to a separate line during lunch today, we ate the same food as you guys, we do the same thing for our products. So Cloudflare journey into workers also started in a few different pieces where we started working on workers for some pretty simple use cases, but over time they developed, and from each one we learned more and more to apply to the next. And I'm hoping you guys will take some of our learnings with you today and think about how you can apply them within your own organization. So I'll start with the first use case. This was basically the very first worker that started running in production for Cloudflare for deprecating old TLS. So we had this requirement, a lot of you guys are familiar with PCI, it's a standard for anyone who accepts payments on the web, right? And old versions of TLS served us really, really well, but at a certain point we decided that they were vulnerable and not secure enough. So it became a requirement for us to be able to accept payments from you guys to only support versions of TLS that were 1.2 or above. The challenge for us is that Cloudflare offers many different services. If you've logged into our dashboard before, you've probably seen all of these icons. Each service is managed by an entirely different team, and each team has full flexibility to run whatever tech stack that they want. So what we end up with are lots and lots of services all running their own unique different tech stacks. So one way that we could have gone around solving this problem is gone to each and every single one of these teams and gone, hey, we're sorry, but you have to put your current project on the backlog, this is a high priority for us for PCI compliance, you need to upgrade your tech stack to decline older versions of TLS. This, of course, would have been very challenging and required someone to constantly monitor and check in, and it would have impeded our productivity since we would have been focusing on this instead of building new features. The nice thing about workers is that workers sit between the eyeball and your origin, or in our case, origins, plural, right? So we're able to set up a proxy and disable the old version of TLS at the worker level. So instead of doing it at every single stage, we could do it just once at api.Cloudflare.com and setting up a worker there. The worker itself is actually incredibly simple. This is the whole worker, this is not a snippet of a worker. So I'll walk you guys through it pretty quickly. We have the event listener that listens on a fetch, so any HTTP request that comes in, we take a look at it and we respond with the SSL block function. We look at the TLS version that's passed in the request.cf feature, and if it's not 1.2 or 1.3, we respond with a 403, otherwise we fetch the request from the origin and continue business as usual. This is all of, like, eight lines of code without comments. So this is super, super simple. Any of you guys can do this. And it got us through the first hump of writing a worker. So this was a great use case for us. It helped unify many different services, and we took no performance hit for it because we already had Cloudflare in front of our services. And because Cloudflare runs at 180 data centers around the world, if anything, it can actually accelerate the connection if you're using things like Argo to route to your origins. The next case study that I'm going to talk about is access on workers. So Cloudflare access is one of our products. Can I get a quick show of hands for how many of you are familiar with access or have used access? Nice. So quite a few of you. So Cloudflare, we practically don't use VPN anymore. Access is our gateway into most of our internal applications. So every day when I go to the wiki, this is what it will look like for me or something like this, where I'll type in wiki.Cloudflare, our internal URL, and it will take me to the sign-in page through Google. I'll select my account, and then I'm able to access the wiki. So I'm going to dive a little bit deeper into what it looks like behind the scenes step by step. So the first URL that I'm going to hit is wiki.mycompany.net, and that will issue a 302 to the auth domain that you've selected during the sign-up process slash login. The login will then have, respond to you with a 200 and a link to your identity provider. So let's say you choose Google. You're then going to go to Google, which will then redirect you yet again to the auth domain, this time to the callback path. So now we're on the callback path, and now we're finally able to get the auth token back from Cloudflare and set it as a cookie so that we can finally access the website that we're going to in the first place. The challenge here is that just for the login flow, we have to go back and forth between our core service and the user three different times. So I just flew in from San Francisco yesterday. It took a long time. Making that round trip three times, as you guys can imagine, takes quite a bit of lag. So what we're able to do with workers is when we're all here in London, we can actually connect to the London pop that's here, and from the edge, we can generate the token and start using the website right away. So these three very long round trips have all of a sudden become three incredibly short round trips that are literally within milliseconds of us currently. Now the only bottleneck that we have is the identity provider, and if you're using Google, they're probably in quite a few locations as well. One of the other options that you have when you're using Access is to use a single time nonce. So that was another source of bottleneck for us, because previously to get that nonce generated, yet again, you would go back to one of our core data centers, which are not the same as the edge data centers, right? They're just inevitably fewer of them. So what we were able to do is actually yet again use workers to generate the nonce on the fly and store it in KV instead. Just to give you guys some insight into how we've operated and worked on this iteratively too, originally we used the cache API, which is also a key value store, but it's ephemeral, and it allows you to store in the location that you've accessed. The reason that this approach was originally problematic for us is because if you're traveling, you might hit one pop one time, and then another time you might connect to instead of the London pop, let's say the Frankfurt pop, right? But using workers KV, which is our distributed key value store, we're able to propagate that nonce to all of our data centers, and you're able to access it from any location. So one of the approaches that the access team took is to split out the logic for each of the endpoints. So each endpoint gets its own worker. This means lower risk deployments because you can operate on different parts without necessarily impacting the rest of the service, and it allowed individuals to work on different parts of the service and kind of split out the work without having to do all of this at once. The other outstanding question as the access team was doing this was, what do we do about logging? So you can use the wait until function, which will basically call out asynchronously while your workers serve the response. So that way the end user is not having to wait on you to send the log line somewhere while they're authenticating. So we're able to use this for audit logs, which if you log in to your account in access, you'll be able to see everyone who's used the service, and we're able to use it for our own debugging as well by logging into our sentry logs. So the result that we saw was improved performance, right, because we're not making these insane round trips all the time. Improved reliability by having fewer points of failure, right? No matter how many core data centers we get, 180 data centers is obviously going to be much more resilient. And we're able to try different approaches and quickly iterate. The third case study that I'm going to talk through is building the reservation system for workers.dev. So, so far we had a use case where we augmented traffic, we had a use case where we enhanced the traffic, but now we're going to build something new. So I'm not sure how many of you guys have seen this page before, but in February we pre-announced the service to sign up and register for subdomains .workers.dev. We had a few pretty simple requirements, right? You put in your subdomain, you put in your email, you click reserve, and you expect it to work. We wanted to originally limit the reservations to one per email address, right? So we didn't want someone to take everyone's phone and steal every first name out there to themselves. The other requirement is that only one person is allowed to grab a subdomain. We didn't want any unpleasant surprises where two users were guaranteed the same thing, but in the end only one of them was able to have it. And the other thing that was really important to us was the ability to blacklist a few key subdomains. So we didn't want this to be used for phishing, and we needed to be able to blacklist things like admin.workers.dev or SSH. Some of the challenges that we were facing as we were thinking about this, we didn't know how many signups we were going to get or what the traffic spike was going to look like. With services like this, it's often a bit of a grab bag and everyone's trying to register. The other thing was we knew that this was going to be a temporary service up until we launched the actual .workers.dev service. We didn't want something cumbersome or that we would have to maintain long term. The other thing was we wanted to launch this within a month, and so we needed to get something working and up and running very, very quickly. Again, the flow looks like this. You reserve a subdomain, you receive an email, you need to confirm it, and then once it's confirmed, it's been reserved so that you can later claim it. We used two workers for this. One to reserve the actual subdomains and the other to validate your email. The workers to reserve the subdomains look like this. When you clicked on reserve, you hit a worker that was running on workers.dev slash reserve. We then used Firestore to use the actual reservations. As a quick note, we thought about using KV versus Firestore. We already have an offering for this. Because of the eventual consistency nature of KV, it does allow for the possibility that if I'm in London and someone else is in San Francisco and we both register the same subdomain at the same time, that we'll both be guaranteed it originally, but in the end obviously only one of us gets it. We needed a centralized data store. We chose Firestore for this. I think this actually really nicely tells the story of workers being able to connect to any property on the Internet. You can use our products, and we tried to build them out and make it as easy as possible for you to use them, but you don't have to. The worker would have to authenticate to GCP because we didn't want anyone else to write or read from it. It needed to check to make sure the subdomain wasn't taken, that it wasn't blacklisted, and then save the reservation. You can find the worker that we used for the GCP authentication part in this blog post that I've referenced. But as you can see, it's pretty easy for us to assemble the data that we have into JSON, and then we use the Node-Jose library to actually create the JWT token. The second worker that we used was the worker to verify emails. Pretty similar setup. We had to generate a verification token and store it alongside the email in Firestore. We then used the Mandrill API to send the email. When you clicked on the email in the URL, it would then take you to this link that would confirm the token against the email that matched up and send it to a worker to confirm your registration. Then we would hold it for you until you're ready. This led to a successful launch. It was seamlessly scalable. Whether we were building this to serve 100 users or a million users, the code would look exactly the same. We got no double bookings. We don't need to maintain it anymore now that we've launched. This is something that was perfect for a single use. It can be used with any APIs or cloud providers. We used the Mandrill API for the email and Firestore for storing the records. What did we learn from this experience of using workers for these three cases? As I mentioned earlier, start small and move functionality to the edge deliberately. I see customers try these big migration projects. It's really, really hard to keep the steam going. You need that little bit of dopamine along the way to make sure that you're making progress. But you also need to be able to learn quickly and apply what you've learned to the next iteration. The other thing is break up endpoints into multiple workers. This will allow you to deploy frequently and work in small and independent groups. Workers have allowed us to move very, very quickly. We actually use workers a lot even outside of these three use cases. Products such as KV itself were actually built with workers. It's really allowed our engineering velocity to go much, much more quickly. The last thing is new projects are great for workers. It's obviously easy to fall back to technologies that you're familiar with when you're trying to get things out the door. But they're also a great opportunity to experiment. Highly recommend it. Thank you. More secure and more reliable. Meet our customer, BookMyShow. They've become India's largest ticketing platform thanks to its commitment to the customer experience and technological innovation. We are primarily a ticketing company. The numbers are really big. We have more than 60 million customers who are registered with us. We're on 5 billion screen views every month. 200 million tickets over the year. We think about what is the best for the customer. If we do not handle customers' experience well, then they are not going to come back again. And, you know, BookMyShow is all about providing that experience. As BookMyShow grew, so did the security threats it faced. That's when it turned to Cloudflare. From a security point of view, we use more or less all the products and features that Cloudflare has. Cloudflare today plays the first level of defense for us. One of the most interesting and aha moments was when we actually got to DDoS. And we were seeing traffic burst up to 50 gigabits per second, 50 GB per second. Usually, we would go into panic mode and get downtime. But then all we got was an alert, and then we just checked it out, and then we didn't have to do anything. We just sat there, looked at the traffic peak, and then being controlled. It just took less than a minute for Cloudflare to kind of start blocking that traffic. Without Cloudflare, we wouldn't have been able to easily manage this because even our data center level, that's the kind of pipe, you know, is not easily available. We started for Cloudflare for security, and I think that was the aha moment. We actually get more sleep now because a lot of the operational overhead is reduced. With the attack safely mitigated, BookMyShow found more ways to harness Cloudflare for better security, performance, and operational efficiency. Once we came on board on the platform, we started seeing the advantage of the other functionalities and features. It was really, really easy to implement HTTP2 when we decided to move towards that. Cloudflare Workers, which is the computing at the edge, we can move that business logic that we have written custom for our applications at the Cloudflare edge level. One of the most interesting things we liked about Cloudflare was everything can be done by the API, which makes almost zero manual work. That helps my team a lot because they don't really have to worry about what they're running because they can see, they can run the test, and then they know they're not going to break anything. Our teams have been able to manage Cloudflare on their own for more or less anything and everything. Cloudflare also empowers BookMyShow to manage its traffic across a complex, highly performant global infrastructure. We are running on not only hybrid, we are running on hybrid and multi-cloud strategy. Cloudflare is the entry point for our customers. Whether it is a cloud in the backend or it is our own data center in the backend, Cloudflare is always the first point of contact. We do load balancing as well as we have multiple data centers running. Data center selection happens on Cloudflare. It also gives us fine-grained control on how much traffic we can push to each data center depending upon what is happening in that data center and what is the capacity of the data center. We believe that our applications and our data centers should be closest to the customers. Cloudflare just provides us the right tools to do that. With Cloudflare, BookMyShow has been able to improve its security, performance, reliability, and operational efficiency. With customers like BookMyShow and over 20 million other domains that trust Cloudflare with their security and performance, we're making the Internet fast, secure, and reliable for everyone. Cloudflare, helping build a better Internet. Cloudflare. No one likes being stuck in traffic, in real life or on the Internet. Apps, APIs, websites, they all need to be fast to delight customers. What we need is a modern routing system for the Internet, one that takes current traffic conditions into account and makes the highest performing, lowest latency routing decision at any given time. Cloudflare Argo does just that. I don't think many people understand what Argo is and how incredible the performance gains can be. It's very easy to think that a request just gets routed a certain way on the Internet no matter what. But that's not the case. There's network congestion all over the place, which slows down requests as they traverse the world. And Cloudflare's Argo is unique in that it is actually polling what is the fastest way to get all across the world. So when a request comes into Zendesk now, it hits Cloudflare's POP, and then it knows the fastest way to get to our data centers. There's a lot of advanced machine learning and feedback happening in the background to make sure it's always performing at its best. But what that means for you, the user, is that enabling it and configuring it is as simple as clicking a button. Zendesk is all about building the best customer experiences, and Cloudflare helps us do that. What is a bot? A bot is a software application that operates on a network. Bots are programmed to automatically perform certain tasks. Bots can be good or bad. Good bots conduct useful tasks like indexing content for search engines, detecting copyright infringement, and providing customer service. Bad bots conduct malicious tasks like generating fraudulent clicks, scraping content, spreading spam, and carrying out cyber attacks. Whether they're helpful or harmful, most bots are automated to imitate and perform simple human behavior on the web at a much faster rate than an actual human user. For example, search engines use bots to constantly crawl web pages and index content for search, a process that would take an astronomical amount of time for any human user to execute. Meet our customer, HubSpot. They're building software products that transform the way businesses market and sell online. My name is Carrie Muntz, and I'm the Director of Engineering for the Platform Infrastructure teams here at HubSpot. Our customers are sales and marketing professionals. They just need to know that we've got this. We knew that the way that HubSpot was growing and scaling, we needed to be able to do this without having to hire an army of people to manage this. That's why HubSpot turned to Cloudflare. Our job was to make sure that HubSpot, and all of HubSpot's customers, could get the latest encryption quickly and easily. We were trying to optimize SSL issuance and onboarding for tens of thousands of customer domains. Previously, because of the difficulties we were having with our old process, we had about 5% of customers SSL-enabled. And with the release of version 68 of Chrome, it became quickly apparent that we needed to get more customers onto HTTPS very quickly to avoid insecure browsing warnings. With Cloudflare, we just did it, and it was easier than we expected. Performance is also crucial to HubSpot, which leverages the deep customization and technical capabilities enabled by Cloudflare. What Cloudflare gives us is a lot of knobs and dials to configure exactly how we want to cache content at the edge, and that results in a better experience, a faster experience for customers. Cloudflare actually understands the Internet. We were able to turn on TLS 1.3 with zero round -trip time with the click of a button. There's a lot of technology behind that. Another pillar of HubSpot's experience with Cloudflare has been customer support. The support with Cloudflare is great. It feels like we're working with another HubSpot team. They really seem to care, they take things seriously. I've filed cases and gotten responses back in under a minute. The quality of the responses is just night and day difference. Cloudflare has been fantastic. It was really an exciting, amazing time to see when you have teams working very closely together, HubSpot on one side and Cloudflare on the other side, on this mission to solve for your customers' problems, which is their businesses. It really was magic. With customers like HubSpot, and over 10 million other domains that trust Cloudflare with their security and performance, we're making the Internet fast, secure, and reliable for everyone. Cloudflare. Helping build a better Internet. Cloudflare. Helping build a better Internet. Cloudflare. Helping build a better Internet. Cloudflare. Helping build a better Internet.

Cloudflare Connect

Connect to the future of networking and security. Cloudflare is a global network designed to make everything you connect to the Internet secure, private, fast, and reliable. Connect is Cloudflare's flagship event that will connect attendees directly...

Watch more episodes