Cloudflare at Cloudflare

Name: Cloudflare at Cloudflare
Uploaded: 2020-08-11T12:30:00.000Z
Duration: 30 min
Description: It's Serverless Week so come join us to hear how we dogfood Cloudflare Workers internally to solve problems and build amazing products!

Presented by: Juan Rodriguez, Rita Kozlov, James Royal, Sachin Fernandes

Originally aired on January 5, 2022 @ 7:00 AM - 7:30 AM EST

It's Serverless Week so come join us to hear how we dogfood Cloudflare Workers internally to solve problems and build amazing products!

English

Transcript (Beta)

So welcome everyone to another session of Cloudflare at Cloudflare and what we do about this program in this program is talk about what we call dogfooding which is how do we use our own internal technologies and products inside of Cloudflare to build to solve problems, build other products, solutions. My name is Juan Miguel Rodriguez. I'm the Cloudflare CIO and I am your host. I am joined today by three amazing people way smarter than me. I have Rita. Why don't you introduce yourself, Rita? What do you do? Hi everyone, I'm Rita. I'm a product manager working on Cloudflare Workers which is our serverless platform that we'll be talking about today. All right, thank you. Sachin. Hello everyone, this is my first time doing Cloudflare TV so that's pretty exciting but I'm a systems engineer on our core API team and we help run the API as a platform for other consumers and services at Cloudflare. All right, thank you Sachin. James. Hi everybody, I'm James. This is my first time but I am the engineering manager for the Cloudflare access team and before that I've been the engineer on access for quite a long time. Perfect, thank you. So this week because it's a serverless week, in case you haven't heard, we're going to talk about dogfooding workers internally to build other things. So I'm going to have Rita, in case you don't know what workers is, which would be unheard of, given that this is serverless week, what workers is and how it started and maybe show a little bit of workers. So Rita, do you want to go ahead? Yes, so workers is Cloudflare's serverless platform that allows developers to run their code from each one of our 200 data centers around the world. So it's really, really powerful in that it's really performant because it runs code really close to your users and it scales basically infinitely because it's built on the same network that we build to protect from things like DDoS, right? So I'm going to give a quick demo of how it works and what it looks like. We like demos. All right, I'll even do some live coding. Yeah, it's very exciting. We always know how live demo goes. Yeah, I know. So welcome to my Menagerie of Cloud project. So this is the Cloudflare dashboard and I'll click into the workers product. You can see again all the different things that I've worked on. I end up creating a lot of projects because I use workers myself pretty regularly for any random ideas that I can think of. But in here you can see I can quickly open up a little code editor that allows me to test out a little bit of code. So maybe we'll start with just deploying a super simple little worker that says hello. So I can customize it and do something like get the name based on a URL query string parameter. I'll do a new URL. And then I can do a name equal URL.friends .kit. That's tooltips and everything. Yeah, it's super handy, especially for demos like this where I forget the exact command. And it's a lot easier to show off than command line stuff where it's like look at me, call the APN. It also shows you errors which is kind of cool. Yeah, it does. So if I – all right, it says hello null, so that makes sense. But if we do something like name equals one, see what happens. All right. All right. Here we go. Is that running when you deploy it, Rita, and it gets pushed to all our ops globally? So this is just a little test preview. So this is not live yet. Ah, when you deploy to go live is when it goes. Are you ready? Yes. Oh, no. I have too many scripts. You see, you got the demo effect where something always will do the work. I know. All right. I will just for you guys, I'll delete one of my old workers. It's pretty wild that you have 130 workers. Don't tell anyone. All right. So now it should be deployed and live. And so if anyone wants to go to donpapere7e7.rita.workers.dev, I can do name equals James, and it will tell me, hello, James. And that is right now running everywhere. And that is right now running everywhere. That is pretty amazing. So that was very quick, and we just pushed all that code to basically globally to Rx, so that's pretty cool. Thank you, Rita. That was a great demo. We even had a little bit of excitement with it. It wouldn't be a true demo if something didn't show up. That's right. Exactly. Now it's good to know that it's actually running live versus showing a video, right? Yeah, that's how you know it's not pre-recorded. So we're going to go to Sachin, which is one of your internal customers that is basically using Cloudflare Worker to basically build stuff. Sachin, you said that you were in the API team. So tell us about the team. What does the team do? What is the function of the API inside of Cloudflare and all that kind of stuff? So essentially, the team is sort of like a platform team where our consumers or our users are other folks inside Cloudflare. So we help drive other APIs, and we help support teams with things like authentication, authorization, anytime someone needs to pull out account info, user info, zone info, things like that. They usually end up coming through our systems in some way. So we have a very broad set of things that we touch. And the other two or three big things that we run is everything under api.Cloudflare.com flows through our team, and then our team decides which service to sort of send it to. And we also drive a lot of those APIs under api.Cloudflare.com. We drive the APIs in the dashboard. So anytime you go to the dashboard and you ask for some information inside the dashboard, that usually flows through one of our boxes or one of our servers. We also help sort of all the partner APIs. So anytime you're using WordPress, Magento, those things, we build and maintain some of the plugins for all the partners. And we also have the host API and other sort of broader APIs that we help maintain. So your API platform is in the middle of many things that basically we're running. Yes, essentially. It's sort of like a supporting platform. It might not be sort of the final thing that gives the answer, but it usually goes through our systems to let the user know where to go in order to get the answer. So it might redirect you to the billing service or the SSL service and things like that. I know the billing service very well. Yeah, you do know the billing service, yes. His team makes making any APIs from a service team much easier to do. So I know we appreciate their work a lot. It's kind of great. We're customers of each other. Like a reciprocal relationship. Teamwork, yeah. So Sachin, how are you seeing workers inside of the service and for what things? I'm sorry. So it's obviously very easy for us to break things if we make large changes across everything. So if we decide to sort of change even a small thing, we know it affects every single team that uses our platform. And it can essentially bring down sort of everyone, which we're very careful about. We don't want to make changes like that. But we also want to deprecate old code. And we also want to like progress and move forward and experiment with things and do that in a safe way, in a way that's sort of controllable. And the API team was actually one of the first consumers of workers at Cloudflare, which was terrifying and also exciting at the same time. Because we had sort of no idea. Would a billion requests a week be okay? How many of those would error out? When we do the deploys, what is the time between the new code getting deployed and the change going out? And even if that's a few seconds, that's thousands of requests that we miss. And so things like that were sort of top of mind when we were trying to use workers. And one of the first things we used was for TLS deprecation. So what we did, so we needed to sort of start rejecting requests across Cloudflare that used the TLS 1.2 or lower, because it had legacy algorithms in there, like things we didn't want to support just because of security concerns and things like that. But we also didn't want to sort of be mean to our customers. We also wanted to give them a reasonable amount of time to migrate to like 1.3. We wanted to give them warnings about, hey, you're using a deprecated protocol. This is going to end on this state. Here's a migration pathway. So we thought of many things. We were like, you know, we could go to every service and basically implement TLS deprecation service level and get each service to say, well, if it's 1.2, then we reject it. We were thinking about platform level stuff where maybe the gateway itself tries to reject things, but not everyone. We didn't want to make large changes to the gateway because then again, it affects the big teams and it's written in Lua and Nginx and things like that. So it's hard to debug. It's hard to parse. And then we're like, well, what if we just put a worker in front of all of api.Cloudflare .com and all of dash.Cloudflare.com? Just in front of all the API. And this is, again, this is like many, many, many, many requests every second of every day. And it was very scary because if we messed up a semicolon in that worker, it essentially means that we don't have the API anymore, which was very bad. So we ended up sort of writing the thing in JavaScript. And then the really easy and nice thing was we could deploy it to as many test zones as we wanted. We could run as many tests on sister zones without actually affecting the primary zone or doing things like that. And we knew that it was a real test because it would go out to the edge. We didn't have to spin up another API gateway to do these tests, and we didn't have to spin up 50 million test services alongside everything to say, well, now you have to say one, two is deprecated, but go through your test cycle, do a release, do a thing, make sure it goes out. And the really nice thing is when we started out, we just started out by logging. All we did was just log anything that's one, two. And tell us sort of like patterns on the volumes. Exactly. What are we sort of to have a before metric in order to have an after metric? So we saw that quite a few people were still using one, two, and we're like, this is bad. We don't want them to use one, two. So we said, well, instead of rejecting them outright, we can just split the code bots in the worker itself. So if they're using one, two, have them return or append a deprecation message to the response body when we send the response back instead of outright rejecting their stuff. So we started doing that. So that was a little bit of an experiment to say, well, we added this little if condition in there, but it was four lines of code and we were able to change all of API.Cloudflare.com with those four lines of code. And it did what we wanted. We didn't touch any platform. We didn't touch any service. But the moment the response came back and we knew it was like one, two, we would say, hey, just by the way, here's a deprecation message saying you're using a bad protocol. And it worked great. We saw the log, we could see people were trying to switch things up. We saw people reaching out to their folks being like, please help us. Don't start rejecting our requests. So that was really nice to be able to just do that in four lines of code. That's a very interesting use case. And the part that I find interesting is like you said that that was one of the first things that we use internally with workers. So Rita, that must have been a little bit, probably like a bit of a heart attack. It's like, oh my God, like my platform internally use case. First of all, we're going to be putting like in front of all the traffic that we're getting from an API perspective, right? Yeah, definitely. I mean, I'm an optimist at heart. So I was like, it's going to be fine. And I mean, no offense to our API team, but we have much, much bigger customers on us now. So with much higher volumes of traffic, but definitely even the fact that we didn't have to do anything and the API team didn't have to do anything to scale to being able to handle so many requests per second was really reassuring. And I think it's really interesting. It sounds so simple, right? We just need to deprecate TLS. But in a services scenario like that, going to every single team and getting them to get their service to handle it is really, really challenging. Yeah, exactly. That's great. Well, thank you, Sachin, for that use case. I'm going to switch gears to James and James runs the access team. And so James, tell us a little bit about access. And I know that, you know, access didn't start as a platform in workers and eventually they got there and stuff like that. So maybe you can tell us a little bit of all that. Yeah. So access was kind of built for a couple of different reasons. But one of the major ones was we wanted to kind of get off of our VPN, especially a long time ago. So access started as an internal project, I believe in 2015 or 2016, somewhere way back a couple of years ago. And we pretty much only use it internally, I think, to protect like a Grafana instance. And then eventually we decided that maybe this is something that we can productize and maybe start selling to users. So then I was hired to kind of take it from that POC into kind of like an actual product. Industrialize and sell to customers. Exactly. And originally, because it was a POC, it was just a Go service running, I think, in like one or two containers in our core data center. And that worked because the amount of traffic that hit it was basically minimal. I mean, you only log in a couple of times a day. So like that was really the only traffic that it was having to deal with. And that worked for a very long time. And it kept working. Sometime though, in like 2018, we started having a lot of kind of instability in our core data center that we were in. Like either bad LB configs, which seems to be the cause of every major cloud outage these days, would get pushed out. Or like the database, like we were on a shared database instance because we were small potatoes as far as databases were concerned. So they just had us on a shared instance with a bunch of other people. And you have noisy neighbor problems where like somebody would start hammering the database, and then our reads would suddenly just extend to it would take 10 seconds to do something. So access would have periodic minor outages just due to things almost outside of our control. And so when you're a service that is in charge of making sure that people can get to their internal tools or whatever else, like having outages in your login service is not something that's acceptable. And so we started looking for ways to kind of reduce either our dependency at least a little bit from that core data center or entirely from that core data center. And this was around the time that workers actually came out. And we evaluated it. And at the time, it was really close, but it wasn't quite what we needed. One thing that finally did kind of take it over the edge for us was when they rolled out workers KB, because then we could actually store configuration at the edge. And we wouldn't ever have to make calls back into our- For people that may not know, what did you say about that KB? Yeah. So workers KB is, I mean, it is what it is. It's basically a generic KB storage. So you can write key values. So basically you have a key and then you store some arbitrary blob inside of it. Like for access, everything is just a JSON blob. So basically when you make an API update to the Cloudflare dashboard, we have our source of truth running inside of our core data center, but then that gets flushed out to all of the data centers that KB listens to, or it's written into KB so that our workers can pull that configuration out at the edge. So we never have to make that round trip back to our core data center. It's as close as wherever KB stores its data. But so once that came out, that was a viable option for us. And so Cloudflare does retreats or did retreats, I guess, before Corona, where they bring everybody in the entire company into the San Francisco office for a week, once a year, so that you can meet other teams and things like that. And during that retreat week, we decided to kind of do like a hackathon on the access team, where we're basically like, could we build the core access product in KB? And we, it was a relatively new product. KB was very new, especially at the time. And so there wasn't somebody I could be like, hey, how did you build this internally? I mean, Sasha's team did, but everyone's kind of stuff was kind of basic, like modify a request, right? At the time, I think in Cloudflare, we were the first real product built out of workers, where there's no origin behind it. This is the end. The request hits us and then it stops and we return a response. And so that was like a week long, just hackathon. All we did that week when we had free time that wasn't like hanging out with other people and getting to know everybody, like the reasons why you do retreat week, was spent just trying to make this work. And so we built a POC at the end of the retreat week that was like, look, we proved you can actually build this. And so we kind of put a hold on everything else that we were building and decided to rewrite our entire core authentication piece into workers, which was a really cool thing. Because as an engineer, rewrites are always kind of fun, because I had basically three years worth of code that we had learned over time. It has cruft of like, you built this, but it doesn't quite work, but you modify something and then you modify something and then it's just like, eventually, you just have like a, it's old. And all those things that you got to deal with. And so getting to rewrite it from scratch with like what you already now know a couple of years down the line was a really cool experience. And we ended up building it in TypeScript, so we didn't lose any of like the typing that like we had with Go, because that was one of the original concerns we had was like, everything was JavaScript. And we were like, well, we're gonna lose like our type guarantees. But writing TypeScript for workers is actually really easy. We ended up using something called serverless because this predated Wrangler. Everything that do these days uses Wrangler, which for those of you that aren't aware is workers kind of CLI tool to help you build workers. It's a really, really cool tool. And like I said, we use it for basically everything that we've built since then, which is a lot. Actually, we run a lot of workers now for basically everything related to our product. And like I said, it was a really neat experience. We learned a lot going along the way, because like I said, there wasn't a lot for building products yet on workers. So there was a lot of things that we kind of had to figure out. Some of those things were like versioning. So like you could see Rita's worker was deployed, but you never know when you first deploy it, you never really know what version of the worker you're running. So one of the things that we always, one of the first things that we added was basically all of our workers return a header that basically gives us like the devs an idea of what was running, which helps for a lot of things. It helps when we do dev, it helps when customers write in support tickets, because if they give us a par file, we can see like inside of their response, like what version of the worker they're getting, what version of code they're getting, which is super helpful for us. We had to figure out lots of metrics, which I think is still kind of an ongoing, like we're still trying to work our way to what is actually best. But one of the things that we do personally is that Cloudflare has a service that basically you can post HTTP requests to when it takes it and puts them into a Kafka queue. And then that Kafka queue runs back to our four data centers where we can then process them out of them. So basically like as our workers run, we dump our logs and metrics to this Kafka queue and then it ends up back and then we can process it and add it. So we have the operational visibility that you would expect out of a normal product running anywhere in a normal place. So that's never really been a problem for us. We are hooked into Sentry. So we use Sentry for error reporting. So like if we're going to return a 500, one of the big key things for workers, in my opinion, is you should usually have some sort of outside handler that like, it's basically a giant try -except block to make sure that you catch any kind of exceptions so that you can do processing with it. So like for us, we'll send an error report to Sentry with all the data that we need to probably debug it and then return an error page that's helpful for the customer. So that way if they need to write in a support, they can, or if like they can solve their own issue, then it will help try to help them solve their own issue. But all of that is built into our worker, which is super nice. But some of the things that you said there were like very interesting, James, that in many cases, I mean, and this is part of like why we dog food, right? We try to make a product work for us. We think like if it's not good for us, it's probably not going to be like good enough for our customers, right? And many of these things that you talked about, whether it's a version, a login or even KB or whatever, you see them then, requirements that then we pass over to Rita and the people that are basically to build things in there, right? So our customers can basically take advantage of it to build their services, right? We talk with Rita all the time. We end up trying a lot of the new features coming down workers pipeline. So we're usually beta testers for them just so that that way they can get kind of some internal usage and then that way they know like are there rough edges that we need to like kind of sand down or is there like some core problem that we need to fix before it becomes like a bigger issue, which almost never happens. But like it's just like those kinds of things, like the logs and metrics idea is the reason why that's now part of Wrangler, I believe, Rita, I think, yeah. Yeah, it's super helpful to have this really tight feedback loop of when we go to even recruit user researchers, right? Like our design team is like, is this intuitive? Do you like this? Do you not like this? I mean, when we're at the office, it would literally just be like turn desk. But even now, just James or Sajan or some of our other internal customers, but yeah, it's access has definitely been like a valued customer of ours from the very, very beginning, which we really appreciate. Yeah, great. And, James, so I know that we've made, you know, during the pandemic, basically a biggest plus to provide access for free to anybody that needed, you know, so how things have been basically with growth, scalability, and, you know, maybe these things were like WordPress has paid off, right? Yeah, no, it really has. Like, the amount of growth that we've seen through the pandemic, because of that, that offer has been rather high. But as far as like from an engineering perspective, I honestly didn't notice the only real thing that I noticed was just kind of an increase in kind of generic support tickets of like, how do I set this up and things like that, that you just, you know, as you get new users, they have questions for. But beyond that, like, workers kind of scaled so perfectly that we'd never really noticed. It made that a lot easier. It wasn't like I was sitting there watching CPU spike and then being like, right, I guess I need more containers in my, you know, Kubernetes pool. It just kind of, I didn't worry about that. That is one of the best things about workers is I just didn't have to think about it, really. And anything that is being announced this week during Serverless Week that you're particularly excited about that you want to use in Access? So, I saw that we're doing Workers Unbound. Workers Unbound would be really nice, actually. We have certain things. So, we have to do SAML in our worker, because that's just how SSO works for a large portion of the world, still, unfortunately. And signing SAML assertions in JavaScript is not the cheapest thing in the world, as far as time is concerned. So, Workers Unbound, like, we basically skirt the boundary right now, but having that kind of Workers Unbound would make that, make me feel better. Yeah. Yeah, I mean, the Access team has been beta testing Workers Unbound, I would say, for a long time. That's great. Well, we're almost, we have a couple of more minutes. So, anything, Sachin or James, that, you know, you want to mention to anybody that we would consider workers for their own development, or? One of the things that was really useful to us, which initially wasn't part of the worker's setup, was secrets. Like, having encrypted secrets inside the worker is pretty life-changing right now for us, because then it opens up this whole world of, I want to send things to this other API, but, like, I don't want anyone to send things to this other API. So, and then we end up using, so, we have a bunch of, like, Auth stuff, also built on top of Access, that then lives on top of Workers, that then is internally dogfooded to, like, other things. So, there's, like, this whole mashup of dogfooding happening with Workers and everything. Thank you, James, for all the Access stuff. It's been super nice and useful. Thank you, Rita, for all the Workers stuff. Great. Well, so, we're coming now to the hour. We have, like, 30 seconds. So, thank you so much. I thought this was very, very, very interesting. I hope that, you know, people that are watching, you know, can really see how, you know, we dogfood all these things, the amount of volume that we put through them, and how, you know, we try to be, as I always say, you know, cloud-first customer before things get to you. So, I hope that this was helpful, and if anybody has any questions or any comments, you know, send us a note in Puffer TV, and we'll be happy to get back to you. Thank you so much, guys. Have a great rest of the week. Yep. See you. Bye. Thanks, y'all.