Cloudflare at Cloudflare
Presented by: Juan Rodriguez, Rita Kozlov, James Royal, Sachin Fernandes
Originally aired on July 30, 2020 @ 1:30 PM - 2:00 PM EDT
It's Serverless Week so come join us to hear how we dogfood Cloudflare Workers internally to solve problems and build amazing products!
English
Transcript (Beta)
So welcome everyone to another session of Cloudflare at Cloudflare and what we do about this program in this program is talk about what we call dogfooding which is how do we use our own internal technologies and products inside of Cloudflare to build to solve problems, build other products, solutions.
My name is Juan Miguel Rodriguez.
I'm the Cloudflare CIO and I am your host. I am joined today by three amazing people way smarter than me.
I have Rita. Why don't you introduce yourself, Rita?
What do you do? Hi everyone, I'm Rita. I'm a product manager working on Cloudflare Workers which is our serverless platform that we'll be talking about today.
All right, thank you. Sachin. Hello everyone, this is my first time doing Cloudflare TV so that's pretty exciting but I'm a systems engineer on our core API team and we help run the API as a platform for other consumers and services at Cloudflare.
All right, thank you Sachin. James. Hey everybody, I'm James. I've also, this is my first time, but I am the engineering manager for the Cloudflare access team and before that I've been the engineer on access for quite a long time.
Perfect, thank you.
So this week, because it's a serverless week, in case you haven't heard, we're going to talk about dogfooding workers internally to build other things.
So I'm going to have Rita, in case you don't know what workers is, which would be unheard of, given that this is serverless week, what workers is and how it started and maybe show a little bit of workers.
So Rita, do you want to go ahead?
Yes, so workers is Cloudflare's serverless platform that allows developers to run their code from each one of our 200 data centers around the world.
So it's really, really powerful in that it's really performant because it runs code really close to your users and it scales basically infinitely because it's built on the same network that we build to protect from things like DDoS, right?
So I'm going to give a quick demo of how it works and what it looks like.
We like demos. All right, I'll even do some live coding.
Yeah, it's very exciting. We always know how live demo goes.
Yeah, I know. I feel like I'm cursing. Yeah, so welcome to my Menagerie of Cloud project.
So this is the Cloudflare dashboard and I'll click into the workers product.
You can see again all the different things that I've worked on. I end up creating a lot of projects because I use workers myself pretty regularly for any random ideas that I can think of.
But in here you can see I can quickly open up a little code editor that allows me to test out a little bit of code.
So maybe we'll start with just deploying a super simple little worker that says hello.
So I can customize it and do something like get the name based on a URL query string parameter.
I'll do a new URL. And then I can do a name equal URL .withFriends.kit.
That's tooltips and everything.
Yeah, it's super handy, especially for demos like this where I forget the exact command.
And it's a lot easier to show off than command line stuff where it's like look at me, call the APN.
It also shows you errors, which is kind of cool.
Yeah, it does. So if I, all right, it says hello null. So that makes sense.
But if we do something like name equals on, see what happens. All right.
All right, here we go. Is that running when you deploy it, Rita, and it gets pushed to all our pops globally?
So this is just a little test preview. So this is not live yet.
Ah, when you deploy to go live is when it goes. Are you ready? Yes.
Oh, no.
I have too many scripts. You see, you got the demo effect where something always will do the work.
All right. I will, just for you guys, I'll delete one of my old workers.
It's pretty wild that you have 130 workers.
No, don't tell anyone. All right. So now it should be deployed and live.
And so if anyone wants to go to donpapere7e7 .rita.workers.dev, I can do name equals James and it will tell me hello, James.
And that is right now running everywhere.
And that is right now running everywhere. That is pretty amazing.
So that was very quick. And, you know, we just pushed like all the code to basically globally to Rx.
So that's pretty, pretty cool. Thank you, Rita.
That was a great demo. We even had like a little bit of excitement with it. It wouldn't be a true demo if something didn't work.
That's right. Exactly. Now it's good to know that it's actually running live versus showing a video, right?
Yeah, that's how you know it's not pre-recorded. Exactly. So we're going to go to Sachin, which is like one of your internal customers, right?
That is basically using Cloudflare Workers to basically build stuff.
So Sachin, you said that you were like in the API team.
So tell us about like the team. What do you, what does the team do?
What is the function of the API inside of Cloudflare and all that kind of stuff?
So essentially, the team is sort of like a platform team where our consumers or our users are other folks inside Cloudflare.
So we help drive other APIs and we help support teams with things like authentication, authorization, anytime someone needs to pull out account info, user info, zone info, things like that.
They usually end up coming to, through our systems in some way.
So it sort of, we have a very broad set of things that we touch.
And the other two or three big things that we run is stuff, everything under api.Cloudflare.com flows through our team.
And then our team decides which service to sort of send it to.
And we also drive a lot of those APIs under api .Cloudflare.com.
We drive the APIs in the dashboard. So anytime you go to the dashboard and you hit, you ask for some information inside the dashboard, that usually flows through one of our boxes or one of our servers.
We also help sort of all the partner APIs.
So anytime you're using WordPress, Magento, those things, we maintain some of the plugins for all the partners.
And we also have the host API and other sort of broader APIs that we help maintain.
So your API platform is in the middle of many things that basically we're running.
Yes, essentially. It's sort of like a supporting platform or like a, it might not be sort of the final thing that gives the answer, but it usually goes through our systems to let the user know where to go in order to get the answer.
So it might redirect you to the billing service or the SSL service and things like that.
I know the billing service very well.
Yeah, you do know the billing service, yes. Yeah. His team makes making any APIs from like a service team much easier to do.
So I know we appreciate their work a lot.
It's kind of great. We're customers of each other. Like a reciprocal relationship.
Teamwork, yeah. So Sachin, how are you using workers inside of the service and for what things and stuff like that?
So it's obviously very easy for us to break things if we make large changes across everything.
So if we decide to sort of change even a small thing, we know it affects every single team that uses our platform and it can essentially bring down sort of everyone, which we're very careful about.
We don't want to make changes like that, but we also want to deprecate old code and we also want to like progress and move forward and experiment with things and do that in a safe way, and in a way that's sort of controllable.
And the API team was actually one of the first consumers of workers at Cloudflare, which was terrifying and also exciting at the same time.
Because we had sort of no idea, like would a billion requests a week be okay? Like how many of those would error out?
Like when we do the deploys, what is the time between the new code getting deployed and the change going out?
And even if that's a few seconds, that's thousands of requests that we miss.
And so things like that were sort of top of mind when we were trying to use workers.
And one of the first things we used was for TLS deprecation.
So what we did, so we needed to sort of start rejecting requests across Cloudflare that used the TLS 1.2 or lower because it had legacy algorithms in there, like things we didn't want to support just because of security concerns and things like that.
But we also didn't want to sort of be mean to our customers.
We also wanted to give them a reasonable amount of time to migrate to like 1.3.
We wanted to give them warnings about, hey, you're using a deprecated protocol.
This is going to end on this state. Here's a migration pathway.
So we thought of many things. We were like, you know, we could go to every service and basically implement TLS deprecation service level and get each service to say, well, if it's 1.2, then we reject it.
We were thinking about platform level stuff where maybe the gateway itself tries to reject things.
But not everyone, we didn't want to make large changes to the gateway because then again, it affects the big teams and it's written in Lua and Nginx and things like that.
So it's hard to debug, it's hard to parse.
And then we're like, well, what if we just put a worker in front of all of api.Cloudflare .com and all of dash.Cloudflare.com?
Just in front of all the APIs.
Just every... Just in front, right. And this is, again, this is like many, many, many, many requests every second of every day.
And it was very scary because if we messed up a semicolon in that worker, it essentially means that we don't have the API anymore, which is very bad.
Sure. So we ended up sort of writing the thing in JavaScript.
And then the really easy and nice thing was we could deploy it to as many test zones as we wanted.
We could run as many tests on sister zones without actually affecting the primary zone or doing things like that.
And we knew that it was a real test because it would go out to the edge. We didn't have to spin up another API gateway to do these tests.
We didn't have to spin up 50 million test services alongside everything to say, well, now you have to say one, two is deprecated.
But like, go through your test cycle, do a release, do a thing, make sure it goes out.
And the really nice thing is when we started out, we just started out by logging.
All we did was just log anything that's wanted and tell us...
The patterns and the volumes. Exactly. What do we sort of to have a before metric in order to have an after metric.
So we saw that quite a few people were still using one, two, and we're like, this is bad.
We don't want them to use one, two.
So we said, well, instead of rejecting them outright, we can just split the code bots in the worker itself.
So if they're using one, two, have them return or append a deprecation message to the response body when we send the response back, instead of outright rejecting their stuff.
So we started doing that.
So that was a little bit of an experiment to say, well, we added this little if condition in there, but it was four lines of code and we were able to change all of api.Cloudflare.com with those four lines of code.
And it did what we wanted.
We didn't touch any platform. We didn't touch any service. But the moment the response came back and we knew it was one, two, we would say, hey, just by the way, here's a deprecation message saying you're using a bad protocol.
And it worked great.
We saw the log, we could see people were trying to switch things up. We saw people reaching out to their folks being like, please help us.
Don't start rejecting our requests.
So that was really nice to be able to just do that in four lines of code.
That's a very interesting use case. And the part that I find interesting is you said that that was one of the first things that we use internally with workers.
So Rita, that must have been a little bit, you know, probably like a bit of a heart attack.
It's like, oh my God, you know, like my platform, the internal use case, first of all, we're going to be putting like in front of all the traffic that we're getting from an API perspective, right?
Yeah, definitely. I mean, I'm an optimist at heart.
So I was like, it's going to be fine. And I mean, no offense to our API team, but we have much, much bigger customers on us now.
So with much higher volumes of traffic, but definitely even the fact that we didn't have to do anything and the API team didn't have to do anything to scale to being able to handle so many requests per second was really reassuring.
And I think it's really interesting.
It sounds so simple, right? We just need to deprecate TLS.
But in a services scenario like that, going to every single team and getting them to get their service to handle it is really, really challenging.
Yeah, yeah, exactly.
That's great. Well, thank you, Sachin, for that use case. I'm going to switch gears to James, and James runs the access team.
And so James, tell us a little bit about access.
And I know that, you know, access didn't start as a platform in workers, and eventually they got there and stuff like that.
So maybe you can tell us a little bit about all that.
Yeah, so access was kind of built for a couple of different reasons.
But one of the major ones was we wanted to kind of get off of our VPN, especially a long time ago.
So access started as an internal project, I believe, in 2015 or 2016, somewhere way back a couple of years ago.
And we pretty much only used it internally, I think, to protect like a Grafana instance.
And then eventually, we decided that maybe this is something that we could productize, and maybe start selling to users.
So then I was hired to kind of take it from that POC into kind of like an actual product.
Industrialize and sell to customers.
Exactly. And originally, because it was a POC, it was just a Go service running, I think, in like one or two containers in our core data center.
And that worked, because the amount of traffic that hit it was basically minimal.
I mean, you only log in a couple times a day. So like that was really the only traffic that it was having to deal with.
And that worked for a very long time.
And it kept working. Sometime, though, in like 2018, we started having a lot of kind of instability in our core data center that we were in, like either bad LB configs, which seems to be the cause of every major cloud outage these days, would get pushed out.
Or like the database, like we were on a shared database instance, because we were small potatoes as far as databases were concerned.
So they just had us on a shared instance with a bunch of other people. And you have noisy neighbor problems, where like, you know, somebody would start ramp hammering the database, and then our reads would suddenly just extend to, it would take 10 seconds to do something.
So access would have, you know, periodic minor outages, just due to things almost out of outside of our control.
And so when you're a service that is in charge of making sure that people can get to their internal, you know, tools or whatever else, like having outages in your login service is not something that's acceptable.
And so we started looking for ways to kind of reduce either our dependency, at least a little bit from that core data center or entirely from that core data center.
And this was around the time that workers actually came out, and we evaluated it.
And at the time, it was really close, but it wasn't quite what we needed.
One thing that finally did kind of take it over the edge for us was when they rolled out workers KB, because then we could actually store configuration at the edge, and we wouldn't ever have to make calls back.
For people that may not know, what did you say about like what that KB?
Yeah, so workers KB is, I mean, it is what it is.
It's basically a generic KB storage. So you can write key, like key values.
So basically, you have a key and then you store some arbitrary blob inside of it.
Like for access, everything is just a JSON blob. So basically, when you make an API update to the Cloudflare dashboard, we have our source of truth running inside of our core data center.
But then that gets flushed out to all of the data centers that KB listens to, or it's written into KB, so that our workers can pull that configuration out at the edge.
So we never have to make that round trip back to our core data center.
It's as close as wherever KB stores this data. But so once that came out, that was a viable option for us.
And so Cloudflare does retreats, or did retreats, I guess, before Corona, where they bring everybody in the entire company into the San Francisco office for a week, once a year, so that you can meet, you know, other teams and things like that.
And during that retreat week, we decided to kind of do like a hackathon on the access team, where we basically like, could we build the core access product in workers?
And we, you know, it's a relatively new product, KB was very new, especially at the time.
And so like, there wasn't like somebody I could be like, hey, how did you build this internally?
I mean, Sasha's team did, but like, everyone's kind of stuff was kind of basic, like, modify a request, right?
At the time, I think in Cloudflare, we were the first like real product built out of workers, where there was there's no origin behind it.
This is the end, like the request hits us, and then it stops and we return a response.
And so that was like a week long, just hackathon, all we did that week, when we had free time, that wasn't like hanging out with other people and getting to know everybody, like the reasons why you do retreat week, was spent just trying to make this work.
And so we built a POC at the end of the retreat week, that was like, look, we proved you can actually build this.
And so we kind of put a hold on everything else that we were building and decided to rewrite our entire core authentication piece into workers, which was a really cool thing.
Because as an engineer, rewrites are always kind of fun.
Because like I had three years, like, basically three years worth of code that like we had learned over time, like, you know, it has cruft of like, you built this, but it doesn't quite work.
But you modify something, and then you modify something.
And then it's just like, eventually, you just have like a, you know, it's old.
And all those things that you got to deal with.
Yeah. And so getting to rewrite it from scratch with like, what you already now know a couple years down the line was a really cool experience.
And we ended up building it in TypeScript.
So we didn't lose any of like the typing that like we had with Go, because that was one of the original concerns we had was like, everything was JavaScript.
And we were like, well, we're gonna lose like our type guarantees.
But writing TypeScript for workers is actually really easy. We ended up using something called serverless because this predated Wrangler.
Everything that we do these days uses Wrangler, which for those of you that aren't aware is workers kind of CLI tool to help you build workers.
It's a really, really cool tool. And like I said, we use it for basically everything that we've built since then.
Which is a lot.
Actually, we run a lot of workers now for basically everything related to our product.
And, you know, that, like I said, it was just, it was a really neat experience.
We learned a lot going along the way, because like I said, there wasn't a lot for building products yet on workers.
So there was a lot of things that we had to figure out.
Some of those things were like versioning. So like you could see Rita's worker was deployed, but you never know when you first deploy it, you never really know what version of the worker you're running.
So one of the things that we always, one of the first things that we added was basically like all of our workers return a header that basically gives us like the devs an idea of what was running, which helps for a lot of things.
It helps when we do dev, it helps when customers write in support tickets, because if they give us a part file, we can see like inside of their response, like what version of the worker they're getting, what version of code they're getting, which is super helpful for us.
We had to figure out lots of metrics, which I think is still kind of an ongoing, like we're still trying to work our way to what is actually best.
But one of the things that we do personally is that Cloudflare has a service that basically you can post HTTP requests to when it takes it and puts them into a Kafka queue.
And then that Kafka queue runs back to our core data centers where we can then process them out of them.
So basically like as our workers run, we dump our logs and metrics to this Kafka queue and then it ends up back and then we can process it and add it.
So we have the operational visibility that you would expect out of a normal product running anywhere in a normal place.
So that's never really been a problem for us.
We are hooked into Sentry, so we use Sentry for error reporting.
So like if we're going to return a 500, one of the big key things for workers, in my opinion, is you should usually have some sort of outside handler that like, it's basically a giant try except block to make sure that you catch any kind of exceptions so that you can do processing with it.
So like for us, we'll send an error report to Sentry with all the data that we need to probably debug it and then return an error page that's helpful for the customer.
So that way, if they need to write a support, they can, or if they can solve their own issue, then it will help try to solve their own issue.
But all of that is built into our worker, which is super nice.
But some of the things that you said there were like very interesting, James, that in many cases, I mean, and this is part of like why we dog food, right?
We try to make a product work for us, and we think like if it's not good for us, it's probably not going to be like good enough for our customers, right?
Many of these things that you talked about, whether it's versioning, logging, or even KB or whatever, you see them then, requirements that then we pass over to Rita and basically to build things in there, right?
So our customers can basically take advantage of it to build their services, right?
We talk with Rita all the time.
We end up trying a lot of the new features coming down workers pipeline.
So we're usually beta testers for them just so that that way they can get kind of some internal usage.
And then that way they know like are there rough edges that we need to sand down?
Or is there like some core problem that we need to fix before it becomes like a bigger issue?
Which almost never happens. But like it's just like those kinds of things, like the logs and metrics idea is the reason why that's now part of Wrangler, I believe, Rita, I think, yeah.
Yeah, it's super helpful to have this really tight feedback loop of when we go to even recruit user researchers, right?
Like our design team is like, is this intuitive? Do you like this?
Do you not like this? I mean, when we're at the office, it would literally just be like turn desk.
But even now, just James or Sajan or some of our other internal customers, but yeah, it's access has definitely been like a valued customer of ours from the very, very beginning, which we really appreciate.
Yeah.
Great. And James, so I know that we've made, you know, during the pandemic, basically our biggest plus to provide access for free to anybody that needed, you know, so how things have been basically with growth, scalability, and, you know, maybe these are things where like WordPress has paid off, right?
Yeah, no, it really has.
Like the amount of growth that we've seen through the pandemic, because of that, that offer has been rather high.
But as far as like, from an engineering perspective, I honestly didn't notice the only real thing that I noticed was just kind of an increase in kind of generic support tickets of like, how do I set this up and things like that, that you just, you know, as you get new users, they have questions for.
But beyond that, like workers kind of scaled so perfectly that we'd never really noticed.
It made that a lot easier. It wasn't like I was sitting there watching CPU spike and then being like, right, I guess I need more containers in my, you know, Kubernetes pool.
It just kind of, I didn't worry about that.
That not as one of the best things about workers is I just didn't have to think about it really.
And anything that is being announced this week during Serverless Week that you're particularly excited about that you want to use in AXS?
So, I saw that we're doing Workers Unbound. Workers Unbound would be really nice, actually.
We have certain things. So, we have to do SAML in our worker, because that's just how SSO works for a large portion of the world still, unfortunately.
And signing SAML assertions in JavaScript is not the cheapest thing in the world as far as time is concerned.
So, we basically skirt the boundary right now, but having that kind of Workers Unbound would make that make me feel better.
Yeah, I mean, the AXS team has been beta testing Workers Unbound, I would say, for a long time.
That's great.
Well, we're almost, we have a couple of more minutes, so anything, Sachin or James, that you want to mention to anybody that we would consider workers for their own development?
One of the things that was really useful to us, which initially wasn't part of the worker's setup, was secrets.
Like having encrypted secrets inside the worker is pretty life -changing right now for us, because then it opens up this whole world of I want to send things to this other API, but I don't want anyone to send things to this other API.
And then we end up using, so we have a bunch of auth stuff also built on top of AXS that then lives on top of workers that then is internally dogfooded to other things.
So, there's this whole mashup of dogfooding happening with workers and everything.
Thank you, James, for all the AXS stuff.
It's been super nice and useful. Thank you, Rita, for all the workers stuff.
Great. Well, so we're coming now to the hour. We have like 30 seconds.
So, thank you so much. I thought this was very, very, very interesting. I hope that people that are watching can really see how we dogfood all these things, the amount of volume that we put through them, and how we try to be, as I always say, cloud-first customer before things get to you.
So, I hope that this was helpful.
And if anybody has any questions or any comments, you can send us a note in Cloudflare TV, and we'll be happy to get back to you.
Thank you so much, guys.
Have a great rest of the week. Yep. See you. Bye. Thanks, y'all.