Latest from Product and Engineering
Presented by: Usman Muzaffar, Briana Berger, Meyer Zinn, Alyssa Wang
Originally aired on July 24, 2022 @ 3:30 AM - 4:00 AM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
English
Product
Engineering
Transcript (Beta)
All right. Good afternoon, everybody. My name is Usman. I'm Cloudflare's head of engineering.
Jen's on vacation, which is great. So we'll see her when she's back next time.
But no matter, I'm very, very happy to welcome three interns from Cloudflare's intern summer projects.
So Alyssa, Meyer and Brianna, I'll ask all of you to just wave.
We were all practicing our waves as we were. That's the Queensways.
I think everyone's really good at this except me. But I'm thrilled to have you here and to talk about your projects.
There's an incredible diversity of projects that interns work on here.
And I've often joked, I'm extremely grateful that I didn't have to compete with people with your skills and resumes when I was your age.
So but with that, let's just start. Brianna, I'm going to pick on you first.
Tell us a little, say hi and tell us a little bit about yourself. Be sure to go off mute.
That's important. And tell us a little bit about yourself. Where are you studying?
Where are you from? And how did you wind up at Cloudflare? Yeah. Yeah, I guess I'm originally from Florida.
I'm studying at Stanford. I'm a rising senior, which is crazy, I think, because maybe years go by faster.
I'm studying computer science there, focusing in systems.
And then hopefully, well, I am going to get a master's in our security track.
So Cloudflare fits right into my- I didn't ask you this.
I'm already going off script. What were some of your favorite classes in school and in computer science?
Oh, yeah, that's a good question. I really liked our CS155.
It's like a breadth class of all things security. So you learn networking, web security.
And then I also really liked the compilers class, where we made a compiler, which is really cool.
Do you still use the Dragon book? There was a famous book with a Dragon cartoon on the cover, but at least that was what I had.
They didn't have us use it, but they said we could, but they said they didn't want us buying it.
All right. All right. Very good. All right. So that shows how old I am.
All right. So how did you- how did Cloudflare cross your radar? How did you even hear about us?
Yeah, I guess I mostly heard about it in my CS155 security class. And they quoted you guys a bunch and talking about what you guys did.
And I got interested.
Us. Not you guys. You're part of it. Us. And so I got interested with the fact that you guys had a intersection of technically challenging problems, but also a social impact.
And I really liked my team, which is super diverse and super friendly and nice.
So I was excited to get started. That's a perfect segue.
So you interned with the security engineering team. And security engineering is a big cross -disciplinary team at Cloudflare.
It's actually a peer of main engineering that I lead.
And security's job is to keep us safe, keep our customers safe.
So in that enormous challenge, Brianna, what was the sub team you joined and what was their mandate?
Yeah, I joined vulnerability management, which is a team that like Jenna Davenport kind of leads, I guess.
And basically, it's about finding vulnerabilities and managing them and creating a pathway for the future of how we continuously keep an eye on everything and manage everything.
So I'll give the perspective from my side. So as the person who's leading the team with hundreds of software engineers who are constantly writing stuff, we do our very best.
We've got all kinds of training, lots of best practices, lots of libraries, hardened code.
There's so many things we do to make sure that the software we write is bulletproof and is regarded.
But we know that we're human, ultimately, and attackers are creative, too.
So we're constantly also thinking about, all right, so what could we have missed?
And that's part of the mandate of the team that Brianna joined is to actually, OK, so where could there potentially be a problem?
Where could there potentially be something that engineering needs to look at?
And then how do we get that prioritized?
Because there's a lot of noise in that. There could be a lot of things, well, that could be a problem.
It's not really clear, I guess, maybe, kind of, sort of.
And so Brianna, what are some of the tools and processes that you worked with?
And what was your specific assignment over the summer to try to help address this?
Yeah, I guess I'll address the latter part of your question first, and then go into the tools and details.
The project that I'm working on is Flanscan.
It's an internal network scanner, which basically scans our network, trying to look at, it uses Nmap, which basically detects installed applications and looks at the ports.
And then once it has the application, it sees like, OK, the version of this is like 2.0.
And then it goes, checks it against a database to see whether or not it has a common vulnerability exists.
And then from there, it creates a ticket for our team to go and address and update that system if necessary.
Excellent. So Flanscan, Nmap. So some of these are famous tools if you've been around.
So Nmap is something that I remember typing on the command line, and it does this thing.
And it's basically taking its best guess of what it thinks is running on the other side.
It's kind of like, you know, the moral equivalent of walking past somebody's house, peering through the windows, and trying to guess what their favorite TV show is based on what you might see, you know, from the outside.
And then trying to be like, well, you know, since they had, you know, since I saw the TV on, I think they like sitcoms.
So maybe we should pay attention to these kinds of things.
So what does that turn into for you then? Like, how does that?
So Flanscan, Nmap, these tools run, what do they wind up with?
And then what does your, where do you take the ball? Where does your challenge start?
Yeah. So I guess the problem that occurred is that the fact that it's guessing at the versions and guessing at the packages, it doesn't know for sure if there is like a common vulnerability there.
So it ends up creating a bunch of tickets and a bunch of reports unnecessarily.
When we go and check, we see, oh, it already fixed that problem in our latest update or whatever we did to that server.
So basically my job is to, I guess, automate that process. So then we don't waste time going and checking production and seeing if it needs to be updated because there have been like thousands of tickets before and it wasn't necessary to go check them all.
So automate how, how did you do that? What was the, what's the magic that you did there to make this mountain of paperwork turn into something manageable?
Yeah. So basically Flanscan is in production, but then in between Flanscan and reporting it, there's this thing called the Flan Normalizer, which is a mediator between like getting the report from Flanscan and then converting it into a ticket for our team to address.
So the Flan Normalizer is within our cloud platform where basically we're able to go into, we're able to call into like this thing called BigQuery, which we're able to query information from individual servers and hosts about that server.
So basically we do a query within the Flan Normalizer and we go and check that information and do version range checkings and everything to see if this, the actual version of it is also, is vulnerable, whether, yeah.
Did you have to write code to get, make all this, all these connections work?
Yes. Good. All right. Excellent. So what, what was the, what were the most surprising things?
What'd you learn and what, what would you, what'd you take away from this experience?
I thought the most interesting was coming away like as a product manager in a little bit of a way because I kind of had to define how my project would work and like defining like how, where and what I would be querying from and like what's allowed, what I can do, which is like very different and a cool opportunity in comparison to like, I guess, bigger companies where I've like had a more defined set of a project.
Meanwhile, this was more exploratory.
And I guess it was also cool to see like how, just because you define a spec doesn't necessarily mean it'll work out that way.
So like recently, like this past week, The real world gets in the way every time.
So like this past week I was adding in concurrency because there ended up being timeouts with being a cloud function with the normalizer and with us querying, it created an extra time amount.
So it was like interesting having to add concurrency and like deal with like all the balancing of that.
Awesome. I love it. It's a great example of exactly what we want.
Honestly, all employees at Cloudflare to be tasked with like, here's the problem.
We're not really sure how to solve it. But you hopefully have all the support, you've got the support of your teams, you got the support of your leads, go figure it out.
And you know, you're going to run into all kinds of tricky things like I'm sure concurrency worrying about timeouts was not on the top of your list.
Exactly where your time is going trying to address all this stuff. Yeah, for sure.
Brianna, that's awesome. Thank you so much. Thanks for joining Cloudflare this summer.
Thanks for all your efforts. I'm really pleased that you were part of it and hopefully had a good experience here.
Yeah, thank you guys so much. Hi, Alyssa.
Hi. Say hi to our audience watching and tell us what team you were on.
I'm Brianna's mom. I'm Brianna's dad. I'm Brianna's calling in from Sacramento.
I'm also going to be a rising senior at UCLA studying computer science and engineering.
I honestly got through most of my CS classes.
So I took up a minor in public affairs. That that definitely drew me to Cloudflare.
I really heard about you guys from the outage last summer where someone was like, wow.
That's why you're listening right now. Any publicity is good publicity.
Yeah, I saw your guys' blogs and I was like, wow, this company is so open.
They like really value security. It's like super cool. Excellent. Our blog, our blog, you guys are all part of the family now.
All right. Okay. So Alyssa, you joined the gateway team.
So the word gateway shows up in engineering all the time.
Which part, which gateway are we talking about when we talk about the gateway team that you joined and what was the, what is the function of that team and what is the challenge that you were facing as you joined your project?
Yeah. Which gateway was definitely a question I was always asking when I first joined.
So I work in the gateway that's under Cloudflare for teams.
So we do a lot of network management for like enterprise people and just people who work together in teams.
Gateway is sort of like a firewall where people can apply network rules and policies to protect their team's traffic.
So I worked on. So let's back up. So let's see.
So, so if I'm an employee at a company and, you know, my, the IT team at this company is like, oh my God, like there's so many threats out there.
We have to make sure that our employees are protected.
Our employees' devices are protected.
So at the very least they can do is say, well, what are you connecting to on the Internet?
Like, as long as we, as long as you're not connecting to ugly, scary, weirdo stuff, like that, that dramatically simplifies and it gives us a way more level of, of, of confidence.
So the gateway here is let's get everything through this gate and this gateway will then act as a gateway and be like, okay, yes, you can go, you can access this website.
This thing is safe. This is company versus actually no better not do that on a company laptop, because that's exactly how we get into trouble.
I'm right so far, right? Yes, exactly. That's a great explanation.
And we use it ourselves, which is, which is also important.
I mean, obviously if we believe in this product, we should be using it ourselves, which we absolutely do.
So, excellent. So now, so with that context, so you're, this is a piece of software that our customers connect through and it, and, and so particular our customers need control over.
So what is allowed through the gate and what's not for, you know, customer by customer.
So what is the, what was the challenge that the project that you were working on?
Yeah. So specifically what I worked on was called tenant control.
So for a lot of applications that we use, like we've got a Google doc open right now.
We're using zoom and we're specifically under like Cloudflares, Google suite or Cloudflares, Dropbox.
I don't know if they actually use Dropbox.
These are, these domains are like not domains.
These are tenants that Cloudflare uses for the company, but I also personally have my own Gmail, my own Slack workspace.
So how does a firewall let me log into drive.google.com for Cloudflare, but also stop me from logging into the same drive.google.com for my personal email?
I'm already stumped. How does a firewall do that?
That already sounds really hard. It's not too bad. So what I had to do was get the firewall to essentially give more information to servers like Google and Microsoft saying, okay, this is the Cloudflare tenant.
And essentially, since, like you said, gateway is taking in all of the network traffic and it's either saying, okay, I'm going to send it upstream to Google, or I'm going to block it and tell this person, no, you can't visit facebook.com.
I just attach a little piece of information and a header that Google will then read, and then they'll do the blocking or allowing for me.
I see. So we are actually in cahoots. We're partnering with the applications that have this and labeling the traffic on the way.
And then the other side, the one that actually knows whether this is a Cloudflare account or whether this is Alyssa's personal account and can say, actually, yes, she can come in to see the notes that Asman created for this cloud for TV session, but she shouldn't be checking the plan.
She has to go hiking with her friends later today, because there's a link on that page that might be problematic.
And so it's actually a partnership with our providers, and gateway has to annotate those requests in a way that the partners can actually consume.
Exactly. Yeah. Awesome. So, okay. So given that we now understand what you had to do, how did you do it?
Well, how many different parts of the architecture did you have to touch to make this work?
Yeah. This honestly was one of the scariest things for me at first, because I learned that even though we've got just the Cloudflare gateway, which is a firewall, right?
We've got the actual firewall, which lives in Cloudflare's edge.
And then I had to work on Cloudflare's API so that IT admins who want to actually create their roles can get through the API.
And I also had to work on our team's dashboard in the UI. So there were lots of different moving parts there.
And I'm learning there's more to gateway, hopefully working on there as well.
That's fantastic. So did you actually, were you able to contribute code that landed on all the parts of the architectures, the control plane, as well as at the edge?
Yeah. Yeah, absolutely. I was super surprised when I first got to be able to release first to the API, then to the edge, which is really scary because I don't know what's the number of how many metals we have on the edge.
Yeah. I think it's well over 10,000 and tens of millions of requests a second.
The edge is serious business. So just for the audience, when we talk about the edge, what we're talking about is Cloudflare's global network and in hundreds of cities around the world, the thing that actually processes the traffic.
And so Alyssa's code, when ready, when it's gone through all the zillions of rigorous checks and it's time to actually get deployed means it is running at the edge for all those requests and labeling for the part of the product that she was working on.
Her code is now running in all those and all those data centers around the world correctly labeling that traffic.
So that when it goes back to Google or Dropbox or Slack or whatever the other tenants are, that it's correctly annotated.
And it's important to get it right, because if you get it wrong, that means it won't make sense to the other partner, or it could potentially confuse or offend.
And that's why there's so many checks and balances to make sure that everything is double checked.
Talk to me a little bit about the difference between the core and the edge.
Did those feel like programming in completely different environments?
Or did it feel like mostly it's the same? Or what was the challenges there?
That's a good question. At least for my team, they're in completely different languages.
So it's kind of like code switching. Maybe I don't have to use that phrase.
Definitely, it's a lot easier to work on the core.
We're able to release a couple times a week, whereas in the edge, once you let your code out, it's going to take several hours to release.
And you have to make sure that it's efficient.
It's not going to take up too much memory and stuff like that.
Yeah. And so it's interesting. I think conceptually the edge in some ways is simpler, because it's dealing with, it gets a request, has to tag it, has to label it, move it on, firewall, block it, get it through.
But the rigor of the edges is so much more.
The room for error is very little, and the efficiency and the security and all that great stuff.
And meanwhile, Brianna's team is poking around and they're looking everywhere, like what could potentially, and she's running Nmap on everything you just ran and everything you stand up.
And so all of this is connected.
That's fantastic. Well done. That's a great achievement in one summer to be able to be part of a product and ship both its control plane and its data plane work at the same time.
Same question I had for Brianna. What was interesting? What did you learn?
What was surprising? I guess I was really surprised by how, again, how open the company was.
Whenever I had questions, which I had a ton of questions, obviously, there was always someone I could talk to almost immediately.
Everyone on the team was, even if they didn't know exactly the answer to my question, they would take time out of their day to figure out a way to explain it to me, and then also explain it to me and make sure I understood.
Everyone was just so supportive and really great mentors.
I'm so happy to hear that. I will say, I think I've been at a bunch of different software companies in my career, and I think if there's one ingredient that all the successful ones have, it's this incredible willingness to learn and teach at the same time.
It's just when you're inventing the future and you're literally building things that no one has ever seen before, there is no Wikipedia page you can just go bookmark and go read.
Even if you've done your best to write things up in internal docs, the truth is all the knowledge is floating around scrambled in pieces in all of our heads.
The only real weapon against that is tremendous empathy and curiosity, constantly trying to ask, how does this work this way and why?
When someone asks you, why does it work this way that your gut reaction is, well, let me tell you.
By the way, do you have a better idea of how it could work?
Because that was the best I came up with at the time.
It's excellent. Well, thank you so much again. I'm so pleased that you had a great internship.
It sounds like you were able to have a real impact on a product that is super valuable to our customers, so well done.
Congratulations.
Meyer, how are you? Howdy, howdy. I'm doing great. Say hi and tell us where you're from and what you've been working on.
How did you hear about CloudFront?
Whoa, okay. I will start with, I'm from Dallas, Texas originally. I go to school in Austin at UT Austin.
I'm a rising sophomore studying computer science.
How did I hear about Cloudflare originally? It's kind of a fun story. I grew up with Cloudflare.
When I wanted to start a Minecraft server in, this must have been late elementary school, Cloudflare was there for me with 3D and S.
It's always been in the back of my mind.
And I think I had the opportunity to interview and there happened to be a role for me here this summer and it was a really great experience.
That's fantastic. What team are you part of and what does that team do?
Sure. So I joined the Magic Transit team. What we do is Cloudflare, most of our history has protected individual websites at the level seven layer.
Magic Transit, we just turned two years old today. We protect entire company networks at the level three layer.
So we go a lot deeper down into like sort of the deep underground of the Internet, the nitty gritty of how things work.
And then we protect companies from these huge scale denial of service attacks that can be launched today.
And we integrate with all of these other products like Gateway, like Firewall to offer those services.
That's fantastic. So Magic Transit is one of the most exciting projects that has actually came out.
And it's interesting that you had to remind me that it hit two years today.
In some ways, it feels like it's been around for a long time.
In other ways, it feels like it just showed up depending on my perspective.
But yeah, it's exactly right. It's literally the front door of your IP space.
So this is companies that own their own IP addresses. They basically sign a piece of paper that says, actually Cloudflare, you tell the world that you're the front door and get all of that traffic, no matter what it was, anything that was due to my address, you pick it up and then block it if it's bad and send it through and use all the Cloudflare products.
And the magic in Magic Transit is very interesting, right?
So I think part of what is going on here is this notion of these very interesting tunnels where, okay, so Cloudflare has received generic IP level traffic, and now it's got to make its way if it's due back to customer origins.
And it does that over a tunnel. And the tunnel is asymmetrical. There's a whole sort of magic trick, sort of rabbit in a hat kind of thing going on here where it goes down the tunnel, but then comes back outside of the tunnel.
And the effect is exactly what both the customer and the eyeball wants.
Talk a little bit about that.
And what are those tunnels and what's on either side of them? Yeah.
So you mentioned that we sort of act like the front door. It's sort of an interesting analogy because what happens is that the traffic comes into Cloudflare.
We scrub it at all of our locations around the world.
So we filter just the traffic that we think is authentic that should actually be delivered to the customer.
Well, what happens next?
We actually just, most of the time, spit it back out onto the public Internet.
And we do this through a technology called generic route encapsulation tunnels.
And what that means is we have sort of a private, so to speak, connection between us and the customer, even over the public Internet or however we're connected to them, whether they have a direct connection to us in one of our locations.
To us, it looks the same. We're just shoveling traffic into a tunnel and trusting that it will be delivered to the customer.
And then for the customer, all they have to do is filter traffic that comes from Cloudflare.
They don't have to do anything else.
They don't have to look at, have we gotten a lot of traffic from this address?
Is this a known attacker address? They just have to filter out everything that's not us.
Excellent. So from their perspective, they basically have this scrubbed clean pipe of information that by definition is important and useful to them.
Right. And a lot of, sorry, no, no, great. The real magic in Magic Transit is what a lot of customers do is we're so good at scrubbing that traffic and then delivering it to them that they actually configure multiple tunnels with us.
And they tell us, okay, if you have trouble delivering traffic over this tunnel, try the other tunnel.
And that way, no matter what is happening on their side or on the public Internet, if a router between us and them is having a bad day and it's dropping packets, we are delivering their traffic to them.
We are delivering on the core promise of transit.
That's the real magic in Magic Transit.
And that's the project that I got to work on this summer is how do we know where to send customer traffic?
Which one's healthy, right? You got a bunch of patients in front of you.
Let's stick a thermometer in each one of them and see who's in the mood to receive traffic.
So how do we do that? And what was the challenge that you had to address?
Sure. So the way that we do that is we send health checks along each of these tunnels every so often.
And we ask the customer basically to when they receive this health check, just respond the way that they would to any health check request.
And that packet makes its way back to us. And then we say, okay, this tunnel looks healthy.
Or it doesn't come back. And we say, okay, we're having a problem delivering traffic across this tunnel.
So the problem with that is we are very big.
We have a lot of servers all around the world.
And if every server is sending health checks on its own, we started off at once per minute, we are sending a lot of health checks to the customer.
On the other hand, if the server can only send health checks once per minute, and the customer's tunnel goes down, it could take up to 60 seconds for us to detect that.
It's a long time to lose your whole connection.
That's a long time. Yeah, exactly.
I mean, all of your company's work just stops. So that's obviously that's not an acceptable delay.
So what we did was that we started making it so that servers within the same data center, when they send out traffic to the Internet, it usually takes the same path once it leaves the Cloudflare data center.
So what we started doing was we had servers in each data center, they would multicast the results of the each tunnel health check to every other server in the same data center.
And so that way, they're sort of all sharing these insights. It's like you and I are standing at arm's length.
I feel a raindrop. Instead of just declaring it's raining and opening my umbrella, I talk to you, and you're like, well, the sky is clear.
I haven't felt anything. And then we're both like, okay, maybe we'll watch this a little more.
It's not worth an umbrella right now. But we'll keep an eye on this, right?
Technology is awesome. But what if I say, yeah, actually, I felt a raindrop too.
Then we might pull the umbrella. The problem with that approach, though, and this is sort of the project that I got to work on this summer, is some of our data centers are very large, and we're onboarding a lot of customers with more and more tunnels.
And so if we keep up, there's an N squared term in multicast, where each server in the data center needs to be able to process a message from every other server for every tunnel once per minute.
In our largest data centers, we were rapidly approaching half a billion messages a minute, and we're just not being able to keep up.
So this summer, I had the opportunity to just redesign how we do it.
And instead of sharing individual tunnel health results, now we assign the responsibility for checking the health of each tunnel to one server in the data center.
That server can send health checks much more frequently, right?
It can do it even once a second or once every couple seconds. And then it sends out a weather forecast or weather report to every other server instead of individual observations.
And so just by doing that, we have decreased the CPU usage of Magic Transit's control plane on the edge by like 60% or more.
Fantastic. And yet, we haven't lost any visibility into what we're trying to do.
Yeah, we have better observability now.
We have better reaction to changes in Internet weather, and we're ready to scale for the next two years of Magic Transit.
This is such a great project.
I could keep talking to all three of you about everything. We only have 90 seconds left.
So just very briefly, Myer, what was the most interesting thing you learned at Cloudflare?
I think the most interesting thing I learned...
I was a freshman last year. This was my first experience in software engineering where the stakes are pretty high.
And I think I learned that things break in really mysterious ways when you 10,000x a problem.
So I think I really took a taste of humility this summer.
My first intuition is not always correct.
And I always really need to slow down and think about problems, talk to other people, ask the questions before just jumping headfirst into the project.
The other way we like to say this, when you do something 30 million times a second, one in a million happens all the time.
One in a million is not rare at all. One in a trillion is not rare at all.
So that's part of the really interesting and challenging part of working at scale.
All three of you, Myer, Brianna, Alyssa, it was such a pleasure to have all of you on.
I'm so glad that you were part of the Cloudflare team this summer.
I hope we see you all again. And I just want to thank everybody, thank all of you for being part of it.
And thanks, everyone, for watching. Thanks so much for having us, everybody.
Thank you.