Story Time
Presented by: Sam Rhea
Originally aired on September 30, 2021 @ 9:30 AM - 10:00 AM EDT
Join Cloudflare CTO John Graham-Cumming as he interviews Cloudflare engineers and they discuss a "war story" of a problem that needed to be solved — and how they did it.
This week's guest: Cloudflare Director of Product Management Sam Rhea
English
Interviews
Transcript (Beta)
All right, good morning from Lisbon, where I've got my close neighbor, not that far away, Sam Rhea.
Sam, we were just talking about your job title and I had actually downgraded you, I'm afraid, so you're actually director of product management, right?
Yes, we're relatively new, so sometimes I forget that myself. So tell me about that.
So what are you directing in product management? I look after our set of products that we call Cloudflare for Teams, which started years ago as our access product and has since evolved in a number of different directions and there's a really fun team.
My job now is I get to work with a bunch of really bright product managers and engineering leaders to build this vision that started as that kind of fun prototype a few years ago.
And this reports into, I mean, not many people really understand the structure of Cloudflare, this reports into a different team than the engineering team, right?
Yeah, it does. We have a group at Cloudflare called Emerging Technology and Incubation, or ETI, and ETI has a pretty simple mandate where we want to constantly think about the new types of problems that our customers have that we can solve on Cloudflare's existing network.
What else can we build on that network to address new problems?
And things like Cloudflare for Teams and also workers, our registrar product have kind of come out of those questions, but many of them start with our own internal problems that we have to solve first.
And just to sort of continue on this theme with the structure of Cloudflare, I think one of the interesting things about Cloudflare is I saw somebody tweeted the other day, Cloudflare has obviously found the secret to unlimited engineering staff because we've obviously done a lot of announcements.
So was it two weeks ago now we did birthday week, we announced a whole bunch of stuff, and this week we're doing Zero Trust Week, which is your leading really in terms of what we're doing.
Cloudflare really, really, really read Clay Christensen's book and said, innovator's dilemma, we have to reinvent ourselves as we go forward.
And so we actually have three separate engineering teams basically.
So there's a small team that reports to me and that consists of our research team, which does like three to five years out kind of stuff.
And there's actually a small number of staff engineers who roam around from project to project.
There's the engineering team, partners with product and produces most of the stuff that Cloudflare are producing, right?
And you go in the dashboard, most of those things in there. So if you think about our caching and DDoS mitigation of WAFs and load balancing, all this stuff is there.
And then there's your group, which is the ETI group, which is then working on somewhere between those products and the products that the research or the things the research team might do.
So having given that introduction, what we announced yesterday, this thing called Cloudflare One, which is a single unified network for the corporation today and tomorrow, actually consists of a bunch of stuff that ETI started incubating a while ago.
So what was the first of those? Yeah, the first was our access product.
A lot of these have kind of DNA that go back years and years ago in Cloudflare, but the product within the set of what we announced yesterday that has maybe the most obvious kind of evolution that got us to this week is our access product.
We had, like a lot of other organizations, we're keeping applications and resources safe just by nature of being on a private network.
So the security model was, if you're on this private network, you can reach this application or this resource.
That's kind of a weird stand-in for a more traditional obvious security model where if I'm in the office, I'm allowed to reach that resource.
And the VPN was kind of this pretend office that extended the private network of the office to people who were working outside of it.
And that felt really discordant, I think, for us as an organization to run Cloudflare's massive global network to connect to data centers in 200 cities around the world through a physical box in San Francisco.
And the kind of origin of this was Grafana.
If our audience is familiar with Grafana, it's a very popular charting tool.
You use it to monitor alerts or events that you're trying to measure in your applications or in your infrastructure.
It's typically the place that people go to first if they get a pager duty alert.
If something's wrong, I want to understand just what flavor of wrong is it right now.
And when you get a pager duty alert and you have to run to your laptop, and the first thing that you do is fire up the VPN and maybe punch in your VPN credentials, which are separate from your SSO credentials and separate from other things.
If you got that pager duty alert on your phone and your laptop's in another room, or maybe you're out walking the dog, you have to run back.
So it became this real burden to go just check what is going wrong.
How do I diagnose this problem? And a team of engineers within Cloudflare thought, hey, wait a second.
We run a network that can check a request against some criteria.
We do that at scale. What if we added identity into that criteria?
And Cloudflare's network's kind of always been in this funny term called Zero Trust, where even before Zero Trust was a term, our network was thinking about every request headed to a resource protected behind it.
And so adding identity into it was a really natural evolution. And so it started as that little prototype to make it easier to get to Grafana.
And what you don't know the history of this is when I joined Cloudflare, I was the only person not in San Francisco.
And so that VPN performance was atrocious because I was VPNing from London and then getting into JIRA and things are in the last in the suite remotely.
And it was, you click on something and it's like, and now you get the page and stuff like that.
And so actually fixing that was a huge problem. And of course, some of the way that got fixed as we got bigger was to say, well, we've got a group of employees in London and let's put a VPN concentrator in London.
And now we've got the London VPN connection. And yeah, actually that kind of worked for a bit, but then of course you get a team in Lisbon and then you get, and then COVID-19 happens and nobody's anywhere where there's a convenient connection point.
So yeah, this architecture really doesn't work very well.
Yeah. It needs to be something that matches kind of where both applications live and where users live rather than some choke point.
And just the evolution of access, I think actually, I'd love to, we should write a blog post someday about kind of how the evolution of access, I think really represents a lot of fun developments within Cloudflare because for a while access was a service that primarily ran in our control plane, which meant that you were still able to connect to a resource through Cloudflare's network.
It's just, if you were authenticating that happened in a few central locations.
And that meant if central location had some impact or some service degradation, you didn't get all the benefits of a distributed authentication engine.
And so the kind of, I think most important thing that ever happened to the access product was in, it must've been a year and a half ago now, maybe two, the access engineers kind of full stop on new feature development and said, we're going to move, not just do a distributed model, but we're going to move access to run on workers.
Right. Which means every Cloudflare data center is now capable of, every isolate running in a Cloudflare data center is now capable of the full access functionality.
And that we saw the time of logins go from hundreds of milliseconds down to three or four.
Right. Because it's all happening there at the edge.
And if the central control plane is undergoing maintenance or disrupted for any reason, no one notices.
No one notices.
Yep. And given the scale of our network, more than 200 cities where we have hardware, you're connecting to something local to you.
And so it's going to be fast.
So yeah, it was just magic. Like we're on the access team, customers of workers in some sense.
And like we had serverless week a couple of months ago now, and we were just like everybody following along the blog post, like, oh, this is going to be great for access.
Yeah. Actually, there's a number of wiki, our internal wiki.
There's a number of wikis about using workers durable objects that we announced.
It's like, oh, wait a minute. We can take that thing that we were doing in Lua or in C in some cases, or in Rust, and actually we can run it directly on the edge and it'll scale automatically.
So yeah, it's fascinating to be a customer.
Another thing that we were a customer of, I think was, we've tried to have some additional security around things.
So for example, recently we announced hard key support for access.
And that was a big one for me because it's pretty amazing how many phishing attempts we get per day against our employees.
And some of those are coming by email and some of them by phone, right?
People phoning up and saying, Hey, this is the IT department or whatever.
And for the internal management tools of Cloudflare, we really wanted to enforce that, not just two factors.
We have two factors forever, but you have to use, here it is, my YubiKey, using YubiKey, you know, to actually authenticate and prove who you are.
So that was an interesting feature to add, because that's not necessarily provided by the single sign on people.
No, it's not. I think it's one of a string of really fun things that we've gotten to build by being best friends with Cloudflare security.
I think a lot about, we're the luckiest product team in the world because we get to sit next to, or at least virtually now, but we used to get to sit next to just some of the best security people in the business who were pretty unvarnished in their feedback about what they wanted us to go build to keep Cloudflare safe.
And when we were back in the Austin office, which is where a lot of the access team and the teams team is located, the security team sat on the other end of the office, not a big space, and they used to have a joke that I, something of a caricature, wear cowboy boots, and you can hear them kind of like in the speed of clop, clop, clop, clop was how urgent a question was I had from the security team about a new idea that we had.
And they come to us with a lot of really fun questions to the point that we used to all meet up in the Austin office to kind of do hackathon sessions together.
A lot of access features were built in collaboration with them, but the hard key one's a great example.
We were looking at the Twitter breach, which started with someone who had hijacked one of the weaker forms of multi-factor auth.
And if you think about multi-factor auth, it's not just on or off, it's a spectrum of security.
There's SMS auth, which is relatively less secure because you can do a SIM swap attack or something else to hijack it remotely.
And on the other end of the spectrum is the hard key, this secure cryptographic exchange that can only take place if I'm in physical possession of that device, which means for someone to take over my account, they have to know my password and physically steal this thing.
And we were thinking about the Twitter hack in mind of our own internal tools, and how do we make sure that all of our internal tools require that heart, more rigorous form of MFA.
And our SSO options didn't provide that.
That was kind of an all or nothing thing. And our security team came to us and just said, hey, can you please build this?
Yeah. And my favorite part about this is I initially recommended the wrong approach.
One of the security team members was, and I were talking the day after the breach, and he's like, we want to build this.
I'm like, all right, cool. Would it look like XYZ? And he goes, no, totally not.
And just the fact that we have that relationship that the security team is willing to say like, no, this idea that you had is bad.
What we really want is one, two, three, right?
This other implementation. And we went out and built it.
And now it's part of the access product that all of our customers can use.
Yeah. You know, that's been really cool. The other thing I thought was interesting is, so we went public over a year ago now, and you didn't know this, but for six months before that, actually more than that, we were working with a bunch of external contractors from like, you know, from accountancy, from, you know, all of the internal systems, particularly around finance processes, because you have to get yourself in a position where, you know, if you're going to go public, then, you know, an investor wants to be sure that, you know, your results are true, right?
And you're able to match an invoice up with a payment and all this kind of stuff that actually requires a lot of engineering.
And so we had external consultants doing that, which who then needed access to our systems for certain things.
And they might not, they, you know, what we were actually doing at the time was provisioning them in our, you know, source of truth.
And that was a real pain.
So there's another feature of access, which I think is really interesting, which is this multi SSO thing.
Yeah, we access doesn't just integrate with one provider, we can integrate with multiple providers.
But what's fun about that is we can do it simultaneously.
So if you have employees, they can log in with Okta.
And then we actually integrate with things like GitHub, and LinkedIn, and even Facebook, where people you don't have to onboard everybody to your core system of truth about user identity.
Instead, access is a bouncer just standing at the door, and it knows, you know, what types of identities to check based on the rules that are configured.
And I think that was a feature in the product that we kind of stumbled upon.
Right, right. We were building the original, okay, you can integrate identity into access.
And then someone kind of came to us.
And I think almost inadvertently integrated multiple sources of identity. And it worked.
Right. And they were like, Yeah, this is really exciting for us, because we have contractors.
And we're like, Oh, that's great. Yeah. And we've since made the tooling a bit more well suited to that use case.
But it's another place where it's kind of someone came to us with an idea that our network could solve for them.
And now it's a kind of full, significant feature of the platform, which is really exciting.
And, and the platform isn't standing still, right. So there's announcements this week around access itself.
Yes. The most recent announcement, I think was the run around with endpoint security, right.
So you can then say, you know, it's this person using this hard key to authenticate themselves.
And it's on this device.
Yeah, we, we, we think about access a lot as as an identity aggregator, where your identity or your demonstration of trust, isn't just that you logged into your identity provider, there's other heuristics, maybe we want to check, especially depending on the severity of what you're attempting to connect to.
So there, there's some really sensitive applications where we want to enforce that hard key off and then integrate with an endpoint protection provider that says not only is john connecting from a country where we think john operates with a hard key, but he's connecting from a corporate healthy device, right before we allow him through.
And again, that's a that is another feature that came to us directly by way of our security team, who, one of them texted us just saying, hey, it's really important that we restrict this to corporate devices.
How do we do that? And that started the journey. And I think what's interesting about that is this, this sort of rolls into the new architecture for corporate networking, which is if you're not going to trust the network, what are you going to trust?
What are you going to trust, knowing who the people are? So what and also actually systems, but authentication in general, and all the things around that, and then how they're connecting.
So sort of where they are, is it is the you know, link encrypted in some way, and then authentication service that they use in the user hard key, but then also like, I got my laptop in front of me, is it actually the laptop that Cloudflare provision?
And have I not installed something on it that was, you know, potentially risky?
Yeah, if we're in this world where we're all working from home, we don't, you know, on the Internet, anyone can be a dog, but you know, we don't know who's connecting and what are they connecting from?
And we want to shift this corporate network model into running on the Internet, that comes with the consequence that we have to have better, smarter controls over how are people able to connect and what they're able to reach, but also in a way that I think doesn't really degrade their experience significantly.
Like we want this to be something that actually improves the administrator and user experience as well as the security posture.
Well, yeah, I mean, for me, everything that, you know, I have to get access to, which is behind Cloudflare access is easy.
And anything that was behind the VPN, and luckily, I no longer have to use the VPN, it's actually been uninstalled from my machine, was a total pain, right?
There's all sorts of problems of authentication, of disconnecting and reconnecting.
And, you know, it just seemed like a very fragile thing.
And then, you know, with access, it's just simple. And I think your point about the security posture and ease of use going together is very, very important, because, you know, for years and years, we've had to educate users about what they can and can't do, watch out for phishing emails and stuff like that.
And what you really want to do is make the security just seamless.
It's like, oh, yeah, this is how I authenticate.
It's dead easy, done. Yeah, you want people to almost ask for it.
I think a lot about where I'm from in Central Texas, not my generation, but not too many generations ago, not all roads were paved.
And so people would in the small towns in the hill country drive their cars over unpaved roads, which is both dangerous to the car, dangerous to the passengers and a pretty miserable experience.
And it was expensive to pave all the roads, people thought this is a huge undertaking, we're not necessarily going to start it just yet.
And so what a few local cities in Central Texas would do is they would pave one mile of road so that people driving on the road would have this miserable experience, it's dangerous for them in the car.
And then they get onto the paved road and for one blissful mile, they'd see the future.
They think, wow, this is smooth, this is good for my tires.
I can actually hear the passenger talking next to me. And then at the end of the mile, they'd go back to the bumpy road.
And they'd immediately go home and phone city council and say, let's start the bond election for paving all the roads.
And when people roll out products like Access, we want that to be the experience.
We want this to be something where a team that adopts this model like Access or now kind of more broadly Cloudflare One, the experience is so great that they go, the people who interact with it as part of their day go tell their team like, hey, we got to get to this model too.
And it's almost something that people ask to pave all the roads, as opposed to having to kind of, dang it, we're raising money to pave the roads, it's going to be multiple years kind of thing.
Now, of course, the other thing that happened was in March, once the pandemic really got going, a lot of companies were forced into, you know, everyone's now working from home, everyone who can now working from home.
And that kind of really put strain on the old model of corporate networking VPNs.
We had to, we've been trying to move, meet people where they're going.
And it started, I think, a few years ago with how people connect from anywhere.
When we first launched Cloudflare Warp, a service that turned really, really for the first time turned our network in the other direction.
Most people think of the services you mentioned in Cloudflare, the WAF, the DDoS protection, those are delivered as a reverse proxy.
Cloudflare stands in front of the resource being protected and can shield it from attack.
But our network can flow in both directions. And so a year and a half ago, I believe it was, we released a consumer mobile application built on a next generation technology called WireGuard, our implementation of it called Warrington, which will take all of the traffic from my device as just a consumer user, and route it through Cloudflare's network, which for me as a consumer user gives me a few benefits.
That one means my traffic is encrypted and protected from things like spying ISPs.
But also it means I get to benefit from what Cloudflare knows about where the Internet is fast or slow.
And in building that, we've ran into on the consumer level for millions of users, a bunch of funny edge cases, like IPv6 behavior for certain carriers in Italy, things that you wouldn't necessarily know unless you were kind of dealing with it at that scale.
And now, also this week, we're repackaging everything that we learned building Warp for millions of users who have a choice, right?
You don't have to use Warp if you're a consumer user.
You don't have to want to use it. We're packaging that up into a way that for organizations, wherever their employees are, their corporate devices now just have a seamless on-ramp to the Internet that routes through Cloudflare where we can keep that traffic safe.
Yep. And access gets a bunch of new features and we're going to take it from there.
The other thing is I want to go back to Cloudflare 1.
What is Cloudflare 1? Yeah. Cloudflare 1 is what we want to help our customers move to as the future of corporate networking.
We want to let our customers run their network on Cloudflare's network.
They have better things to do, quite frankly, than have to worry about how are connections being made available, how are connections routing, how are connections being secured.
They run their business. I'd like to point out that it should have been called Cloudflare on, okay?
You need to go buy expensive appliances.
They've had to deal with constant upgrades, constant configuration challenges, the now overwhelming burden of supporting employees working everywhere.
That's pretty miserable. We were talking to customers who said, why can't I just run my network on yours?
Benefit from the security controls that you've enforced, benefit from the routing that you're able to do, whether it's my office or data center or employee laptop in Portugal.
Cloudflare 1 really brings together all these products that we built to solve components of that problem, whether it's access to solve identity and Zero Trust rules or gateway to solve outbound filtering or products like Magic Transit, which are really exciting, which are meant to be protection for your entire corporate network, now delivered in Cloudflare's edge, brings it all together into just one simple solution that customers can use to, again, stop worrying about their own corporate network and just get to focus on why they built their business, what their business does.
Right. Cloudflare 1 got launched yesterday, but actually, essentially, people have been using Cloudflare 1 for a while, right?
We have customers using access and gateway and transit.
Yeah, we've kind of, there are some Cloudflare 1 customers out there who don't know it yet, in some sense.
They, or yesterday, they saw it and they're like, oh, this makes sense now.
And a lot of it is because Cloudflare 1 is a direct response to how customers perceive this problem.
So for customers who've had this problem already, who have been buying the different products within our portfolio, they kind of helped us define what Cloudflare 1 is and what Cloudflare 1 should be, because we've kept seeing a pattern of customers who would come to us and say, you know, I want access for apps.
I want gateway for filtering.
I want magic transit as an on-ramp and for DDoS protection.
And this pattern kept resonating with us and thought, we can make this even better and easier and make these pieces fit together a bit more seamlessly.
The other thing that has been going on, so since March, and I seem to be, my impression has accelerated lately, is attacks with ransoms attached to them.
And this is, you know, because people are now working in different places, the Internet has become so critical for businesses to keep going that we're seeing a bunch of that kind of activity, right?
Yeah. It's really terrifying. Like it's a very sophisticated type of attack, which, as opposed to maybe just a kind of more vanilla DDoS attack that's focused on a website, which can be attempted by, you know, any number of bad actors.
These types of attacks take kind of knowledge. Someone has kind of cracked into a network and understood some of its nuance and is able to then leverage that information to produce a ransom and say, hey, I know your secrets.
I know how to exploit them. And our Magic Transit product and this kind of whole suite of products, Magic Transit in particular, is something really exciting because like we built Cloudflare Access to initially keep our own Grafana secure and performant, Magic Transit takes everything that Cloudflare has been building for a decade, securing our own network, right?
Because our network is a network that, you know, people might want to attack.
And we had to build safeguards. Yeah, we had to build safeguards and defenses for our own network.
And now Magic Transit, again, as kind of all of these have evolved over the last several years, Magic Transit is something that customers can deploy in front of their own IP space.
Yeah. Get all the benefits of our network.
This is really interesting. I talked to a customer the other day who had received a DDoS attack on a piece of infrastructure.
So not on their website or something very obvious.
That had caused them difficulty. They had to do a bunch of work to get around it.
They'd have, you know, people not having access to internal applications and stuff.
But the DDoS was followed up by a essentially, you know, blackmail, which was to say that the attacker said, here is a list of your internal IP addresses and how stuff is laid out in your network.
This is what we're going to attack.
At this data rate, we've done a demo. That was what the attack was the other day.
And, you know, here's how much you should pay.
That, you know, requires a little sophistication, which I think is pretty striking.
Brian Krebs on his website is a fantastic investigative journalist.
He had a thing about a gang doing this stuff where they had previously figured out how much businesses could afford to pay by breaking into the networks of the businesses, getting financial data, and then saying, well, okay, so we'll do a ransomware attack on this company.
And we know from their cash flow that this is the right amount to ask them to pay, which is just, you know, stunning.
They're running a business, basically.
Yeah, that becomes, for the people at those organizations and enterprises responsible for keeping the network up, that becomes the worst day in their career, right?
Like a truly, like, abjectly awful experience for the people responsible for it, of course, for the entire business itself and its ability to execute.
And I think anything we can do to make that a non-factor for those customers through a combination of services like Cloudflare One, I just kind of love the idea that we can help these businesses not have to worry about things like that.
Yeah.
We need to solve some other interesting problem that oftentimes is really fun to learn about.
What we don't want to have them to have to learn about is the kind of disaster response that might occur if someone's able to exploit a weakness like that.
Yeah. I mean, the real thing that occurred to me is for years and years, I've been telling people, you know, attackers will look for the weak point and they'll go after that weak point.
And it's really clear now that's what they're doing, right?
It's not a marketing story. It's not one of them. It's, that's what they're doing.
They're going out and figuring it out and you actually end up needing to protect the entire IP space, hence Magic Transient.
They're not knocking off your main website.
They're going after some piece of infrastructure. Well, listen, we don't have much time left.
Sam, thank you so much for joining me this morning from sunny Lisbon.
We could have done this in person, right? But yeah, I'm so used to Zoom now that I just talk to everybody on Zoom.
But, you know, thanks for talking about Cloudflare One.
Good luck with the rest of the launches this week.
I mean, I know what they are, but I'm looking forward to everyone else knowing what they are.
And, you know, I'll see you in person at some point. Cheers, Sam.
Yeah. Thanks, John. Good to see you. Bye.