Cloudflare TV

Let's //Get\ Technical: Blocking AI bots and crawlers for free to make a safer Internet

Presented by John Graham-Cumming, João Tomé
Originally aired on 

In this week’s episode, we go a bit technical in our segment “Let’s //Get\ Technical,” featuring our CTO, John Graham-Cumming. We discuss AI bot and crawlers, elections and more.

Host João Tomé and John Graham-Cumming begin by explaining how Cloudflare recently launched a new “easy button” to block all AI bots, scrapers, and crawlers to help maintain a safe Internet for content creators. It’s available to all customers, including those on our free tier. We demonstrate how to activate this button on the Cloudflare dashboard. We also explain the purpose of the robots.txt file—it resides at the root of your domain and contains directives for search engine crawlers on how to index pages on a website.

Next, we discuss how we recently replaced links to polyfill.io with Cloudflare’s mirror to ensure a safer Internet—polyfill.io is a popular JavaScript library service that can no longer be trusted and should be removed from websites.

We also analyze recent election trends in the UK and France, as well as the Biden vs. Trump US election debate. We were surprised to find that so many websites related to political parties were the targets of DDoS attacks blocked by Cloudflare around the election (or debate) days.

Also mentioned is Cloudflare’s 1.1.1.1 incident on June 27, 2024. The root cause was a combination of BGP (Border Gateway Protocol) hijacking and a route leak.

Mentioned topics:

English
News

Transcript (Beta)

Hello, everyone, and welcome to This Week in Net.

It's the July the 10th, 2024 edition. And this week, we're going to talk about AI bots crawling the Internet, and also about a thing called Polyfill, and also about elections.

I'm your host, João Tomé, based in Lisbon, Portugal.

And with me, I have back to the show, John Graham-Cumming, our CDO.

Hello, John, how are you? I'm good. Good afternoon, João. Good afternoon, or morning, night, wherever you may be hearing this.

We wrote a few blog posts in the past week, but there's one that has got a lot of attention, because everyone is talking not only about AI, but these days about AI crawlers, bots that are crawling the Internet, in a sense.

And we, this blog post called Declare Your AI Independence, Block AI Bots, Scrapers and Crawlers, with a single click, got us a lot of attention.

What is this specific button that we mentioned in this blog post?

Well, it's a button that says Block AI Scrapers and Crawlers. And what it does is, if we identify a bot that is coming to your website, that we understand is being used to get content to build some AI model, we will block it.

Some bots do that very openly.

They state they are a bot for this use. Other bots are pretending to be genuine web users.

They're pretending to be different web browsers. But because we do a lot of security around detecting bots for other reasons, we're able to detect those.

And so this will block those ones as well. So the idea is that we can block AI crawlers coming to your website so that your content does not get used by them to build large language models or other AI models.

And the reason we did this is that a lot of customers are worried about copyright concerns, or more generally about the concern about things they create as a human being used to create an artificial intelligence.

And so there it is. We launched it. It was previously available under a more complicated form where you could go in through our web application firewall and set a rule.

But we've now created this simply, literally an on-off button.

Block those for me, please. Let me share my screen here just to show this specific blog post.

In this case, I can also show the actual button. Here it is.

Block AI scrapers and crawlers. Block bots from scraping your content for AI applications like model training.

And it's a very simple button. It's literally on-off.

On and off, exactly. Quite easy to understand. You're mentioning the blog post itself.

It has some more explanation on this topic. One of the things that I found interesting, first for those who don't know, this AI bots system is robots.txt.

This works in a specific way for a while, right? Taking bots in specific for crawlers.

Well, yeah. I mean, I think the thing that's interesting about robots.txt, this really dates back to as the web was getting going and search engines were visiting websites and they were doing that to build the search engine.

So your website will be findable. And sometimes those web crawlers were too aggressive or sometimes there were things that you didn't want indexed within a search engine somewhere.

So you use that robots.txt to prevent those bots from coming in.

And by and large, for legit services, robots.txt is respected. And if you stick in robots.txt, do not index my website, you will not appear in Google and Bing and other search engines.

But it was kind of voluntary, right? I mean, the search engines had to go ask for that file and parse it and then respect what it said.

But by and large, pretty widely used. Yes, go ahead.

In this case, it's a matter of trust, right? You have this to respect what the content author wants.

But it's a matter of you can choose to respect it or not.

It's not like mandatory, right? Yeah, and that's right. And so we for a long time have blocked bots that are misbehaving, right?

So one of the ways of misbehaving is to ignore robots.txt because you don't want that to happen.

But we've entered a kind of a new world in which there are crawlers which are not building things for search engines anymore.

They are in a way consuming the content.

If you think about something like a search engine, you probably want your site indexed by the search engine because that makes your site findable, right?

It's not necessarily the case that you want your site content consumed by something that's building an AI model.

And so there's a slightly different use case there.

So we do see people in their robots .txt specifically saying, the Bing crawler, which might be coming to do crawling for search engine is allowed.

And GPT bot, which is being used to build chat GPT is not allowed, right?

So you can do that kind of granularity.

But we also saw some of the systems that are doing AI being dishonest about their crawler.

So instead of saying, we're a bot of this type, they're actually pretending to be a version of Firefox or a version of Chrome and trying to pretend to be human.

And so we're blocking that behavior with the AI blocking button.

Our blog post has a few specific data here about AI bot activity these days.

This comes from our radar perspective, our radar tool. And we can see here a clear increase since in this case, July, 2023 of requests by user agents specific to AI bots.

And ByteSpider that is from ByteDance, the Chinese company that owns TikTok is number one, but also Amazon bot, BotBot, GPT bot, those are for specific chat services.

Although ByteSpider also has its own chat service, AI, generative AI chat service called Jubao.

But there's a clear tendency of increase of these types of AI bots in a sense that also shows the relevance of this topic right now.

Well, I mean, there's an intense battle going on between different companies to build different models with different capabilities, right?

We're all seeing that with like, the chat type things, multimodal models where you can take an imagery or create imagery, video, voice, music, all this kind of stuff.

And then very specific examples where it's like, things that are very good at coding or trying to make things that's good at mathematics, et cetera.

So there's a tremendous amount of content being crawled.

And if you look at many of the bots that have been created, their biggest source of content is the web.

After that, it tends to be printed books and some other very specific material.

But the web has a tremendous amount of content on it.

And so these bots are out there looking for it.

And also, recently, you've also got these sort of search engines that are sort of what we might think of as summarization engines, where you go and ask them a question and they use AI to look at content on the web and then summarize it.

And they, again, are going out and getting stuff off the web. There's also another twist on some of the concerns our customers have, which is like, well, if you come and visit my website, take the content and summarize it for some user, I've lost a click, basically.

Someone hasn't come to my website, even if that service summarizes it and says, actually, I got this from my website, jgc.org, for example.

If that happens, then you've lost potentially revenue, you've lost potentially a customer or a visitor to your website.

So I think there's a lot of concern.

And actually, we have absolutely seen with the number of people who have turned on that button that there's a tremendous amount of uptake on this, that people are worried about the stuff they are creating being used in a manner that they didn't intend or they won't control over it.

Makes sense. And in this case, it's their livelihood because those are content creators, they have their websites.

And if users are not going there, they will lose something. So it's also a matter of incentives of you creating something online.

And if others will use it and take some revenue from there, you won't have the incentive to continue to create to do your own thing, right?

I think it's more than just money. I think it's also philosophical, right?

I think there are people who are certainly concerned about money, but also concerned about feeding something which is creating an artificial intelligence, right?

And there are artists who are concerned about their art being in some ways commoditized, right?

If I can say, create me, I really like Claude Monet, but I'd like a Monet of Lisbon, Monet never came to Lisbon, so make me one.

And suddenly I've got something that looks like Monet, that sounds kind of cool, but also for an artist who's working today, who is expressing something very deep, that is beyond the surface of what it looks like, that's worrying.

It is, it takes something, right? From, in this case, an artist, a creator in a sense.

Yes, I think people are concerned about it. And hence, although we had a system for doing this already within the WAF, we decided to make it very, very easy and it has had tremendous uptake.

In this case, we have some explanations on how it works.

There's also a reporting tool that we set up where any call center customer can submit reports of an AI bot scraping their website without permission.

This also helps to make it more effective, right? Make this tool more effective.

Yes, yes, absolutely. If we've missed something, although we are building our own machine learning models to detect these things.

Exactly.

I have here just for those who like to see what is a typical robots.txt page, one from Calthair, Dear Robot, Be Nice.

We have, so some sites have tried to also not only make it for robots, but also for humans, more fun.

YouTube also has some text, interesting text, a little bit catastrophic text.

Google, for example, is much more straightforward.

Facebook, Apple, all of these. Nike actually has a logo at the end.

Wikipedia has some messages for typical humans, but not necessarily something really nice or cool.

And Netflix has a humans one. This is for humans, humans.txt.

Ah, yes, yeah. Come join us. Exactly. So different aspects here on this topic, which is an interesting one, more and more and more in this case.

Where should we go next? You also wrote a blog post with other folks a few weeks ago about Polyfill, automatically replacing polyfill.io links with mirror for a safer Internet.

This is a popular JavaScript library that several folks use it, different companies, different services use it.

And it can all be trusted, right?

Yeah, this is an unusual situation, right? Because, so polyfill, the idea of polyfill was that the APIs that are used in browsers from JavaScript have varying functionality across different browser versions, different historical versions.

And so the idea of the polyfill was you loaded this polyfill thing and it made sure you got a consistent API basically.

And a very popular one was polyfill.io.

And some time ago, earlier in the year, the polyfill.io domain got sold to a third party who wasn't well known.

And there was a concern that what might happen is somebody could do something malicious, because if you could change what was being served from polyfill.io, it could get injected into webpages and therefore could have a security problem.

At the time, what we did was we made a mirror of it on CDNJS, which is our JavaScript CDN.

And so you could replace polyfill.io with CDNJS slash polyfill and get exactly the same piece of JavaScript, but you didn't have the same supply chain risk.

Subsequently, what happened was the thing that everyone was concerned about, which is something malicious would happen with polyfill.io happened.

And when we looked at it, what was happening was in certain circumstances, some redirection malware stuff was being inserted into someone else's website.

So if you use polyfill.io on your website, there was a potential that something got injected.

And so June 25th, this actually happened.

It was no longer a theoretical thing. We decided that we would use the power of Cloudflare to actually modify our customers' websites so that if they went polyfill.io, we would change it to cdnjs.Cloudflare.com with the polyfill thing.

And this was kind of an unusual thing. We had a few times in the past decided something was so severely bad that we would do it for free for everybody.

We did it for log4j, the vulnerability. That was very, very serious.

We turned on the WAF for everybody. We did it long ago for a thing called Shellshock with the WAF.

And in this case, we actually used our power of our ability to modify HTML as it goes through our network.

And we'd look for where it says polyfill.io and actually swap it out and replace it.

And that was, you know, it was an extraordinary decision to make that, but it felt like our customers come to us for, one of the things they come to us for is more secure websites, right?

And it was really on us to make this more secure. This was a serious threat.

The supply chain threats are very serious. I mean, we've seen other ones with the XE backdoor.

We've seen Magecart stuff before that. It's really important that this stuff gets fixed.

And so we rolled this out for free for everybody.

Exactly. In this case, it's rolled out for free, but there's some action that the customer should do, right?

Well, you should go through and get rid of polyfill.io from your website.

So it'd be better if you yourself change it to something else.

It doesn't have to be CDNGS. It can be another good JavaScript provider.

But if you do that, then it will, you know, get rid of any more risk that, you know, polyfill.io is somewhere in your code.

But in the meantime, you have protection from us.

Exactly. We also have an incident. I'm not sure if we want to go more deep here.

This is the quad one incident on June 27, 2024. This affected a small number of users globally.

And this was related to something that sometimes happens on the Internet.

In this case, BGP hijacking, right? Yes. So we've written about BGP lots of times.

We've talked about it before. BGP is the thing that binds the networks that make up the Internet into the Internet, right?

You've got some network and you say, hey, I'm this network.

Let's say I'm Cloudflare. And you can advertise to the rest of the world who you are and where you are and what portion of IP address space you control.

So in our case, 1.1.1.1. And what happened was, with two things simultaneously hijacked, somebody said, actually, it's me who owns 1.1.1.1.1.

And another thing was a route leak, which is someone's internal routing leaked out onto the Internet, which then caused the upshot was a bunch of networks around the world thought, oh, those people over there handle 1.1.1.1, not Cloudflare and sent traffic over there.

And particularly, you know, 1% of users in the UK and Germany, for example.

So what would happen is the resolver would either work really slowly because it was all getting sent to someone else's network and then ultimately forwarded on, or it just didn't work at all.

And, you know, this is a problem with the Internet. We've done an enormous amount to work around this.

And in particular, there's a bunch of stuff called RPKI, which is trying to secure when someone says, hey, I own this set of IPs, there's a way to cryptographically verify that.

And that's now at about 50%. Routes are actually protected by that.

We're going to keep pushing on that. And if you want to know if your ISP supports that, because it's super important that everybody supports it, you can go to is BGP safe yet and test your ISP.

Yeah, that's a cool site that we have that's around for a while now.

It's another recent one.

We started it during the pandemic, actually, because, you know, during the pandemic, the Internet went up really another level of importance, right?

In terms of we were all relying on it for schooling, for food, for communications, for work, for, you know, staying in contact with loved ones.

Well, we use your own Cloudflare warp.

So we do support BGP. But if you did this on your ISP, you'd see your ISP.

And so we did it during the pandemic. And we've been pushing for this as have many other players and keeping track of who has good progress been made.

And we can make the Internet much safer by rolling out RPKI across the Internet.

Makes sense.

It's one of those cases that shows us the complexity in a sense of the Internet, but also the fact that it became what was not expected to become in a sense, right?

Well, I mean, it became much more critical to our lives than I think the original people imagined it could do.

And, you know, we built a world on it at this point.

And so, yes, we needed to be secure. Hence, a lot of things Cloudflare does and a lot of things that as a community we do together to make sure that the Internet stays online and secure for everybody.

We also have several elections -related blog posts that I've been working on specifically recently.

And also an exam -related Internet shutdown in Syria, Iraq and Algeria that has been happening still in all of those countries.

I mean, as you were saying in your blog post, I mean, this is the year of elections.

And I think it's more. If you consider the European Union, which is 27 countries, the election, it's more than 60.

And so in the last week, we have had France, the UK, and then France again.

And it's been kind of a wild week in terms of European politics, right?

Because the UK moved to the left with the election of the Labour Party and massive losses for the Conservative Party.

And you looked at this and you saw, you know, perhaps unsurprising things, drops in traffic at certain times and people were looking at the news and things like that.

And then a couple of attacks on political parties. You kind of see the trends of the Internet there.

So that was the UK, which kind of went to the left. Last Sunday, the 30th of June, France kind of veered to the right with what was the party called RN, our Assemblée Nationale, getting at the top of the votes in the first round.

But they have a two round voting system. And what happened was that you saw that, you saw attacks on different parties, you saw a lot of traffic and things happening because the French had a massive turnout, right?

Then there was such concern in France by some people about the rise of the sort of essentially the far right, that a political deal was made by a whole lot of parties.

So that a lot of the racism in France, instead of becoming multi-candidate, became two -candidate in order to try and block it.

That actually worked. And the French veered to the left, in which way you're watching my hands go.

And actually the quite, perhaps the most vocal person possible, Mélenchon, suddenly becomes, you know, at least leader of the largest sort of group within the Assembly.

So that was like left and right and left all over Europe.

And then we've got more, right? More coming, yeah.

More coming. But as you know, you know, what we see in these things is drops of traffic when people are voting, attacks on political parties or other groups associated with things.

I mean, it's sort of almost run of the mill, right?

I mean, you sort of see this and it's like, oh, did the attack happen this day or that day?

Or did it, you know, that kind of thing. True. What surprised me the most, it's these political related sides.

So political parties in those countries. And we've seen this recently, actually, I wrote about it like a few weeks ago in the Netherlands during the European Union Parliament election.

Yes, because we had the EU elections as well, just before that.

Exactly. I feel like I've been voting every five minutes in the last...

We had in June, so it was not far past. In one month period, a lot had happened.

So we saw some attacks there, the DOS attacks specifically, and in France also, and in the UK.

So political parties under attack. And the blog post I just finished writing about the French election.

So even on the election day, so Sunday, July the 7th, there was also attacks on political parties sites at that time, which is interesting in a way that political parties continue to be DDoS attacked.

But in the past, we've seen how DDoS attacks sometimes are used even to divert attention.

So if IT teams are trying to deal with a DDoS attack, maybe there's another attack happening, maybe using another vector.

And that's just like smoke trying to diverge attentions. Yeah. And in some cases, it's just, I want to make this party look bad or incompetent or something because their website's down, right?

So yeah, you have to see that kind of thing. Well, but it's kind of interesting that so many have been attacked.

And this is in a matter of a few weeks.

So not a lot there specifically. So elections are still going on.

And we have a report on radar, actually, that folks can follow through. A lot of elections coming in in 2024.

And we'll keep an eye on that in our election report on radar, actually.

So that's it for this week. And I'm on vacation in the next few weeks.

So that's it for a few weeks. Exactly. Good. Good to see you again, Joao.

Bye. Good to see you. That's

Thumbnail image for video "This Week in Net"

This Week in Net
Tune in for weekly updates on the latest news at Cloudflare and across the Internet. Check back regularly for updates. Also available as an audio podcast!
Watch more episodes