Originally aired on July 29, 2021 @ 6:30 PM - 7:00 PM EDT
Join Alex Krivit (Product Manager, Cloudflare), Matthew Prince (CEO & Co-Founder, Cloudflare), and Rustam Lalkaka (Director of Product, Cloudflare) as they discuss a new company initiative that aims to reduce the technical and environmental costs associated with redundant website crawls.
Read the blog posts
Impact Week Hub for every announcement and CFTV episode — check back all week for more! And we're live. Welcome to today's session of Cloudflare TV. I'm joined by two very special guests, Matthew Prince and Alex Krivit. And we're here today to talk to you about what we're doing to reduce the environmental impact of search engine crawling. Before we dive in on that topic, I want to talk about Impact Week a little bit. Matthew, this is Cloudflare's first Impact Week. Where did the idea come from? What's the goal for the week? And what would you like folks to sort of take away from the week when all is said and done? So, you know, I think that we've done these Innovation Weeks. They were born out of the very first year that after we launched, we decided right around the time that was the anniversary of our launch date, which is the end of September 27th, that we would announce a new product that we wouldn't make any money off of, but we thought it was a good thing for the Internet. And that first product was an IPv6 gateway that allowed IPv6 support across our network. And it was great for the team. You know, it felt like it was the right thing to do for the Internet. And it felt like something that really resonated across the market. And, you know, getting more IPv6 support is just a per se good thing globally. And so that was something that we really wanted to work on. And I think that then inspired us over the years, every year on our birthday, to try and do something that gave back to the Internet. That wasn't a product that we charged for, but was something that was just, you know, we thought per se good. I think the biggest one and the one that, you know, as I think about some of the smartest things that we did at the company, and some of the things I'm the most proud of, was that in 2014, we made TLS, so HTTPS, free for all of our customers. And it was super scary at the time, because it was the one thing, used to be the one thing that was the difference between our free plan and our paid plan was one supported encryption and the other didn't. But, you know, we looked out and said, you know, on the right, we want to be on the right side of history here, obviously, a better Internet is an encrypted Internet. And so if our mission is to help build a better Internet, then of course, we should do that. And it was a huge technical challenge and a huge lift. But that moment that we sort of pushed the button, and encrypted a huge portion of the of the web, you know, literally doubled the size of the encrypted web in 24 hours. You know, it's just one of my fondest memories at the company. And so I think that that that that was it was such a great motivating thing that we were that we are always looking for, you know, how could we do other weeks that would drive drive us to take kind of innovative, interesting risks. And I think a lot of companies use like their user conference or whatever to be sort of a deadline that stuff has to get done by this, because we're going to present it at the conference. And I think we've deconstructed that to some extent and said, how can we use these sort of weeks over the course of the year in order to, you know, push the launching of various features. And so we did a security week, we've done a developer week, you know, we've got some some more coming up. But at some point, Alyssa, who runs our public policy team said, you know, we should really feature some of the good that we're doing package it up, because a lot of both our customers and our investors are interested in what's known as ESG. So enterprise social good, or excuse me, environment, social and also governance, I guess, is actually what that stands for. I was thinking of it as the other way around. And, and as we started talking about it, we thought, well, maybe this can be this can be a week. But we inherently, innovation weeks at Cleveland are mean launching new products. And, and so we looked out at a number of the things that we were doing and ask ourselves, you know, what are some of the projects that are just per se good across across this? And how can that come together? And it was, like all of these, it's, it's often a race, right? The last minute Alex, Alex was, you know, dealing with that sort of 10 days before, there's always a moment of sheer panic. But it but it comes together well, and we're coming up on tomorrow will be the last day of, of impact week. And it's just it's, it's a list of things that I think our whole team is incredibly proud to to have worked on. I think just speaking personally, as as a citizen of the planet, and then speaking professionally, as a product manager, I thought one of the cool sort of intersection here is that we could have just said, oh, we're going to buy some, some carbon offsets, we're going to do the sort of checklisty things that big companies go out and do. But we sort of sat down, but I think it kept on and thought about what products can we build? How can we be innovative to solve real problems here? Not just, and I think, you know, sometimes, sometimes, and, you know, we're not immune to this. But, you know, sometimes, when companies do things that they don't feel core to what it is that that they're doing. And I think what's been good about impact week is, is it very much it isn't just sort of like we, we decided to do these check the box things we said, how can we use our technology and our network, and our people in order to in order to make the in order to make the Internet better. And that's exactly, you know, where crawler hands came from. Well, yeah, let's let's sort of switch gears and talk about college for a second. I know, since since that first birthday, for Cloudflare, sort of, when we came into the world, interacting with crawlers and helping customers sort of manage their interactions with callers has been very close to the top of, of the sort of jobs we have to do. How, how, what was what were the sort of initial products sort of focused on helping customers manage crawler interactions and have those matured over time or changed over time? Yeah, I mean, at some level, Cloudflare was inspired because the problem problems that crawlers create, Lee Holloway, who is one of the three co founders of Cloudflare, and I had worked on something called Project Honeypot, which is basically a open source project that would track kind of bad actors online, still around at projecthoneypot.org. And I mean, it's a it's a, it's a testament to what a bad coder I am, that, you know, if you surf around those pages, they're all hyper, hyper, hyper interlinked, but they're all very database driven. And there was no caching layer. And what would happen is, you know, Google would come, Google crawler would come along and find all these links and be like, wow, this is amazing content, and crawl it incredibly. And our, and our back end server would crash all the time. And so I think, you know, almost as much as, you know, understanding that, that, that, that there are all these sort of malicious crawlers that were out there that were harvesting email addresses or, or searching for vulnerabilities, which which Project Honeypot was specifically studying. And we also saw that there were good crawlers that would impose, you know, a substantial load on on the Internet. And what's what's actually interesting is that as a percentage of traffic, search crawlers have decreased over that period of time. And that's, that's not because that the total volume is has gone down. It's because there's just been so much more malicious crawling and so much more Internet usage generally. But it's still about 5% of all Internet traffic globally. And, and it puts a lot of load, especially on on smaller, smaller customers. And so, you know, we early on, you know, we're always sort of saying, how can we make it so that if a page is static, if a page doesn't have to be database driven, if it doesn't have to be driven from, you know, WordPress instance, or whatever, how can we make that that available and make sure that it's incredibly fast. And I think especially as Google and other search engines started to use performance as one of the key metrics for ranking, I think that really helped us focus on how do we make the performance for, you know, whenever Google is crawling the web as, as fast as, as possible. And, and so I think we've, we've always sort of had in the back of our mind that, you know, helping our customers work with crawlers both to decrease their load, but then also to help them have better, you know, SEO search engine optimization were some of the sort of values that we probably don't talk about a ton at Cloudflare. But, but we're, you know, very much part of the value proposition from, from our earliest days. Yeah, no, that makes a lot of sense. Alex, for those of our viewers that, that haven't read the blog or aren't familiar, how does crawler hints, this thing we announced, sort of do the things that Matthew talks about, help, help crawlers and help our customers and help the Internet in general? Yeah. So I think that's a really good question. It's sort of going back to Matthew's point. One of the things that a lot of the automated traffic that exists on the, the web, sort of does in terms of like, I guess, good bots, which is an, I think an important distinction to draw here. So automated traffic that performs some sort of useful function. So if we're talking about crawlers that could be going to your webpage and indexing it such that you show up in a search engine and that you're indexed in some way that's useful so that your website can be found by people doing that, using that search engine to find specific types of content or social media aggregators, pricing, sort of bot traffic. But what we were sort of interested in with all this traffic was the efficiency of it. How frequently do they go to a webpage and do they find something new that they haven't seen already that could help in better inform the, the purpose of the So if you're a search engine crawler, you're going to a website and you're getting indexed. And then you come back a little bit later. And if that website hasn't changed, you already have that content. So there's no reason to keep coming back sort of over and over again, if nothing's changed. So we were really focused on trying to answer the question as to whether or not these search engines were sort of doing redundant crawls, sort of the same thing that Matthew had seen, you know, these over and over again crawls, where they be, were they efficiently used or not. And so we sort of diving into the data, we found that around 50% of the crawls didn't really find anything new. They were just sort of hitting the web pages behind Cloudflare just sort of randomly or naively, or they thought that they may find something new, which I guess is not really, I guess, a point of, of blame or accusation against these crawlers. They're trying to do like trying to find the newest content, the freshest content so that they can accomplish their tasks of being sort of useful to the user. But we thought that if we were going to, you know, work with some of these large bot developers or search engine providers and help give them an additional heuristic, that we could actually look to see when content on our edge has changed and provide that information such that, you know, these going back to check to see if anything has changed could sort of be diminished. And they could add our hint, our crawler hint, I guess, as a, an additional point that could help inform the cadence of crawls to these websites. Cool. So just to sort of simplistically explain it for, for, for websites that use Cloudflare that opt into this thing we will help transition crawling from a poll based or sort of polling based operation to one that happens on a push basis, right? Like when we know something has changed, we will let the major crawlers know, and they can be much more efficient and sort of just stingy almost in how they spend their crawls, right? Is that, is that accurate? Yeah, that that's accurate. That's sort of providing them an additional, additional information besides in the past, maybe that this sort of page we've seen changes at this cadence or something, which may not be representative of when things are actually changing now or in the future. And so providing them an additional piece of data that they can make a little bit smarter decisions was sort of the goal there. I always cringe when I hear like the win-win or win-win-win term, but like this really feels like that, right? We the, the search engines get fresher content than they would if they were pulling constantly. They they, the, our customers and websites get less crawler traffic and the Internet is a little greener as a result. What, are there any downsides to this thing? Like what's, what's the catch? I mean, nothing really that, that sort of comes into mind. It's that Cloudflare sort of sits in this privileged position between, you know, a lot of these client requests and a lot of origin servers that without Cloudflare would field a vast majority of these requests. And so because of where we sit sort of in that, in the Internet's like architecture, I guess, we can really do some here. And so it's really a good opportunity for us to, to use some of the information that we have to help reduce general sort of carbon emissions associated with these redundant crawls. We can, you know, repurpose these servers that would otherwise be fielding these sort of, you know, excessive crawls or whatever to do something else, to, you know, field other types of traffic, to do something more useful. I thought you would have said the downside is that your engineering team has to do a whole bunch of high-end engineering work, but. Well, there's that. So, so, so two things. I think we did try to set out to make this a win for the Internet as a whole. And, and it's important to note that this is not something which is proprietary to Cloudflare anyway. It's an open standard. Any, anyone who's running a crawler can, can subscribe to it. And anyone who is a content, you know, serving content, whether it's a traditional hosting provider, a service like Cloudflare, any of our competitors are all, all welcome to use this. And again, if it becomes a standard, hopefully that's the only way that we can have the really big environmental impact. The one, the one party, there are a set of parties that I can imagine not liking something like this is the people who are currently the dominant crawlers that are out there. You know, part of the challenge, if you're launching a new search engine is how do you do crawl in a way that gets allowed by people? Because all of a sudden a brand new crawler shows up, you know, that's, that's some new name that nobody recognizes and people instinctively block it. And I think that that's one of the things that actually has held back more innovation around search. And so if we can make it so that you can be a responsible crawler, you can build a system out. I think one of the benefits might be that we actually get more innovation around search. And I think, so I think, you know, that might disadvantage some of the, the current providers that are out there. And so that's the only downside that I can imagine, but I think again, for the good of the Internet as a whole, that's, that's a, that's, that's well worth it. More level playing field is hard to complain about. Makes a lot of sense. How big, how big an impact do we think this will have if we see broad adoption? I guess the question for both of you, Matthew, you want to take first crack here? You know, it's really tough because the numbers are all a little bit hand wavy in, in, in these cases, but we're pretty confident in a couple of things. So we're, we're pretty confident that 5% of all all Internet traffic is these good crawlers. And we're, and we're pretty confident that 53% of the time that they request a page that that is, is redundant and doesn't need to happen. It doesn't improve search in any way. And so that gives you, you know, some percent, about two and a half percent of all Internet traffic is, is redundant. And so that after that, I think that that's where it gets hard. And we spent a lot of time trying to figure out what the most reliable way we could tell those numbers. I think the best data that's out there is that the Internet as a whole consumes about 2% of global CO2 emissions in order to run on an annual basis. And that's growing a little bit faster, although, although servers are getting more efficient and there's a lot of, a lot of work to make that more efficient, but you assume that's about right. And that's the BCG data, Boston Consulting Group data. And if you assume that's right, then, then you can just say that, you know, this is a meaningful amount of carbon. If we could get the entire industry to, to adopt this now, now, obviously that's going to take a lot of work, but if you imagined a world in which every search engine, every legitimate bot was, was, was subscribing to this service and every content provider was subscribing to the service, it's the equivalent of, you know, planning something like 30 million acres of new forest, you know, every year. And that's, that's such an incredible, valuable thing that, that we can do again, as you put it with a, you know, win, win, win across the industry. And so these are the sorts of things that, you know, we really love to work on. It doesn't, doesn't generate any, any revenue for us, but, you know, it does make the Internet a better place and it helps us live up, live up to our mission. Yeah. Alex, you've actually sort of made a specialty out of figuring out interesting ways to partner with crawlers and to make, help make the Internet a better place. This feels kind of related, but not really to the work you did with the Internet archive last year to help get, get content archived. I mean, for those of our viewers that aren't familiar with, with that project, you might want to spend 10 seconds talking through that, but then how are these things related and where does our work with helping crawlers and other folks on the Internet get smarter, you know, using the data we have and the positions on our standards bodies and that? What's the next step here? Yeah. So I think taking that first sort of question as to some of the work that we did with the Internet archive last year, we started working with them to help power our always online service, which is a service that serves up content that has been crawled and archived by the Internet archive when their origin is unreachable. So instead of serving an error page to an end user, which, you know, kind of sucks in terms of their experience, we go and we serve them content that's been crawled by the Internet archive and archived in their system, which is a little bit better, maybe not as good as if the, the origin was up and running, but it's still sort of a Internet sort of insurance policy, I guess, is a good way to think about it. And so we've spent about a year working with them. They're, you know, a large crawler. They do a lot of traffic. They, their whole mission is to archive the entirety of the web. And so they're running a lot of these crawlers and their, their purpose is to, you know, go get content, take snapshots of it. And then we use that service obviously when we're seeing an origin down or problem with part of the Internet, such that these things can be up and be available to end users. And so sort of with our experience working with them in sort of this, again, software engineering push model-esque sort of way, we were able to, you know, know some of the, I guess, rough outlines of how something like this should work and scale. And so it's really helped, I think, to inform a lot of the early sort of wireframes around this product and working with them. They were one of the, I think, first large crawl operators that were really excited about this because, you know, they've seen and they obviously need to continuously go back and check to see if things need to be re-updated in their archive. And so if they can have another hint like, you know, crawler hints when something's changed, they can operate their infrastructure more efficiently and that's good for them. It's good for us and it's good for the end user again. It's actually a really cool little snapshot into how product development here works, right? We start with one idea, we learn a bunch of stuff and then we say, how can we make this bigger? How can we make this cooler, right? And then none of these things happen overnight, but are all sort of built on years of trying lots of things. And a lot of these things don't work, right? This happens to be a good, you know, couple of successes in a row. In the context of carbon emissions and climate change, we're obviously looking at like a couple big questions as a, you know, society and then as a focus on the Internet, specific to the Internet, right? How can the Internet be more sustainable and like whose job is it to make sure that we go in the right direction there? I think the good news is that everyone's interests are aligned. We have to pay for the power that powers our servers around the world. We want to deliver content from as close as possible to waste as little power as possible. And we're not alone, you know, Google and Facebook and Microsoft invest enormous resources to figure out how to run processing as efficiently as possible across their networks. And so at Cloudflare, we've been just obsessed with how can we be as power efficient as possible for a long time? I think one of the frustrations that I've had has been that for a long time, the chip industry, at least in the server chips, hasn't, they sort of talked like they care about power optimization, but it doesn't feel like it has been a big priority. And I remember going up to the Intel Research Center back in early 2012, and we were a tiny company at the time. And so it was amazing they were even inviting us there. But to their credit, I think they saw that we had a lot of potential. And I remember sitting with a bunch of their senior chip team and saying, you know, we care about cores per watt. We want to have as many cores as possible with as low of wattage as possible. And they spent the entire time trying to convince me that that actually wasn't what we needed. And then in servers, power efficiency wasn't something that was as much of a priority. And honestly, it was just a disappointing conversation, not only because of the fact that we were obsessed over how do we make our services as efficient as possible. But if you thought about how that conversation played out across millions of server chips, it was just an enormous amount of just loss in energy that was being consumed. And so I think what's changed in the last little bit is, first of all, with the move to first laptops and now mobile devices, there's been so much more focus on power efficiency in those battery -powered devices. What we're now seeing is that that efficiency is making its way kind of the other direction. It started with mobile phones. You saw what Apple did in terms of moving their laptops to ARM -based processors. My wife has an ARM -based laptop, and the battery just lasts forever. And it's amazing. And so we've done a lot of the hard work to make sure that we could port our stack over to run our software on more efficient ARM-based chips in the servers when we figured they would inevitably come. And for a while, we were pretty excited about the Qualcomm ARM servers and the opportunity it had. Unfortunately, Qualcomm, right as we were about to deploy them, shut that division down. But we've been talking with and working with nearly everyone who's working in the ARM-based space. And the new Ampere 128 core ARM chips, which we now have in production at Cloudflare, are really potentially a game changer, not just because they can replicate the performance of what we got with an x86 design, but they can do it at approximately half the power consumption. And that's great for us in terms of it means that we can go into more locations. It means that we can put more servers per rack. It means that our power bill is less. But I think it's also incredibly important for the Internet. And so what I hope is that this is a wake-up call for the likes of Intel and AMD, because I think more and more companies are going to start to choose what is better for the environment, but then also what is better for their power bill and for their overall business. And I think that now that you've got places where you can get the same performance for half the power usage, that's just an incredible opportunity for us to make the Internet as a whole much more efficient. Yeah, I think that's an important and obvious, but maybe not always appreciated point. Like any business like ours, reducing our carbon footprint and energy consumption and all that is just good business. Obviously, it's the right thing to do for the planet, but it also makes our bottom line, which is a good alignment. It's interesting to see with a product like Workers, the magic of Workers is that if you write code as a developer, it runs on every server throughout our entire network, potentially. And you don't have to spin up instances. You don't have to handle any of that. And so we can distribute it everywhere. What that also means is that we're able to run our servers at a much higher utilization rate than typical cloud providers. And so just the mere fact of using our platform, it typically is significantly greener than what you see from other more traditional computing platforms. And it's not because we've invented new chips or anything else. It literally is just because we can get more utilization out of our network by moving that traffic around. And so that's just one of those consequences where by being able to get more out of our existing gear, not only is that, again, aligned with our business interests, not only does that mean that our customers can literally scale up and down to whatever capacity they need, but at the same time that we can make sure that it is as environmentally friendly as possible. And one of the other announcements from Impact Week was green workers, which allows developers to pick where their workloads are running, not just to be as efficient as Cloudflare is normally, but even more efficient over time. And I think that we're the first of the major public clouds to allow a green option with computing. And again, I think that's something that our entire team is really proud of. And we've been amazed at how many people have already reached out wanting to take advantage of it. Yeah, that makes a lot of sense. We've got a minute left. What's on deck for Impact Week 2022? Well, I think we'll always be thinking about how we can help the environment, which is something we've talked a lot about. I think the other piece that is just part of what we've done at Cloudflare all the time is thinking about how we can help civil society organizations, journalists, governments around the world that are trying to administer elections. And so I think we're always looking for ways that we can use our network in order to make the Internet a better place and in turn make the world a better place. And so I'm excited for what it is and we should start working on it as soon as possible. Exciting time for the Internet and for Cloudflare. Cool. Thank you both for joining and see you all on Cloudflare TV soon. Thanks, Russell.