Workers and Cloudflare Network Expansion
Presented by: David Antunes, Jon Rolfe, Mekhala Sanghani, Kabir Sikand
Originally aired on September 29, 2021 @ 12:00 PM - 12:30 PM EDT
Join our product and engineering teams as they discuss how we are expanding our network and for an exciting update on our Workers product!
Read the blog posts:
- Cloudflare Workers: the Fast Serverless Platform
- Cloudflare Passes 250 Cities, Triples External Network Capacity, 8x-es Backbone
- Vary for Images: Serve the Correct Images to the Correct Browsers
Visit the Speed Week Hub for every announcement and CFTV episode — check back all week for more!
English
Speed Week
Transcript (Beta)
Thank you all for joining me. Thanks for joining on Cloudflare TV. I am Jon. We're here to talk about Speed Week and our really exciting passage of 250 cities, our tripling of external network capacity, our 8x growth of backbone, and of course, we have Kabir on here to kind of keep us grounded on what that means for our users, what that means for the folks over in product, who take the great work that's done on the infrastructure side to make all the new cities, make all the new expansions happen, but what that ultimately means.
Just to kick us off, let's just go around and talk a little bit about what everyone does.
I'll start. I'm Jon.
I work on the network strategy team under infrastructure. I wear a bunch of different hats depending on what day it is, but today I'm here to kind of talk about the contents of the blog, the planning process before it even gets to the point of installation or shipping or whatever, and also here to guide the conversation.
David, could you tell us a little bit more about what you do? Sure. So I'm David Antunes.
So I joined the company. It's almost eight months now. So I'm part of the infrastructure team, which is an awesome challenge.
We're enjoying it.
So basically, we have to deal with building the network, connecting the network, maintaining the network.
I guess we'll go into those details during the conversation.
So in a wrap up, that's pretty much it. Great. Gail? Hi, everyone. I'm Mekhla from the infrastructure team.
And yes, we are a team and we are collaboratively working between our teams.
And Kabir? Yeah. Hi, I'm Kabir. I'm a product manager on the workers team.
So today I'll be chatting a little bit about how some of these changes to our network and the expansion can help you guys.
Great.
And just to also base our discussion a little bit, let me go quickly over the accompanying blog to this segment, which talks about how we have just surpassed 250 cities.
It's come with some really exciting news, like obviously, yay, we passed 250 cities, but also the incredible expansion of our external capacity, our new stats involved in faster connection to more IPs, new backbone segments, et cetera.
So we're in 17 new cities outside of mainland China that we were not just two months ago when I last went on Cloudflare TV across eight countries.
So we've got Ecuador, Saudi Arabia, Algeria, Thailand, Guam, United States, Russia, Philippines, and Brazil.
But also something that we don't talk about as much is all of our awesome work done with our partner, JD Cloud and AI, which is up to a whopping 37 cities in mainland China spread throughout the country.
Oh, that's really great. We're now 250 plus cities, 100 terabits plus. I guess we don't go into the exact number, but we have a very large internal backbone that we're really excited to be growing.
And all of this has come together at a time where it's been exceptionally hard to do really anything.
So we've added all these new cities just in two months while it's not really possible to transit borders at all easily, either for humans or for cargo, for servers.
And you can have all the maintenance contracts in the world and it doesn't mean anything if you can't get goods and services elsewhere.
So this kind of growth has been really made possible, really enabled some great changes to our products like what is it?
We're entering the era of the weightless Internet, thanks to all the hard work being done on the back end by the folks on the infrastructure team and by all the great work done by our engineering and product and strategy teams to make it so that once you connect really quickly to our data centers, because speed of light only goes so fast in fiber, then you have a great experience from there on out.
So with that little background set, just to kind of kick off the discussion.
So David and Mekla, while I get the great pleasure of looking top down on these new cities, expansions and everything, I don't really have the space to go into the hard technical details and technical challenges of this multi-team process that makes this happen.
Since you guys are working on the inside, could you tell us and our viewers some more about the team and this process end to end?
Yeah, I can start. So basically, as I mentioned earlier, basically the team has three main pillars of activity, which is about maintaining the network, which has to do with handling faults.
Faults can be backbone links down, can be servers, can be networking, can be a lot of things.
We have the connect team, which is literally about connecting, again, links, backbones, interchange, things like this.
And the build team, which is the one that I'm more focused on and so is Mekla.
So the build team has to do with basically, one, new sites, new calls, what we call calls.
And number two, expanding existing ones, capacity, adding servers, adding networking, networking gear, all that stuff.
So focusing on a new call, which is what we posted today, reaching out 250 cities plus.
Yeah, it's quite a long process, so we have other teams interacting with us.
So the site planning team, for instance, that handles all the negotiations with our partners, if that's the case.
The capacity team as well, which basically will identify what are the needs in terms of gear for there.
So if it is a big country, a small country, a big city, of course, that will affect how many parts we will need to address the needs.
So once that is settled, we have all the part of the scoping, so we need to specify exactly what we need to fulfill those needs, how many servers, how many routers, how many switches, everything.
Then another team that works very closely with us is the logistics team, so they have to do all the part of the supply chain.
That gives us some time also to prepare the installation ourselves, and once everything is ready, then we interact with our technicians on site and with the partners to have the things starting being installed and cabled on that part.
Once the installation is done, we have all the provisioning of the services, so that includes networking and the service as well.
And ultimately, it will end with enabling the site, turning up the services, and voila, there's another city to the list.
So yeah, it's challenging, but a good challenge, and the type of things that really encourages us to try to improve better and faster.
So of those, of this, I mean, of this massive number of individual parts, what is it, what do you most often get blocked on?
Like, what is the most common cause of delays in this process?
Once we have gear on site, the biggest block between that and getting traffic flowing?
You asked for the delay, John.
So, delay, if it occurs, if the material is not delivered, then only.
Otherwise, our, all the teams, they are collaborating with each, between the teams, and so we prioritize our installation.
We have the installation document ready to the engineers, and we make sure that we achieve our RFS date to get that site live.
So, yeah, so we are the eye on all the projects parallelly, so that they work smooth.
Yeah, it's kind of incredible, just looking on it.
As I said, we posted this two months after our last post, a bunch of new cities, 17 cities in two months, not even going into a JD.
That's just a lot of literal projects to manage, to keep on top of, particularly since we don't have any full-time staff on site.
And that given, actually, what is it like working with these internationally spread out contractors, all these different folks on site, who you instruct and you have this documentation for, but obviously, you can't, you're not over their shoulder being like, hey, plug that in, right?
Yeah.
So, we have the installation guide for the engineers working on site to help them, and we work with them on call basis, email basis, everything.
And we make sure all the connections we get, they are intact, verified, everything, connectivity is healthy, 100%.
And then we proceed for the network provisioning or the server provisioning task.
So, we have to assess the customer and situation also, what the physically it can support us, if their infrastructure, the physical infrastructure is not supporting what the requirement is Cloudflare is having.
So, we prepare the solution for that.
So, for example, the partner is having a DC PDUs, but we need the AC.
So, we put the inverters, our inverters there. So, that is one case.
Other case, sometime the partners don't have the IPv6 infrastructure.
In that case, we make the solution like a GRE tunnel, and we try to work out to final turn up of the service for that.
We don't start or stop, and we try to just make it out from that.
Yeah, there are some challenges that are very specific to specific regions.
The basic one, which has to do with power supplies, it's not the same everywhere.
So, we need to adapt to that. Sometimes, of course, the customs are always not the same.
Sometimes, there are delays in that part as well.
Michael already mentioned network provisioning. We try to use kind of a standard approach, but it's not always possible, and we have to adapt to specific cases.
Time zones, also a challenge, but fortunately, we have teams spread all over the place, so we can follow the sun, as we used to say.
So, that's not actually a big challenge, I would say.
Sometimes, even the language, of course, we all try to speak English, but it's not always possible.
So, we need to adapt to that as well, but we have also different people in the team that is able to meet some of those needs.
So, yeah, it's one of the things that I personally enjoy is the chance of having the opportunity to meet different people from different places, and certainly, we have that here.
Yeah, that's actually, if you don't mind me asking a follow-up on that, that is like truly, again, going back to how it's incredible that we have so many new cities and so quickly, but it's not like there is a cookie-cutter process.
There is each area, each region has its own cultures that have issues.
Maybe some places use IPv6 more or less. Obviously, we have a bunch of great automation tooling and have good guides, but how much does it vary place by place in terms of challenges or experiences while doing new city turn-up?
David?
Challenges, sorry, if I understood the question. Challenges when turning up a new site, is that the question?
Yeah, yeah, yeah, but like specifically, we have all these processes.
We try to standardize wherever possible, but not all places are the same in terms of how their setups work.
Could you just talk a little bit about like maybe some of those cultural or technical issues?
You said DC versus AC power, IPv6.
What are some other examples, or are those the two big ones?
Well, power for sure is one of them, and you mentioned one important one, but it's not the same everywhere.
Yeah, we have tool, from that tool, we can analysis of that region. So, suppose we turn up some site in Brazil location, and they are not getting that prefixes or service from that pop, but from some other pop.
Then we do analysis using our tools, and where that traffic is going to really, where that traffic is hitting, on which colo.
Because with that tool, we get the idea that why the customer is looking that he's not getting the service from that pop, but from that.
So, we assess that and explain the partner what is happening and how they can balance the traffic within their ISP side.
Yeah, yeah. Oh, that's an excellent point, particularly when we don't control the last mile ISP.
So, particularly where we have a bunch of cities deployed with one partner.
Yeah, traffic doesn't always flow in the nice organized way.
So, there are cases where we're really close to a user, but that user's traffic flowing over the original ISP goes somewhere entirely different.
Do you have any stories, like what is the strangest routing? Are there any cases where while turning up, like you're putting up a city in Rio de Janeiro or something, and traffic is hitting Miami?
Are there any funny stories like that?
I remember one- No, we have seen many in the, recently we have turned up so many pops in Brazil area.
So, there we have seen such issues and we help our partners to understand the situation.
I remember a case recently in Eastern Europe where we had, it was not the first nor the second location on that country.
And when we enabled, well, another location, basically we noticed that the traffic was not being attracted there.
So, all the traffic was still going to the others and we have to discuss that with our partner, with the ISP to check what was actually going on.
And we have to work together to fine tune this and finally manage to balance the traffic because, yeah, it needs to be a kind of a handshake between the two parts.
So, yeah, it happens. But we also, when we enable a new site, we do it in a controlled manner.
So, we don't throw it all away in the first place. So, we do it gradually, step by step, and we like to just a small set of services in the beginning to see how is it going.
And then when we are absolutely sure that, okay, this is behaving as we want, we start adding further and further until we get everything that we want working there.
So, we try to, well, to mitigate the risk sort of and identify very soon any potential issues.
Given, speaking from a product side and from a, for this, I gather you were a solutions engineer.
How much do you think about this?
Like, how transparent is this process for the work you do?
Yeah, that's a great question. So, a little bit twofold of an answer on that. One nice thing about the way that Cloudflare has architected our infrastructure is that as a solutions engineer, when we were building out solutions for large, small, medium-sized customers, you don't have to think about the network very much.
It's there for you. It works out of the box. It's incredibly fast.
It's reliable. There's problems that might arise as Michaela and David have spoken about.
Maybe a new data center is spun up, but we're seeing some issues there and we might have to route around that.
It's built out of the box that, as a customer, I don't have to think about those things necessarily.
Now, on the flip side, when we're building products at Cloudflare, we do think about this and we want to make sure that we're delivering some of the fastest speeds that we can, the most reliable network that we can.
Even within data centers, we have different services that run on top of these pieces of infrastructure.
And so, we have to think about how does this play nicely in such a way that our customers don't have to think about this?
Let's make something where we're taking the burden of a lot of the issues and complexities of distributed computing off of our customer.
And we're making it something that's completely transparent, yet reliable and easy to use.
So, in answer to your question, yes, we think about it, and we know about it and we're acutely aware, but we don't always have to.
And it's nice when we see some of these updates, right?
I think only a year ago, we were in something like 200 cities.
John, you might have the right number there. And then, within this last year, just moving up to 250, we're incrementally seeing these improvements as we do tests and speed tests, and our customers are telling us, hey, things got faster, I don't know why, right?
But at the same time, I might be able to deliver the good news to a customer in one of our new cities that, hey, look, we have a point of presence in a place where your business is fairly uniquely run as compared to other businesses, or maybe there's a new niche market that you guys are tapping into.
And luckily, Cloudflare is there now. So, we're able to serve those customers incredibly quickly.
Yeah, that's awesome detail.
I think that's something I like about doing these blog posts and Cloudflare TV segments is kind of shining a light because it is, even from my role on network strategy, it's incredibly easy to take it for granted.
Because if everyone on infrastructure is doing our jobs right, it is perfectly transparent.
And it's like something that is good to think about and good to brag about.
I have this really great map in my background.
But almost, it's nice to not be seen in some ways when it comes to our infrastructure.
We just add new dots, more, more, more, more, more, more, more.
And because our network is already so great, the incremental improvements are sometimes small on a macro scale, but matter hugely to the customers and our customers' customers in these various cities.
In my blog post, I had this graph about South American traffic and essentially South American latency to users from just three months ago and now.
Obviously, South America is an incredible market and we've turned up so many cities in this past year.
But looking at that graph, which I encourage you and our viewers to do if you haven't already seen, you just see a notable shift to the left of just suddenly everyone's Internet experience on like literally tens of millions of Internet properties just got that much faster in three months.
It's a real source of pride for me in my work. Yeah, absolutely.
I think you brought up a good point there in that it's incredibly simple to use and not have to think about.
But in aggregate over time, it's just another piece of another benefit for our customers.
You can see this by looking at any single one of Cloudflare's products.
The whole thing is built on this incredibly fast and reliable network that's growing over time.
Products like Workers, which is the one that I'm working on, are getting faster and faster every time that the infrastructure team adds more points of presence to Cloudflare's network.
On the other side of that, if you wanted to do this on your own without a service like Cloudflare, you have a whole host of reliability concerns, a host of concerns around consistency, partnerships, where you're going to hook your data centers into, what services you're going to use in each region, how to balance between those regions.
And that's just really the tip of the iceberg of some of the problems.
I'm sure David and Mikhail have even more and more items they could add on to that laundry list.
But suffice to say, we're able to make it a lot simpler to use the Internet at scale.
Yeah.
And that's a really great thing to keep us grounded in. So from a product perspective, given that we are making this move towards discussing our network, I feel like, and this is something I touched on at the end of my post, I feel like we're turning the corner from thinking of ourselves as a global network, which I think we've really known we are for years now, to starting to think of ourselves as a globally local network.
You can make decisions for what to build and work on reliability assumptions based on incredibly low latency and incredible proximity to our customers.
Would you mind telling us what some cool applications are of workers, just because our customers really get the freedom to play that close to our users?
What kinds of things are made possible by this globally local network, this globally low latency?
Yeah, there's been a lot of use cases that I think anything from very, very simple parts of an application, let's say an endpoint that needs to be close to the user or previously was within the browser and made the experience a lot slower because the packages might be larger for the customer to download, or maybe their instance of their mobile browser or their web browser just gets a lot heavier when they're using your application.
You can take a lot of those and bring them to the edge because the edge is now no longer hundreds of milliseconds or thousands of milliseconds away.
It's tens up to 100 milliseconds, really.
And as Cloudflare grows, that number is going to get lower and lower for those end users.
The other thing that's nice about the workers platform is last year, we announced that we've built out tooling to help eliminate cold starts completely.
That's been a large, I guess, point of contention amongst the serverless world, which is, yes, I want to be very fast, but for some reason, every time I spin something up in any data center, there's cold starts that I have to deal with.
And Cloudflare's eliminated that using our isolate model, using some really interesting tricks that we've pulled through the front end of our architecture.
From there, you're able to take not only those small use cases that are things that used to live in the browser, but now you can start to take things that are previously thought of as functions you might put in a server and some regionally distributed or even centralized cloud.
And you can start to take those and put those onto Cloudflare's edge.
One of the things we did with the launch of Unbound is we removed a lot of the limits that you had around the CPU time you're allocated within the workers environment.
So you're no longer bound by something that you're not bound by in a cloud environment.
You're now able to do that in Cloudflare's environment.
So you can do things that require complex algorithms or lots of CPU time on Cloudflare's edge.
You can schedule those jobs to be run using our functionality.
And a lot of that just opens up use cases that we haven't seen before.
So I think over the course of the next few months and years, you're going to see a lot more improvement and I think a lot more use cases of Cloudflare displacing some classic cloud infrastructure on the edge with the workers platform.
Awesome. Thank you so much. And we're almost at time, just about 70 seconds left.
But just to wrap us out, could you all tell me about your teams and maybe what positions you're hiring for?
People you need to help to make a better Internet?
Well, I believe we are searching for people to help us in the networking part because, well, this is a never-stop project.
So we reach 250. I'm pretty sure that this number will get bigger and bigger.
So we need people for that. So if you are good in networking skills, please look for the roles that we have open and apply.
And you will enjoy it for sure. It's a great company with a great future and a very challenging and wonderful environment to work on.
And certainly across the board at Cloudflare, there's a lot of growth happening.
So anything from product management to engineering, we are hiring across the board.
We'd love to work with you.
Great. Thank you all so much. Pleasure talking. Have a nice rest of your day.
Bye, all. Bye.