Cloudflare TV

Hardware at Cloudflare (Ep1)

Presented by Rob Dinh , Brian Bassett
Originally aired on 

What does Cloudflare look for in hardware?

Episode 1

English
Hardware

Transcript (Beta)

Hello, everyone.

Thanks for tuning in to Cloudflare TV. This is a section that we're going to be talking about hardware at Cloudflare.

My name is Rob. And we got on the other side, Brian, who is a hardware engineer.

I'm an SRE. Both of us are really excited to talk about essentially what is our daily life about what we're doing at Cloudflare.

I've always felt that. So, I mean, just to get a little bit of context, I've been at Cloudflare for almost five years.

And hardware has always been something that I've always messed around with.

And we will always talk about all the products like, you know, serverless and Argo, Tarno and Magic Transit.

But it all kind of roots down to hardware. So a lot of the customers or even friends when I try to explain to them what we do at Cloudflare, they get surprised about, oh, yeah, there's this hardware at Cloudflare.

And I've been basically just trying to preach that, yeah, actually, there will not be any Cloudflare without the hardware.

So it makes me really happy that I'm here with Brian to have this post here.

So a little bit of the outline that we're going to be talking about is going to be our latest generation server for our Edge Compute, the Gen 10.

We'll talk a little bit, some insights about what we do at Cloudflare as an SRE, just for a little bit, and particularly for the hardware team that Brian is in, and just a little bit about what Cloudflare is looking for in terms of hardware.

So we can lay it all out first. There are 200 cities that Cloudflare is set up in.

That's over 95 countries. And that includes what, about at least 17 cities in mainland China.

And all of that has basically the hardware that we buy. So Brian, if you could talk to us a little bit more about our latest gen server.

That has a huge change between Gen 9.

Yeah. Hi, everybody. Thanks for viewing. I do want to remind people that if you have questions while we're on the air, you can email livestudio at Cloudflare.tv and those questions will get relayed to us and we'll answer them in real time.

So late last year, Cloudflare announced what we're calling our Gen X servers.

They're our 10th generation of servers. And they're a lot different than the Gen 9 and Gen 8 and so on that came before them.

One of the big differences is that we went with AMD processors for this generation.

All the previous generations of servers at Cloudflare had been with Intel processors.

And I can show a picture of what one of them looks like.

So, Yes.

The reason we did that is we tested both. We didn't just have faith that one or the other would be the best fit for us.

We tested with a 48 core set up on the AMD side and a 48 core set up on the Intel side using their latest processors.

And our primary metric for determining whether we're getting the best bang for the buck out of a particular processor is request per second.

It's request per second that are serviced by NGINX, which is the web server that we use.

This is a shot of one of the processors or one of the servers.

You can see the AMD EPYC CPU in the middle.

We went with 8 32 gig DIMMs for a total of 256 gigabytes of system memory. We arrived on that based on the calculations that we did for the minimum amount that we need for programs and data that are running and then a little bit of margin, which is used by the Linux file system for file system buffers.

This is primarily for caching.

Caching of customers web assets, for example. So here you can see another shot again 8 DIMMs, the big AMD processor in the middle.

And yeah, we went with 48 cores for this generation.

We also also tested 24 cores and 64 cores.

And to make that determination, we went with basically a request per second per dollar.

And it turned out for this generation that 48 cores was the sweet spot.

Okay. I mean, I was just trying to expand this a little bit.

I don't know if you could just help me. So Gen 9 had 48 cores as well, right? Yeah.

So what was sort of like the difference there because at least maybe in my shallow understanding is that we just have more cores and just have more performance.

Right. So we saw a higher performance per core with Gen X. There's, there's a couple of reasons for that.

First reason is that the Gen X server has 48 cores, but it's all on a single CPU package.

And so any accesses from those CPUs to memory are direct accesses.

On previous generations, we had two 24 core processors.

And so if one processor had to access memory that's attached to the other processor, it has to go through an interconnect and that reduces a lot of latency.

So Gen X solves that problem. A big part of the reason also is that these AMD processors have 256 megabytes of level three cache per processor package.

Level three cache is very fast memory that's attached directly to the processors that's used to access recently access programs and data.

And if it's, if you get what's called a cache hit, then it's accessed much quicker than if you have to go to main system memory.

And with such a large level three cache, the cache hit ratio with Gen X is much higher.

Yeah. So I guess that kind of comes down a little bit if I was to try to simplify for everybody else.

It's almost like we're basically building a cache server, right?

100% cache server. And the bigger the cache size that we have at the lowest level, the better.

Is this correct? Yeah. I mean, that is one of the primary functions of these servers is to cache our customers web assets and make accesses quicker or lower latency.

There's some other processing that these machines do to each packet as it comes in.

For example, it determines whether a packet is part of a general service attack and it throws it away.

And one of the things I was interested to learn when I came to Cloudflare is how CPU intensive that process is.

Cloudflare is not just passing data through. We're actually processing each packet and that's the value add.

Yeah, I'd say just at least on a bigger system level.

Having a system to be CPU bound is maybe easier to manage is probably one way of looking at it.

We're not looking if it's NIC bound or bandwidth bound.

Maybe it's kind of a luxury that we don't do much of bandwidth. Anything that we care about is IO or IOPS.

This is actually a pretty good segue on how we're looking at our CPU and L3 cache, but also how we transition from SATA drives to NBMEs.

Yeah. As you know, the stuff that we do is not terribly IO bound, but there are bursts.

And so one of the changes that we made for GenX was moving from the SATA SSDs that we had used in the past and they're kind of the legacy protocol to NBME, which is a newer protocol and it allows the CPUs to talk directly to the drives.

Our overall capacity didn't change.

We have the same capacity in each host. But we went with these NBME drives that are in what's called an M.2 form factor.

It's about the size of a stick of gum.

And that gets us about 5x the read IOPS and about 2x the write IOPS compared to the SATA drives that we used before.

And that's only with three drives.

We had six drives in the past. We went with three drives at double the size and we're still getting this big improvement in IOPS performance.

Okay. What was the previous configuration?

It was like six disks of 512, I believe, right?

Or 480 gigs. There you go. 480, yeah. Yeah. Just to give a little bit of a context too.

When I was trying to spec for the Gen7s or Gen8s way, way long ago, we started with just SATA drives.

And we didn't have that many metrics back in the days for what would be the perfect drive that we needed for the workload that we had at the time.

I mean, there was something that I was worrying was capacity, right?

Back in the days, we had 240 gig per disk. And, you know, one of the disks that we have for server is a state disk that stores some values that we have there compared to caching.

And it was going to be a pretty big trouble for us because that side was going to keep on increasing.

So it's, yeah, I'm actually glad that it's something that we can tackle it.

So I believe now that we have three instead for one gig, right?

Or no, one terabyte. Yeah. 960 gigs. Yeah. Okay. Okay.

Yeah. And of course, the reason that's possible is because the price of NVMe and the price of SATA have pretty much converged these days to where you're not paying a big premium to get NVMe drives.

The NVMe protocol has been around for a while, but as recently as two or three years ago, the price premium was really big to get to NVMe drives.

And now we have the luxury of getting drives that are so much faster for about the same price per gigabyte.

Right. And what was the decision that we made to have the Gunston form factor instead of, I guess, more readily available U.2s?

Yeah. So what we went with with this generation is a carrier card.

And it's a PCIe card with just blank sockets for M.2s that plug in there and then goes into a PCIe slot.

Each one of those gets its own four PCIe lanes. And the PCIe x16 slot is carved up, what they call bifurcation, into four different groups of four PCIe lanes.

So what that lets you do is it lets you use a pretty cheap method of connecting them to the system.

The problem with the other kind of drives, which is U.2 drives, they're two and a half inches rectangles.

It was really designed to be a hot pluggable form factor that people usually plug them into the front of a server with a backplane and you can hot swap them.

We don't have the need for hot swapping.

We just want the drives to be internal to the system. So we're not paying for that extra plumbing, basically, with U.2 drives.

Right, right, right.

Yeah, I guess something's kind of has changed a little bit. I remember sort of, quote unquote, falling in love with hot pluggable components.

For us to try to service a node or server, it was sort of like the thought that we had to ensure that it was still up and having a hot pluggable was sort of the assumption or the solution that we had.

So it's good that we have something where we can actually save a whole bunch of useless components like the hard drive carriers and the screws that we don't need anymore.

And our services have to be down anyways. And it's fine.

We have a whole bunch of engineering that can also load balance without having us to shut down a whole bunch of nodes because of repairs.

So I think it's a great start that we have done.

Yeah, that's one of the advantages of a cloud environment and especially the Cloudflare one where every machine runs every service on the edge.

So any one machine going down, it does decrease your overall capacity, but there's no critical service running on that one machine that's going to cause any disruptions for customers.

So we have the luxury of being able to take the system down, get it out of production, and then get it fixed offline whenever there's someone available to do that.

And for having three AVMAs, are we talking about rating two of them?

Is that right? Or if you could just explain it to us.

About what? What's the configuration for having three AVMAs? Why not two?

Why not four? Maybe are we rating them? Yeah, so we have, like you said before, in previous generations, we had a single 480 gig disk dedicated to the state, which is some stateful information about the server, database of configurations and stuff.

That disk on the older generations is close to getting filled up.

So with Gen10, we went with a single 960 gig drive for the same information.

That gives us a lot of headroom for that database to grow. And then the other two disks are for caching customers' web assets, and they're not in a RAID format.

They are just single drives that the management of the data on them is handled by our services.

And so, yeah, we can afford to lose one of those, and the system can actually keep functioning with just a single cache disk, just at a lower capacity.

Yeah, I think I remember a blog that we've done, at least internally. Not too much of a detail I can disclose, but there wasn't much of a performance difference to have a mini cache disk over having less or fewer disks.

So I think that's what also encouraged that we can just do two cache disks and one state disk.

Not only are we actually saving on the plumbing that you have mentioned, but we're also dealing with less disks.

So it's something, at least for me as an SRE, I'm excited about.

It's hard to deal with servers that have subpar performance on a disk, because that means we have to take them off.

And with more disks just means more single points of failure, at least the way I see it.

Yeah, sure. And there's always been this claim that NVMEs have, I guess, fewer percentage of failures, but I don't know if it's something that we were able to do a test on, unless there was something you developed that, I'm not exactly sure.

Yeah, I mean, we're going to have to just monitor failure rates over time.

GenX is new enough that we're not seeing any disks failing at a rate high enough to be able to judge.

The rated mean time between failures for these drives and the SATA disks that they replaced is the same.

They're rated at 2 million hours for both of them. Endurance is about the same between these drives and the previous one.

So really what we're getting is performance and the low latency.

Okay, yeah. Also, we were just showing our Gen10 picture.

Something to notice is that it's a pizza box shape. So it's a 1U server. The past generations, we're talking about the 9, 8, all the way from 6, so 6, 7, 8, 9, were 2U 4N servers.

So just to explain just a little bit, it's a big 2U chassis. So imagine you have two pizza boxes stacked on top of each other, and they're also 90 inches long.

But for that big chassis has four nodes sharing your power supplies, I mean, in the chassis itself.

And the whole point of it was so we could actually have a huge amount of compute density-wise.

A multi-node server's purpose was to actually just have as many cores as we can.

So we decided on that. And in a way, you're actually also saving on the hoods, the chassis itself, the rails, and everything like that.

So why did we move out from the 2U 4N into your regular 1U?

So the advantage to a multi -node chassis like that is that you do save on the shared components.

So you're saving on power supplies, the chassis metal itself, the power distribution board is shared.

But those components, honestly, are not that big a percentage of the cost of a system compared to the CPU, the memory, the SSDs.

And there's no savings there. The downside to a shared chassis is that there are some things that can fail that could cause all four nodes to go down.

That power distribution board is one, for example. The other one is that you're gaining density in the rack.

But I know in some colos, we're running out of our power allocation for a rack before we run out of physical space.

So we're really not gaining that much.

The 1U form factor has the advantage of being simpler. And in a cloud environment, you're always striving for simplicity.

It's no shared components between hosts, so one can't affect the one next to it.

It ends up being this generation is much cheaper than the previous generation.

That's not just due to form factor.

There are other factors that go into that. But even with the loss of potential savings with the shared components, it's much cheaper this generation.

Yeah, I'd say we can actually feel the benefits already when it comes to provisioning those servers.

Normally, we would provision a whole chassis, and we would assume that this whole chassis has four nodes.

And it was always hard for us to figure out which node was in a particular chassis.

If we were looking for a failed node, for example, was it node one, two, three, or four?

And like you said, it's more simplified.

Because there was a lot of confusion when we talked about, is the node serial number, the way we actually identify a server, the chassis serial number?

Or is it going to be the node serial number, which is usually tied by either an ISA tag or a motherboard.

When we go with the serial number of a motherboard, it's super hard to look at it physically.

If you try to imagine a technician at the data center, and we actually give them a ticket that says, hey, could you replace this disk?

And that's the serial number of the node. And it's really unfortunate when the serial number of the node is actually sticked inside the server on the motherboard.

So that's always been hard. So that's great. And I think another positive that we can look at is there's less batching to do.

If we wanted to go leaner, it means we need to minimize our batch size.

So say somehow in a capacity, we say we only need 10 servers there.

And when you go to buy two or four Ns, you have no choice but buying 12 servers.

Because four times three. And that does complicate things when it comes to trying to overprivilege something or get overcapacity.

And that's money that we could actually just spend it or invest it somewhere else.

So that's good. Well, I'm glad to hear that GenX is making your life easier, too.

I didn't know that. We'll see. It's still pretty early. But there are still some adjustments that we have to do.

So for quite a few years now, we've kind of tailored our environment to take two or four Ns.

And one supplier, one vendor.

Now we're introduced with two different vendors at GenX. So it does, I wouldn't say complicates things, but there's more values to manage.

So there's more firmware versions to manage, for example.

But as long as we got that going on, I think we're okay.

All we need to do is just ensure that we have good communications between SREs and hardware engineers.

And also the vendor relations. I think we'll get there if we're not there yet.

Like the BNC, for example, right? Or BIOS settings.

We've seen in the past that these things also affect our performances. It's something that we always have to keep an eye on anyways.

Yeah. But that dual sourcing of our server vendors has been really good for our cost structure.

Having two server vendors competing against each other for our business is always a good thing.

And then we're protected in case there's a disruption in the supply chain of one of them.

We can order from the other. So it does create some additional complication for you.

It did for me during the qualification phase, but it's worth it.

Yeah. We're in a place where we just want to scale things. And when you're trying to scale things, you try to keep the values as constant as you can.

So that way you won't confuse anything else.

But we're trying to realize, okay, where are the servers exactly?

And then also, are these servers from vendor one? Or are they from vendor two or three?

I mean, it's just something that we just have to engineer around.

It's not something we've faced before. If anything, I remember specing for Gen 8.

We did ask ourselves, just get that positive of, well, if you have two vendors, it's nice because there's some sort of a redundancy effect to it.

But for us, we were just like, well, I don't know if that makes any sense.

Because that means we have to tackle two different vendors and solve somehow.

So we just didn't see it that way. But we're a big company now. We get to see more metrics and we have other ways to try to do some cost savings.

So I think it's worth it.

We have some questions in the chat channel. Yeah, why not?

We can go through just a little bit here. Where do you want to start?

We can talk about that first one. So, listener asked, can you help me understand your thoughts on building your own servers versus buying systems versus some mix of components integrated?

That's a good question. Yeah, so like a lot of cloud providers, we use ODMs rather than OEMs.

OEMs being original equipment manufacturers.

So that's like your big enterprise manufacturers like Dell or HPE. We use ODMs, original development manufacturers, which are smaller companies that are willing to customize the hardware to exactly what we want.

If you go to Dell and say, I want a server that looks like this, if they don't have it off the shelf, they're not probably going to be able to give it to you.

These ODMs are willing to work with us to configure the system the way we want.

We partner with them and rely on them to do some of the things that we're not yet staffed up for, such as thermal testing and electrical engineering validation.

At some point, Cloudflare is going to get to the scale, I think, where it would make sense to do more of our own development in-house.

We're growing really fast, and I don't know how long it'll take to get to that point, but it seems like it will get there eventually.

Yeah, I'd say if it makes sense, it's just what Cloudflare will do.

That's been the philosophy that I've just been doing for years now.

We're not married to any method, and we're not married to any vendor or any company.

We're just going to go for what's best, what's best for us.

What we're married to really is the quality of service to our customers.

Whatever it takes, whether it's AMD or Intel or 2U4N or whatever, this is what we're going to go for, whatever makes sense for us.

That's kind of how I'm going to go with it. ODNs currently just make more sense for us.

Like the configurations you're saying, I'm just going to add on a little bit more also that they can also configure it on the firmware side of things.

There's support there, and there's also support with the components themselves.

The NICs, like we have with Mellanox or the CPUs, we have open communications with all these guys, too.

There's this triangle of system integrators, and then there's the components, and then there's us here that we're all working together.

I think that's a communication that you would not be able to benefit if we went through the side of HPE or Dell, for example.

Yeah, that's a great point about the BIOS. We ask our ODN partners to customize the BIOS with the defaults to the settings that we need to provision the systems in our environment.

That's not always a service that's available from the OEM vendors.

Okay, yeah. Let's see. What's another one that we got here?

What type of support do you typically find valuable for your servers, on-site, self-service, etc.?

I'm trying to actually understand this question. What type of support do you typically find valuable for our servers?

Maybe how to manage our servers on-site is probably how it goes.

I think this is for faulty equipment remediation.

Oh, yeah, yeah. At least in my position, we should have somebody from DC Ops to come in here to talk more about it.

At least in my experience, years ago when I was in DC Ops, the type of support that we had would be the technicians that are on-site.

That kind of does pose a bit of a challenge. I'm going to try to maybe leave this answer more as a question.

I would love to hear anybody else who has some sort of suggestion about it.

Again, we set up shop in 200 different cities, just all over across the world.

It's hard to find a way to replace one drive that's at the corner of the world.

The reason is because a drive could cost us, I don't know, what's the market price for one of those things?

$120 for a random drive.

Let's say this server is still under warranty. Typical warranty is if you have a failed component, like a drive, within the years of service that we expect it to be, it will be for free.

However, there's going to be some support where the logistics is now going to be a problem.

Because now we're sending this one disk through customs and everything else.

It has to get to the colo locally over there.

Then we also have to hire a technician to be able to not only read our own instructions, but to locate and actually do the job properly.

That's tough. That's tough when you're essentially completely remote. It's different.

In my past life, when I was at Twitter, we had site technicians. Shout out to the site technicians.

To me, they're the essential workers for this time around.

They're trained to have our language, our jargon. They know exactly what to look for and stuff like that.

When you have 200 different cities where we have a number of different POPs, a number of different partner POPs, ISPs, and co -locations where there's nothing that's too standardized, it becomes difficult to do.

While the support is there, we can turn those things on and rack them with good and clear instructions.

Managing them for RMAs has been immediately less than perfect, but it's still something that we still have to figure out.

I think Cloudflare has a good position of where we are.

We're at the forefront to try to define how to operate at the edge, basically.

The edge is the future, if it's not already.

For us to figure out how to get that kind of support, in a way that makes business sense, is a solution that we're trying to look for.

The first time I met you, you mentioned that some of these locations that we're in, it's so hard to get replacement parts to them.

When you send a server there, it's about like sending a rover to Mars.

It gets there and lands, and it works until it doesn't, and then your options are limited at that point.

Yeah. There's so many things that we could wish to do.

There's a lot of solutions like, let's just go ahead and send one of us over there.

We've done that. We've done it. It's become a very critical thing to do.

If there's a very complex project that needs to turn up a very, very big COLO, it's great to have one of our DC ops engineers to go out there.

It's just one of those things where you have to go through until you figure out a solution.

That answer is a question that's left out there for anybody else who can help us out.

Okay. Yeah. Someone asked about what are your thoughts, plans on OpenBMC?

I guess we can go ahead and spoil that one.

I had that one for the end, but we are actively experimenting with OpenBMC.

We have it running in a test environment today. We think that OpenBMC is the future, and we're hoping to get it in place for Gen 11, potentially even retrofitting Gen 10.

We really like what we see about it so far.

Okay. That's the first time I learned that Gen 10 is backwards compatible. You can open it up to OpenBMC, I guess.

Yeah. Gen 10 and Gen 11 are both using AS2500 processors, and OpenBMC has a fork that's made for that.

It wouldn't be the same firmware image between Gen 10 and 11.

There are changes that you have to make to OpenBMC to tell it, for example, how many fans you have, how many RPMs you expect them to spin at, what level is an alert, what temperatures you expect to alert on if your processors are getting hot, stuff like that.

That all has to be customized per the physical chassis.

But the actual processor that OpenBMC runs on is common, and that's a super common processor in the industry.

Right. Yeah. I guess that's true.

We can slow it down just a little bit to give us a little bit of context. We've always wanted to tinker around with OpenBMC for a while, and now it looks like we actually have the boards and the CPU to do it.

It's something that I'm actually pretty excited about for Gen 11.

I didn't know that we could actually try to do it on Gen 10.

It's something to look forward to. If you could actually just list out or at least describe what OpenBMC is to the audience, compare it to what's the classic management tool, I guess.

What we're hoping for with OpenBMC. The BMC is the Baseboard Management Controller, and that's the out-of-band way to talk to and monitor and control your servers.

You can use open standard tools like IPMI tool to ask a server whether it's powered up or off right now, to turn it on and off, to monitor its temperatures, stuff like that.

The way that we've done it in the past and the common way to do it is the server vendor takes a common code base and then does that customization to the server.

Sometimes they'll add features on. Sometimes it's not features you want.

They're doing all that coding in the background and they just give you a binary blob and you don't have any insight into what's happening.

Bugs can tend to creep in that way. We can't audit the security of the firmware that way.

OpenBMC, this is a common problem across the industry. The OpenBMC consortium is a group of industry players who have a stake in this.

It's big players like Intel and Facebook and Google, and then a bunch of other players who are also interested.

What they've done is come up with a code base that provides all these features of monitoring and controlling the servers, but it's all open source.

You can look at it and see that it's secure. You can add features to it if you want.

That's what we think we want to be running going forward for the additional security and the peace of mind of being able to fix bugs yourself if you have to.

Right. Okay. It's always been on our wish list. It's good that now it's not just a wish list, but a requirement.

I guess that's what it is. I think OpenBMC has gone a bit of a long way.

I've always heard it through the Vines, some ex -colleagues that are part of pushing and leading this thing.

For me, it's more like, does that mean I have to train for something a little bit different now, or can I just still use IPMI tool or something like that?

I'm excited to see what we would like to control.

There's definitely a lot of things that we could actually, maybe PowerCap or find a way to control our fans would be pretty interesting, especially when we have colos that are different.

We don't have anything standardized in our colos. Again, 200 colos.

We have different power delivery that's going to be out there. There's something that's 208 volts here.

It's just going to be different in Europe. It's going to be different in Asia.

In cooling too, cooling could be different. Some cooling could be from the top or from the ground, or there could actually be no cooling at all.

It really depends on where we're going to be at. As long as we set our minds to set up shop, we're going to try to find a way to make it.

To have all these variables, we could allow ourselves to have that control and flexibility in OpenBMC.

I'm excited to see what we could do.

Yeah, it works with standard IPMI tool. Someone in the questions asked about Redfish.

OpenBMC and our vendor-provided firmwares, they do work with Redfish, although we haven't really made that switch internally.

We're mostly using IPMI tool for server control, right?

Yeah. Just regular IPMI tool. Yeah. It looks like there was a follow-up for the breakfast of the support.

Do we expect the hardware inventor to partner and take care of that review?

I would say it's a typical warranty configuration.

I don't know what's the real word for it.

Whatever your typical warranty that we have for our servers.

Each of our vendors, at least the past vendors when I used to deal with them, they would have a portal.

And you would exchange your information about what the faulty component is.

But the actual swapping of the drives, or swapping those PSUs, or even do maybe like receipt memory, right?

All of that is done by the local technician that's out there at the comps.

This is a good question.

Someone says, is there only one type of server to run on the Cloudflare platform?

An edge caching server? Is there a core type server also? That's a good question.

On the edge, we have what we call edge metals. And those are what runs the CDN service, our denial of service mitigation, all the other good services that Cloudflare provides.

And those all look the same for each generation.

And like we mentioned earlier, they all run every service. We also do have two core data centers.

And those are where we collect customer analytics. For example, if you're a Cloudflare customer and you go to your dashboard and it tells you that Cloudflare has mitigated X number of denial of service attacks to your website in the past seven days, that's where that data gets collected and gets served to users when they go to look at that.

So the amount of data flowing through the core is actually pretty high because the customer analytics is a lot of data.

But that's not where customer data goes because customer data needs to be on the edge.

Yeah, I think that does set it up pretty well.

That was actually going to be one of the questions that we had running down there.

Yeah, again, there is an edge metal and then there's a whole variety of things that we do at the core.

So I'm working at the edge, so it's more of an education for me.

I could ask somebody at the core, but from your perspective, what kind of hardware or what type of things are you guys look for at the core?

Yes, so we do, our databases run on rotational hard drives still because we have a lot of data that we need to store, but the requirements for data throughput is not high enough to demand SSDs.

And obviously we save a ton of money that way.

We do also have some Kafka machines that are running on SSDs.

And then there's compute in the core as well. Okay.

Yeah.

Let's see. Anything that's maybe, I guess, maybe fancy or sexy about the metals that we have that's compared to the core?

Well, the NVMEs, although we will soon be deploying NVME in the core as well.

It was just kind of by coincidence. There are a couple of core configurations that are coming up soon that use the same AMD 48 core processor, the same RAM, same NIC.

And really the difference is the SSD configuration and then of course how they're going to be utilized.

Right. That's really good because it gives us economy of scale.

It makes it easier for us to debug problems.

Like if we solve a bug one place, it should solve it for the other places as well.

And I guess for the SREs who do work on the core machines, the closer they are to each other, the easier it is for them to provision and identify problems.

Yeah. It's a different set of problems actually that we have here. Yeah. I mean, in SREs at the edge, it's technically just one metal.

Yeah, there's two types, but it's actually really one.

And to be able to ensure that we have a whole bunch of different types of servers like compute and database and maybe a little bit of storage that we have at the core at the same time using the same CPU, I think it's a great deal to do.

We mentioned a little bit that we're tinkering about SmartNICs.

So we have OpenVMC now. And so what can we expect from SmartNICs?

I mean, maybe we're not actually making the decision that we go with SmartNICs, but what are you guys envisioning at the hardware team?

Yeah, we're just in the proof of concept phase with SmartNICs.

We know that there are other cloud providers who are using them and getting really good benefit out of them.

So our idea with the SmartNICs is to figure out how much of the load that's running on the x86 cores in a server today we can offload to the ARM cores that are on a SmartNIC and thereby free up more x86 CPU cycles to run the stuff that's CPU intensive on an edge metal, so NGINX basically.

Right. So we're still in the process of figuring out what's that percent?

What percentage of the CPU that's running on x86 today can we offload?

And what percentage higher request per second can we get out of that metal in that case?

So we'll see. Right. We have samples from multiple vendors in-house that we're testing right now.

Yeah, okay. Yeah, I see it as almost like a host within a host really.

It is, yeah. Yeah. This is an interesting question.

Have you looked at disaggregation of CPU and storage to get much better flash utilization?

Giving you the same performance as direct attach, using NVMe over fabric, i.e.

NVMe over TCP. So I did do some experimentation with this earlier this year.

And our thinking was, if we ever wanted to offer persistent storage at the edge, this could be one way to do it.

You'd have, for example, one big drive in a host that is otherwise the same as any other metal.

And then you could share it with other hosts in the same colo over NVMe over TCP.

Another potential use case would be you could share it with a machine that has a failed cache disk and let that machine basically cache the customer data over NVMe over TCP, thereby getting around that problem we were talking about earlier about how it can be hard to replace disks in some colos.

This was just experimentation, sort of like to see, what kind of performance would we get out of it with this kind of thing.

What I saw was that the performance really was pretty close to direct attach storage.

I compared it to NVMe over TCP, sharing an NVMe drive, and compared it to local SATA, and it was no comparison.

The NVMe over TCP was much faster.

Even comparing to a local NVMe drive in the same host, I saw that the performance was about the same but when you get to the very long tail of latency, like the 99.95, 99.99, then the performance of NVMe over TCP kind of started to tail off.

Also, with very small packet size, the performance was a little worse than direct attach storage, which kind of makes sense because you think about you're encapsulating each packet of data inside TCP.

There's going to be some overhead that comes with that.

Overall, I was impressed. I think it could work for the things that I was thinking we might use it for.

I guess it depends on the load, too, that we have.

If we're going to be seeing this type of thing often enough, maybe it's actually not worth doing NVMe over TCP.

At the same time, the first thing I can see is there's already a physical benefit about using NVMe over fabric, over direct attach.

The big benefit is that you can allocate space where it's needed.

Right now, we're allocating physical hardware to each metal.

We know how that hardware is going to be used. We control all the services, so that works good.

For example, let's say we wanted to solve this problem of needing an extra disk to replace a failed one.

We don't want to ship an extra disk with every server just in case of a failure.

This is a way that you could allocate just the capacity that a host needs on demand.

Yeah. Okay. We're down to the 10 minutes left.

If there's any more questions, please bring them in. Meanwhile, I will pose questions for hardware.

Gen 11 is going to be the first time I'm going to officially make myself divorced from hardware selection.

I've been very proud about Gen 10.

It's thanks to the hardware team. It's like you, Brian. We actually have actual hardware engineers that can do this and not leave me making decisions anymore.

This is good. I have a pretty good feeling that I would be more proud with Gen 11.

Is there any kind of previews? Any tidbits that you could tease us with for Gen 11?

What are we looking for? We're really looking for any technology that benefits us.

We don't want to get locked into any one technology or vendor.

We're trying to look at as much as we can. I mentioned OpenBMC. That one, it's not a go yet, but I think we'll get there for Gen 11.

The SmartNICs are still in proof-of-concept phase, but I'm cautiously optimistic that we'll see the benefit there.

We're testing CPUs from several manufacturers. We love our AMD CPUs, but it's not a given that we'll use them again for Gen 11.

We'll test them, and we'll see which one gives us the best benefit in performance, performance per watt, performance per dollar, and we'll go with whichever one's best.

I think that's what's great about working at hardware at Cloudflare.

We are positioned to, I wouldn't say pivot, but decide which component makes sense for us.

We can do it in many different ways.

The elephant in the room is everybody was saying, so when are you going to be switching off to ARM or something?

The best, I guess, most accurate answer would be we are ready.

We can be ready for ARM. It doesn't mean that we're committing to it or anything like that.

It has to make sense. Whatever solutions that we have available for ARM, AMD, and Intel, and anybody else, it's more like, are they ready?

Because we are. One of the features that we got from the AMD processors that are in the current generation is we can encrypt memory while it's in use.

It's a feature called Symmetric Memory Encryption, SME. We found out that there is a little bit of a performance hit from doing that, but we like the idea of having memory encrypted on disk when it's at rest, on the network when it's in transit, which we do both of those today.

Then the AMD processors give us the capability of doing it when it's in use as well.

That's the feature that we hope that other CPU manufacturers will match in their forthcoming generations because we really like that idea.

Yes. I had a question about 64-core CPU compared to 48.

When we were making that decision, we were looking at not just performance, but also performance per dollar and performance per watt.

Like I said earlier, we landed on 48 cores as the sweet spot for the current generation.

I wouldn't be surprised if that changes in the future.

The good news for us is that our network stack and our whole software stack scales really well with CPU cores.

It's pretty much just a drop-in replacement to go to more cores if we decide that it makes sense from a cost and power consumption perspective.

Yes. I guess that does bring up a move that we could do too that rather than iterate a server, we can always upgrade some of our current servers if it makes sense.

Then again, it goes back to the whole logistics of shipping a CPU to the corners of the world and have a qualified technician to make that move of swapping the CPU.

Yes, that's true of any part in a cloud environment.

It's just upgrades are tough. I think in this case, if we decide to go with more CPU cores, we'd probably just get that into Gen 11.

Yes. It's definitely a factor, but we were trying to also ensure that we're also future-proofing our servers.

Is that a good question to have? Our servers typically last four years of service, five years of service, whatever the warranty is.

Yes. You're typically looking at about four years.

One of the things we did is we over-provisioned the NVMe drives that are in these.

What that means is you leave some of the space unused on each one.

The controller that's inside each NVMe drive will use that space as basically a spare space for when it's moving data on disk.

What that does is cut down on write amplification, which is the ratio between how much data you've written to disk from the OS perspective versus how many bytes have to actually be written on the disk to rearrange pieces on the disk.

What that does, the effect that that has is it greatly increases the durability of the disk.

A disk that might be rated for 1 .3 drive writes per day if you do 20% over -provisioning, it'll about double that.

Endurance of these drives is a concern.

That's one of the ways you can mitigate that and like you said, future-proof.

Future-proofing for that is definitely an immediate thing.

That's currently what we're dealing with. I think the past year or two, we did those upgrades, not for CPUs but for disks.

A lot of the disks that we had, at least with the testing and metrics that we had at the time, which was barely anything really, we said laptop disks made sense.

It was just because they were cheap.

That was exactly the reason. We couldn't find any other reasons to try something more expensive unless you could find a way to reason it.

Now that we have a team that does all that type of stuff, the benchmarking, we could talk a little bit about the benchmarking a little bit.

It's already almost 12 o'clock. Maybe we can have another time to talk about how we're testing things.

Yeah, I think that'd be good.

Benchmarking, I'm sure there's a lot of folks who would be excited to come in.

I would love to hear more about other teams that are involved with hardware, the supply chain, there's DC ops and everybody else.

Anyways, I guess we can close that for now.

I just want to say thanks for everybody for tuning in. We apologize if we couldn't answer all the questions.

We'll do our best to get over them. Maybe we'll have another time for those things.

Yeah, I do appreciate the questions. Yeah, so this is it in closing.

This is me, Rob, and Brian. We're out here. Hardware at Cloudflare.

Shout out to everybody who's a Python illiterate at Cloudflare, basically.

This is what we do. We're a proud folks and we hope to deliver good products when we have our software engineers to deliver good products.

For us, it's the hardware.

We're very proud of that. Thank you. Yeah, thanks for being here.

For sure. Thank you for tuning in, the audience. We're going to shut off the mic now.

Thank you.

Thank you.

The whole population of Singapore is on Carousel. Our end user traffic is close to 10,000 to 15,000 queries per second.

Carousel started working with Cloudflare to address their immediate DNS needs, but quickly expanded their usage in order to continue serving web content quickly as their user base grew.

Started off as a solution for our DNS and SSL termination requirements, and then we sort of explored the product further and our usage grew after that.

We started exploring the ability to cache our assets at Cloudflare, which we found quite good and within three years we have moved all our caching requirements to Cloudflare.

Carousel also adopted Cloudflare security features like the web application firewall after noticing vulnerabilities in their existing firewall.

As someone who would run a public property on the Internet, there is a constant threat of adversaries targeting you for DDoS attacks for various purposes.

The firewall which we had at that time was there were a few lacunas and we wanted to have a solution which was much more comprehensive and we explored the web application firewall offered by Cloudflare and it ticked all our boxes.

It had the basic OWSP security related rules as well as various other specialized rules that Cloudflare themselves added and we also had the ability to add custom rules.

So all of these features essentially made it a perfect fit for our need.

As a thriving e-commerce platform, Carousel understands that its Internet security and performance must work hand-in-hand.

The company cannot allow its security posture to impact site speed and availability, especially when flash sales attract sudden influxes of valuable traffic.

One of the biggest benefits that we get out of Cloudflare that the security is nearly free of cost when it comes to performance.

So user experience doesn't degrade with all these checks that are put in place before a particular request is served from our origin.

So that is one of the benefits that we definitely see.

In the past three and a half years of using Cloudflare, they have solved our problems around the various areas of handling traffic in caching, in security, reliability.

Cloudflare is very key for the experience that we offer to our users.

With customers like Carousel and over 25 million other Internet properties that trust Cloudflare with their security and performance, we're making the Internet secure, fast, and reliable for everyone.

Cloudflare, helping build a better Internet.

Thumbnail image for video "Hardware at Cloudflare"

Hardware at Cloudflare
What does Cloudflare look for in hardware?
Watch more episodes