Hardware Debugging and Performance
Presented by: Yasir Jamal, John Graham-Cumming
Originally aired on August 17, 2024 @ 5:00 PM - 5:30 PM EDT
Join Cloudflare Hardware Engineer Yasir Jamal and CTO John Graham-Cumming for a technical discussion on how Cloudflare keeps its network running as efficiently as possible. Tune in for a deep dive on how we diagnosed and debugged certain server SKUs to achieve optimal performance.
Don't miss the blog post:
English
Engineering
Hardware
Transcript (Beta)
Okay, well good afternoon everybody, or good morning, or good evening, depending on where you are around the world.
I am John Graham-Cumming, Cloudflare's CTO, reaching you today from Lisbon.
And I have with me today Yasir Jamal, who is a hardware engineer at Cloudflare.
Welcome, Yasir. Welcome, good morning, good evening, good afternoon.
Listen, it's going to be interesting to talk about hardware because I suppose that most people think about Cloudflare as a cloud company, a software company.
And if you read through our blog or any of our publications, we're often talking about software we've created and things we've open sourced or services we run.
But of course, we run that on our own hardware. I think perhaps a little bit unusually for companies these days, we're a cloud provider, which means we have to build the cloud ourselves, which means real hardware.
And obviously that requires hardware expertise. So you're here obviously to help explain what we actually do with hardware.
And you wrote a really interesting blog post about hardware debugging and performance back in May, I think.
Yes. And you've been here about a year. So first of all, what's it like as a hardware engineer to come into a software company?
And I sort of have a little bit of an inkling of this because my first ever job was working on network adapter cards, writing the firmware, and I worked very closely with the hardware engineers.
So tell us about your experience coming into Cloudflare.
Yeah. I mean, it's an interesting experience for sure.
I spent the first 10 years of my career at HP. Everyone knows it's a hardware company, right?
And when I decided to make that move, leaving HP and joining Cloudflare, it surprised everybody because when people talk about Cloudflare, it's a software -based company, right?
And me joining Cloudflare was also very interesting.
One day I was just looking at LinkedIn and I saw Cloudflare name popped up and there was an easy apply button that popped up and I just clicked on it.
And that's where it all started. It's been an year and it's been an awesome experience.
One of the best thing about this company is the senior leadership, the engagement of the senior leadership.
It's just at the next level.
If I have any suggestions, if I have any recommendations, I can reach out to the senior leadership that you can rarely see in other companies.
So that's the best part.
And that's what I share mostly with my friends and community. And coming into the company like you positioned it as a sort of a software company, right?
And I think perhaps I've always thought that clearly we build a lot of software, but there's a real synergy between the software and the hardware.
We don't just buy any old Dell servers off the shelf and wrap them up.
It's actually quite a lot of work in choosing it.
So tell us a little bit about what you've discovered in that realm since you joined the company.
Yeah. So people think about if you buy a server, that server will stay forever.
No, it's not like that, right? It's like buying a car, but that after five years, you have to go purchase a new car because the technology evolved.
The same as the case with the servers, right? And that's what Cloudflare is doing with our infrastructure, right?
We have to move on, we have to pick up, we have to adopt new technologies.
So when I joined last year, I came across the situation, I think during the orientation week, during the first week where it was brought to my attention that there are a specific set of servers that were not performing well to the expectations.
And a little background is when we are buying infrastructure, we normally focus that our supply chain is open so that if you're buying from one vendor, and if some unfortunate things happen to that vendor, we are not just solely dependent on that vendor, right?
We can go and buy it from the next vendor as well.
So that's how we have our infrastructure and it's what we call multi -source, right?
So what was observed when we launched our GenX servers, which was Roam-based.
The servers were performing at least from one vendor, on average, 10% behind servers from another vendor, right?
And the good thing about that was that all the servers from one vendor was not performing well, because then it was like, it's not like something that's on the board that's bad, the chipset is bad or the memory, right?
So that's good. That's points to that maybe something that's wrong somewhere else, the firmware, right?
You have BIOS, you have BMCs that you can take a look into.
When you talk about buying from multiple vendors, obviously we are working closely with our manufacturers on the actual servers we're building, right?
So we're not like we're saying, we'll get this off the shelf model and we'll just run that.
We're choosing the chipset, we're choosing the NICs, the memory configuration is a huge amount of work that goes into that.
Although we're multi-sourced, it is in some sense our design and cooperation with the vendor.
Yeah, that's right. So we have a team which basically do the spec work, right?
We do the spec work, we see how much performance gains we will get with our next generation of servers.
So there's a lot of maths, there's a lot of work, legwork that's done, right?
Yes, obviously we just don't go buy off the shelf.
We work with our vendors very closely. And also we're working with our vendors for our next generation of infrastructure, right?
So back to the debugging thing. So I started looking into logs because I just joined Cloudflare and it was brought to my attention.
So there was a lot of work already done by the team.
And the thing was that, okay, let's start with the compute analysis, right?
When we talk about compute, okay, how much is the performance of the server or the processor is?
So there are open tools available, like the open tools, for example, Limpack.
You can get out of floating point operations for a second, just to see how you can gauge the processors.
You can use DGEM, which is double precision metrics multiplication.
So we started with DGEM because it's simple and it was based on the specific set libraries for AMD Roam processor.
So what we noticed was the server vendor A, which was doing well, they were a hundred gigaflops ahead of server vendor B.
So, okay, this is a good pointer for us.
Let's start looking into why. When we were running these DGEM applications, we noticed that the idle states were kicking in.
Okay. So this gives an indication, why not just disable those idle states?
And by the way, if someone's watching this and they don't understand what an idle state is, can you just tell us quickly?
Because, you know, we're in the broad polls. Yeah. So there are like, in a processor, there are like two states, performance states, and then you have those core states, right?
So your computer is not always, you're not always using it all the time, unless if it's a supercomputer and you're computing it and crunching it, right?
So if you're not doing it, it stays idle, then the workload shifts to your core states, right?
So that's the high level summary, right? So what we did, okay, let's disable the core states.
And once we disabled those core states, what we saw was that while running the performance applications, the servers, we got that hundred gigaflops on that machine.
While talking about these gigaflops, we also noticed that there's a terminology that I mentioned on the blog, it's called TDP.
So this is called the thermal design power, right? So this is the kind of a key element when we are doing any performance analysis of the processor, we're doing any benchmarking applications, we see, are we using the whole processor stack or not, right?
Like for example, if it's a 240 rated processor or 225 rated processor, and if we are running full benchmarking, it should draw current and it should post 225 Watts or 240 Watts.
If it's doing 220, 210, there's something wrong, right?
We're not fully optimizing, fully using a processor. So we were able to gauge that as well.
Okay. So once you figured out the gigaflops were back, okay, let's throw it back into the production, but we still saw the Delta, 10 % Delta was still there.
Again, we went back to the whiteboard, started looking, okay, we did the compute performance, it checked out.
And then we moved on and we looked into, okay, let's talk where, what else is missing, right?
And by the way, just an hour scale, 10% is absolutely huge.
I mean, you're talking, we're in the 30 to 40 million HTTP requests per second.
And so, anything where there's a 10% difference in performance, especially for the sorts of tasks we're doing would really, really impact.
Yeah. It's huge as well, right, John? Because our whole stack run on each metal.
Yeah. Right. That's why it's important, right? If one metal is not performing 10%, if that's half of the fleet, we're losing a lot of, it's also not efficient, right?
We're losing a lot of power, we're losing a lot of dollar value, right?
So 10% is a huge, even two to 3%, that's also a big matter for us as well, right?
Yep. And just so that people understand, we also use Intel, right? So we have a mixture of stuff in the fleet.
So it wasn't 10% of the whole fleet and we have a mixture of generations of hardware, but still it was 10% of a significant chunk of machines.
Yeah. Right. It's all the Gen X servers from one specific vendor that was 10% behind.
So we then started looking into different aspects of, okay, because the data is coming from the PCIe, right?
Let's take a look at the network bandwidth.
And there are good tools available in open source communities, such as IPerf, right?
You can do a communication between two processors, not processors, two computers or two servers, and the bandwidth was good.
So there was no bottleneck on the bandwidth aspect of the system.
The next was, okay, AMD has this preferred IO options in their processor firmware.
We enable that and- What does that do?
What is preferred IO? It basically gives preference to the IO bus that's being used as a network on the network, right?
I see. So yes, and then it didn't give us the benefit that we were looking for.
So running out of options while doing that, we also ran memory stream just to see if there's no bottlenecks on our DDR site and nothing was observed there as well.
And actually, I like stream a lot because the person who wrote this, John McAlpin, you can always email to Dr.
John McAlpin, and he will respond to you. If you have any suggestions, he will make those changes and upload on the stream repo.
So yes, I definitely recommend reaching out to Dr.
John McAlpin. He's very nice and you can understand stream.
Stream is used all across the world for memory benchmarking. So done with the stream.
Stream's all good and memory's good. So we reached out to AMD to see if they have any tools that we can use.
And then we got lucky. They had a tool called HSMB, which is called Host System Management Port.
You can run it in a host environment.
And by running in a host environment, basically you can go into Linux, load the base and see.
So what we noticed was there's this thing called...
So in a processor, you have clock frequencies, right? So you have memory clock frequency.
And for AMD, there's also a special fabric clock. And when we call this infinity fabric, what it does is, if I visualize this, it basically connects the memory to the processor core, right?
So what we noticed was on good systems from different vendors, the values were synced, meaning they were 1467, 1467.
And when I say 1467, it's basically we were using 2933 megahertz DDR4 DIMMs, right?
So it's the double frequency that gets you the transfer rate. And the fabric frequency was not set to 1467, it was set to 1333 on a bad system, or let's say on a system from a different vendor.
So when we sync those and run the systems back in production, we got our desired performance gain.
Now, it is so simple when I say that, because this was just a simple knob, which we tweaked and everything was working.
But this whole thing of the time we spent, and it was not just me, it was the whole team looking into different aspects of hardware, right?
And that's how it is.
People say that PhD is important, but the way you do the PhD is important. I'm sure you can give us a better experience, right, John?
The way you do things, that's more important.
So that's how we came across. And I wanted to get it done, and I was very happy.
I was very, very happy that we got this thing done and we got the performance gain back.
I want to just dial back. So explain to someone who hasn't been inside these processes a lot.
So there were these two different frequencies for two different components.
Just dig in and say, what are these two different components doing?
And why would they even have two different frequencies? So there is this memory, which is your DDR4.
So it has its own frequency clock, right?
It's set to level. And then there's this called infinity fabric. That's the AMD IP, right?
And it has its own frequency, right? And I have to go take a look into more detail analysis, but these are different components, right?
So each component normally requires their own frequency clocks.
Normally you can have it sync, like if it's a master clock, but there has to be a reason where you have different frequency clocks for different components.
It's like, for example, if you have a server, right?
You have FPGAs, you have BMCs, right? So they have their own oscillators and those oscillators provide the frequency clocks for BMCs, for FPGAs.
FPGAs can have multiple clocks.
BMCs can also have multiple clocks. It's fascinating these kinds of things that you discover when you operate something at scale.
Very long time ago when we had the cloud bleed bug, afterwards we went on this mission to get rid of any crashes of our software on our hardware.
And eventually there were still some crashes happening.
And David Rag in the UK, he spent an enormous amount of time trying to figure out, we thought it might be a kernel bug in Linux.
We thought there were all sorts of things, right? We thought it might be cosmic radiation because it was like very, very occasional.
And going back to your like how much effort it is to debug, eventually it turned out it was a microcode bug.
And there was actually an update and we had to update. Yes. Microcodes are normally provided by chip vendors, right?
Exactly. And if you think about our network with all the different machines it was, what eventually happened was we realized it was one particular chip type.
And then we realized it was one particular microcode level.
And we're like, oh, wait a minute. That's interesting.
And we went and looked and then lo and behold, in the Intel documentation, it said rather obtusely that system instability might occur, which is exactly what we were seeing.
Because we were seeing, in that instance, we were seeing crashes that could never happen.
Like an invalid instruction, but the program counter was pointing to a perfectly valid instruction.
So at this level of debugging, it's a bit of an Agatha Christie, right?
You're trying to figure out who done it.
And it's quite a lot of work. Yeah. I mean, also, right? Hardware debugging is different from doing software debugging.
You can get lucky remotely do work, right, on hardware debugging.
For example, if we have these PCI devices, they are now linking at different speeds, Gen 3, which is 8 giga transfers, Gen 4 is 16 giga transfers, right?
And you have a Gen 4 device and you will see that it's just linking with Gen 3.
Why? Right? And you're doing everything, all your efforts in software.
Why? If something's wrong, are there any correctable errors or not?
But for example, but once you go and take a look at the board and say, okay, what's wrong with the board, right?
Are there stubs on the board?
You go and take a look at the schematics, the board layout, everything looks good.
But what if the board vendor forgets to clear the stubs, right? The stubs basically are the traces that basically kills the PCI bus if you have it on there, right?
So not all hardware debugging you can do remotely. In this time, I got lucky, right?
But there will be time where you have to be in front of your machine.
Okay, let's figure out what's going on. So, yep. It's fascinating.
It's like, I can give you a story of, like, if you're a supervisor, if you're a manager, it can be a lesson for new engineers that are coming up, right?
Tell you to do something that you don't like.
For example, for me, it was back in 2019, one of my managers said to me, hey, Yasser, I want you to go unrack servers from one rack and install it in another rack, cable it up, and then we have to ship it to some party.
And in my mind, I was thinking why he wants me to do a technician's job, because this is a technician's job, right?
I'm an engineer. My job is to take a look at the schematics, the layout, run some performance.
But okay, then I decided, okay, let's go take a look at it. And I did that all the work.
And then in the end, we had to ship it. We powered it up. Everything was green.
I'm like, okay, let's ship it. So there was one more engineer.
He said, do you have the data to prove it that everything is good or not? In 2019, then, okay, let's, how do I get the data?
Okay, let's start doing LINPACK. Let's start doing memory benchmarking, right?
And that's how an hardware engineer got involved into doing performance benchmarking.
So for me, the story is like, even if you think that it's not, you shouldn't be doing this, at least go do it, experience it.
You will be out of your comfort zone, but you will learn something new.
So I had the opposite experience, right? Back when I was working on this network adapter card stuff, I got shipped out to an HP office outside of Boston, outside at Route 128, to debug a device driver I had worked on, on a particular machine.
And this machine had a slightly weird architecture in that it had a six, I think it had a 16-bit bus internally.
And they had an EISA sort of extender thing. So you can plug in the EISA cards.
And it turned out that it had to latch two 16-bit values to get a 32-bit value.
And it was a timing problem. And that we were reading, basically, we were reading this 32-bit value before it was stable, because in fact, it was a problem with the bus in the device.
It wasn't within the timing thing. So I came in from outside.
And of course, one of the great things about being at HP, particularly at that time, was that every piece of debug equipment I wanted, I could get.
So like, oh, you need an in-second emulator? You need a 32-channel logic analyzer?
Great. We can give you all this kind of stuff. And eventually, unfortunately, HP had locked the design of this machine.
And so this was going to be forever.
So in fact, we actually fixed it in software, waiting for the bus to become stable.
But it's a real eye-opener that I think, as a software engineer, when you start to work on looking at the sort of problems that hardware has, it's really fascinating to see the sorts of things going on.
And equally, when you think about stuff like microcode, and the fact that you, for example, in this example, tweaked a configuration parameter on a piece of hardware, that it feels a bit like software.
And these two worlds are not so far apart that you can really appreciate it.
And it's helpful to understand both worlds. So what's next? I mean, if you debug this problem, we've got that 10% back.
That's fantastic. What are the sorts of things you're working on right now?
Yeah. So that's the thing, right?
So we are working on new technologies that we will be adopting at Cloudflare.
We are working on also offloadings. So we're reaching out to different technologies in the market to see what can we offload.
We are also working right now, how much we can optimize our current fleet.
Can we gain at least 2%, 3%, like doing different BIOS configuration, such as, okay, we are running with NUMA disabled.
Let's run with NUMA enabled, right? Non-uniform memory addressing.
Let's see if it's going to give us 1% out of it, right? Let's take a look at the thermals, right?
Are we overcooling our processors, right? Because power is also very important for us, right?
If a processor thermal threshold is at 100C, meaning it will start to throttle at 100C, what is the point of keeping the processor to cool at 60C?
Why not just decrease the fan speed so we save some power and bring it to like 80C?
So we will save some wattage. So one of our key metrics is performance per watt, right?
So if we reduce our wattage, we will automatically increase our performance.
We will pay less bills, but also we will save the environment because we are drawing less power, right?
So these are the things that are in pipeline.
Very excited to be here. The team is awesome, right? And I said it before, right?
In this company, the interesting part is, I mean, I'm talking to a CTO of the company and it's not just a CTO, right?
I mean, if I have questions, concerns, I can reach out to Matthew, right?
He's one of the best CEOs that, I mean, I've been to different companies, I've worked at HP, but yeah, Matthew is something special.
If I have a suggestion, I can go directly reach out to Matthew and he provide his honest opinion.
Yes. He listens to employees. Yeah. I mean, to be honest with you, I think that that comes from a recognition that the management team as a whole has ideas about where Cloudflare is going, but we often say to people that what we're trying to do in the company is build a team and that it's the teams that really win, right?
You can have the heroic engineer who does some great work, but really what you want is a team that does great work.
But that team also applies at every level, right?
And so when you think about the management team, it is itself a team.
And that means that team needs to, first of all, work as a team together, but also will get better if it has more knowledge, right?
So you don't want to be like a sort of pantheon of these great minds who supposedly know the solution to everything.
It's kind of like, well, I actually know. And in fact, the larger the company gets, the less I know about what's happening in the company.
The more useful it is.
When I read your original blog post, I was like, this is fascinating, right?
There's parts of this I know nothing about and I'm going to learn something and I get smarter.
So I it's a recognition that as a team, you grow from the team that's around you, under you, and around you.
And so I think that's very important.
And I do think in some companies are very, very hierarchical.
And I think sometimes people are also embarrassed to say they don't know something.
And I think we've kind of immunized, I hope we've immunized the company against that, where people can say, I have no idea what this is all about.
And I would be very challenged by some of the things that are happening in your world.
I'd have to sit down and be like, hey, explain this to me. What I hope is I'm smart enough to understand it as long as you explain it well.
Yeah. One thing I didn't mention is also, Nathan Wright, he was there.
He was a good support during the whole, he used to check on me.
Hey, okay, what's the progress? Where are we?
It's not just the management pressure, it's the management who cared, right?
So that's also a good thing to notice. And when I wrote this blog post, my friends from HPE, my friends in AMD, they read it and they're like, you know what?
We referred your blog to look into our systems just to make sure that we have this settings in sync, right?
Because if you are getting performance gain, we would love to also see it in our system so that we also get the same performance gain.
So I was very happy about that, that we are shaping up the industry and helping the industry.
Exactly. And I think what's interesting about that is that people are often somewhat reticent to write about things like what you wrote about, because there's this fear that someone out there, someone in the peanut gallery is going to be like, oh, are those idiots?
Why didn't they have that? Or they should have checked that before.
And sure, maybe we should in some perfect world, but you know what?
The reality is that if you build something as big as Cloudflare, there's going to be all sorts of surprises and you're going to have to go in there and debug them.
And if you talk about them, first of all, you talk about them and others learn from it.
And also you get some satisfaction saying, yeah, we fixed that.
Now we gained this extra 10%, which is pretty amazing. So I'm hoping you'll write 10 more blog posts over time about the surprising things we discovered.
That's what the goal is, right?
If you do something new, if you do something that you can see that you can help the industry, you open it up to the world.
Okay.
We did this. It helped us. We want you to also learn and fix it if you have the same problem.
And that's what I see. That's the best part of Cloudflare, right? You see a lot of blog posts that are coming out.
It's because we're helping the community.
That's absolutely right. And actually, looking forward to the idea of using accelerations of some things, maybe it's for compression or crypto or et cetera.
I think it's perhaps struggled with, or were a little bit reticent about using slightly more exotic hardware at Cloudflare was that you can get a lot out of a modern CPU, right?
And so it was like, well, we'll just stick with this modern CPU and then we won't have these different variants of hardware out there.
But it's now looking like it probably makes sense for us to start saying, hey, what's the trade-off here?
And that trade-off about whether it's worth it is going to be really to understand.
It's in pipeline, John, and I'm sure we will have one more session and we will talk about acceleration next.
Well, that sounds great because it's going to be very interesting to see what really makes, and hopefully others will learn from our experience or tell us their experience because we'd like to learn from what other people are doing.
Well, as we get into the end of this, I'm curious what you would say to someone who's thinking about applying to Cloudflare on the hardware side, right?
What would your pitch be for in, you've got a minute, and why you should join Cloudflare?
If you want to build a better, safe, efficient Internet, and you like working problems, solving problems, you should come and work for Cloudflare.
There are tons of opportunities available, and you can reach out to me, you can reach out to John, you can even reach out to Matthew directly, and yes.
So yes, and this is now, Cloudflare basically helps build Internet better.
That's what our- That's fantastic. And if you do work on something at Cloudflare, I mean, it's literally millions of Internet applications, websites, IoT devices, and everything like that.
Hardware deployed in 300 locations around the world.
I mean, the scale never ceases to amaze me, and I've been here for quite a long time, and that will continue to grow.
And so going back to you, when you talk about a 10% impact or a 2% impact, you're talking about something that ends up being an enormous amount of time saving for people, an enormous amount of electricity saving, or millions of dollars.
And so I think for me, that's one of the most satisfying things about being at Cloudflare.
Well, listen, thank you so much, Esther, for writing the blog post, for fixing this 10% thing, and for being on Cloudflare TV with me.
I hope you'll be back at some point, and we can talk more about hardware.
Likewise, Sean. Have a good rest of your day. You too. Take care. See you.