Cloudflare TV

Hardware at Cloudflare (Ep5)

Presented by Rob Dinh, Brian Bassett, Ryan Chow
Originally aired on 

What does Cloudflare look for in hardware?

English
Hardware

Transcript (Beta)

It looks like we are live now. So thanks to everybody to tuning in. This is Hardware at Cloudflare episode 5.

This is your host Rob Dinh and Brian Bassett. Hi. And today we have a guest from the hardware team as well.

His name is Ryan Chow and when we talked about the very first episode for for Gen X, we mentioned something about BMC and what we've had plans for it.

So Ryan just took the initiative to put himself on the list so he can present about what he's got going on.

So we can take it away. Yeah I guess before we get started we should mention that people can email us questions at livestudio at Cloudflare.tv and we'll be taking them live.

So Ryan why don't you tell people who don't know what the heck is a BMC anyway?

Yeah so a BMC is a Baseboard Management Controller and basically it's another you can think of it as another computer in our servers that's external to the main CPU.

And what it does for us is it provides a subset of functionality such as or that allows us to manage our servers.

So you have things such as a power management being able to turn the CPU on and off.

You have inventory being able to identify all the pieces of hardware in the server.

You have debugging utilities such as logging, sensors as well to monitor voltage output, temperature of the system.

And also importantly fan management being able to cool your system effectively without consuming too much power.

So does every computer have a BMC?

I mean they can but it's mainly a server, a computer specific to servers.

I've heard of some home computers adding an external BMC but it's not completely practical.

But it's mainly used for out-of-band management. If you don't have physical access to your computer like a lot of us at Cloudflare do, the BMC allows us to have that external control.

Let's see here.

I'm switching back to the slides there and we can try to walk through there.

So just for the audience who may not be very familiar with how hardware works and how we manage them at the edge, we'll just give a quick overview about what BMC is and what's the difference between OpenBMC and what makes OpenBMC something that we kind of consider as a requirement.

So as most employees at Cloudflare know, we're in the business of Internet security.

And one of the things that we currently use for our existing firmware is we ask our ODMs to provide firmware for the BIOS, the BMC.

And even for our vendors, they outsource it to other vendors to provide them with the firmware.

So there's a lot of blocks. It's unknown to us what it actually is, what's in the actual software firmware.

And for us, we want to understand what's in there so we can add our own features, make our own bug fixes, etc.

So does this mean like, you know, maybe I grab like an HP server or a Dell server or like the ones that we have currently, we got three different vendors that are active in our fleet.

So does this mean that we have different firmwares for each BMC or do they go to, you know, this common third party?

Typically, it's done through a common third party.

But what the firmwares can be different because the firmware is tuned to the specific motherboard and all the peripherals that they use and particular address spaces and devices that they have on their motherboard.

So it's tuned to their system, which is also information that's kind of held tightly within the vendors that we outsource this work to.

Okay, all right.

Let's see. So yeah.

Yeah, so why do we care? As I mentioned earlier, we don't know what goes into our firmware.

And the visibility is pretty important for us to effectively secure pieces of hardware in our servers.

When we have bugs or defects with our systems, we kind of go through this cycle of through our vendors are saying, hey, you know, we have this bug, we hear the symptoms here, some potential causes, you know, can you can you help us with it?

And, you know, sometimes the vendors are like, okay, yeah, we'll take a look at it.

And maybe two to three weeks later, they come back and we say, well, we can't replicate it.

And you know, the annoying for us, and you know, when we're trying to serve our own customers, and we have effectively useless metals all around the world, and we're just like, okay, and you know, we do our best to continually try to provide feedback.

And maybe two weeks later, they come back with, oh, they're like, oh, yeah, we're finally able to reproduce it.

And that's a month gone, just identifying the problem for them. And then another two weeks later, it's two to three weeks later, you know, they're, they're still working on an actual bug fix.

And then, you know, they come back to us and say, hey, try it out.

And when we try it out, it may or may not work. And you can understand the headache that we go through some of our vendors and why we're interested in opening it up so that you know, we can take a look and such that maybe we can fix things and help identify problems.

Yeah, the response that you actually mentioned is actually a pretty big deal.

When we talk about things that are very crucial for us.

So like, let's say, I'm trying to think of some examples like, you know, we could see a whole bunch of servers that shows that there's only a PC power supply that's off, even though maybe it's a false positive.

But you know, if there's a bad PSU, that's, that's really bad for us.

Because, you know, we've tried, we try to strive for redundancy, we try to strive for, you know, like a 99.999% operation time.

So if we have one of those crucial servers, that is losing power supplies, you know, that's supposed to be something that we know we take, we take action on.

And what we saw is that it's actually a false positive, both PSUs actually on.

And, you know, you try to follow through with like a vendor that says, Well, okay, it's a bug, we're gonna take a look at it, like you said, and it takes them a couple weeks.

And then they're like, I don't know what you're talking about.

I can't reproduce it. You know, can you give me a little bit more information or something?

And you sort of like, well, you know, I, we gave you all the information that we got.

And all more information is basically just more instances of the same thing that we're producing ourselves.

And it's not only not only it's hard to communicate with a vendor, simply because, you know, plainly, maybe there's some things that we just can't share.

And then also, it's hard to have outside engineers to come in and then just sort of have this sort of intimate, intimate relationship with the servers that we have.

And a lot of the ones that we talked to actually have different time zones, or they're actually in different days, right?

A lot of them that we that they sort of push through afford their, their problems is maybe based in Taiwan or based in Shenzhen or somewhere.

And that's where all the lag kind of comes from, too.

And it's, it's hard for them to actually prioritize things because they don't they might have other customers, they might have bigger customers than we as Cloudflare.

But for us, it's like a priority for us, right?

So that's, that's kind of why we kind of wanted to say, hey, let's just take things on our own.

You know, we're more intimate with it. We got our engineers, we got Ryan, we got Brian, and just go ahead and figure it out.

And to be fair to the vendors, a lot of times they have a hard time reproducing issues that we find because they don't have access to our network environment.

And so that's one of the things, another thing that we can't give them that access to our servers working in our environment, but we can reproduce that ourselves at will.

So that should also help make making these bug fixes easier.

Yeah, so on a slide there, we talked, we saw something about security.

So could you mention a little bit more about that?

Yeah. So, you know, part of security is understanding what goes into our VMCs.

And if we don't know, you know, it could be as simple as, you know, having dangling pointers existing in the code and not being able to be able to see those kinds of things makes our code vulnerable to different vulnerabilities.

Also, you know, security wise, we can add in and remove our own features.

Some people on the security team that we've talked to, you know, have had a desire to remove the web GUI, just to, you know, remove another attack vector and another attack plane.

And, you know, we have that option if, you know, moving to an open source firmware stack that we can modify ourselves.

And there's, yeah, so it's, the way we do currently as well is, you know, we're a security company and we do strive to make things as secure as we can.

And, you know, the reality is a lot of the servers that we buy, that get shipped from those factories will actually come in, you know, with default login credentials too.

And, you know, it's our job to try to figure out or to ensure that, you know, whatever we got is actually not tampered.

And, you know, we can only do so much with shipping or at least ensuring physically that things have not been tampered.

So, you know, OpenBNC does open up, you know, some possibilities there.

So some of the troubles that we have, I think is, you know, the use of LDAP or at least have like, you know, static passwords and credentials that, you know, at least we actually are trying to recycle them or change them.

I mean, on a periodic basis. So just wondering if, you know, if you could look more into that or.

Yeah, I guess OpenBMC doesn't necessarily, you know, I think as far as the default passwords that, you know, some of our vendors provided, that there's a vulnerability that, you know, OpenBMC was not really required to do.

But as far as other such as moving from our existing IPMI interface to a Redfish interface, you know, IPMI being this old 80s protocol to communicate between the BMC and the server, you know, just moving away from that and being able to move to some more modern protocols helps us out tremendously.

Going back to your comment about LDAP, you know, it's currently existing in OpenBMC as well, but we had the option to, you know, make it LDAP S, LDAP Secure.

And, you know, yeah, I guess that's kind of like.

LDAP over SSL. Yeah, LDAP over SSL. Yeah, exactly.

I guess one of the things that's interesting about controlling your software, your firmware yourself, is that we could potentially turn off features if we want to remove an attack plane, right?

Correct. Like I had mentioned before, kind of like the web GUI that we're accustomed to using, we can even change what, you know, from, excuse me, OpenBMC currently uses DropBear by default, which is an SSH client.

We can move over to OpenSSH, you know, we can kind of pick and choose the libraries that we use, the packages that we use, etc.

Right, yeah. Currently what we have with BMC is our current servers come in shipped without Serial Overland enabled, so they come in disabled for security reasons.

We try not to use Serial Overland if we need to.

Most of the things that we have to operate and repair doesn't necessarily need to.

But, you know, there is like sort of like a last resort where we actually need to use, you know, SoulActivate.

And, you know, that's one of those things where we actually have to flash a BMC firmware and then enable it, go into it, you know, making sure that we have it closed after we're done with that kind of business.

So after it's fixed, you know, we're going to try to re-disable it, I guess.

So you can see sort of the additional toil that we have with hardware engineers and SREs to try to fix them at all is, you know, we don't want to live, we don't want to leave those servers vulnerable like that.

But it is a very last resort that we have to do.

And I think OpenBMC can open a few things up here.

So I'm resharing the presentation to show what that solution is for OpenBMC. Yeah, so OpenBMC was, or is, was originally created by Facebook and IBM like to claim their hold on who was first creating OpenBMC.

But now it's under the Linux Foundation and it's kind of managed, or you have all these companies that are contributing to OpenBMC and we are trying to kind of join in and use their use their firmware.

So some people are familiar with kind of bare metal embedded, you know, they're used to using some real-time operating systems and writing all the drivers, I2C drivers, SPI drivers, communication drivers, and then writing their application on top of these bare metal to get their desired task accomplished.

And then, you know, we're familiar with Linux, which is a very general purpose, you know, operating system that we can run our favorite programs on.

What OpenBMC is, is an embedded Linux space. So it's somewhere in between that very narrow embedded space and the general purpose operating system space.

And the way it's done is OpenBMC uses a build system called the Yocto project.

And what the Yocto project is, is removing all the headache that we had to do when writing embedded drivers over and over again from I2C to all those SPI drivers.

When you remove from one processor to another, you would have to rewrite all those drivers over and over again because those processors had different definitions, stuff like that.

With the Yocto project, the Yocto project eliminates that and basically these companies would share their BSPs or their board support packages where they had all these pin definitions and these functionality definitions and mappings.

And it would basically pull all the software, all the source code, it would compile it and you can even add in your custom patches throughout compilation, package it such that it can be prepared for a root file system.

All this is done through, it's a Python-based tool, and all it does is take these recipes or these little metadata files to point to different URLs all over the Internet to know where to retrieve the source code.

And is this something that, do you know if other companies or other?

Oh yeah, so all sorts of embedded companies use the Yocto project because they're tired of writing these low-level, bare-metal projects and you kind of just want to hit the ground running.

The Yocto project allows developers to create their own Linux distribution for their target hardware.

Okay, I mean that's the way I'm understanding it.

That's going to be the future for all the server integrators or any kind of hardware supplier.

They kind of have to buy into VMC. Did you guys see that?

Is that the trend or maybe these servers are still going to be stubborn about it?

I'm not super familiar with necessarily the server space, but as far as embedded applications that we're used to, such as thermostats and kind of the IoT space really, where the Yocto project kind of had its motivations from.

Could we see it in the future of switch processors or whatever other specialized kind of hardware?

There's certainly room for that.

Yeah, that seems like it would be a thing that we'd be doing, really.

The switches and routers would be interesting. I haven't really thought too much about it yet.

Servers seem like it would be something that we for sure we're going to be doing.

I do have to assume that hardware vendors may need to, I don't know, re-strategize or redesign how they make servers.

Maybe there's like a special board for it or a special port.

Do you guys know anything about that?

Not here. Not that I'm aware of. Yeah, other parts of the OpenBMC stack kind of uses Systemd for its process management, Dbus for its inter-process communication, and Uboot for a universal bootloader.

Some other software for the OpenBMC stack.

Right, yeah. Are we there yet? So I guess we kind of segue a little bit into that if the server industry is going to be doing it.

I mean, I'm sort of seeing just from my senses that there's going to be more and more adoption from the server side.

It's hard to not be able to attend to any kind of conferences nowadays.

But it seems like this is where we're heading towards the direction and this is going to be a requirement for Gen 11, right?

Yeah, so we're targeting Gen 11 for our launch of OpenBMC.

We've started working with vendors to open that up.

Internally, we have some samples that we've been playing with OpenBMC, adding our own small little applications, kind of playing with the ins and outs of it.

We have a working CI-CD pipeline because these builds on an average four-core CPU can take up to six hours to build some of these images.

So we've kind of got the infrastructure kind of going for our development and we're hoping, or no, we're going to be collaborating with some of our vendors to get the OpenBMC done on Gen 11.

Now, you mentioned that the firmware for a BMC has to be customized per system.

It's also true with OpenBMC, right? Yes. So, I mean, things such as description of the devices that exist on the motherboard, address spaces and address locations for where some of these particular pieces are.

And those are the big things.

Also, you know, traditionally with not OpenBMC, I mean, you had to work on a fan curve to effectively cool your system.

That's also a system provided by OpenBMC.

Yeah, nothing else off the top of my head right now.

So do we have a feel at this point for how much of that work is going to be done by our server vendors and how much by us?

A lot of the, at least the descriptions, will be done by the vendors.

I mean, they hold on to those schematics very tightly and sometimes they're reluctant to share with it, but we're hoping to open that up and understand that a little bit better ourselves, as well as to what the BMC is connected to.

And, you know, even things such as pin assignments can be a little black box for us.

So we've opened the software or the firmware side of things, but not necessarily the hardware at this point.

No, not quite yet. That seems to be the one thread that's sort of missing, I guess, in a way.

And I mean, I can kind of sort of see, you know, from the vendor side.

I personally have never really been a fan of it because I like to do things myself.

So if I were to fix something, I'd like to see everything and just open it up myself rather than rely on somebody.

I guess it would be kind of interesting on how vendors are going to be able to, I guess, adopt this way, right?

We've had different, I guess, side quests or side narratives from, even in Nix had this thing where we used a Nix vendor where, you know, they kind of had their own sort of magic tricks or, you know, things that they wouldn't really let us in onto unless, you know, we're really tight with them and we have great, great relationships for.

And, you know, we say now, you know, we kind of want to use the current Nix that we have where a lot of things can be open sourced and, you know, we'll just kind of have to try to manage the firmware ourselves this way.

So it's trending this way with BMC as well.

And I assume for pretty much all the firmware controls that we have, there may be, I'm trying to think of a con for us or like for any company, like, you know, the Facebooks and the Googles, you know, they probably have a whole team for firmware control, right?

I'm trying to figure out, like, what is our ambition then for OpenBMC besides controlling fans?

I guess for us, it's getting involved in kind of adding our, aside from adding in the tracking features, even kind of baking it into some kind of automated process to have as far as testing, right?

Testing is pretty vital as far as trying to effectively secure our images.

You know, when we're given new images to test out, you know, we can trust that, you know, or we can verify what's in the change logs and maybe even go, if we're diligent, you know, verify everything in the past change logs that are still present, but we still don't necessarily know what the effects that there could possibly exist.

So, personally, for my vision, in addition to the CI-CD pipeline, we're hoping, or I'm hoping that we can automate the flashing process to actually, upon completion of the image built, we could remotely flash some of our, a test server, a test server's BMC and be able to run a series of unit testing, a series of just automated testing so that we have some higher level of confidence that say, hey, this is working for us, you know, this is secure enough, this is effective enough, etc.

Provide some more robustness to the system.

Yeah, yeah. I mean, from just a surface level, the little demo that you're going to show right after this, you gave me a little bit of a sample.

And the way I looked at it was, it almost looks like a Linux service, basically, like it's on systemd, you can have some dependencies on it.

And for us at SRE, that's great, because we can do all this version control, maybe not necessarily writing the versions themselves, or the firmware versions, but at least rolling it out to the fleet.

So that way, you can have it tested in some of our colos, or and then just have it roll out to the rest of the world if something bad happens, and we can actually revert it back quickly.

And I think that's great. That's great that we can do that.

So we can just have it like typical services that we have, they get tested in a low test lab first, and then we expand them into certain colos.

So we just have a handful of colos that sort of this is going to be your soft opening of a new service, or maybe a new version of a service, and then it rolls out to the rest of the world.

And there's a whole bunch of controls that we have to do in between to ensure that we're not pushing bad firmware or bad software out to the rest of the world before anything.

So we will catch them in those test labs and those certain colos.

So it's exciting that we can do that with BMC, or open BMC, I guess. Because, yeah, there's been a lot of times where we have to have this toil of reflashing the whole entire fleet, because of some faulty BMC that was shipped.

And it goes through the whole like, we have to identify which servers had the bad firmware.

And that's always hard, especially when you have 200 colos. And correct me if I'm wrong, but I don't think BMC needs to have it rebooted.

So that's a good thing, in a way, relatively, I guess, because, you know, we can talk about BIOS and BIOS we need to reboot.

And it sucks to reboot servers if they have bad BIOS. Yeah, we're right at the demo now.

So if we can take a look. Yeah, I guess to start to preface this, we have this working on one of our metals right now.

And it's a very simple demo where I have my local host. Let's see if I can. Yeah, so there's my local host, and I'm actually SSHed into the BMC right now.

Yeah, this is my BMC.

And I'm not really sure how technical I want to go, but so out on GitHub, I just have some random repo where I have some open source software.

It's out there on the Internet, so anyone can get to it.

And I have kind of a build file and just a simple source file that anyone can grab.

So with the Yocto project, what I would do in the Yocto project is I would create this recipe to go out to this URL, this kwongflare at fosterryan.

It would grab it, it would compile it and link it and package it into the root file system.

So just taking a quick look at the source code, we won't get into too much detail.

All I'm doing is going onto the Dbus. I'm attaching myself to the Dbus and I'm just requesting a bunch of calls to all the ADC sensors on our metal and printing out to the log.

Earlier, I'd just set this up as, let's see, excuse me.

So I set this up using rsyslog to export all the logs from the BMC to my remote host.

So on my left is the remote host and here's a simple pick off of my service.

And we should see it on my local host here.

Yeah, there we go.

So taking a quick look at the log, you know, have timestamps and a simple monitoring of some of the ADC sensors.

So our CPU core, so this metal is turned off right now and all the various voltages on these ADC sensors.

And I have this running as a background task and every two minutes it'll print out another set of readings.

And yeah, that's pretty much it. I guess what this kind of demonstrates as far as what we're looking for in an open source firmware is that we have the ability to tune and add our own applications or remove applications that we deem kind of useless or vulnerable whenever we want.

And in this case, additional logging that didn't come with the original OpenBMC project, right?

Right.

So, you know, SREs and other people who are trying to debug some of our systems remotely, I guess prior you would have to just use IPMI tool to request sensor data.

Even with this small application, you have timestamps to, you know, show over time like, you know, how these readings changed or didn't change.

And, you know, maybe if you view anomalies in some of the logs, you'd be able to be able to debug things easier, but basically just open up information and open up.

Yeah.

Yeah, there could be more details on those logs too. And it looks like these logs can actually retain information from many, many weeks, maybe days.

I'm not exactly sure.

I mean, that's the thing we can do however we want to. If we want to have a, you know, revolving buffer, we can, you know, or we can just say, you know, we'll allocate enough space for maybe a few days and be done with anything past that.

It's very, very flexible to how we want to be able to control things. Yeah, that's great.

For us, at least as far as I remember, we were only able to retain 14 days or two weeks worth of IPMI data, which may actually not be very helpful sometimes, especially if it is something that's just been long running.

And it's nice to just have logs that we can extract to, you know, just to try to give a little bit of history.

We had no way of reading anything from IPMI before. That was in terms of like logging or, having it into like a time-based graph.

We didn't have any of that.

We just had IPMI tool. And when we tested for Gen 8, I think when we tried to read power or like CPU usage, so we had like a stress, we wanted to stress the CPU and then also try to read how much power that took.

It was a challenge to not have any kind of logging for that type of stuff.

And what we had to do was just create like a little batch script with a for loop that just says, you know, can you print out, you know, the power on IPMI tool, you know, like every two seconds or something.

So, you know, we did that. We let it run for like multiple days and then just graph it out.

I guess it seemed kind of fun, but, you know, we're civilized now, right?

So we should do something better. A little more sophisticated.

Yeah. A little more baked into the, baked into our tooling. Yeah. Yeah.

And our tooling does a lot of funny things where, you know, we, we try to not to have that many vendors.

I mean, we definitely don't have more than, you know, maybe three or something like that.

But, you know, back then we only just had one vendor.

We've hard coded a lot of our stuff from, you know, their own IPMI tool things.

So sometimes they were, they were, they were actually grabbed like string -based graph, like, you know, VNPSU0 or something like that.

And that's, that's what, that's what we had a hard code so we can at least monitor the power and a new generation of service coming in, or maybe we have another vendor and, you know, they start to write, you know, their own ways for, you know, IPMI tool.

And, you know, that gives us toil because that means now we have to account, you know, for this generation as well.

And, you know, you know, if we can standardize all of that, you know, for monitoring, that would be great.

Yeah. I mean, it's, I mean, it's terribly annoying to have to change your scripts just to add an underscore for a subscript of your sensor or anything like that.

And yes, with, with OpenBMC, we have capability to even rename these sensors as we please, if we want to, and give things more descriptive names or for purposes of automation, like you had suggested, just unify things.

Yeah. Let's see. It was, there was also another one. Oh, yeah, that was, I think that was back in Gen 9 or even Gen X.

So before we try to spec for Gen X, we had some samples.

And, you know, I know, I know for sure vendors would like to try to give you a prototype or something.

So a lot of the things that they have is just not exactly finalized yet, which is fine and understandable.

But, you know, there's some metrics that were probably crucial for us to make decisions on.

I mean, I don't know how, I don't know if that's how it is now, but I remember, you know, power CPU to be a big thing or a big factor for choice of CPU.

And, you know, on one of our vendors didn't have it in their IPMI tool.

And, you know, we just sort of left, like, well, I can't really tell anything then, basically.

I can't, maybe I can read the servers, but, you know, reading server power to another server power for different CPUs when we try to make CPU selection was, it was like apples to oranges for comparison.

And, you know, that was just not something that I was a big fan of.

But it looks like we can do stuff like that now.

Like we can try to find any kind of sensor or, or maybe there's like additional sensors that we can do.

I'm not even sure. I can't think of the top of my head.

Like maybe we can figure out what the power draw is between, you know, between a NIC and a CPU or something.

I don't know. That'd be interesting.

I could see some potential here also for just improving the quality of the logging that gets exposed for the data center folks for like, for troubleshooting purposes.

Those, the logs are not always real helpful, but since we have control over it, we could make them more helpful.

So an example is like, a lot of times you'll see a message, and I've seen this on servers from multiple vendors where it'll say something like ECC memory error on DIMM 0x137.

And then you're supposed to know from that, what that maps to, to figure out which DIMM to go replace.

And people end up sending around decoder sheets and emails.

And there's like some tribal knowledge. Some folks know where they are and how to get them and others don't.

So that kind of decoding is the thing that computers are good at.

So, you know, we could fix a problem like that. Yeah, that's good.

That's good. I just have it all customized. Customized, agnostic to what vendor we have to, which is great.

Yeah. It's not like we're looking at, you know, two different vendors and we're trying to figure out which vendors there are.

At least on my side, it's hard for me to look at it. I mean, there's probably a few, you know, command lines, I can figure it out.

But, you know, when a server is down, it's like urgent.

You know, that's kind of like the last thing I'm trying to think of.

I'm trying to turn it up and to try to figure out, you know, what command is what, depending on what vendor it is.

Especially IPMI raw commands, which vendors have tended to be about giving for some reason.

Yeah, and that's another thing that those, even when you get them, they tend to be tucked away in email chains and not really someplace that you can get them necessarily when you need them.

So, for OpenBMC, I mean, there's still a little bit of, it's not completely agnostic to the vendor, right?

Like, there's going to be part of the repo, like a list of hardware that's available for OpenBMC, right?

Correct. So, this is kind of what I alluded to when we had talked about, and it's not just even the vendor hardware, it's even specific to the BMC chip that you use.

You know, we're accustomed to use a very particular group's BMC, but there are other BMC makers as well.

Right now, I believe, yeah, the OpenBMC supports the AST series as well as a couple of Nuvitin chips, as far as BMCs.

And as far as the motherboard and the interactions with the motherboard, yeah, it's kind of a pick and choose system, as far as how the Yocto project, you can specify which group's configuration files you want to use.

OpenBMC is really just the baseline functionality of the BMC features, but how does that get mapped to the hardware itself?

And that's where you can say, okay, hey, I'm using this chip, I'm using this machine.

You showed a picture of the AST chip at the beginning, that's a super common one for BMCs, right?

A super what? A super common in the industry? Yeah, yes, it is.

I believe in the past few years, or I believe AST just came out with a 2600, and I think the 2500 is a single-core ARM core, and the 2600 is a three-core ARM core part.

So if we move on to upgrading BMC chip over time, I guess what I'm getting at, these BMCs are getting much more sophisticated and becoming their own computers, general-purpose computers themselves almost.

Have you run into anything in your work with OpenBMC where you wish you had more CPU horsepower?

No, I think I've run into some performance issues as far as D-Bus itself.

That might be, that'd probably be aided if the BMC was a more powerful chip.

But yeah, D-Bus itself is just, it can be a little slow sometimes.

Sometimes the feedback is multiple seconds long, and you're kind of expecting it instantaneously as far as requesting information or even setting information and other common tasks.

How does the D-Bus performance manifest for the end user? Like what do you see when you're trying to do something?

I guess it's really just when I've kind of been playing with some of the power management as far as turning on and off the machine.

As far as the end user and specifically, I mean, right now I'm SSH into the BMC.

That's definitely not going to happen in production. This is a strictly debugging feature.

Yeah, but as far as the end user, I mean, I can't say I've noticed anything quite terrible.

Yeah. In general, pretty responsive to things like IPMI commands?

Yeah, for sure. Returning all the sensor data you request, returning status of the chassis, being able to send a command to identify your chassis, sending firmware update requests.

It's been responsive. And it has a GUI similar to the the ones that the BMC firmwares that we're using today.

Yeah, and take a look at that actually. Am I still presenting? I think so.

Yeah, you are. Yeah, let me take a look here. Yeah, so here's the web GUI that we're kind of customizing.

Even the web GUI is customizable for us if we so choose.

That landing page loaded nice and fast. Yeah, that's not always the case.

Yeah, I guess now that you mentioned it, but granted, running it locally as opposed to requesting it from a data center that may be hundreds of miles away, maybe that contributes to it as well.

Yeah, I'm not sure how much of that lag is physical distance and how much of it is just maybe like how busy the page is.

I think maybe some firmwares are trying to display too much there and end up basically having to process more data than they need to, I think.

For sure. Yeah, the web GUI could be a lot more minimal than it needs to be for the current servers that we have, at least.

Yeah. Which is great. It's nice to have it just something more straightforward.

There's definitely a lot of features that I don't think I've ever clicked on every time I go to the web GUI.

I think JavaScript has been pretty much useless for me so far.

Like uploading ISO images, things like that is still definitely part of our troubleshooting for our servers.

So if in case we need to either reprovision or do anything where we can't even re-login, but at least the BMC is still alive, the web GUI is still pretty crucial.

So it's it's nice to just have it really lightweight. Oh, yeah.

Yeah. I mean, you know, the idea to remove it has just kind of been thrown around.

No, I mean, when we struggle to, you know, solve the host, you know, some of us go through the web GUI and sometimes the web GUI is working for one reason or the next, right?

Yeah. Having a second option has been helpful for me as well. Can OpenBMC also take like your, what would you call it, your list of, your fru list, right?

I'm talking about part numbers, serial numbers of each of our DIMMs, disks.

Yeah. So, I mean, we're aiming to make OpenBMC on Gen11 a one-for-one replacement as far as functionality is concerned.

I wasn't quite prepared for this, but let me see as far as fru.

Yeah, I don't know if I found it intuitive of our current vendors to see things like that, or at least like a little status to say, you know, failed DIMM or bad DIMM or something.

That'd be pretty cool. I'm just throwing wish lists now. Yeah, I wish not to have to decode all the time, like what the errors mean.

And you wanted fru.

Yeah, that looks pretty good.

Yeah, I mean, so we're really hoping to create a one-for-one replacement here as far as OpenBMC on Gen11.

From there, we have some kind of stretch requirements that we're targeting as well security-wise.

How much of a part does OpenBMC play in the security piece?

So we've had, how do I say this?

If we're able to patch pieces of, if we're able to patch the OpenBMC firmware to our liking as far as even using OpenBMC.

Yeah, I'm trying to avoid some certain topics here, I guess, as far as our currently existing vulnerabilities.

Okay. Yeah, I guess that's kind of why I'm a little reluctant or trying to figure out how to phrase some of these things right now.

Yeah. Nonetheless, it's a requirement. We just got to have OpenBMC for Gen11.

Right. I guess some of the milestones would be that, yeah, like you said, it would be, it should be like a one-to-one replacement to IPMI.

And I assume it would be something that's going to be like totally transparent or invisible to anybody that can SSH into a metal.

Yeah. It's funny because talking about OpenBMC requirements for Gen11, it was hard to get currently existing functionality on the firmware that we have right now.

It's kind of based on what SRDs use the most. And we kind of came up with some list of functionality and we're hoping for the best here.

Yeah. Yeah.

Let's see. For SIEs, IPMI tool, there's not too much about it besides just trying to identify the hardware health, like power supplies, are they asserted or they're de-asserted, things like that.

It's great to see a log of when it happened and all that.

It is pretty much hardware driven anyways. So if it makes your lives easy, then I'm sure it's going to make our lives easy too.

And it's just all about knowing what the commands are.

I'm trying to look.

I think that's pretty much it. Or maybe like identification of servers, identification of all of our components.

That definitely helps. Years ago, we weren't able to turn on specific LEDs for our hard drives.

So if we had a failed drive, like a hard drive, it tends to be a hard drive for our storage servers.

It was the same thing for SSDs. If we had one failed SSD, like SDB or something, and SDB doesn't necessarily mean it's the second physically located flash drive or hard drive in a server.

We weren't able to do that either. Maybe we fixed it before, but back then what we had to do was actually use this process of elimination of, well, here are all of the drives that are still good because we can still recognize them.

And here are the serial numbers. So we're going to shut off the whole server and then just pull out all of the drives and just match all of the serial numbers that you can find on the list.

Those are the good ones. So the missing one is a bad one.

So please replace that one. Yeah. I hope that we actually fixed it.

I think we did. It just took us a while to actually find the actual Rockman for it.

And it was begging the vendors, just give us the manual for what we need to do and even test it and things like that.

There's going to be some things like that that's going to come about.

Otherwise, that should be it.

Just controlling LEDs, controlling the locations of things.

Rob, does your team do anything with Redfish today?

Not that I know of, no. Not at all.

Obviously, it's very rarely. I personally, myself, haven't really used Redfish at all.

That was actually going to be a question that I had for you guys about what are we thinking about Redfish here.

I think some of us are saying, yeah, it's nice to use it.

Currently, I guess personally, I've used it to initiate firmware updates.

That's kind of the big one I've been using. Unfortunately, for OpenBMC, Redfish is a direction that the community wants to go in, but it's not complete yet.

Not all the APIs exist yet. Something that's being worked on by the community.

Do you have a feel for ETA on that? No. From what I understand about the Redfish part of OpenBMC, that is a part of the community that is closed to certain companies that have already bought in.

I'm not entirely sure why, but that's kind of what I found.

I didn't realize that there were closed committees adding features. This is the only one.

It's just the Redfish. I'm not really sure why that is. Although their repository is under OpenBMC, they're super open and you can check out everything.

It's only Redfish. Redfish itself is an open source implementation, right?

Or an open standard? I'm not sure, actually.

I know we had to use Redfish once to try to flash a whole bunch of servers at once.

And I don't know if it was actually that successful either.

It was kind of buggy. Yeah, that's going to depend a lot on the firmware vendors to implement it correctly.

I mean, even on our currently existing fleet, we've had some trouble with Redfish being properly supported and operating correctly.

That would be the only alternative from SockFlash or Yahoo!

Flash too, right? Yeah.

We're having troubles with SockFlash and Yahoo! Flash as well for other reasons.

I think I can speak for the whole industry that we have some problems with those things.

It's just not really consistent. We kind of pray that it works and just smile that it does and be grateful for it.

Yeah. That's just kind of how it goes.

It sucks too, because for us to do all this automation, we also do hard code, which tool that we're going to be using.

We have SockFlash, Yahoo! Flash, and then all the flag options that we have to do.

So those things are things that we have to consider every time we have to introduce a new server or a new hardware or firmware too.

Some firmwares just don't work with this version of Yahoo! Flash, for example.

And yeah, it's kind of a pain in the ass. I was just going to say, we're pretty much wrapping it up for this episode.

If there's any other questions, Brian, if you have any more, I'm pretty good here.

Yeah, that was all I had for today.

Yep. So yeah, thank you, Ryan, for your time. Thanks for the little demo. Things are coming very, very promising for Gen 11.

I'm excited to use this whole OpenVMC thing.

It's going to be new to me when it comes to actually try to use it in production.

Hopefully you won't hurt me. Well, at least instead of yelling at the vendor, I can yell at you now.

Yeah. That'd be great.

And I know I don't have to wait for months to get it fixed. For sure.

All right. So this wraps it up. Hardware Cloudflare Episode 5. Thank you very much, Ryan.

See you next time. Thank you. Thank you.

Thank you.

Thank you.

We're making the Internet fast, secure, and reliable for everyone.

Cloudflare. Helping build a better Internet.

Thumbnail image for video "Hardware at Cloudflare"

Hardware at Cloudflare
What does Cloudflare look for in hardware?
Watch more episodes