Hardware at Cloudflare (Ep5)
Presented by: Rob Dinh , Brian Bassett, Ryan Chow
Originally aired on September 8, 2020 @ 3:00 PM - 4:00 PM EDT
What does Cloudflare look for in hardware?
English
Hardware
Transcript (Beta)
It looks like we are live now. So thanks to everybody to tuning in. This is hardware at Cloudflare episode five.
This is your host Rob Dinh and Brian Bassett. Hi.
And today we have a guest from the hardware team as well. His name is Ryan Chow.
And when we talked about the very first episode for Gen X, we mentioned something about BMC and what we've had plans for it.
So Ryan just took the initiative to put himself on the list so he can present about what he's got going on.
So we can take it away. Yeah, I guess before we get started, we should mention that people can email us questions at livestudio at Cloudflare.tv and we'll be taking them live.
So Ryan, why don't you tell people who don't know, what the heck is a BMC anyway?
Yeah, so a BMC is a Baseboard Management Controller. And basically, you can think of it as another computer in our servers that's external to the main CPU.
And what it does for us is it provides a subset of functionality that allows us to manage our servers.
So you have things such as power management, being able to turn the CPU on and off.
You have inventory, being able to identify all the pieces of hardware in the server.
You have debugging utilities, such as logging, sensors as well to monitor voltage output, temperature of the system, and also importantly, fan management, being able to cool your system effectively without consuming too much power.
So does every computer have a BMC? I mean, they can, but it's mainly a server, a computer specific to servers.
I've heard of some home computers adding an external BMC, but it's not completely practical.
But it's mainly used for out-of-band management.
If you don't have physical access to your computer, like a lot of us at Cloudflare do, the BMC allows us to have that external control.
Let's see here.
I'm searching back to the slides there, and we can try to walk through there.
So just for the audience who may not be very familiar with how hardware works and how we manage them at the edge, we'll just give a quick overview about what BMC is and what's the difference between OpenBMC and what makes OpenBMC something that we consider as a requirement.
So as most employees at Cloudflare know, we're in the business of Internet security.
And one of the things that we currently use for our existing firmware is we ask our ODMs to provide firmware for the BIOS, the BMC.
And even for our vendors, they outsource it to other vendors to provide them with the firmware.
So there's a lot of blockage. It's unknown to us what it actually is, what's in the actual software or firmware.
And for us, we want to understand what's in there so we can add our own features, make our own bug fixes, et cetera.
So does this mean maybe I grab an HP server or a Dell server or the ones that we have currently, we got three different vendors that are active in our fleet.
So does this mean that we have different firmwares for each BMC or do they go to this common third party?
Typically, it's done through a common third party.
But the firmwares can be different because the firmware is tuned to the specific motherboard and all the peripherals that they use and particular address spaces and devices that they have on their motherboard.
So it's tuned to their system, which is also information that's held tightly within the vendors that we outsource this work to.
OK. All right. Let's see. So go ahead.
Yeah. So why do we care? As I mentioned earlier, we don't know what goes into our firmware.
And the visibility is pretty important for us to effectively secure pieces of hardware in our servers.
When we have bugs or defects with our systems, we kind of go through this cycle through our vendors of saying, hey, we have this bug.
We hear the symptoms. Here are some potential causes. Can you help us with it?
And sometimes the vendors are like, OK, yeah, we'll take a look at it.
And maybe two to three weeks later, they come back and say, well, we can't replicate it.
It can be annoying for us when we're trying to serve our own customers and we have effectively useless metals all around the world.
And we're just like, OK.
And we do our best to continually try to provide feedback. And maybe two weeks later, they come back with, oh, yeah, we're finally able to reproduce it.
And that's a month gone just identifying the problem for them. And then another two weeks later, two to three weeks later, they're still working on an actual bug fix.
And then they come back to us and say, hey, try it out. And when we try it out, it may or may not work.
And you can understand the headache that we go through some of our vendors and why we're interested in opening it up such that we can take a look and such that maybe we can fix things and help identify problems.
Yeah, the response that you actually mentioned is actually a pretty big deal when we talk about things that are very crucial for us.
So let's say, I'm trying to think of some examples, like we could see a whole bunch of servers that shows that there's only a power supply that's off, even though maybe it's a false positive.
But if there is a bad PSU, that's really bad for us because we try to strive for redundancy.
We try to strive for a 99.999% operation time. So if we have one of those crucial servers that is losing power supplies, that's supposed to be something that we take action on.
And what we saw is that it is actually a false positive.
Both PSUs are actually on. And you try to follow through with a vendor that says, well, OK, it's a bug.
We're going to take a look at it, like you said. And it takes them a couple of weeks.
And then they're like, oh, I don't know what you're talking about.
I can't reproduce it. Can you give me a little bit more information or something?
And you're sort of like, well, we gave you all the information that we got.
And more information is basically just more instances of the same thing that we're producing ourselves.
And not only it's hard to communicate with a vendor simply because plainly, maybe there's some things that we just can't share.
And then also, it's hard to have outside engineers to come in and then just sort of have this sort of intimate relationship with the service that we have.
And a lot of the ones that we talk to actually have different time zones.
Or they're actually in different days. A lot of them that they sort of push through or afford their problems is maybe based in Taiwan or based in Shenzhen or somewhere.
And that's where all the lag kind of comes from, too. And it's hard for them to actually prioritize things.
Because they might have other customers.
They might have bigger customers than we as Cloudflare. But for us, it's a priority for us.
So that's kind of why we kind of wanted to say, hey, let's just take things on our own.
We're more intimate with it. We got our engineers.
We got Ryan. We got Brian. And just go ahead and figure it out. And to be fair to the vendors, a lot of times, they have a hard time reproducing issues that we find because they don't have access to our network environment.
And so that's one of the things, another thing that we can't give them, that access to our servers working in our environment.
But we can reproduce that ourselves at will. So that should also help make these bug fixes easier.
Yeah, so on the slide there, we saw something about security.
So could you mention a little bit more about that? Yeah, so part of security is understanding what goes into our VMCs.
And if we don't know, it could be as simple as having dangling pointers existing in the code.
And not being able to be able to see those kinds of things makes our code vulnerable to different vulnerabilities.
Also, security-wise, we can add and remove our own features.
Some people on the security team that we've talked to have had a desire to remove the web GUI just to remove another attack vector and another attack plane.
And we have that option if moving to an open source firmware stack that we can modify ourselves.
And there's, yeah, so the way we do currently as well is we're a security company, and we do strive to make things as secure as we can.
And the reality is a lot of the servers that we buy that get shipped from those factories will actually come in with default login credentials too.
And it's our job to try to figure out or to ensure that whatever we got is actually not tampered.
And we can only do so much with shipping, or at least ensuring physically that things have not been tampered.
So OpenBNC does open up some possibilities there.
So some of the troubles that we have, I think, is the use of LDAP or at least have static passwords and credentials that at least we actually are trying to recycle them or change them on a periodic basis.
So just wondering if you could look more into that or explain. Yeah, I guess OpenBNC doesn't necessarily, I think as far as the default passwords that some of our vendors provided, that there's a vulnerability that OpenBNC was not really required to do.
But as far as other, such as moving from our existing IPMI interface to a Redfish interface, IPMI being this old 80s protocol to communicate between the BMC and the server, just moving away from that and being able to move to some more modern protocols helps us out tremendously.
Going back to your comment about LDAP, it's currently existing in OpenBNC as well.
But we had the option to make it LDAP-S, LDAP-secure. And yeah, I guess that's kind of like.
LDAP over SSL. Yeah, LDAP over SSL. Exactly. I guess one of the things that's interesting about controlling your firmware yourself is that we could potentially turn off features if we want to remove an attack plane, right?
Correct. Like I had mentioned before, kind of like the web GUI that we're accustomed to using, we can even change what we can from, excuse me, OpenBMC currently uses DropBear by default, which is an SSH client.
We can move over to OpenSSH. We can kind of pick and choose the libraries that we use, the packages that we use, et cetera.
Yeah. Currently, what we have with BMC is our current servers come in shipped without Serial Overland enabled.
So they come in disabled for security reasons.
We try not to use Serial Overland if we need to.
Most of the things that we have to operate and repair doesn't necessarily need to.
But there is sort of a last resort where we actually need to use soleActivate.
And that's one of those things where we actually have to flash a BMC firmware and then enable it, go into it, making sure that we have it closed after we're done with that kind of business.
So after it's fixed, we're going to try to re-disable it.
So you can see sort of the additional toil that we have with hardware engineers and SREs to try to fix them at all.
We don't want to leave those servers vulnerable like that.
But it is a very last resort that we have to do. And I think OpenBMC can open a few things up here.
So I'm resharing the presentation to show what that solution is for OpenBMC.
Yeah, so OpenBMC was originally created by Facebook and IBM, like to claim their hold on who was first creating OpenBMC.
But now it's under the Linux Foundation.
And you have all these companies that are contributing to OpenBMC.
And we are trying to kind of join in and use their firmware. So some people are familiar with kind of bare metal embedded.
They're used to using some real-time operating systems and writing all the drivers, I2C drivers, SPI drivers, communication drivers, and then writing their application on top of these bare metals to get their desired task accomplished.
And then we're familiar with Linux, which is a very general-purpose operating system that we can run our favorite programs on.
What OpenBMC is is an embedded Linux space. So it's somewhere in between that very narrow embedded space and the general-purpose operating system space.
And the way it's done is OpenBMC uses a build system called the Yocto Project.
And what the Yocto Project is is removing all of the headache that we had to do when writing embedded drivers over and over again from I2C to all those SPI drivers.
When you move from one processor to another, you would have to rewrite all those drivers over and over again because those processors had different definitions, stuff like that.
With the Yocto Project, the Yocto Project eliminates that.
And basically, these companies would share their BSPs or their board support packages where they had all these pin definitions and these functionality definitions and mappings.
And it would basically pull all the software, all the source code.
It would compile it. And you can even add in your custom patches throughout compilation, package it such that it can be prepared for a root file system.
All this is done through a Python-based tool. And all it does is take these recipes or these little metadata files to point to different URLs all over the Internet to know where to retrieve the source code.
Yeah.
And is this something that, do you know if other companies or other organizations?
Oh, yeah. So all sorts of embedded companies use the Yocto Project because they're tired of writing these low -level, bare-metal projects.
And you just want to hit the ground running.
The Yocto Project allows developers to create their own Linux distribution for their target hardware.
OK. That's, the way I'm understanding is it's kind of like, that's going to be the future for all the server integrators or any kind of hardware supplier.
They're going to have to buy into VMC.
Do you guys see that that's, is that the trend? Or maybe the servers are still going to be stubborn about it?
I'm not super familiar with necessarily the server space.
But as far as embedded applications that we're used to, such as thermostats and the IoT space, really, where the Yocto Project had its motivations from.
Could we see it in the future of switch processors or whatever other specialized hardware?
There's certainly room for that.
Yeah, that seems like it would be a thing that we'd be doing, really.
Yeah, like the switches and routers would be interesting. I haven't really thought too much about it yet.
Servers seems like it would be something that we, for sure, we're going to be doing.
But I do have to assume that hardware vendors may need to, I don't know, re-strategize or re-design how they make servers.
Maybe there's a special board for it or a special port. Do you guys know anything about that?
Not here. Not that I'm aware of. Yeah. Let's see. But yeah, other parts of the OpenBMC stack kind of uses systemd for its process management, dbus for its interprocess communication, and uboot for a universal bootloader.
Some other software for the OpenBMC stack.
Right. Yeah. And are we there yet? So I guess we kind of segue a little bit into that if the server industry is going to be doing it.
I mean, I'm sort of seeing just from my senses that there's going to be more and more adoption from the server side.
It's hard to not be able to attend to any kind of conferences nowadays.
But it seems like this is where we're heading towards the direction.
And we sort of made like, this is going to be a requirement for Gen 11, right?
Yeah, so we're targeting Gen 11 for our launch of OpenBMC. We've started working with vendors to open that up.
Internally, we have some samples that we've been playing with OpenBMC, adding our own small little applications, kind of playing with the ins and outs of it.
We have a working CI, CD pipeline because these builds on an average four-core CPU can take up to six hours to build some of these images.
So we've kind of got the infrastructure kind of going for our development.
And we're hoping, or no, we're going to be collaborating with some of our vendors to get the OpenBMC done on Gen 11.
Now, you mentioned that the firmware for a BMC has to be customized per system.
It's also true with OpenBMC, right?
Yes. So I mean, things such as description of the devices that exist on the motherboard, address spaces and address locations for where some of these particular pieces are.
And those are the big things. Also, traditionally with not OpenBMC, I mean, you had to work on a fan curve to effectively cool your system.
That's also a system provided by OpenBMC.
Yeah, nothing else off the top of my head right now.
So do we have a feel at this point for how much of that work is going to be done by our server vendors and how much by us?
A lot of the, at least the descriptions, will be done by the vendors.
I mean, they hold on to those schematics very tightly.
And sometimes they're reluctant to share with it. But we're hoping to open that up and understand that a little bit better ourselves as well as to what the BMC is connected to.
And even things such as pin assignments can be a little black box for us.
So we've opened the software or the firmware side of things, but not necessarily the hardware at this point.
No, not quite yet. That seems to be the one thread that's sort of missing, I guess, in a way.
And I mean, I can kind of sort of see from the vendor side.
I personally have never really been a fan of it because I like to do things myself.
So if I were to fix something, I'd like to see everything and just open it up myself rather than rely on somebody.
I guess it would be kind of interesting on how vendors are going to be able to, I guess, adopt this way.
We've had different, I guess, side quests or side narratives from even in Nix had this thing where we used a Nix vendor where they kind of had their own sort of magic tricks or things that they wouldn't really let us in onto unless we're really tight with them or we have great, great relationships for.
And we say, now we kind of want to use the current Nix that we have where a lot of things can be open sourced.
And we'll just kind of have to try to manage the firmware ourselves this way.
So it's trending this way with BMC as well. And I assume for pretty much all the firmware controls that we have, there may be, I'm trying to think of a con for us or for any company, like the Facebooks and the Googles.
They probably have a whole team for firmware control, right?
I'm trying to figure out, what is our ambition then for OpenBMC besides controlling fans?
I guess for us, it's getting involved in kind of adding our, aside from adding in detracting features, even kind of baking it into some kind of automated process to have as far as testing, right?
Testing is pretty vital as far as trying to effectively secure our images.
When we're given new images to test out, we can trust that we can verify what's in the change logs and maybe even go, if we're diligent, verify everything in the past change logs that are still present.
But we still don't necessarily know what the effects there could possibly exist.
So personally, for my vision, in addition to the CICD pipeline, I'm hoping that we can automate the flashing process to actually, upon completion of the image build, we could remotely flash some of our test server, a test server's BMC, and be able to run a series of unit testing, a series of just automated testing, so that we have some higher level of confidence that say, hey, this is working for us.
This is secure enough. This is effective enough, et cetera.
Provide some more robustness to the system. Yeah, yeah. From just a surface level, the little demo that you're going to show right after this, you gave me a little bit of a sample.
And the way I looked at it was, it almost looks like a Linux service, basically.
Like it's on system D, you can have some dependencies on it.
And for us at SRE, that's great, because we can do all this version control, not maybe not necessarily writing the versions themselves or the firmware versions, but at least rolling it out to the fleet.
That way you can have it tested in some of our coders or just have it roll out to the rest of the world if something bad happens, then we can actually revert it back quickly.
And I think that's great. That's great that we can do that.
So we can just have it like typical services that we have, they get tested in our load test lab first, and then we expand them into certain coders.
So we just have a handful of coders that sort of like, this is going to be your soft opening of a new service, or maybe a new version of a service.
And then it rolls out to the rest of the world. And there's a whole bunch of controls that we have to do in between to ensure that we're not pushing bad, bad firmware or bad software out to the rest of the world before anything.
So we will catch them in those test labs and in those certain coders.
So it's exciting that we can do that with BMC or open BMC, I guess.
Because yeah, there's been a lot of times where we have to have this toil of reflashing the whole entire fleet because of some faulty BMC that was shipped.
And it goes through the whole, like when we have to identify which servers had the bad firmware.
And that's always hard, especially when you have 200 coders.
And correct me if I'm wrong, but I don't think BMC needs to have it rebooted.
So that's a good thing in a way, relatively, I guess, because we can talk about BIOS and BIOS we need to reboot.
And it sucks to reboot servers if they have bad BIOS.
For sure. Yeah, we're right at the demo now.
So if we can take a look. Let me share this real quick.
Yeah, I guess to start to preface this, we have this working on one of our metals right now.
And it's a very simple demo where I have my local host. Let me see if I can.
Yeah, so there's my local host, and I'm actually SSHed into the BMC right now.
Yeah, this is my BMC.
And I'm not really sure how technical I want to go, but so out on GitHub, I just have some random repo where I have some open source software.
It's out there on the Internet, so anyone can get to it.
And I have kind of a build file and just a simple source file that anyone can grab.
So with the Yocto project, what I would do in the Yocto project is I would create this recipe to go out to this URL, this kwongflare at Foster Ryan.
It would grab it. It would compile it and link it and package it into the root file system.
So just taking a quick look at the source code. We won't get into too much detail.
All I'm doing is going onto the Dbus. I'm attaching myself to the Dbus, and I'm just requesting a bunch of calls to all the ADC sensors on our metal and printing out to the log.
Earlier, yeah. Earlier, I'd set this up as...
Let's see, excuse me. Can't find my... So I set this up using rsyslog to export all the logs from the BMC to my remote host.
So on my left is the remote host, and here's a simple kickoff of my service.
And we should see it on my local host here.
Yeah, there we go.
So taking a quick look at the log, you know, have timestamps, and a simple monitoring of some of the ADC sensors.
So our CPU core, so this metal is turned off right now, and all the various voltages on these ADC sensors.
And I have this running as a background task, and every two minutes, it'll print out another set of readings.
And yeah, that's pretty much it. So I guess what this kind of demonstrates as far as what we're looking for in an open source firmware is that we have the ability to tune and add our own applications, or remove applications that we deem kind of useless or vulnerable whenever we want.
And in this case, additional logging that didn't come with the original OpenBMC project, right?
Right, so, you know, SREs and other people who are trying to debug some of our systems remotely, I guess prior you would have to just use IPMI tool to request sensor data.
Even with this small application, you have timestamps to, you know, show over time, like, you know, how these readings changed or didn't change.
You know, maybe if you view anomalies in some of the logs, you'd be able to be able to debug things easier, but basically just open up information and open up.
Yeah. Yeah, there could be more details on those logs too.
And it looks like these logs can actually retain information from many, many weeks, maybe days.
I'm not exactly sure because.
I mean, that's the thing, we can do however we want to. If we want to have a revolving buffer, we can, you know, or we can just say, you know, we'll allocate enough space for maybe a few days and be done with anything past that.
Yeah. It's very, very flexible to how we want to be able to control things.
Yeah, that's great.
For us, at least as far as I remember, we were only able to retain 14 days or two weeks worth of IPMI data, which may actually not be very helpful sometimes, especially if it is something that's just been long running.
And it's nice to just have logs that we can extract to, you know, just to try to give a little bit of history.
We had no way of reading anything from IPMI before. That was in terms of like logging or, you know, having it into like a time-based graph.
We didn't have any of that.
And we just had IPMI tool. And when we tested for Gen 8, I think when we tried to read power or like CPU usage, so we had like a stress, we wanted to stress the CPU and then also try to read how much power that took.
It was a challenge to not have any kind of logging for that type of stuff.
And what we had to do was just create like a little batch script with a for loop that just says, you know, can you print out, you know, the power on IPMI tool, you know, like every two seconds or something.
So we did that. We let it run for like multiple days and then just graph it out.
I guess it seemed kind of fun, but you know, we're civilized now, right?
So we should do something better. We're a little more sophisticated and a little more, yeah, a little more baked into our tooling.
Yeah, yeah. And our tooling does a lot of funny things where, you know, we try to not to have that many vendors.
I mean, we definitely don't have more than, you know, maybe three or something like that.
But, you know, back then we only just had one vendor.
We've hard-coded a lot of our stuff from, you know, their own IPMI tool things.
So sometimes they were actually grabbed like string-based graph, like, you know, VNPSU0 or something like that.
And that's what we had to hard-code so we can at least monitor the power.
And a new generation of servers come in or maybe we have another vendor and, you know, they start to write, you know, their own ways for, you know, IPMI tool.
And, you know, that gives us toil because that means now we have to account, you know, for this generation as well.
And, you know, if we can standardize all of that, you know, for monitoring, that would be great.
Yeah, I mean, it's terribly annoying to have to change your scripts just to add an underscore for a subscript of your sensor or anything like that.
And yes, with OpenBMC, we have the capability to even rename these sensors as we please if we want to and give things more descriptive names for purposes of automation, like you had suggested, just unify things.
Yeah. Let's see. There was also another one. Oh yeah, that was, I think that was back in gen nine or even gen X.
So before we try to spec for gen X, we had some samples.
And, you know, I know for sure vendors would like to try to give you a prototype or something.
So a lot of the things that they have is just not exactly finalized yet, which is fine and understandable, but, you know, there's some metrics that were probably crucial for us to make decisions on.
I mean, I don't know if that's how it is now, but I remember, you know, power CPU to be a big thing or a big factor for choice of CPU.
And, you know, one of our vendors didn't have it in their IPMI tool.
And, you know, we just sort of left like, well, I can't really tell anything then, basically.
I can't, maybe I can read the servers, but, you know, reading server power to another server power for different CPUs when we try to make CPU selection was, it was like apples to oranges for comparison.
And, you know, that was just not something that I was a big fan of, but it looks like we can do something like that now.
Like we can try to find any kind of sensor or, or maybe there's like additional sensors that we can do.
I'm not even sure. I can't think of the top of my head. Like maybe we can figure out what the power draw is between, you know, between a NIC and a CPU or something.
I don't know. That'd be interesting. I could see some potential here also for just improving the quality of the logging that gets exposed for the data center folks for like for troubleshooting purposes.
Those, the logs are not always real helpful, but since we have control over it, we could make them more helpful.
So an example is like a lot of times you'll see a message and I've seen this on servers from multiple vendors where it'll say something like ECC memory error on DIMM 0x137.
And then you're supposed to know from that, what that maps to, to figure out which DIMM to go replace.
And people end up sending around decoder sheets and emails and there's like some tribal knowledge.
Some folks know where they are and how to get them and others don't.
So that kind of decoding is the thing that computers are good at.
So, you know, we could fix a problem like that. Yeah, that's good.
That's good. I just have it all customized. Customized, agnostic to what vendor would we have to, which is great.
Yeah. It's not like we're looking at, you know, two different vendors and we're trying to figure out which vendors there are.
At least on my side, it's hard for me to look at it. I mean, there's probably a few, you know, command lines I can figure it out.
But, you know, when a server is down and it's like urgent, you know, that's kind of like the last thing I'm trying to think of.
I'm trying to turn it up and to try to figure out, you know, what command is what, depending on what vendor it is.
Especially IP and my raw commands, which vendors have tended to be about giving for some reason.
Yeah, and that's another thing that those, even when you get them, they tend to be tucked away in email chains and not really someplace that you can get them necessarily when you need them.
So for OpenBMC, I mean, there's still a little bit of, it's not completely agnostic to the vendor, right?
Like there's gonna be part of the repo, like a list of hardware that's available for OpenBMC, right?
Correct, so this is kind of what I alluded to when we had talked about, and it's not just even the vendor hardware, it's even the specific to the BMC chip that you use, you know, we're accustomed to use a very, a particular group's BMC, but there are other BMC makers as well.
And right now, I believe, yeah, the OpenBMC supports the AST series as well as a couple of new V10 chips, as far as BMCs.
And as far as the motherboard and the interactions with the motherboard, yeah, it's a kind of a pick and choose system as far as how the Yocto project, if you can specify which group's configuration files you wanna use, you know.
OpenBMC is really just the baseline functionality of the BMC features, but how does that get mapped to the hardware itself?
And that's where you can say, okay, hey, I'm using this chip, I'm, you know, I'm using this machine.
You showed a picture of the AST chip at the beginning, that's a super common one for BMCs, right?
A super what? A super common in the industry?
Yeah, yeah, yes, it is. I believe in the past few years, I believe AST just came out with a 2600.
And I think the 2500 is a single core ARM core and the 2600 is a three core ARM core part.
So, you know, if we move on to, you know, upgrading BMC chip over time, these, I guess what I'm getting at, these BMCs are getting much more sophisticated and becoming their own computers, general purpose computers themselves almost.
Have you run into anything in your work with OpenBMC where you wish you had more CPU horsepower?
No, I think I've run into some performance issues as far as DBus itself.
That might be, that'd probably be aided if, you know, it was a, the BMC was a more powerful chip.
But yeah, DBus itself is just, it can be a little slow sometimes.
Sometimes the feedback is multiple seconds long and you're kind of expecting it instantaneously as far as requesting information or even setting, setting information and other common tasks.
How does the DBus performance manifest for the end user?
Like, what do you see when you're trying to do something?
For, I guess it's really just when I've kind of been playing with some of the power management as far as, you know, turning on and off the machine.
As far as the end user and specifically IPM, I mean, right now I'm SSH into the BMC.
That's definitely not going to happen in production.
This is a strictly debugging feature. Yeah, but as far as the end user, I mean, I can't say I've noticed anything quite terrible.
Yeah. In general, pretty responsive to things like IPMI commands.
Yeah, for sure. Returning, you know, all the sensor data you request, returning status of the chassis, being able to send a command to identify your chassis, sending firmware update requests.
It's been responsive.
And it has a GUI similar to the ones that, the BMC firmwares that we're using today.
Yeah, I can take a look at that actually.
Am I still presenting? Yeah, you are. Yeah, let me take a look here.
Yeah.
So here's kind of the, here's the web GUI that we're kind of customizing.
Even the web GUI is customizable for us if we so choose.
That landing page loaded nice and fast.
Yeah. That's not always the case. Yeah, I guess now that you've mentioned it, but granted, you know, running it locally as opposed to having a, or requesting it from a data center that may be, you know, hundreds of miles away, maybe that's contributes to it as well.
Yeah, I'm not sure how much of that lag is physical distance and how much of it is just maybe like how busy the page is.
I think maybe some firmwares are trying to display too much there and end up basically having to process more data than they need to, I think.
For sure. Yeah, the web GUI could be a lot more minimal than it needs to be for the current service that we have at least.
Yeah. Which is great. It's nice to have it just something more straightforward.
There's definitely a lot of features that I don't think I've ever clicked on every time I go to the web GUI.
I think JavaScript has been pretty much useless for me so far.
But yeah, like uploading ISO images, things like that is, it's still definitely part of like our troubleshooting for our servers.
So if in case we need to either reprovision or do anything where we can't even re-login, but at least the BMC is still alive.
The web GUI is still pretty crucial. So it's nice to just have it really lightweight.
Oh yeah, yeah. I mean, the idea to remove it has just kind of been thrown around.
No, I mean, when we struggled to solve via the host, some of us go through the web GUI and sometimes the web GUI is working for one reason or the next, right?
Yeah. It's having a second option has been helpful for me as well.
Can OpenBMC also, can also take like your, what would you call it? Your list of, your fru list, right?
I'm talking about part numbers, serial numbers of each of our DIMMs, disks.
Yeah, so I mean, we're aiming to make OpenBMC on Gen 11 a one-for-one replacement as far as functionality is concerned.
I wasn't quite prepared for this, but let me see as far as fru.
Yeah, I don't know if I found it intuitive with our current vendors to see things like that, or at least like a little status of say, failed DIMM or bad DIMM or something.
That'd be pretty cool. I'm just throwing wishlists now. Yeah, I wish not to have to decode all the time, like what the errors mean.
And you wanted fru.
Yeah, that looks pretty good.
Yeah, I mean, so we're really hoping to create a one-for -one replacement here as far as OpenBMC on Gen 11.
From there, we have some kind of stretch requirements that we're targeting as well security-wise.
How much of a part does OpenBMC play in the security piece?
So we've had, how do I say this?
If we're able to patch pieces of...
If we're able to patch the OpenBMC firmware to our liking as far as even using...
Yeah, I'm trying to avoid some certain topics here, I guess.
As far as our currently existing vulnerabilities.
Okay. Yeah, I guess that's kind of why I'm a little reluctant or trying to figure out how to phrase some of these things right now.
Yeah. Nonetheless, it's a requirement. We just got to have OpenBMC for Gen 11.
Right. Yeah, so some of the milestones would be that, yeah, like you said, it would be, it should be like a one-to-one replacement to IPMI.
And I assume it would be something that's got to be like totally transparent or invisible to anybody that can SSH into a metal.
Yeah, yeah. We've, it's funny because talking about OpenBMC requirements for Gen 11, it was hard to get currently existing functionality on the firmware that we have right now.
It's kind of based on like, oh, what SREs use the most, you know?
And we kind of came up with some list of functionality and we're hoping for the best here.
Yeah, let's see. For SREs, IPMI, like IPMI tool, there's not too much about it besides just trying to identify the hardware health, like power supplies, you know, are they asserted or they're de -asserted, you know, things like that.
You know, it's great to see like a log of like when it happened and all that kind of stuff.
But I mean, it is pretty much hardware, you know, driven anyways. So if it makes your lives easy, then I'm sure it's gonna make our lives easy too.
And it's just all about, you know, knowing what the commands are.
Trying to look.
I think that's pretty much it. Or maybe like identification of servers, identification of all of our components, that definitely helps.
Years ago, we weren't able to turn on, you know, specific LEDs for hard drives, right?
So if we had a failed drive, like a hard drive, it tends to be a hard drive for our storage servers.
It was the same thing for SSDs. You know, if we had one failed SSD, like SDB or something, and SDB doesn't necessarily mean it's the second physically located flash drive or hard drive in a server.
You know, we weren't able to do that either.
Like maybe we fixed it before, but back then what we had to do was actually use like this process of elimination of, well, here are all of the drives that are still good because we can still recognize them.
And here are the serial numbers.
So we're gonna shut off the whole server and then just pull out all of the drives and just match all of the serial numbers that you can find on the list.
Those are the good ones. So the missing one is a bad one. So please replace that one.
Yeah, I hope that we actually fixed it. I think we did. And it just took us a while to actually find the actual rock command for it.
And it was, you know, begging the vendors, you know, just give us the manual for what we need to do and even test it and things like that.
So, you know, there's gonna be some things like that that's gonna come about.
Otherwise, yeah, that should be it.
Just controlling, you know, LEDs, controlling, you know, the locations of things.
Yeah. Rob, does your team do anything with Redfish today?
Not that I know of, no. Not at all. Obviously, it's very rarely. I personally, myself, haven't really used anything Redfish for at all.
That was actually gonna be a question that I had for you guys about what are we thinking about Redfish here?
And I think some of us are saying, yeah, it's nicer to use it. Currently, I guess, personally, I've used it to initiate firmware updates.
That's kind of the big one I've been using.
Unfortunately for OpenBMC, Redfish is not, it's a direction that the community wants to go in, but it's not complete yet.
Not all the APIs exist yet.
Something that's being worked on by the community. Yeah. Yeah.
Do you have a feel for ETA on that? No. From what I understand about the Redfish part of OpenBMC, that is a part of the community that is closed to certain companies that have already bought in.
Yeah, I'm not entirely sure why, but that's kind of what I found.
Okay. I didn't realize that there were closed committees adding features.
This is the only one. It's just the Redfish. So I'm not really sure why that is.
Although there are repositories under OpenBMC, they're just, they're super open and you can check out everything, but it's only Redfish.
Redfish itself is an open source implementation, right? Or an open standard.
I'm not sure, actually.
I know we had to use Redfish once to try to flash a whole bunch of servers at once.
And I don't know if it was actually that successful either.
It was kind of buggy. Yeah, that's going to depend a lot on the firmware vendors to implement it correctly.
Yeah. I mean, even on our currently existing fleet, we've had some trouble with Redfish being properly supported and operating correctly.
That would be kind of sort of like the only alternative from like SockFlash or Yahoo Flash too, right?
Yeah. We're having troubles with SockFlash and Yahoo Flash as well for other reasons.
Yeah. Yeah, I think I can speak for the whole industry that we have some problems with those things.
Or it's just not really consistent and we kind of pray that it works and just smile that it does and be grateful for it.
I don't know, it's just kind of how it goes.
But it sucks too, because for us to do all this automation, we also have to do hard code.
Which tool that we're going to be using, we have SockFlash, Yahoo Flash, and then all the flag options that we have to do.
So those things are things that we have to consider every time we have to introduce a new server or like a new hardware or firmware too.
Like some firmwares just don't work with this version of Yahoo Flash, for example.
And it's, yeah, it's kind of a pain in the ass.
But I mean, I was just going to say, we're pretty much wrapping it up for this episode.
If there's any other questions, Brian, if you have any more, I'm pretty good here.
Yeah, that was all I had for today.
Yep. So yeah, thank you, Ryan, for your time. Thanks for the little demo. Things are very coming, very promising for general 11.
I'm excited to use this whole OpenVMC thing.
It's going to be new to me when it comes to actually try to use it in production, so.
Hopefully you won't hurt me. Well, at least instead of yelling at the vendor, I can yell at you now.
Yeah. Well, that'd be great. And I know I don't have to wait for months to get it fixed.
Appreciate it. All right, so this wraps it up.
Hardware Cloudflare episode five. Thank you very much, Brian. Ryan, see you next time.
Thank you. Bye. Hi, we're Cloudflare.
We're building one of the world's largest global cloud networks to help make the Internet more secure, faster, and more reliable.
Meet our customer, Wongnai, an online food and lifestyle platform with over 13 million active users in Thailand.
Wongnai is a lifestyle platform. So we do food reviews, cooking recipes, travel reviews, and we do food delivery with Lineman, and we do POS software that we launched last year.
Wongnai uses the Cloudflare content delivery network to boost the performance and reliability of its website and mobile app.
The company understands that speed and availability are important drivers of its good reputation and ongoing growth.
Three years ago, we were expanding into new services like a chatbot.
We are generating images dynamically for the people who are curing the chatbot.
Now, when we generate image dynamically, we need to cache it somewhere so it doesn't overload our server.
We turned into a local CDN provider.
They can give us caching service in Thailand for a very cheap price.
But after using that service for about a year, I found that the service is not so reliable turning to Cloudflare.
And for the one year that we have using Cloudflare, I would say that they achieved the reliability goals that we're expecting for.
With Cloudflare, we can cache everything locally and the site would be much faster.
Wongnai also uses Cloudflare to boost their platform security. Cloudflare has blocked several significant DDoS attacks against the platform and allows Wongnai to easily extend protection across multiple sites and applications.
We also use web application firewalls for some of our websites that allow us to run open source CMS like WordPress and Drupal in a secure fashion.
If you want to make your website available everywhere in the world and you want it to load very fast and you want it to be secure, you can use Cloudflare.
With customers like Wongnai and over 25 million other Internet properties that trust Cloudflare with their performance and security, we're making the Internet fast, secure, and reliable for everyone.
Cloudflare, helping build a better Internet.