Hertzbleed in simple terms
Presented by: Armando Faz, João Tomé, Yingchen Wang
Originally aired on July 27, 2022 @ 10:00 AM - 10:30 AM EDT
Hertzbleed is a new class of side-channel attack that leverages changes on the CPU frequency. To learn more about it, Yingchen Wang (co-author of the Hertzbleed paper) joins the conversation with João Tomé and Armando Faz from Cloudflare.
Check our in-depth blog post: Hertzbleed explained
English
Research
Transcript (Beta)
Hello and welcome. We were just discussing how to pronounce my name, it's João. Welcome to our special program about the HeartSplit attack that was recently disclosed.
With me we have Yingchen Wang. Hello Yingchen. Hello João. Is that the right pronunciation for your name?
Yingchen? Yes. Good. You're based in Austin and you're the co-author of the HeartSplit paper and now research intern at Cloudflare and also our very own researcher Armando Faz Hernández based in San Francisco.
Hello Armando. Hello João. Thanks for having me. Thanks for being here.
I'm João Tomé, storyteller at your service and I'm in Lisbon, Portugal. Again this is my second segment that I have three time zones represented, two from the US and one from Europe.
So let's get started. Yingchen and Armando, you wrote a blog post called HeartSplit Explained.
Actually anyone can see our blog and see the blog post there.
There's many details about this very specific and silent attack and I can disclose right now that no systems at Cloudflare were affected by this attack.
But I'm curious Yingchen, you were one of the student researchers from the team that discovered this vulnerability.
Can you help us understand how you found it?
How does it all started for you in this topic? Yeah that's a great question.
Actually we found this vulnerability as a group of two and for my collaborator, he and his advisor Ricardo and Chris, they found this vulnerability by thinking is it possible by first they read a paper called Platypus that talk about power interfaces leaks data and then they start thinking is it possible that frequency also leaks data because there's a frequency interface too and they start by trying to figure out how turbo boost works.
And from my side, I actually started from a completely unrelated topic called silencer store and that is a micro architectural optimization such that it squashes store of zero on top of zero.
So we were thinking about is there any crypto system vulnerable to silencer store and we found out a new vulnerability on site such that it can trigger a lot of zeros depending on secret key bit.
And later we run on real machine, we figured out that this actually can work on real machine and that's how we found HeartSplit from a completely unrelated topic to HeartSplit.
So and going back to make it on simple terms, the HeartSplit attack, what is it really for most people?
So this is related to CPU frequencies, of course, but what is it completely in terms of most people can understand it really?
So traditionally speaking, power side channels and timing side channels are two disjoint areas.
Power side channels can leak verifying brain data but usually requires an interface, usually an oscilloscope.
However, timing side channels can only capture core screen data but it does not require any co-residency.
And HeartSplit is a new class of side channel that bridges these two by turning power side channels into timing side channels.
Basically now we can, it is possible if some program leaks by a power side channel, basically if we measure its power, then we can see that the power consumption, the CPU power consumption depends on secret information.
Now it is possible to transform this information into timing.
So that we leave the need of any power measurement interface.
Exactly.
So in essence, people will lose, if this attack is done, that people will lose efficiency for sure, and that there could be more problems at hand, right?
Yes.
Because it's all related, but I say it's silent because this was a good thing, it was discovered this soon because this could create problems and people would have difficulty to understand why.
Why were their systems less efficient, more slow, and this would be a reason why, right?
Yeah, exactly. I'm also curious here about what most users should think about this.
So the general user of the Internet probably won't be affected by this directly, although if the systems are slower and directly, there's some impact there, but what should most users be worried or not, and who should be dealing with this vulnerability?
So if for a ordinary user, but not a cryptography engineer, I don't think they should be worried, but this should be a real problem for cryptography engineer, because if for any system, it is possible to leak by a power section, which means if we measure the power interface, the power consumption depends on any secret key bit, then it might be able to transform such information by frequency than by timing.
So I think this is a problem that should be patched by a cryptography engineer, or unless you're a crypto nerd like me running a server with Cyc, then you probably need to apply a patch.
Of course, makes sense. In the blog post, we did a comparison with a runner, a runner in a long distance race.
I really like that comparison.
Can you give people that comparison in terms of why this is related to a long distance runner?
Yeah, that's a great question. So you can think about CPU just like a human runner, and when it first starts a workload, sorry, Joao, do you mind showing the figure with the second one?
Okay, here is the blog post, and the figure you want to show is this one, right?
Sorry, one down. One down?
Yes. Oh, here it is, yeah. So when a CPU first starts a workload, it can run extra fast.
That's a phase when the fan makes a lot of noise, and that's also a phase called turbo boost on Intel CPU.
On this figure, that's probably from zero second to 100 seconds.
However, after a while, the CPU realized, okay, this workload is not a short one, so it cannot maintain such a speed forever, just like a human runner cannot sprint forever.
So it downloads itself into a state that we call steady state in the paper, and inside the steady state, the CPU oscillate between several different P states.
And what is a P state? P state is the operation level of a CPU.
So on this figure, the CPU actually oscillate between P state 4.0 and 3.9 gigahertz after the 100 seconds.
And in a sense, this was the, you're trying out how the vulnerability could play out the problems that it could entice, right?
Exactly.
The reason that it has such an oscillation between the P states is because every CPU has a thing called TDP, thermal design point, that caps the total power consumption of a CPU.
So suppose a workload W1 consumes more power than workload W2, Zhu Wang, do you mind go to the next figure?
Sorry, next, next. This one, right?
Next, next figure. Sorry. This one? Next, next, next. Yeah, this one. Yeah, suppose this W1, we have a workload W1 that consumes more power than W2, then because it is, the CPU power consumption is capped by a, by this TDP, the thermal design point.
Then for those two workloads, W1 can spend longer inside, W1 can spend longer inside the higher frequency compared to W2 because it consumes less power.
Hmm. And inside such a way, we transform power consumption into frequency and from frequency to timing, because right now we can actually measure how long it takes for W1 and W2 to execute and then figure out which one is which.
This might not be a problem if it is only for, to distinguish between two workload, because usually workload is not a secret.
It's okay for people to know which workload I'm running.
However, there's a very long history that power consumption is data dependent.
Zhu Wang, do you mind go up to the, the Hamming distance and Hamming weight figure?
Yeah. Oh, this one. Yeah. So what does power consumption is data dependent mean?
It means like if we're doing a computation that involves more Hamming weight, which, which means the number of ones that are turned down, it is going to consume more power.
For example, the left one here is going to consume more power than the right one here.
And if the data flow contains more Hamming distance, it is going to consume more power.
For example, the, the next figure at the bottom, the right one is going to consume more power than the left one because the right one has a Hamming distance equals to six.
Hamming distance means the difference between two numbers, either from one to zero or from zero to one.
Exactly.
So this were the tests made to see the difference and the difference was there.
Like is, this is like the proof that the difference was there. I'm, I'm, before we enter in another more specific things, I'm interested to know also Armando, in a sense, how did Cloudflare approach this vulnerability?
How we got to know the vulnerability initially and how we approach it?
Yeah. So fortunately we have the research team from, in, when Jingxuan is working.
So she communicated to us about like the possible attack, like in, in very early stages of the, of the research and, and the potential application on Psyche, right?
So we were interested in that because we have this implementation of Psyche in Go.
And then during that time we were looking for ways to, in which we can mitigate or at least prevent or adding some countermeasures to that.
So I think if I remember correctly, well, one of the first countermeasure was to include some time, some kind of randomization on the computations.
Why randomization is important is because since we are here, as Jingxuan already mentioned, detecting differences between the, between the type, between the data and, and, and the binary representation of that, like the Hamming distance or the, or the Hamming wave.
So whenever we randomize the computation, so this will scramble what are the actual values.
And this, this in case, like if, if the attack, like at that point in the, in, in the very early stage of the attack, this was like mitigated because it wasn't able to, to reconstruct based on randomized data.
But I think like after that, like the attack was, Jingxuan, you make commentaries about like how it got more like efficient, the attack in, in that sense.
And then randomization was not enough in order to protect.
You want to follow, follow up exactly? Yeah, basically the attack, the, the attack we report to Kaufler can actually be more complete so that there are more ways to trigger.
We, what we report to Kaufler is a way to trigger a lot of zeros depending on secret key bit.
And then those zeros can leak via frequency and can be mapped via remote timing.
So, which means we can measure how long it takes for the server to run psych and then figure out the secret key bit by bit.
However, such attack we report to Kaufler is incomplete. There's actually another paper that introduces two different types of this attack.
They all trigger a lot of zeros.
And then the original patches only patch the one that we report to Kaufler, but not the other two.
You didn't say to us, Jingxuan, about the university you were doing your work.
Can you just give us a sum up there of where were you working?
How do you collaborate with your colleagues there? Oh, I'm at University of Texas at Austin and my advisor is Haobao Shaokun.
He is a professor in computer security.
He, he used to do cryptography and now he focuses more on security, especially browser security.
That's a good background there. And in terms of in going back here to the collaboration, because you now are a researcher intern in at Kaufler, but you, you are doing your work.
Collaboration is very important.
When a researcher from, from a new university finds out something to share with the others in the industry is important.
When someone from the industry finds something and shares like we do in our blog with others in the, with universities and other companies is important.
Collaboration is really important in this topics, right?
Yeah, exactly. I think we will not make the story like as good as right now if we haven't got able to collaborate with like my, my, the other co-authors group and Kaufler for the patches.
Exactly. And in terms of Kaufler, it's important to be aware.
That's why we have like a, a really important research team, Armando.
So it's important to be aware of what's happening, of vulnerabilities people are discovering, right?
Yeah. So that, that's super important.
And that's why we are like here trying to communicate how HertzFluid was actually working and what are the implications of that?
Yeah. So something that, that I want to focus in like this is like a, let's say a generic attack on, on, on implementation.
So this is very important for like crypto implementers or in general for, for the academic community about like how to securely implement an algorithm.
So, because this brings like, as Jameson already mentioned, like the connection between power analysis with timing attacks.
So before then that, so for example people used to measure like difference of power on very tiny constraint devices, like you know, like a, like a small chips or small devices that can be, that can be found in, in, in several products.
But now with this there is a way to measure like the power consumption of some program, like a very fine level, like in, in a CPU, like in a, in a regular desktop or laptop CPU.
And this brings like another like panorama, like of how to protect actually implementations to do that.
And as, as I mentioned this was amplified on, on Cyc, which is actually a post-quantum resistant algorithm, but it will, it will don't be surprised that some, some other type of algorithms can be, can be explored at this kind of, using the methodology that Hartsfleet is, is using.
Well, I'm interested to understand a little bit more about Cyc, that algorithm in specific, because it's an important one, right?
Yeah, yeah. If you can go to the blog post, I can, I have one figure to show.
So the one with the curves. So Cyc, yeah. Cyc is, is basically an algorithm that works between elliptic curves.
And the difference between the classic elliptical cryptosystems is that in this case, it adds a new operation, which is the computation of isogeny.
So an isogeny can be means, can be regarded like in very simple terms, as a transformation that takes one point of the curve.
Like for example, we have like this input curve in green and, and takes you with this function that is called like the isogeny.
And basically maps to another point in a, in a different curve.
This is one, one important difference between the classics elliptical cryptosystems, because pretty much all the functions that are like in, in regular cryptosystems are using one single curve.
But in this case for Cyc, there are multiple curves and there are relationships between, between them.
So, so basically here, the, the difficult problem is try to find a relation or try to compute like this relation efficiently when, when the number of curves is, is, is exponentially large.
And this is, this is all, this is all related to this computing the protocol in a sense is all related to how computers behave, how fast they are, right.
How they compute in a sense. Yeah. So sorry.
So, yeah. So, so this is like the mathematical interpretation of that. So and now when we map to, to what is actually happening.
So there's a small detail that I want to mention that is displayed in the next figure.
So one important point of this is that some of the mathematical formulas that we use in order to compute operation on elliptic curves, some of them are, doesn't cover all the points or given in a different way, there are some points for which this isogenic computation doesn't map to actually a point in a curve.
So in this case you can see like the points that are marked are like, they went to nowhere, right.
And this nowhere, it has very important property that this nowhere is actually represented by coordinates with all zeros.
And this is very, very important fact for, for the case of, of hair split because when we represent a number like in, in the matching computer and if this number is whole zeros, we will have like several registers that are, that are set all to zeros.
And as we already saw in, in, in previously, the hair split attack can distinguish a computation that has like a different Hamming way or, or, or different Hamming distance.
And in this case, the, the Hamming way of all zeros is, is like zero versus a random looking word that could be, that could be a valid point or a valid computation.
It will just look randomly and then the Hamming distance will be, the Hamming way, sorry, will be much higher.
So basically this is like the point in which you can see difference between the computations.
Going back a little bit, Yen-Chen, how did this name come about?
Because there was a very known attack that doesn't have any relation to this one, the heart, heart bleed.
This doesn't have a relation, but the name is not completely different.
What is, why, how, how was it called heart, heart split in this case?
Yeah, so that's a great question. First, I want to say that it has nothing to do with a rental car because that's, it's very funny.
I, I heard people comment on the Twitter or Hacker News about like, they thought the, the rental car heard it's going to bankrupt or something.
And it also doesn't have anything to do heart, with heart bleed because traditionally speaking, attack paper has, has a habit of naming their work as something bleed.
And we choose heart bleed because our attack is related to frequency, CPU frequency, and heart is the unit of frequency.
That's why we choose as heart bleed. We thought about like frequency bleed, that's too long and doesn't, just doesn't sound good.
So we go with heart bleed.
Makes sense. And in terms of putting the, the, the attack in practice, how did you use, what tools did you use to try to put it in practice?
Yeah, we tried in our lab environment.
So we have a server that is free of background noise, and then we run the Cyc, Cyc, we run a Cyc server on the server.
And then we have another machine connect you to the server with a cable.
And then we time how long we're going to, how long it takes for the server to send back certain number of requests.
And a common question that people have is that, is their signal is going to go away if you have background noise on the server?
And we try that.
We try to add background noise on our CPU and the, and the noises. And even though we add some noise, but the signal is still there.
Makes sense. Looking back in terms of how the developers can do now from now on, what lessons should be taken from this attack?
Well, yeah. So that, that are very, very important points to, to summarize here.
First of all, that be aware that computation doesn't trigger like basically zeros or a big difference between, between the, the difference between valid cases or invalid cases.
And as I, as I mentioned before, so the, the correlation now that like for this case, you've seen changes on the frequency actually leads to a, there are, there are world dependent and that actually lead to visible timing differences.
And your, your perspective Yinchen there, in terms of lessons.
I think it's, it's quite surprising for cryptography engineer because nobody is prepared for such a, such a channel.
And I think the lesson is that right now the constant programming principle is insufficient because the current constant timing principle that we have deal with how many cycles a program runs.
However, what Herzby said is that constant cycle and constant time are two disjoint concept, concept from now on, because a program can be constant cycle, but if it's power consumption is secretly dependent, it is no longer constant time.
So Herzby adds a requirement, a new requirement to the constant time programming principle such that the program not only has to be constant cycle, but it's frequency variation has to be secret dependent too.
Makes sense. In terms of also curious here, and we're almost out of time, but also curious here in terms personally for you Yinchen, what dealing with this attack meant for you as a PhD student, because this is something that adds up to your career, right?
Yeah, I think, first, it got me an internship at Koffler, which is nice.
Yeah, I'm very happy about that. And second, it's like the, I've been trying to attack many things during the first couple years of my PhD, and this is the first time I make one thing happen.
And before that, all of my attack actually failed.
They are like, I like some of them ideas, but they fail due to this reason, that reason.
Basically, I have some idea, but the implementation just happened to be that way.
So my idea doesn't work. And this is the first time it works.
So I'm really happy about it. And it has a real world impact because what you did there and what your team did there had an impact in companies, how they deal with this vulnerability, how they protect themselves.
So it's really important in a sense, right?
Yeah, I think it's not like any system is vulnerable to this attack, but they are a crypto system that is vulnerable, although it doesn't lead to a real world attack, but at least gives people a new perspective to about security.
So I'm very glad I can make this. For sure. And for us, Armando, for Koffler, the difference is making those patches already available when this was presented and now disclosed publicly.
Is that right? Yeah. So once the attack just evolved a little bit, so both in Genshin papers and in the first paper talking about what specifically measures to do in order to protect Psyche.
So basically this was in part caused because of the validation of public keys or validation of ciphertexts.
So basically enforcing or being more restricted on the kind of validation that is done on the keys actually prevents from triggering zeros.
And yeah, so a very funny thing is that, for example, for the cryptanalyst that is trying to attack some algorithm, he's very happy when this succeeds, as Genshin already mentioned, but for the cryptographer, it is totally the opposite.
Oh, my system was attacked.
And then, so there are these differences between cryptographer and cryptanalyst.
And yeah, we're very happy with the work that Genshin was doing.
And then, yeah. And we're almost out of time. I want to thank you both for your work and for your explanations.
It was great. And I should mention that you should check our blog post about this and also research .cloudfair.com.
And if you're interested, you can apply for an internship and also join as a visitor researcher.