Using HPKE to Encrypt Request Payloads

Presented by: Andre Bluehs, Nafeez Ahamed, Miguel de Moura

Originally aired on April 7, 2021 @ 9:30 AM - 10:00 AM EDT

HPKE is a new encryption standard, developed with help from Cloudflare. The first application of it is in our Matched Payload Logging product. Two of the engineers involved in building Matched Payload Logging talk about the process of how we selected HPKE to use, and the benefits it has over other encryption techniques, and why we chose to encrypt our data at all.

English

Engineering

Cryptography

Transcript (Beta)

🎵Upbeat Music🎵 All right. Hello and welcome everyone. So we are going to be talking today about HBKE hybrid public key encryption. Got it right. And about why we chose HBKE, what we use it for here at Cloudflare. And I am joined here today with two of the engineers that were instrumental in helping us implement this and pick HBKE for our use cases. If you don't mind introducing yourselves, Miguel, kick us off. All right. So I'm an assistant engineer on the WAF team. And Nafis? Hi. I'm Nafis. I work in the product security team. I help different Cloudflare engineering teams with some security stuff. Excellent. Thank you. And I'm Andre. I'm the engineering manager for the WAF team. We internally, we're known as the managed rules team because we own a bunch of rules that are managed by us. And externally, we own the product that is the WAF. So let's talk a little bit about one of the features of our managed rules and our WAF product in general. So we had a couple of customers that came to us sometime last year and asked us for a little bit more information about the rules that we're matching and why they were matching. And Miguel, can you kind of go into detail about the ask that the customers had and how we ended up approaching that? Yes. So this ask was about, so we have, so the WAF has a bunch of rules. And these rules look at an incoming request and they apply a bunch of heuristics and signatures and decide whether or not to take an action on the request. And essentially, when an action is taken, we log an event, which contains a bit of information about the request. Like we log, for example, what kind of website this was, like the path, the query string, the user agent, the sort of thing. The sort of thing you find in your web server logs. But we never log like the body or the headers. And in some cases, we have rules that look at these kinds of bits of the request. So they look at a specific header, they look at the body of the request. And customers wanted to know a bit more about why these rules matched. And when they looked at the events, they wouldn't be able to see the body or the header. So they couldn't drill down a bit more into why the rule matched. So when you say we log these things, you mean that we actually give these back to the users, right? We actually say, here's for this particular request, it matched the request that looked generally like this with this path and query parameter. Yes, exactly. Yeah. So if you go to our dashboard, the Cloudflare dashboard, even our interlock integrations, you can see exactly, okay, so there's this request that came in and it matched on this specific WAF rule. And we took a specific action. For example, we blocked the request. So in these cases, they were interested in knowing a bit more about why the request matched. And this could be for a variety of reasons. Either they're interested into why we matched, or they're interested into knowing, okay, perhaps this is a false positive. Should we do something about it or bring a bit more? And yeah, and this essentially made us look into this new feature, which we introduced, which was the payload logging or match data feature. So why don't we just do this across the board? Why don't we log the entire request for every request that matches a WAF rule? Yeah. So that'll be great, right? So no. It'd be super helpful, yeah. Yeah. We don't log the whole thing for a variety of reasons. Like the most obvious one would be if we were to log the body of these requests, they could be for very sensitive info, for example. So you could have PII in it, so personally identifiable information that wouldn't be very good at all. Like we could be logging, for example, a cookie header, which would contain an authorization, a cookie or a token. And this would, yeah, it's the sort of thing you could not log. So that's why we don't have this by default. And that's why we had to introduce a new feature to support this. Okay. So that's the only logging this for certain requests and for certain customers. If we are only logging it for certain customers and certain requests anyway, why do we need to encrypt it at all? So, yeah. So the reason... So we have to encrypt it for a very simple reason. We can't sort this in plain text, essentially. The data is very sensitive or could be very sensitive. Usually we're just matching on specific payloads. It would be like normal attacks, but it could also contain bits of information would fall into the category of sensitive. So we have to encrypt it somehow that we can't just access the information because obviously it's very sensitive. So Nafis, this is kind of a question for you then. If we have data encrypted on disk, so meaning if someone wandered into our data center and yanked a disk out, they wouldn't be able to get anything meaningful because the data is encrypted. Why do we need to double encrypt our data as it goes to this logging pipeline? Sure. Yeah. So usually any customer data that we have at Cloudflare, we try to make sure that it's not in a form that's... If Cloudflare is attacked someday, and if the attacker gets to get all this data, it will be very hard for them to know what it is. So that's one reason. So the other reason is even Cloudflare ourselves are not interested in this data. So we call this toxic asset, where some of these very customer sensitive data. So in this case, this is going to be our customers' customers' data, or maybe even customers' customers' customers' data in some cases. So we want to treat it opaque. We just want to use it for what specifically we mentioned. So in the case of web application firewall, it's only to determine whether if this request has anything malicious. And that's at the very early phase of the request life cycle. And ever since that, we don't need this data. But for the very specific use case, the customer might need it later on. But we make it so that even we can't decrypt it. So that's a very good motivation for this. And what's more interesting is even though WAF started this, we are planning to do a similar kinds of encryption throughout Cloudflare's logging mechanisms going forward. That's cool. So we're going to be using this kind of basically any data that exists anywhere in our logging pipeline is going to be encrypted. Yeah. And compared to how most security companies or in general companies have been doing this organization across the security industry, what we have seen is this trend moving away from policy-based security to more technologically progressive solutions. So we say, hey, we won't look at your data, but that's mostly just enforced through auditing or policy. Every few companies come and audit. A more technologically progressive solution is to give something that even everyone else can audit. You can audit through software, you can audit through anything else. So if you look at some of the protocols that's out there, like the oblivious DOH or the oblivious DNS or HTTPS, things like the certificate transparency logs, where instead of trust, you give more auditing capabilities to everyone else out there. Similarly, we try to provide solutions that can be audited as well. So yeah. Okay, cool. So you mentioned an interesting point there of we view logging data or data in general as a toxic asset. We definitely had a choice for what kinds of mechanisms to encrypt this data. Talk to me, either one of you here, talk to me a little bit about how we eventually got to our decision to use HPKE. What was the process like? How did we arrive at that? I guess I'll pick on Miguel first. Yeah, so we definitely had a bunch of options or choices we could actually take on this. So the first thing we decided to do was what Nafis mentioned, which was we didn't want to be able to decrypt the data afterwards. So at this point, we realized we don't actually need symmetric encryption in the first place, so we don't actually need to hold the keys to this data. So right off the bat, we limited the whole class of encryption mechanisms we could actually use. And what do you mean by symmetric versus asymmetric? So symmetric versus asymmetric. So with symmetric encryption, essentially, we would have a key. This key would be used at the Clover service all over the world to encrypt the data we would be logging. And then, for example, in our dashboard, in the UI, we'd be able to have an API. And this API would also hold the same key, and the same key would be used to decrypt the data that was encrypted at our Clover Edge. So essentially, it's the same key to encrypt and decrypt. Whereas for asymmetric encryption, it's different. So we have two keys, essentially. We have a public and a private key. So this public key would be used at the Clover Edge. So in all our Edge servers all around the world to encrypt the data. And then we could have, for example, we could have the private key available in a service, which our API would call and would decrypt the data. But something like that, even in that kind of case, because Cloudflare has both keys, we would still be able to decrypt. And that was one of the points that Nafis was making, is that we don't even want to be able to decrypt. So how did you get to the solution that we're eventually? Exactly. So this alternative, which I just explained, is still better than the other one, because we don't distribute the decryption keys all over the world into our Edge points, essentially. It's still better than the other option. But as you said, we still have the two keys. So it's still not great. So we realized that we don't actually need and we don't want to have access to this data. So instead of us having the private key, we would have the customers have the private key. And then it means that we're not able at all to decrypt this data as soon as it leaves our Edge. And this is more or less how we settled on, OK, so we want something that is asymmetric. We don't need the actual key. But then we also have other concerns, which is one of them will be performance. And one of the, let's say, packaging solutions for this kind of system, which combines both asymmetric and symmetric encryption, was HPKey. And perhaps Nafis can be- Is that what the H stands for? Hybrid? The hybrid, yes. Yes, pretty much. And yeah, so it combines both things. There's been a bunch of projects that have used this sort of scheme before, because asymmetric encryption is pretty much slower than the other option. So there's this kind of hybrid schemes that are around, and people have implemented in very different ways. And one of these emerging standards now is HPKey, which we have. One of the co-authors is actually on Clover as well and has been helping us and guiding us through picking this. And yes, we essentially settled on HPKey because it met all these requirements. We don't have access to the key, and it's also very, very performance. And one of the other very important things as well is that it's made in a way that is very likely to not incurring any mistakes. So when you try to always roll down your own crypto, it's usually bound to have issues. So yes, it was pretty much the complete package. Okay, cool. So it says, you used the phrase emerging standard. So how dangerous is this, that we're using this kind of at the cutting edge of technology here? How safe is it that not only that we're future -proofing ourselves from needing to upgrade to some better solution in the future, but how robust are the tooling and the technologies and the libraries that we're using now? Yeah, so exactly. So the usual concern with something that's emerging isn't finally set. So when we wrote this, the final draft wasn't in. There have been some minor changes ever since. And the way we designed this is obviously it's versionable, so we can always bump the version of the protocol we actually use. And yes, and Nafis can talk a bit more about this as well. We're using tooling that has been audited by us, because obviously there's not a lot of tooling around to use HPKey. But yes, I think Nafis can explain this a bit more. Oh yeah. So we looked at, I know there was a reference implementation in Go, but we run everything on Rust these days at Cloudflare, especially things related to firewall and WAF. So there was this other implementation that we looked at. It's on our GitHub, and you can check it out, the match data CLI on the Cloudflare GitHub. And one of the libraries that we use was implementing at that time, the latest specification for HPKey in Rust. And it was not audited, so we thought we would do an internal audit. So the regular crypto auditing related stuff in Rust, we have some level of confidence in terms of the memory corruption related bugs. So that's out of the way to some extent. If you're in Rust C, C++ land, other than looking at memory corruption, other things you'd look at, how is the source of randomness? So the sort of random generator, the functions being used, are we seeing them right? And also zeroing out memory for all these different variables that hold stuff. So different cross-customer functions and calls don't mix up with each other. Yeah, those are some of the things that we looked at. We were quite confident in its ability to do some of the simple primitives that we use. So in our case, I guess we used x25519 for the key exchange with HKDF, hash-based KDF for the key iteration function. So usually what happens is during this first stage, so the way I look at HBKE to go a little bit more in depth about how HBKE really functions, you can try to compare that with a TLS handshake where the first phase is where you generate the key exchange and you agree on a shared secret. And then you use the shared secret or the master key in the context of TLS to do more encryption, like similar to the symmetric encryption. In case of HBKE, you agree on a key using x2559 and HKDF. It's kind of interesting how this is selected. You kind of hash on the curve and the other party gets the shared key. And you kind of expand on that key from there. You use that to derive more symmetric keys. So in this case, you generate that key and that's how you encrypt and decrypt. So we don't have the private key. So we don't have the ability to generate that key that can be used to decrypt. Yeah. So that's at a high level how it works, HBKE and what we did to make it secure. Cool. So we've talked a little bit about how we can't at Cloudflare actually decrypt this match data logging. Miguel, talk to a little bit about how that actually works, about how we handle the public versus private key and do we allow users to generate them? Can we generate them on behalf? How is that safe? Yes. So the one always concerned would be, okay, what about user experience? How can I actually use this? So the way this works is we have a few pieces. We have the piece that runs at the Cloudflare edge, which encrypts the data. And then we have some pieces to decrypt the data. So we have two options. We have the dashboard, the Cloudflare dashboard, where you can generate your public and private key and then store your private keys securely. And then we also have a CLI, so command line interface, like a small program to decrypt data where we provide it. So for example, in the dashboard, when you are setting up the WAF, you can enable this feature. If you already have a public and private key pair, you just provide your public key to us, which is the only thing we have. You set up the WAF. And for example, if some request is blocked by the WAF for some reason, you can then drill down on the events and you can supply to us your private key. And when I say supply to us, we supply it in the dashboard and this stays in your browser. So the private key is never transmitted to Cloudflare at all. Decryption itself is also done in the browser. So we use something quite exciting as well, which is WASM, so WebAssembly, where essentially we take most of our worst code that actually runs at the Cloudflare edge and we convert it to WASM, which runs in the browser. So you pretty much essentially run in the exact same code that we run at the edge for the encryption part. And this essentially, you can just decrypt the data that we stored. So to the dashboard, the Cloudflare API sends essentially a blob of information. It's essentially meaningless, an encrypted blob. And then by providing your private key in the browser, the browser decrypts the data for you and you can actually see exactly which thing matched on the request. And this is the same thing with our small program, the CLI, where you can actually do the same thing on your own command line or as well with the integrations with other lock providers. So for example, if you're pushing to Sumo Logic, for example, for your integration for your security analysts to take a look at the payloads, you can actually have this pre-processing step where you have your private key decrypting the data with our CLI, and then you can actually see all the data neatly on your dashboards. So this is how we allow you to actually access this data, which is encrypted, and all this without ever passing the private key to Cloudflare. So both in the browser and obviously in your CLI, in your actual terminal, we never transmit the private key. And UX is quite nice because, for example, in the browser, if you're in the same session, in the same tab, we just stored your private key in the JavaScript variable, for example. And this allows you to not have to provide the private key every time you try to see a single event. So for example, if you're doing long into a bunch of events, for a few requests, you don't have to always provide the same private key all the time. And this is also secure because as soon as you close the tab or refresh, the private key is lost and never transmitted. That was going to be my question, is you say we keep it around for ease of user experience, but how long does it last? So as soon as you navigate away from the page or close the browser tab, it goes away? Yes, it's trying to strike that balance between UX and security, and I think we did it very well. Excellent, excellent. So, sorry, Vinavis, go ahead. Oh yeah, I just wanted to add, I didn't know when this whole HPKE and match, the encrypted WAF logging was being developed. I was only aware of the match CLI, and this whole browser feature was a thing that I came to know at the end. It was super awesome to see that, especially because it makes usability super easy. And it also goes back again to the auditing capabilities that I talked about. So whenever any of these websites claim something related to security that we don't do this, it's kind of quite hard to look at and audit that. So in this case, you can just go to the browser and open up your network tab and see what really comes from the Cloudflare servers. So most of these are generated locally, and it also goes along how browsers have evolved recently, allowing us to do whatever the key generation functions for HPKE can be done right in the browser. As I said, that's great. Yeah, that's super cool. Yeah, so it allowed us to reuse a lot of the same code. Miguel, what was the process like to actually take our Rust code and then turn it to use it into the browser? Yeah, so we used it for two reasons. So the first one was the APIs for HPKE don't quite exist neatly in the web APIs for browser support. And yeah, so essentially all we did was adding a small bit of code to be able to compile to WebAssembly, which is the target that we actually use. So it was very, very straightforward. It's quite nice because you're able to share most of the actual code that we actually run. So the surface that you actually have to audit is much smaller than if you had multiple implementations for each target you have. So if you have one implementation for what runs at the edge, one for what runs in the browser, and another one for what runs in the actual CLI, it will be much more complicated to do. And this was not only much easier for us to audit and to make sure it was actually correct, but it's very, very simple to use in the multiple places. And yeah, so it was very, very simple to do. Okay, excellent. So I think that was actually all of the points that I had to cover. Nafis and Miguel, did we get to all of the talking points that you wanted to kind of mention about HPKE and about our implementation? Sure. Just going back a little bit about HPKE and why the standard in the first place. So we should also talk about some of the alternatives. Over the years, as we say, this is not new. And this use case, a lot of people have had it. And there are some solutions like Cryptobots. There are some libraries. So these are opinionated. These are the cipher suites. These are the combinations of cryptographic primitives that you can use. And if you look at this recently popular Go-based tool called Agai by Filippo, that's also similar. It's super simple in that one of my teammates says, it's just encrypt and you just decrypt and you don't have to worry about anything else. So there are tools that solve these problems. And as I keep talking about HPKE, we have different products at Cloudflare and they have different use cases as well. So the optionality with HPKE with different cryptographic primitives, which helps us balance, we know that this is going to be secure because this is already audited. That's why I say this is similar to the TLS cipher suites, the well-audited TLS cipher suites that you can choose from. And also it's going to make a good trade-off with performance. So I think that's something which I wanted to discuss. Yeah, now I made a point. Yeah. So it's like the HPKE kind of standard is flexible enough for our use case where we cared a little bit more about performance when we were choosing the encryption schemes, the primitives that you were talking about. But other teams' use cases might not necessarily be that. Some teams might actually care more about it costing more in terms of RAM or CPU to encrypt a thing. Is that correct? Yeah, that's right. So in general, that's the reason why we have these different primitives or these combinations. One of the things I always used to wonder is why do we have a lot of TLS cipher suites for this and all these vulnerabilities? And over the years, they have shrunk. The amount of cipher suites that are usually used have become smaller, but it's never going to be just a single one. And I think that's the whole point of HPKE. You want some optionality and you want some level of opinions, cryptographic experts' opinions. And that's why we lived up to people like Chris Wood at Clark Florence, so they are the great authors of this specification. Excellent. Yeah, Miguel and I are very familiar with Chris's work. He was working with us on another project that we just launched last week, week before last, recently, for Security Week. And that was our whole login production suite of products. Okay, that was all I have for everyone today. Thank you very much for tuning in and thank you very much to Miguel and Nafis for joining us. Thanks so much. you