🌐 Introducing Geo Key Manager v2

Presented by: Dina Kozlov, Jon Levine, Tanya Verma

Originally aired on March 5, 2023 @ 4:00 PM - 4:30 PM EST

Welcome to Cloudflare Impact Week 2022!

Cloudflare's mission is to help build a better Internet. We believe a better Internet can be not only a force for good, but an engine of global sustainability. This week we'll be highlighting an array of initiatives inspired by these optimistic ideals, as well as stories from partners who share them.

In this episode, tune in for a conversation with Dina Kozlov, Jon Levine, and Tanya Verma.

Tune in all week for more news, announcements, and thought-provoking discussions!

Read the blog posts:

For more, don't miss the Cloudflare Impact Week Hub

English

Impact Week

Transcript

Hello, everyone. Welcome to Cloudflare TV. My name is John Levine, and today we're going to be talking about jockey manager version two. Very exciting product, very exciting update. We're going to learn all about what this is while we built it. What's new? I'm joined today by Dina and by Tanya. Dina, do you want to go first and introduce yourself? Sure. Hi, everyone. Thank you for watching. I'm Dina Kozlov. I'm the product manager for the SSL/TLS team. Been here for a few years at Cloudflare now, but really excited to be talking about this launch. I'll pass it on to Tanya. Hey, I'm Tanya Verma. I'm an engineer on the research team, and research just works on different projects that are sort of in the proof of concept stage and we try to make it into a functional product over time. So excited to hear about your manager. Awesome. Yeah, I guess I didn't introduce myself. So I'm John Levine. I'm a product manager here at Cloudflare. I work on our data products and I work on data localization more generally, which joke manager is a critical part of. And we'll definitely have some time and then we'll talk we'll talk more about our our broader data localization efforts. But cool. Let's jump in the you. Okay. So, Dina, let's just start with what is it? What is your key manager? Yeah. So the way the traffic stays encrypted on the Internet is through the use of cryptography. So there's always a public key and there's a private key. And the private key is the one that's responsible for encrypting the data that's being transmitted. And so in general, at Cloudflare, we manage tens of millions of domains and we manage their TLS certificates and the encryption keys that are associated with them. And in general, we store all of these encryption keys on all of our machines at our edge in more than 200 data centers around the world. And we do that because that way we can complete the TLS termination closer to the user. And so that really helps with latency and just makes requests a lot faster. But for some customers, they don't want these encryption keys to be stored everywhere. There's new data localization laws that are coming into play. So GDPR was one major one that came in a few years ago and now they're being developed around the world. I think there's more than 15 countries that are developing these data localization policies. And what they say is that you should keep your encryption keys in one area. And so customers don't. The ones that don't want to store these globally, they instead want to define a region where these are stored. So this region can be something like a country, the US, it can be something like a region. The European Union or some customers want to define this region based on an attribute. So for example, for example, the FEDRAMP compliant COLAs or different kinds of compliance based rules. So Duke Manager allows customers to specify these attribute based essentially geo restrictions so that we only store the encryption keys in certain areas based on this certain attribute. Oh, you're on me. I'm on mute. Thank you. Thank you, Dina. So, yeah, I mentioned earlier we have a whole suite of data localization products, and I think Joke Manager was the first one we launched in 2017. Is that right? And so it's really it's really one of the most fundamental things because there's a lot of elements of data localization, there's a lot of types of data that Cloudflare handles that we want to make sure customers control where that's processed and where that's stored. But these private keys for private keys used for TLS, that's some of the most sensitive data that we have that our customers trust us with. So, yeah, so super important that people can control the regions specifically where these are. So yeah, so you've kind of brought us up actually to a lot of what you said was true and you know, five years ago. So, so we just launched a big update. We're calling this the V2, the biggest update we've really ever done in this product. So tell me about kind of what were we hearing feedback from customer wise and what's new? What do what do we change in this new version? Yeah, so there are a few challenges that we saw with the first version that we launched with Dock Manager. So the first thing is that it initially only supported three regions. This was the US, the European Union and a set of data centers that we ran as our high security data centers. These are locations that, beyond our standard physical security measures, have a higher standard to them. And so one of the things was that as these compliance laws keep developing, these two regions are not enough for customers. They want to be able to define other countries where these laws are being developed. And so the first thing was just being able to support more regions and more countries so that customers have the flexibility as these laws develop to be able to essentially develop the logic with them. The next thing is customers want configurability. Sometimes it's not as simple as just being as saying, Store my keys. In Japan, sometimes customers want to write different rules on where their keys are stored. So this could be store my keys in the US and in Europe and in the European Union. This could be store my keys in the European Union, but not in France. And so that was another thing that we saw that customers were asking for is that configurability aspect to be able to write rules based on where these keys should be stored. The next thing was scalability. When we first built out the three initial regions, this was actually a hardcoded list of data centers, but this product was launched a few years ago, and since then our network has grown significantly. And so what we want to make sure is as our network grows, these lists also continue to grow and modify dynamically instead of statically. And so and the next thing is just the whole architecture that GOP manager was built on. Kind of I can speak about this a bit more, but we essentially went back to the drawing board and re architected everything so that it can be a lot more robust, a lot more efficient, a lot more scalable. And so did a complete overhaul. Yeah, and now it's a much better product. Cool. It sounds like. Yeah. The keyword is scale, right? More. More datacenters, more countries, more customers. Exactly. More flexibility or requests. Yeah. Cool. You know, it's interesting as you're talking, I realized sometimes when we talk about data localization, there's kind of two pretty different types of customers. Like you mentioned bringing this to Japan. Sometimes we work with customers who let's say they're like a Japanese bank and like their customers are in Japan, their business is in Japan. Like they don't have much reason for traffic or anything to ever leave Japan, and they want things to stay in that one region. And that's quite common. We have customers who have that. It might be a health care company in Germany right where they have. They're dealing with one country and one set of rules. And the new version of GOP manager is great for that. And like, really what we're doing there is bringing this product to many more regions where it wasn't available before. I think before it was just us and EU maybe. So it was a bit limited. So now we're bringing that to. Do you know how many how many countries are we in now? You know. We are in more than 15 countries. So more than ten regions. Wonderful. More than 15 countries and ten regions. I bring this to many, many more places where people have these requirements, which is which is really awesome to see. But then the other thing I think we're doing that's really cool is we also see customers that have a global customer base themselves. So obviously Cloudflare is a global company and we have customers around the world. And what we also see is the need to have people who kind of have these like a denialist, say like you can be in this place and this place and this place, but not in these places. So it sounds like you're really giving that kind of granularity and flexibility, which is awesome to see. Exactly. Well, actually want to correct in more than 30 countries. More than 30 countries. Amazing. More than 30 countries. Thank you for the correction. It's an impressive number. So yeah, and I think. Oh, go ahead. Oh, I just wanted to say one of the other things that we also learned from D1 is we wanted to have some safety measures in place. So when customers restrict their keys to only be stored in one country, sometimes that country can have a small capacity data center or for example, region can be too small. And so one of the things that we've done is we've added essentially a redundancy requirement that just requires that whatever region that you choose, we feel that the data centers there are capable of handling that level of traffic. And so or for example, if customers choose a country where we don't have a data center, we prevent them from doing that. And so we just wanted to be very careful to prevent our customers from essentially shooting themselves in the foot setting policies that were not good for their setup. Yeah, that's a great point. I mean, obviously there's some trade offs in using JQ manager, which we can talk more about, but we want to kind of minimize that and make sure that you don't have a ton of extra latency incurred or a ton of error taking a hit to reliability. Right. For that. So cool. So that's actually a great transition. Tanya, I'd love to learn more about how you did this. How do we build it, How do we re-architect it? Maybe just give the high level summary and then we can we can kind of dig in from there. What were kind of the biggest or main changes we made? Yeah, So one big thing was we used this cryptographic scheme for the first one, for the first version of View Key Manager. And this key had a lot of restrictions that were fundamentally built into the cryptography. So it supported very inflexible policies and policies. And referring to is like what country you want or wherever you want things to be. We couldn't add in new policies after the instantiation of the scheme without doing a lot of work. Anytime we wanted to change a policy that was it was just very complicated. The other big problem was again, scale. Like we add new data centers all the time, but with this particular scheme we were restricted to only the data centers that existed in 2017 as like endpoints for being able to declare these keys. So we couldn't take advantage of like all the new data centers that were coming up and like that affect speed, that affects performance, that affects several different things. And then a couple more internal things are if you change the attribute of a given data center. So an example is if you have a list of if you have a list for European based data centers and UK is no longer in Europe, like how would you get it out of that list? So that was very complicated. So sometimes you have these sorts of attribute changes. There's a lot of flux in the system and all of that can be very elegantly handled by switching to a different cryptographic scheme that underpins the whole system. So if you heard of a public key cryptography, this is kind of like a variant of that and it allows us to make these really feature full policies very detailed. You can do like negations, which is like not this particular country and you can like wine country, you can exclude certain countries, you can do whatever you want. So that was one part and the other challenging part was the performance. So again, as I mentioned, like research is where we start out projects and we deploy them. We don't necessarily think that they're going to take off. Um, so when this, when the demand for a major V one scaled up quite a bit, we realized that like a lot of our decisions that we made in 2017 didn't really hold up over time. So this led to certain performance problems. Like one issue that we noticed quite a bit was at the tail, which means like at the, say, the 99th percentile of requests, we would see a lot of performance drops that would affect the ability it would cause incidents. It was just a whole lot of pain. And figuring out those small issues like we worked for a while to figure out like why certain things were the way they were. And it's a very complex system. So we had to add in a lot of like distributed tracing because just like several different services across the system. And so that took time. And yeah, the third major component was just being able to rotate keys. So, you know, there's there's this running joke on the Internet that like every outage that's caused like everybody else is this DNS, This is caused by DNS. And I would say that like probably like the second or third major reason for outages is like someone has changed the key somewhere and like the keys don't get propagated properly. So a key manager be one. That was a big concern. It was really difficult to just like change any of these keys. So with me to that was like a big part of our design. Like make sure that process is as simple as possible. So yeah, I would say like three components. Yeah, I'd love to dig in and learn a little more. So maybe you could tell us, could you describe maybe what was the you mentioned you, you mentioned that the issue with like policies of like having different different types of policies sort of per customer and like adding new data centers that this would solve by moving to a new cryptographic scheme would be awesome If you could you describe maybe what was the old scheme and how is the new scheme different and how did it how did it solve those challenges? Yeah. So the old scheme was so we use this thing called attribute based encryption, which basically allows you to encrypt a file, for instance, based on the based on some access policy that can only be decrypted by the users or data centers that have those certain that satisfy that particular policy. So it kind of breaks the access control, like when you do if you're if you're building any sort of access control system like you have, IAM like has one like everybody has like an access management system, what you do is you make a request to like some central party and they check like, hey, does this person have the right authorizations? And then they give you access to the resource or they give you some token to do whatever. But the problem with doing this at Cloudflare is we built out this huge and amazing like, you know, data center system where like everything is as close as possible to you. And if every single time we need to enforce this access policy where you have to like call back to the control plane or the core in like Portland or something from like Japan for every single request, that's that's really problematic. So what attribute based encryption allows you to do is instead of having to call back home all the time, and that being in like the critical path of like, like a handshake, you can do this, you can fetch like the correct set of decryption keys for your particular set of attributes, like out of band. And you can just use that to think of it like a token that you can use that you don't have to get like right when you're making the authorization request, you can just have it at a different time and then you can continue to use it for a set of requests until you have to change up your keys. So it allows you to do these things like really fast and in a more capable fashion. And again, these policies themselves, like you can support a ebullient formula. So it's like and or not like any combination of these policies you can support. So our previous scheme was made at a time where very where some of these like attribute based encryption systems did not yet exist, like the system that we're using is from 2020, actually 2020 after a couple of changes. But we basically couldn't have supported negation, for instance. Interesting in 2017, that just did not exist, right? So we had to build together, like cobble together these two other primitives and sort of make it kind of like simulates attribute based encryption, but it's not. So yeah, so you get like some theoretical issues with that, but then you also get like because it's sort of brittle, the way that it all fits. Like it just doesn't scale very well. And I think like going into the details would take a lot of time. So maybe. Oh, go ahead. Yes. This is a good it's a good preview. We are going to have a very detailed technical blog post scheduled for next week, very soon. And we are going to also we also have a paper in the works driving this system. So, yeah. If you're interested in how to use like attribute based encryption for your access control, interesting. You can also please check out that paper and read the technical. Yeah, it's super cool. I'm going to have to study it. I'm going to study up on this. I love to know. Maybe you could just explain a little bit like kind of like the life of an Szell handshake or TLS handshake when using joke manager. Do I understand correctly that this is using keyless SSL still under the hood? Yeah. So maybe can I tell you my very simplistic idea of how this works? You can tell me if this is wrong. So. So when a client initiates a TLS handshake. And then this year is that the server, like our edge server, may or may not have access to that TLS key locally. So either we have access locally and we can just fetch it locally or we have to essentially go over the network and we don't. We still actually don't. It's not like we're fetching a copy of that key. We're using keyless SSL to kind of use that key in this kind of delegated way to generate like a session token. Is that is that kind of right at a at a high level. Sort of sort of sort of like several points. Right. Okay, Good. One. One thing is that in a TLS handshake, the only part that requires any sensitive data at all is the signature part, which happens at the very end. So like authenticates, whether or not the person is who they say that they are. So that's like the title is signature. So in order. So if like you have a policy for Europe and the request comes to the US, what's going to happen is the US data center. So it does have the key. It's just it can't decrypt it because it doesn't have the right set of attributes. I see, I see. Yeah. Because we use this like global key value store system that basically gives everyone a copy of the same data. So the key is it's distributed everywhere. It's about where, but it's the key is wrapped encrypted. So it's about where we can unwrap and decrypt the key. Exactly. And so that is actually what determines the policy. And that's you have this attribute based decryption scheme, which is where it can be decrypted. Exactly, yeah. So the key was part would work the same. Does that work the same way in joke manager v2 If you don't have it, use kilos. Yeah. So if you, if you can decrypt it, what you're going to do is you have a list of satisfying data centers that can so you send it over to like the closest one. And so you just send a request, you don't send any keys because that has a copyright. So you just tell them, Hey, can you sign this handshake for me? Yes. And they give you the signature back in the. Yes, exactly. You finished it. Finish the handshake. Great. Just. Yeah, but yeah, the amazing thing is, once you do this first handshake, then you have a session ticket. The session, take it. Right? Yeah. And you can just continue to use it to no longer do you have to do this whole process. It's just for the first. And the key material never lands. Never lands in that country. Super interesting. Yeah. So. So the hard part here, if I understand in B two, is really making it so that we can distribute these keys everywhere. But you could establish these sort of cryptographically verifiable rules about where things could be, these keys could be unwrapped and accessed. Yeah, really cool. That's amazing. Thank you for explaining that. I feel like I understand it much better now. Thank you. Yes, thank you for the great questions. Yeah. Yeah. Okay, I have one more, one more nerd question because I'm very curious and we can spend a couple of minutes on this and I'll jump to kind of what's what's next. You mentioned performance. I'm always interested in this question of tradeoffs, so why doesn't everyone use this in every request? So I understand if you're if you are using joke manager in some cases, especially if you're making a request in a connection with that key is not located, that can add some latency and but obviously we want to minimize that latency. We want to have the highest reliability. So I'm curious, how did you kind of maybe can you say more about this tradeoffs and how you work to kind of reduce them or minimize the tradeoffs that need to be made? Yeah. So one big component is, again, like session keys or session resumption, right, with TLS. So once you've made this initial connection, you don't have to do it for a very long time for a given user. So the amortized cost is like pretty low. The second crazy thing that we did was we added a layer of indirection because, you know, in computer science, like the best way to solve all problems is just add a layer of indirection. So that's an interaction, right? Yes, exactly. So what we do is we cache we cache this layer of indirection. Basically we have an encryption key and then we use this key encryption key. We encrypt this encryption key with the policy. But the key encryption key itself, think of it kind of like hybrid encryption or like symmetric key. And we use key to, like, encrypt the actual private keys for a given set of policy. So once someone decrypts this particular key, they just cache it for a while until, you know, like restart or reboot or whatever. And you can just continue using that to amortize. You don't have to do as many decryption, you just have to do it for once, for every policy, for the jury. Once per policy per. Is it per data center per session? Is that the way to. Think about it? Per machine. Per machine. Per session. Addressing this amortization point is really important because you think about a user interacting with a web service, they establish a connection, but then they might make many, many, many requests over that connection over the lifetime of a connection which could be quite long lived. And so maybe that first request and then the key the key latency factor here, I imagine, is going over the network and doing that list, finishing the traffic connection, using keyless, where you might have to go a very far geographic distance and you're limited by the speed of light, right, to do that round trip. Right. Did you were there other parts of the system So then and then of course, if you have the key locally to be very fast to even. Arrive because of this caching. Yeah. Because you can cache it across users. Right. Because there's not that many variants of policies or like most people just like us or Europe or Japan, like there's not that much diversity. Yeah. And it's also like per customer. So this kind of limits how many of these policies we can have. It's not like per end user, per eyeball. Right. Yeah. Per, per customer. Right. That makes, that makes a lot of sense. So we have so we're relatively when it's local, it's pretty fast. We kind of minimize how often we need to go over the network with our other parts of the system that we had to optimize besides those that were surprising. Yeah. So the coolest part, right? Like making a connection to another data center. So that had we do like a little bit of because you can't just have like everyone connect to everything else otherwise it's like too many connections. So you do like a little bit of like load balancing there where like a bunch of bunch of machines, like I'll talk to like one and then someone. So we have like a little bit of an issue in how we were sending these requests over. Basically, they were all getting serialized instead of being like asynchronous. And part of this was like related to the keyless protocol. So we wrote the keyless protocol before gRPC and that, yeah, so this was like 2013 or something. So okay. Yeah. So, so the protocol itself, like supported bidirectional streaming and all of these things, but the way that going like our PC library works like it does not support that particular thing. So we had this like sort of mismatch between that and like we just put a lock and you know, and yeah, so there were like some concurrency issues that we hadn't quite solved. And again, these are super subtle, so it's like so much tracing in order to like figure some of this out. Yeah, it's super interesting how it sounds like you can describe this problem so quickly. Like, okay, we need to keep keys only in certain data centers. And then like that one sentence description turns into like novel cryptography and like these gnarly distributed systems problems and. Yeah, pretty cool. Incredible. Yeah. Thanks for thanks for walking us through all that. So we've got about 3 minutes left. Dina, I would love to know. Well, actually, two questions quickly. How do we start using this? Let's start with that. What do I need to do to start using Geo Key Manager? Yeah. So we launched the blog post today about the new version of Geo Key Manager. So if you go to the blog post, there's a link to the form at the top and at the bottom of the blog. That's as when you fill it out, it'll get sent to us and we can include you in the closed beta for this product. Awesome. Yes. Please get in touch. We love it. We love to talk to folks who are interested in using this. Thank you. And then yeah, what's what's what's coming next? We did two of you already thought about what is. Yeah. So actually on the topic of performance, one of the things that we want to start doing is show. It is to show customers the performance impact of the changes that they're making. So let's say I choose to only store my keys in Canada. What is the global performance Global performance impact of that decision? What is the impact going to be like on the customer base in other regions? We also want to support Geo Key Manager. It's only for us, it's only supported right now for certificates that you upload. So we want to also support this for advantage through Cloudflare. So ones issued to Geo Key Manager. And another big thing is we want to make this easy to configure. So right now it's API only, but we want to bring a nice shiny UI to this so that it's easy for customers to use. But actually while we have a quick minute, I have a question for you, John. So what's happening with data localization? I'm so glad you asked us. So yeah, so I'm glad you mentioned the ease of use team. So for folks who don't know, we launched the data localization suite I think about two years ago now and last year we had a big update with what we called the customer metadata boundary, which is all about keeping your logs and your analytics data stored in a region. And that kind of joins Geo Key Manager and regional services. And just like you mentioned, you're working on a UI to make it easier to manage Geo Key Manager we are working on. User interfaces in the dashboard to very easily manage regional services and to manage the metadata boundary as well. And another common theme, bring it to more places. So if you think about this is how these products kind of available in more regions for people, reduce these kind of tradeoffs like we're talking about with latency, making sure our whole product portfolio is really compatible across these products and then just making them easy to use. Right. And I love what you're doing there with the certificate manager because it's great that we have it for uploaded certificates, but a lot of customers, they don't want to think about it. They just want to push a button and have their certs in one place. We're going to do the same thing with the metadata boundary and the same with regional services as well. So yeah, so stay tuned. We hope to be debuting those soon, next year. Thanks so much, Dina. Tanya was a pleasure. Thanks for answering my questions and nerding out. And see everyone again soon on a future call or TV segment. Thanks for watching.

Impact Week

Tune in for all of Cloudflare's Impact Week programming, featuring an array of CFTV episodes spanning environmental, social, and governance issues.

Watch more episodes