Originally aired on May 7, 2021 @ 6:00 PM - 6:30 PM EDT
Join Cloudflare's Head of Product, Jen Taylor and Head of Engineering, Usman Muzaffar, for a quick recap of everything that shipped in the last week. Covers both new features and enhancements on Cloudflare products and the technology under the hood.
Hi, I'm Jen Taylor. Welcome to another institution or an episode of Latest from Product and Eng. I forgot what I was talking about. Usman. Hi Jen. It's nice to see you. I'm Usman Muzaffar. I'm Cloudflare's head of engineering. I'm really thrilled today because we've got two of our favorite teammates joining us. Niels and Thomas. Niels, say hi. How long have you been at Cloudflare and what do you do? Hi everybody. I've been here for almost two years at this point. I'm the PM on IAM, which is Identity and Access Management, and I kind of do my best to keep the wheels on the bus while our team is running along. I'm Tom Hill. I'm the engineering manager for the IAM team. Been here three and a half years, I think. Yeah, I was about to say like it'll turn four this year. I remember hiring you, Thomas. That's right. I will graduate six months from now, and I'm the one building the wheels that Niels tries to keep on the bus with the rest of the team. So I'll start. The name of this team is IAM, and we were just joking right in the green room before. This is an acronym that I see everywhere, and let's even say 10 years ago, I'd never seen it before. All of a sudden, it became this critical thing that every software engineer and software engineering leader needs to know about. Niels, what is IAM? What does it stand for? Why is this so important? Why do we have a whole team at Cloudflare devoted to this topic? Right. So IAM stands for Identity and Access Management, and so that is a piece of software trying to figure out who is this person and what should they be allowed to do. Is this person really who they say they are? Should they be allowed to operate this feature? How do we know? And so this has really been thrust into the forefront now with the recent coronavirus because everybody is working from home. So you have, all of a sudden, someone comes into your system and says that, I am Usman, and I want to take our whole website down, and someone has to decide, is this really Usman? Should he be allowed? Maybe that's not me. I don't think I would say that. It seems unlikely, but we can't be sure. So whose job is it to figure out whether we can trust this entity that is asking to do something, and that is Identity and Access Management. Now, OK, hold on a second. So when we think about Identity and Access Management, you're really thinking about how we enable our users to access our product and our resources, Yeah, that's a good point. So authorization and authentication and IAM, those terms get tossed around a lot, and there's kind of two separate markets here. There's authorization and authentication. There's IAM for employees, which is, I have someone at home, and they're dialing to my system. Dialing, and you can tell I'm old. There's someone at home, and they've logged into our system, and we're trying to figure out, what should our employee be allowed to do? Should they have access to this tool or not? And then there's the other side of that, which is, I'm a developer, and I have all these users, and I want to know which features should, you know, is this user really who they say they are, and which features should they have access to? Have they paid for premium? Should they have access to those premium features? Can they log in on the dashboard? Can they use my API? What pieces of my product do they have access to? And so I think we've heard more recently about IAM and authorization and authentication for employees. You know, that's where VPNs play, where Okta plays, and of course, where Cloudflare Access plays, and that's become a really big concern with everybody working from home. But there's also IAM for customers, which is where our team focuses a little bit more, and we try and figure out, okay, for a given Cloudflare customer, for this given user, how do we know that they are who they say they are, and what features inside of Cloudflare should they have access to? Well, and if I step back and think about it, I mean, like, you know, we were just saying, you know, Thomas has been here for three and a half years. Usman and I have been here for four or five. The world at Cloudflare today, from a product and user perspective, looks very different than it did when we arrived or even before that, right? Like, you know, when we started, you know, the whole product offering was come to Cloudflare, sign up for free, and that worked with, you know, kind of one person working on a website. You know, it's shifted now, right? We still have those customers, but we have, you know, some global 1,000 organizations that are running their web properties through our services with teams of hundreds of people. How has that shifted the work, or how has that sort of driven your roadmap and the types of things that you've had to do? Right. It's this amazing story about, you know, kind of about the growth of our customers. You know, when Cloudflare first started, we assumed, like, there is just a user, and they have a website, and that's it. And we didn't even consider the idea that, well, maybe there might be multiple people who want to have access to edit this website. Maybe an account should be shared in some way. And I think that was one of the first features that Thomas ended up working on. I mean, it's a pretty massive project, if the stories I've heard can be believed. What was the central challenge, Thomas? Shared account access was one of the landmark, like, it was literally this continent that we were trying to conquer in 2018, 20, you know, going into 2019. 2018, yeah. Yeah. Let's talk about that. What was the Hello World, even the simplest Hello World use case, not, you couldn't do it in the early era. And now it's something you can absolutely. That's right. Elegantly do. Like Neil said, the original user management, the perspective of managing users for a long time, almost half of the company at that point, more than, was there is a user, and then they own, contain, or manage directly the things that they're associated with. And even between the beginning and when I joined, there was a need to support multi-user accounts. And account was a noun that didn't exist in the database. So there were ways in which you had to simulate or synthesize an account and a membership to it, while at the same time, not breaking the database or the way in which, you know, people had been using the system and they, you know, they have their IDs that they've been storing and tracking and you can't break those, you can't change those. So how do you get around all of that? And it was, it was an example of how the internal identity and access management system at Cloudflare grows with not only the company that's building it, but the customers that it's serving in order to support their needs. So the concept of the account needed to be created. The membership needed to be created. What dimensions of membership can you have? And there was a course set of permissions that you could grant that were packaged into roles. So all of these things became words that we started throwing around. And it seems like common sense, right? You would normally think like, of course, we would want to link up this and like that. But then when you start to think about it, the things that you sign up for as a person to, on the web, don't often have permissions or roles or management, unless they're there to large companies with complicated expectations. And so that's why this older system was able to last for so long. So when I joined, it had already been in development for a long time. The biggest problem was doing a no downtime migration from one system that had a lot of code and a lot of database expectations around the way things worked to a brand new one that had been beta tested. And we even had to make sure the UI was working correctly. And what does this mean for errors on the API? There might be access errors that had never existed before. So now we have to introduce- Correctly. And that's the system functioning correctly, right? Yes, exactly. Yes. So there are now failures by design that didn't exist before, which is not only us having to teach ourselves what the new system is like, but also having to figure out how to carefully educate users. Okay, you're going to make this call. This may have worked for you in the past, but now you're part of an account with membership, with a restriction, and it might fail, and you might not have code that deals with that failure. And we did go through some periods where we would get pinged on chat from the system and it would be, oh, the database is struggling. Oh, why? Oh, it's because someone had a script that went haywire because it didn't know how to deal with not having access. Do you know who I am? I've been here for so long. It's so true though, right? Because I was a Cloudflare customer myself for years, and it was my startup. And the day we created a Cloudflare account, I looked over to my co-founder. I was like, which one of us should have? And he's like, oh, just create an alias. We'll share it. We both need to be able to log in. Don't type your name or my name. Let's just go to Google, create an alias, and we'll use that. We'll call it CF at startup.com. And we'll just share the password in our one password or last password or whatever. And that'll be great. And it was great because there was only two of us who ever needed access to it. So Nils, what does it look like when that falls apart? Unfortunately, my startup never made it to the point where that became- Your startup fell apart before the authentication did. I started working at Cloudflare before I ran into the problem of us using a shared account, literally a shared username and password. But if my little startup had turned into a gigantic company, what is the first problem that customers are like, wait a minute, this isn't going to cut it. We can't just get away with a shared username and password. What are the problems they need to actually fix that prompted us to build all this tech? Well, maybe you and your co-founder start thinking about this and start thinking that maybe Usman, our very technical co-founder, should not have the same permissions to change everything inside Cloudflare that I'm going to make assumptions about your completely non -technical sales-driven co-founder should do. Maybe you would be totally happy having him go inside Cloudflare and updating some content, purging the cache, but maybe he shouldn't be able to go there and change DNS settings, something like that. And so maybe you want to say, okay, let's both access the Cloudflare account. These are the features inside the Cloudflare account that you can safely change to your heart's content. And then I want to make sure that you don't accidentally touch any of these other features that you aren't interested in anyway. So let's just keep those behind the curtain. You won't even have to see them. You won't have to think about them. And that doesn't work when you just have a single shared login and a shared password. You need roles and permissions and policies. Well, and the other thing is in the world in which the startup becomes wildly successful, you hire more people, right? And you want to be able to... Your tech team expands, you become rich and famous, and you don't have time to administer your Cloudflare domain. And so you need to delegate that permission to some of these other people you hired. So you need to find a way to invite them into your account, right? And then once you find a way to invite them into your account, you need to figure out how you give them the right roles so that they don't do something silly, like accidentally delete your main DNS and take you offline. That's right. Role is the next noun we have to invent, right? So that's another word that didn't exist. And now it's this thing. So what's a role, Nils? And then I'm going to ask Thomas, inside, under the covers, what the heck's a role? How did we even think about this? Right. So a role is a set of permissions. That's how we think about it. And so, of course, a standard role would be something like a super admin. You create the Cloudflare Academy. You are the super admin, and you can choose who to invite to it. You can decide what permissions everybody should have. And that's kind of one end of the spectrum. That's maximum permission. You get everything. And there's the other end of the spectrum, which is maybe someone who just manages DNS settings or can come in and only purge the cache when you start. That might be something that perhaps a content manager working on the blog might do, that they make an edit to a blog post. They want to go out right away. And then they notice that Cloudflare is still serving a cached version of that blog post. So they just want to go into Cloudflare and tell them, hey, refresh and get the latest content. And that's all you want them to be able to do inside Cloudflare. You don't want them to go in and change the DNS settings or accidentally boot you out of the account. And so that's something that different roles and different permissions allow you to do within your account. So Thomas, how do you do that? Just on first blush, that feels like this is not just login and access. I need to somehow associate every possible action that you could do in Cloudflare with some kind of notion of, OK, you can look at the DNS table, but you can't edit it, or you can look at the cache settings, but you can't change them. How does this work? Yeah, so the original permission system was pretty coarse. We would sort of identify the objects that we felt were worth giving a permission to, like zone or DNS, and then just simply say read or edit. And not a lot as the company grew, more product engineering teams grew and they lived in their own code bases and they had their own opinions about what permissions they would want to use or interact with. But I found that chiefly they would simply say, you know what, zone edit, that's good enough for me. So we ended up with a lot of permission reuse, which is fine. I mean, the implications of permissions is a normal thing. So as we would bag up permissions into roles, we definitely wanted to make sure we were explicit about being as granular as we could across the system as we understood it when we created the new system or the new IAM backend, which meant that we didn't want a role that just had like zone edit and account edit. We wanted like record create, record read, record update, record delete. We wanted worker read, worker route, low balance origin. We wanted to give all the teams the opportunities to express themselves in ways that they didn't probably really consider up until that point necessary. But at the same time, we didn't really want to require them to have to go through a lot of red tape to describe or say whatever, or even for a user, because as granular as you become on the core system, you're then asking the user to engage and consider all of this stuff, which could be non-sense. Yeah, that's right. So for us, a permission attempts to be as granular as it makes sense to be. But then we also say you can take this permission and say it is implied by another permission, which sort of goes up this tree, if you will, of objects, because the way that our objects work is like, you know, an account has zones. That's a universal truth for us. And then a zone will have a DNS record. You know, so you have these containing, these sort of containing sets of things, this tree, if you will. And so you're able to say, you know, DNS edit is implied by zone edit. So that maintains that taxonomy here of how these actions go. And the interesting thing is you're in the meta taxonomy business because you're providing a framework where your teammates, product managers and engineering managers, are the ones who are actually going to describe what are the operations that make sense for that corner of the... Yeah, what's been kind of surprising to me, I guess this is an example of the manner in which my brain is broken. But like, the ability to finally put names to these faces is like to put something there that wasn't before, was genuinely like interesting and exciting to be able to give this option to all of the teams, and thus also all of the users to say, you can restrict as much as you want, hopefully in a way that isn't very complicated. So that you can provide the ability to say something like, yes, I'd invite Usman to my account. No, you'd not be able to edit anything. Or, you know, only the ops at Usman.com. By the way, if you ever invite me to account, that's a good default. But then you're kind of victims of your own success, right? I mean, like, so now, right, you got Usman.com has become this wildly successful startup. Usman is rich and famous. He's hired a bunch of employees. And he's got all of these employees invited into his account. And he's got all of these different properties that live within his Cloudflare account that they all have access to. Even if you have like roles defined, like they're still looking around and messing with things that like maybe they shouldn't mess with, like Niels, what do we do now? Right. So as the company grows, you start realizing that, oh, maybe even something that is as broad as cash purge is still too broad for it. Because now we don't just have a marketing blog, we have the US marketing blog and the Europe marketing blog and the EMEA marketing. And the company that we just acquired, that's a completely different animal altogether. And nobody, the host and the parent organization should be messing with those settings yet. Yeah. What if you're a dev shop and you're managing properties for your 50 different customers who are completely different companies and all of them are rolled up into your Cloudflare account and all of them are saying, hey, can you just give me access to the Cloudflare account so that I can purge the cash when I make an edit to the post on my blog or something like that? And you're like, well, I can't do that because if I give you access to the Cloudflare account, you are going to have access to the websites of all of my customers or of all the other teams inside my company. And what if you make the wrong setting and you blow up everything that they're working on? Nobody's going to be happy. And we should also remember that this same system is also what powers our API. So it's not just about users hitting a UI, it's about the developer teams inside those orgs using automated systems that can make, to really mess up takes a computer, right? So if you think about how wide the surface area for mistakes is, as soon as you think about robots making those mistakes at scale, it can be really, really, you can suddenly see why customers care a great deal about this problem. Right. And it's so sad for us because what we have is we have customers who are telling us, hey, we want more people inside our Cloudflare account. We want to give people more power. We want to empower our users and get more people inside Cloudflare using the amazing tools that you've built, but we can't do it safely yet. We need you to build guardrails so that we can give Cloudflare to more people inside our org and let them get in there and interact with their tools safely. What we want is to not only have roles to limit what they're allowed to do, but also what we're calling scopes, which is where they're allowed to do it, which zones inside the account. Now you got who, what, where. Right. And so, after overseeing one major re-architecture of the Cloudflare infrastructure, Thomas gets to come in for round two. Yeah. Why not do it again? Fantastic. So now we're calling in the construction crews. Thomas, what is this going to look like, right? When you add this, the wear and like, you know, what's the work there? I could talk about this for like an hour. This is the second dimension to access control that doesn't really get baked into a lot of systems or expressed very well is this resource scoping. And it's because it's difficult and contextual and specific usually, because if the way we do it doesn't necessarily map to the way that the hospital or a university or just another startup would do it, they might never need to scope by resource or ever be asked by one of their customers to do so. But in order to think about how do you scope access, you have to, again, structure the things that you're talking about in a way that lets you scope how far into that tree you want to go, where are you starting from in that tree to begin with. So in order for us to be able to pull that off, we had to add, we had to basically teach our authorization system a little bit about the relationships between the resources that we're managing, just like we did when we taught it about the permissions, the permissions tree. These things are pretty linked, right? So the way in which we expressed permissions was in that hierarchy of resource. And then for each resource, we'd say, here are the things you can do. We took that tree, and then we sort of just, instead of putting permissions at the end, we put identifiers at the end. And so when you tell the author's authorization system something like Niels can only edit, Niels can edit no zones in this account, except he can edit zone one, two, three. So what those two combinations that are evaluated one after another allow the restriction that I'm looking for so that I can say, so that I never have to do maintenance to a policy going forward. I can just simply have a blanket statement and then poke holes in it as I need. And we were able to do that because you can express these, I don't want to call them queries, but they're definitely like shorthand expressions of resources in the system. Like account with a wildcard at the end of it, you can think of it like that, or with an ID instead. It's an old problem with a new twist, right? It's access control. It's been there for a long time in the NT permission system, but having to build that in a way that makes sense for Cloudflare, and then to do it performantly. It's got to be fast. This can't slow down the things. Thomas mentioned access a couple of times. One of the other things we worry about is making sure that there are other ways that we support the most modern, more sophisticated ways. Another acronym that I think most people have heard of now is 2FA. Nils, what is 2FA in the context of the IAM team and our mission, and how did that intersect with all this? So 2FA stands for two-factor authentication. Everybody is used to having a username and a password. And so that means in order to know that you are you, you want to know who you claim you are, and then some piece of proof that you are actually who you say you are. And usually, for years, that piece of proof has been a password, something that you know. But passwords get leaked. Sometimes people reuse passwords. Maybe someone fishes your password, and now they're on the other side of the world, and they know your username, and they know your password, and they want to get into an account. So one thing that we can do to make things more secure is require you not only to have the username and the password, but also something physical. So not just something you know, but also something you have, like your phone, for example, or a hardware key. And so even if somebody gets your password, they're not going to have that second piece of hardware that you have with you, and we're not going to let you into your Cloudflare account. Once you turn on 2FA, we won't let you into your Cloudflare account unless you know both the password and you have that physical device with you to prove that you are who you say you are. But here's the thing. By the time we started rolling out 2FA, we already had millions of customers. How do you wake up one morning and roll out 2FA in a scalable and performant way to a huge base of customers? What does that look like? Thomas, do you want to take this? I mean, yeah, you don't wake up and do it. Well, yeah, I'm a product manager, so that's exactly how it works. I wake up and decide. They wake up and write it down on a PRD. That's what they do. They write it down, so it's true. Well, the first step is a lot of reading because the keys themselves interact in a very specific way with the system that they're a part of. Even if it's a virtual key on your phone, there's a way in which it works algorithmically. And then if there's a hardware key or something you have to click or touch, it's doing something very specific as well. And all of that then has to get put into the browser, which itself has APIs and expectations. And then the browser is going to talk to the server in specific ways that you have to consider. And then it's going to have expectations down the line, right? So you can recover it. So there's a recovery system as well based on more expectations. So you learn a lot about certificate management and these state machines that exist in these multiple places and that you all have to line them up just right in order for the account to get unlocked and so you can continue to access your account. And then past that, it's very similar to how you would carefully manage other credentials that you want to keep track of securely, like a password or recovery question. You pay attention to how you're storing that securely in the database so that you don't have to worry about it getting leaked. You want to make sure that it's stored encrypted and managed securely. And then you have to teach the users, okay, you're in this new world about these multiple factors of authorization for access, authentication, excuse me. And if you lose this, you're in a world of hurt. So here's - Actually, I was going to swivel the chair back to Niels. So now, so it's something I know is something I have. What if I lose the thing I have? Now what, Niels? Yeah. I mean, the worst part of this is we can guarantee that you will lose this thing that you have because every two years, people upgrade their phone. And the standard 2FA code generator program on your phone, Google Auth, does not bring your codes over with you to your new phone. So people always discover at the worst possible time that, oh, I can't log back into my account because I have my code on my old phone, which I sent back to the company, or I traded in for my new phone, and now I'm locked out. And so really the solution to this is you can't say- We don't want people to have just one second piece of proof. We want you to have your username. We want you to know your password. But then we also want you to have a collection of hardware devices that you keep with you to prove that you are who you say you are. So maybe it's not just this code that your phone generates, but you can also have a hardware key. Maybe you've got a few hardware keys. You've got one that you keep at work, one that you keep at home, one that you keep at your parents' house. Maybe you've got some backup codes, which are some codes that you write down and we give you. You write them down on a piece of paper. You can use them only one time, but if everything else fails, you can use one of those codes to get back into your account. And we have to implement all that infrastructure too, right? You can build all of that and a management interface so that people can manage all of those different pieces. Yeah, that's great. Guys, I can't believe how fast- The time always goes fast, but with talking about this stuff, it's endless. I think we really should have just immediately scheduled that follow-on session, but Niels and Thomas, it was so awesome to have you on board and on the show and talk about all the awesome stuff here. I know there's a whole bunch other stuff, so we're just getting scratched into the beginning of this, but thank you both for joining. Jen, always great seeing you. Thank you. Likewise. Great to see you. And we will see everybody at the next episode of Latest from Product and Edge. Thanks for watching, everybody. Bye. Thanks, all. What is the cloud? The cloud refers to servers that are accessed over the Internet, along with the software and databases that run on those servers. Cloud servers are located in data centers all over the world. By using the cloud, users and companies don't have to manage physical servers themselves or run software applications on their own machines. The cloud enables users to access the same files and applications from almost any device because the computing and storage takes place on servers in a remote data center instead of on a user's device. For example, Gmail stores emails and attachments in Google Drive cloud storage, allowing users to access their email and files via any Internet-connected device.