Consul in Core: Dynamic Service Discovery in a Multi-Core World

Name: Consul in Core: Dynamic Service Discovery in a Multi-Core World
Uploaded: 2020-07-29T22:00:00.000Z
Duration: 30 min
Description: How Consul helps Cloudflare Resilience with Service Discovery

Presented by: Neerav Kumar , Yury Evtikhov

Originally aired on July 15, 2020 @ 11:30 AM - 12:00 PM EDT

How Consul helps Cloudflare Resilience with Service Discovery

English

Transcript (Beta)

Things are looking fine for us, this will turn out great! Hello everyone, this is Nirav. I'm a Co-Systems Reliability Engineer here at Cloudflare Austin and joining me from across the Atlantic Ocean today is Yuri who is a SRE in Core 2. He's based in our London office and today both of us are going to be talking about Core data centers and how we are using console in our Core data centers and how we are improving console too. Yuri, do you want to go? Yeah, absolutely. Thank you, Nirav. I'm Yuri, I'm a Core SRE from London and as Nirav mentioned, we were both working for Core SRE. Nirav, do you really want to tell our guests maybe what's the difference between Core SRE and a lot of other SREs, such as Edge SREs? Oh, awesome. That's a great question. So when I first joined Cloudflare, I actually joined as an Edge SRE. So I have worked for three years as an Edge SRE and for almost two years now as a Core SRE. So that puts me in a good position to answer this question, I think. And at Cloudflare, Edge and Core data centers are very different. So our Edge data centers are where we serve our customers, our content distribution network, our network security, DDoS prevention, etc. happens at Edge. And the control panel, control plane and the data plane for these services lives in our Core data centers. So whenever I am asked this question in an interview, I like to give this analogy between a farm and a zoo. So the Edge is like a farm with a fewer machine types of with much fewer machine types compared to Core. And with each machine type, we have a defined workload. And things are simpler, while Core is like a zoo because we have data services like Kafka, Clickhouse, and we have the control panel services. So Core is like a zoo with a lot of animals. But as we have been scaling up our Core data centers, it looks more like a wildlife sanctuary. And which brings us to the point of using console. Yuri, do you want to explain like how, why we chose console? Yeah, absolutely. So just a quick overview. Since we started, like services in Core data center started to take off, we were at the position where we like thought, like, okay, we just do a lot of hardware and we basically need something to do service discovery. We had like, we were looking at multiple products, what's available online, we're also looking into should we like to do something internal intra company, like, and develop an inside product. But yeah, that wasn't an option, because, like, ideally, we spent more effort than that. So the reason for console was actually service discovery, but plus, that we had other ideas in mind. So for service discovery, console is one of the best products, probably. It's very easy to use, very like nice API and pretty much flexible. Yeah, but apart from that, it brings us like a few other different other differences. For example, like if we look at etcd, etcd is mostly distributed logs and key value store, while console can give you for the same thing, provides you health checks or like external health checks, which you can run like from your cloud or from external data centers. Since Mirev has more experience in core myself as I joined, like, not that's recent, actually recently. So maybe Nirav can give us more detailed overview of the backgrounds of console in core. Of course, and coming back to my point about core data centers being a wildlife century. So we have a lot of machine types and a lot of services running at core. And, and as I said, they, they combine make up the control plane and the data plane of all the services that we offer at Cloudflare. And so service discovery is really important because all these services are highly interdependent in each other, and they talk a lot with each other. And that's why we decided to look for a service discovery solution that that satisfied some of our needs. One of our biggest needs was that we wanted a service discovery solution that was not tied to something. And it's independent in itself, which console is. So for example, we can run services in on bare metal servers, or we can run services in Kubernetes. And we can still have them talking to each other and discovering each other using console since console serves as a neutral framework between them. Also, console allows us to do inter data center service discovery really well. And as we are building out these core data centers, we are trying to make our services more resilient, and more redundant. And console console fits there perfectly, since console fits there perfectly since it has one federation, which allows us to do inter data center service discovery over DNS or HTTPS. Yeah. New graph. Do you remember why, like, one of the points for service discovery, like when we were looking at different products was of use and like, one of that one of the point was DNS service discovery over like some sort of API or HTTP API service discovery. So, in in core console, we are providing both methods as console provides us internally the view of services registered is the same for DNS or HTTP. So you can query all the services or either DNS or HTTP. And we are providing both both solutions to our customers, who in this case are our engineering teams. But what we found was that a lot of our services already use HTTP host names to communicate with each other. And since DNS sits a layer below HTTP, we can implement service discovery there without applications having to change a lot of things. And a good example of this is how we did service discovery in communities, for example, in how, how we are adjusting our community services to console, for example, is using an external DNS controller for communities, we forked it to register services directly from the communities API to console. So that way, we were able to speed up that option of console console by our engineering teams. And as I said, for more complex use cases, we still have a HTTPS endpoint that we advertise to using console to our engineering teams that they can use for if they are not satisfied with DNS. Right, right. Yeah, so to summarize, it's much more easier, like to start using service discovery over DNS gets complicated, I guess. Yeah. UD, do you wanna go and tell us some console features that you like personally? Oh, well, yeah. So while doing like solving some something for service discovery, and you're like our first steps with console. So the first thing from the beginning is that console has a scale system, basically allowing you to manage different level of access for different applications and different console agents. So console consists of agents, which can be servers or clients. And basically, every agent clients can authenticate with console server, basically having different level of access. So for example, one agent can write to KV store and another agent may not be allowed to do so, as well as applications, which usually talk to agents with specific tokens. So security is set to be bound to token. So application or service, it has security token, which it writes during its communication with console system. And it limits with policies, like to specific actions, like to read or write to service discovery subsystem or to read the rights, like in to KV store or like basically, you can say, like any service can discover other services or other multiple other options. It's pretty based. So it's really, really, really flexible. Yeah. Another thing I was recently thinking of, we gonna use and like a lot, it's distributed locks, which we are thinking to use for such solving such problem as canary deployments or stage rollout. So for example, if you have if you want to have a stage rollout of some service or an upgrade or something like that, we can always basically say, okay, let's lock this number of racks and allow only these racks to be upgraded and not any other service should be upgraded at this time. So this is a good example where we can actually use distributed locks. That's pretty much it. What's I was like experiencing with console in terms of features. Maybe you have something on your mind. What, what was your experience working with console and its features. Yeah, I'm definitely a big fan of the ACL system too. Because one of the big concerns that we had while we were designing the system is that one since the console cluster was going to be a global cluster in in our Core data centers. We didn't want one service owner to be able to mess up the service for someone else. So That's where the ACL system really helped us. And as you said, we have a prefix tokens for everything and no client is given global tokens, so they cannot mess up. They can only register services or access keys that we give them permissions to using using the prefix. But another thing that I was really impressed by was prepared queries in console, which allow us to do different kind of failover modes. So to just Give everyone a bit of context. In console we register services in all our core data centers. Right. But then This only like by default. If you only use the service URLs. It gives you active active as a failover method. Since you are basically registering a service and multiple data centers and all of them will be active. And that's the default failover policy for console and that works in some cases, but not, but we found that we Our developers required options like active failover and that's where prepared query comes in. It gives us a very Rich format to specify how exactly to answer a particular services query and allows us to do as I said that active, active, active failover or Kind of topologies and we can do that on a service by service basis. So we actually give service owners a choice to do Active active or active passive and this is worked out great for us because that means that service owners can move at their own pace as they're setting up new services in in our newer core data centers and and while we still have a failover solution. Right, right. The other thing that that Really, really stands out. For me about console is how they design the the inter data center service discovery. So, I don't know how many people follow the cloud for blog here, but in April of this year we had an incident where we lost total connectivity to our primary data center and And And one of the biggest learnings that we had from from that was that we should we need to, we need to have service failover that is more And we need to have service failover that is totally independent of our, we should be able to tolerate a full data center loss and and that's when we decided that we will Not make have not have a primary secondary architecture in console, but have an active, active architecture where all of our clusters in all of our core data centers are completely independent of each other. So, so even if we lose a full data center. Our console systems will keep working other data centers, but we still have to do cross data center service discovery and Consoles WAN federation work perfectly there. So we have WAN federation where you can discover a service in another core data center, but You are not dependent on it as an if that core data center gets has network issues or is totally down or whatever. In that case, you can still Have local service discovery running and intra, intra colo service discovery still work. So that was one of the big things I think that Attracted attracting me towards console. Yeah, right. New release actually 1.8 like bringing even more features in console. Yeah, if you're going back to ACLs. We can actually now authenticate clients, not just using tokens, but actually use more sophisticated systems and In like joint with Vault, for example, or Kubernetes authentication which By allowing like JSON web token authentication. Yeah, but I think it's just the most recent release 1 .8 Yeah, so one of the things that we noticed when we set up thinking the ACL system is the ACL system is good, but it's very static. The, the ACL. The auth methods that console is introducing now allows us to create dynamic Tokens and policies for agents that are prefixed down to the services that are prefixed on to The set the services or default resources that they should have access to. So that's something that we are planning to integrate now in in console. Speaking of 1.8.0, by the way. Didn't didn't you fix something in in console and and that got released in 1.8 Yuri. Oh, yeah. Yeah, yeah. So, um, actually, that was Sort of a bug we bumped into when I was doing some work. So I was a great in a standard operating procedure up just a bit like for console, which is a document. In outlining like what you like on a three. It's Yeah, do a bit more detail about what an SOP is Yeah, so basically just the documents. On what we need to do. I'm sorry, in order to perform some action. Yeah, we'll have like those procedures documented So initially they are run manually and over some time. They usually like being automated resulting in some new service or some other type of like integration automation or Like otherwise to do that. So yeah, I was working on this SOP. And one of the steps was to discover like all the servers in console cluster and pick one is to start with it and What I was doing is I was like looking. Okay, like, let me try that. And what I did, it was a query to discover console nodes, but I made a typo. And instead of like putting Validate the center because I needed to find console node in specific data center. And then typo and I was like invalid data center, basically. What I received was Serve fail response like with DNS. Which didn't seem right to me. And there's actually should be an X domain, like a non existing domain, but not a server failure. Yeah, I started digging into that and hear that Console for some reason is treating like valid and X domain response as an error and just receiving a server fail instead of the proper response. So, I was like, Didn't anyone try to fix that previously. And actually there was a PR, which was exactly what to Solve it. But for some reason. Specifically in our situation. And I was like, Explain why that that PR did not fix the issue for us. Absolutely. So, There is no PR like all the logic legit and everything was kind of fine. But when I started debugging it, it appeared that when you make a request. To a console server, which comes not directly to the server, but first goes to like to the agent and then agent basically over RPC Requests That response from a server because agents don't really keep all the information. They just keep information about membership. And if they don't have some response. In cache the cache, they will definitely query the server. So, It appeared that the PR, the original fix was issued and it does work, but it works only when you query console server. So if you're if you query a console agent, which then goes to Console server over RPC actually like basically shows you a failure and like there is like and you have like all the logs about the RPC errors and It doesn't really work. So I started looking and into it. First thing I was like, okay, let's, let's see you like how this person implemented that I checked that I checked what they were testing and it appears. Yes. So the tests were just written for the console server itself, but not think like not without RPC communication in mind. So I did the tests I proved that. Okay, this doesn't work. Basically class just spins up to console instances one agent and one server and tries to do some RPCs. Yeah. So once that was proved I implemented the fix it fixed the issue. Absolutely. And now If you query a console over RPC basically from agent, it will respond to you like with DNS and X domain. Unfortunately, unfortunately, it's not yet released. It was accepted upstream. It's right now in the master for console and But yeah, I mean, we're just waiting for the next release upstream to This thing to be fixed. Awesome. And how was your experience contributing to console. Open Source UD Oh, that was fun. The community really was welcoming and supportive I described like what about phones and like what we're going to do that. It's been pretty fast, I guess, because partly the community is based on is pressure for itself. Yeah. And I think the same day that was merged. So yeah, very good. Awesome. Yeah. So let's talk a bit about what what's next for console and We have already set up console. Right. And as you said, we have written SOP is we have we have done tests and make sure that we have fixed and fix all the bugs. What do you see Yuri like The new thing that we could do with console. Like, for example, distributed locks is a As I said, is a is something really exciting that we could do with console. Right. What are some other things that you think would be good use of console. Yeah, well, I think of the store because we're using that at this moment. Our main goal was to set up a discovery and basically it's The store. Typically, we can actually keep like our hardware nodes metadata in that Compared to We store it right now in centralized fashion. So like each node can have Information about it. It's stored like in very distinct So this definitely can help with configuration management simplify a lot of tasks. Yeah, that is also one of the things that I am excited about. So at Configuration management wise storing storing configuration management values is a difficult problem, especially at our scale where we have thousands of thousands and thousands of machines. Right, so Right now, our config we store a lot of data statically. For example, alert definition would be something that We use Prometheus for our monitoring and all the alert references are stored statically. So that was one of the things that I was very Interested to solve where we could have these Alert. Alert definitions store dynamically in console and and we could also leverage health checks in console for For some of the alerting so that that's one of the things that I'm looking into implementing Soon with console. Right. Well, yeah. Another thing I can think of is Once we get A few like services applications need more flexibility and service discovery and configuration, we might want to Actually jump off DNS for discovery and maybe go on with embedding console clients into applications directly so it will simplify Service discovery for those application services. Yeah. Another thing that I wanted to talk about was This. So to give you some context Criteo Is a very is another big user of console and they have a lot of blog posts on their infrastructure blog about console and and they are They are a few years ahead of us and using console. So they're doing very interesting things with console. One of the things that I really liked was this, which is inversion of control for infrastructure. And if you notice, Pierre is the same person who was on on Yuri's PR because he was the one who committed the initial fix also. Yeah, so What they do is they use they use console for service metadata and And So that is also one of the things that that I'm looking in where you can basically define alerts as console service metadata and then you can have a preset. You can have preset alerts globally for services that can That can be modified by developers using their service tags. So basically developers can define their own alerting thresholds for us and we can have a global set of alerts that applies to every service using console. Right. I think we've had this conversation previously. Yeah. Well, Service. Without monitoring team. Yeah. We can definitely do that. Okay, I think that should be it from our side. If you have any questions, feel free to send them to Send them to us. Right. I think it's CFTV Cloudflare.com Yeah, CFTV at Cloudflare .com. Thanks, Yuri. That should be it from me and Yuri. And if you have any questions, send them to CFTV at Cloudflare .com. Thanks for attending this talk. Thank you. Bye. Bye. Bye.