Making socket lookup in Linux programmable with BPF - a retrospective

Presented by: Lorenz Bauer, Arthur Fabre, Jakub Sitnicki, Marek Majkowski

Originally aired on October 2, 2023 @ 6:00 PM - 7:00 PM EDT

Cloudflare outgrew the Linux network stack, and fixing it took three years. Join us to discover why it took so long, what mistakes were made along the way, and why we should've used an ioctl() after all!

Read the blog posts:

English

Transcript

Hi, everyone. Welcome to us live from London and Warsaw. We're here to talk to you about what led to sk_lookup and tubular. Two big projects we open sourced recently. My name is Arthur, I'm a systems engineer and I wasn't directly involved in any of these things. I'm just here to look pretty and kind of on the sidelines during this whole time. And with me here today, we have the real heroes, Marek, Jakub and Lorenz, that were part of this saga. If you could all just go around and introduce yourselves. Hello, I'm Marek. I joined Cloudflare many years ago, and I was one of the early employees doing early hacks. So I have, I have to say, the origin story of our sk_lookup that landed in the kernel today. Looking forward to it. Hello, I'm Jakub and I work on all kinds of network proxies and the kernel here at Cloudflare as well. I've been here for, I think, three plus years now. Hello, everyone. My name is Lorenz. I'm from the London office as well. I'm a systems engineer on the same team as Arthur and I work on kind of load balancing, all things that are related to that, and I also do a fair bit of work on kind of control planes that we have a Cloudflare. I'm also the guy who didn't have his tab muted just now, so apologies for that. Nice. Cool. And so every one of you has written a blog. I think we have three blog posts spanning almost four years to kind of show the magnitude of this project. And I think to start it off, Marek has a cool story about kind of the origins and how we got started with this all. Right. So this segment, this segment is about our work that we did in the kernel. It's an eBPF hook that is, that we call, that is called sk_lookup. It's about putting an eBPF code into the kernel and the socket lookup. The story of that work is very long. It started with me five years ago maybe, and then it kind of finishes with, we think, Lorenz's blog post. Lorenz published a blog post a couple of days ago, which publishes a tool that we use to actually use that kernel feature. But the whole saga is kind of very long and very bumpy. And it started with me very early on. So when I joined the company, I was asked to do the work on DNS Server. And I must admit doing DNS is not really, I wasn't really very good at that, but I knew how to write servers, how to do networking. So I looked more closely into that and I quickly noticed that our DNS servers in London crashed every Friday at 5 p.m. I was like, I should probably do something about it. Were they the existing DNS servers that you were working on or was this a project you started? No, no, no. It was the DNS server, it's written in-house. It's called rDNS and it was written by Ray, a great engineer. And the whole idea of why did DNS server is probably, probably something for another time. But the point is that it was only recently productized and it was not really as stable as we wanted. But remember, the company was very small these days, but back in the days and we didn't have many European customers, so the impact of London crashing wasn't that dramatic as it would be right now. So I looked at the code and I fixed it, which is I made it faster, I... there was a particular.... It was in Go, it is in Go. And in Go, each server, each socket has a Go routine around it. That's how that's how your programing goes. So we did that and they sped up the Go routine that was basically picking up packets from the sockets and then it was sending it for a kind of more central processing with the thread pool, we had kind of more structured things. And once I did it, I was very, very proud. It's like, hey, we're be much better now, much faster. And obviously the next day it crashed in an even worse situation. It was much worse. And the reason was that, yeah my code would work perfectly, it was bug-free, it was ideal, but it ended up just putting too much work on the central queue, on the kind of central processing. So we ended up with basically putting way more work on the server than it could have done. We just ran out of CPU. So you'd offloaded kind of accepting packet... so you kind of there were one thread accepting packets and all the rest of the work was offloaded to a different thread group. And because this kind of reading the packets was slow in the past, it all worked reasonably fine. As for yeah, if we get too many packets on one socket, yeah sure, the socket will drop packets, but the whole server will generally kind of work. After my fix, the whole server was dying, was basically unresponsive. So this is this is a story where I kind of, I started looking at the sockets like, okay, so we have many sockets that our server is listening to. That may be an issue. Should we fix that? And at the same time, I... Can we take a step back? Why do we, why did we have so many sockets in the first place? Exactly. So, so, so at the same time I noticed that the reload of the server takes like 5 minutes. So if you do like, it wasn't systemd back then, it was circus, so like due to that circus it would be a proper long time. Not like 5 seconds, it's like 50 seconds, it was... you had to really go for a coffee and exactly, it was because we had many, many, many sockets. I think it's thousands. Why was that? It's because we are just listening on so many IP addresses. So you can then say, okay, but maybe we'll just wild card the IP address and just get all the packets into one socket. But then we go back to my problem, which is, Yeah, sure, but then that socket will be overwhelmed by potentially incoming DDoS attacks or network spikes, and the whole server will be even a worse case. So we had this kind of weird situation when we understand that we need the sockets on inbound, we understand that we need to put some kind of throttle on them so that each socket, its IP address can get some quota, but then we, we really cannot have those sockets because our, our systems underneath don't really work because it just reloads of the service. We used socket activation and the reload was just too long. So the fix was to fix circus and I made it faster by doing some low level hacks. It's like, Yeah, circus is in Python, you really don't need to do all this magic, just get number and integer and don't... it's like, trust me. So, so it was sped up to like more reasonable time, I think 50 seconds - was after... - Wow, so fast. Yeah. And we did fix and we did want to fix the TCP. So I think we did do a bind to star. So we did bind to star for TCP because for TCP we don't really care about the inbound queuing because in TCP we only get connections when there is SYNs and ACK, So we are pretty sure that the connection is legitimate. Right, so you're talking about more about having denial of service attacks where. Yeah, it was. Kind of a natural we have to do the handshake so we know things are at least somewhat legitimate, whereas for UDP any random junk. Yeah, but imagine if you're running a DNS server, it's not really that hard to overload with DNS floods. And we did have them all the time. So that was the kind of the root cause of the failing in the queues. So this is this is when we kind of started to think about it more carefully and it's like, okay, but what if we have another instance of a DNS server that like, how are we going to do bind to star twice? How's that going to work? So obviously, it doesn't work. You can only bind to kind of wild card or any IP. You can only bind to it once. So we had this problem because we couldn't really do that. So we, we wrote a quick patch to the kernel and Gilberto wrote a patch which we called bind to prefix, which is basically the simplest possible thing you can do on the kind of socket lookup layer when you... we added a field to the socket, which is like what subnets is the socket supposed to be listening on and if it's not empty then we only passed sockets to that subnet. Then we have some reuse other, maybe even reuse port machinery. No, it was reuse other, reuse other machinery, we're able to basically have different subnets on each socket. So each socket is binding with the bind to prefix patch, is binding to any, to the wild card. But then you specify with some, some custom setup, you specify a particular subnet that you're interested in here. And this is the part we are still using. It has kind of its ups and downs. And on the good side, it's pretty easy to use. On the bad side, it is a custom fix for the kernel and it also requires custom changes to systemd. We really don't want to patch systemd. It's really not helping anyone. But that's that's the way that's what we wanted to, what we needed to use. You mentioned balancing between kind of all the queues and the sockets. So is that a concern with bind to prefix? Do you have to like carefully choose how many prefixes and which prefixes you bind on which sockets? We only use bind to prefix for TCP, which is less of a concern. Okay. So if I understand correctly, you said the data server had many, like hundreds or thousands of UDP sockets. But bind to prefix doesn't doesn't really affect the UDP case. It's more for TCP and it's all about being able to have another DNS server. So we had rDNS, which was authoritative, and then the other DNS server you're talking about, is that one one one one? Well, it was earlier than that. I think we had some other instance. But yeah, we, we used bind to prefix, we started I think with the DNS and yes, indeed the main, the original use case was How can we not have these thousands of sockets? And for TCP we can solve it with basically the star, bind to star, plus cooperating with other tenants on the system, other processes on the system, so the bind to prefix. For UDP, you cannot really do that. So the fix for UDP was just to speed up circus and to... I also I then wrote a kind of throttling mechanism that will basically count packets per second from each IP address. And if it's too much, it will like, hey, hey, hey, slow down. Let's, let's, let's, let's, let's be clear here. Now this, I think we could be able to fix it, the UDP case, but we never got that, got down to that. So what you're saying is today we still have thousands of UDP sockets. We have different systems, we don't have circus, we still have rDNS and we still have thousands of sockets, although it's much more optimized with advisord. And so do you know how quick it is to restart now with advisord and systemd? I think we try not to restart too often. Bind to prefix itself is not very popular with SREs because we never adapted sockdeac to show that you are actually binding to a subnet. So so when you're looking at what sockets are listening, you just see conflicting bindings, like two processes bound to star, any IP on port 444e, which is just plain confusing when you're trying to debug something. That's a really good point. And so we tried to upstream this bind to prefix when we first wrote it, right? We did. And the response of the community was, It's very nice, but we are sure you at Cloudflare have a use case for that, but it's too specific for you, which is a fair argument. So this is kind of kind of one, one origin story of how we went into this sk-lookup. Then we did something different. Then we did, we created Spectrum, which is our layer or layer five, depending how you look at that, load balancer. It basically terminated the connections on our edge on Cloudflare servers and then pushes the connections, usually TSP connections, back to the origin. So it's a pretty simple kind of forward proxy. And we, we did it. We, we can, we can terminate client connections to stuff. But obviously the question is what happens if the user wants to, if the user wants us to listen on some port? Ok, we can bind to it. So when you say the user here, you mean the customer who's... The customer, yeah. So so in Spectrum, the idea is that the customer can ask us to forward any port, which is kind of obvious. So initially when we design Spectrum, we focused on the TCP part, it's like, let's listen, let's process, let's send, let's make it fast, let's add the TLS, let's add more features. But I originally I did know that there would be an issue with binding. I knew that. Yeah, I can bind on like obvious ports, like port eighty, four, four, three, maybe four, 22, maybe 25 and a couple of dozen of more. And that was the initial scope of the project. It's like, let's wait for the customers to ask for more ports and no, we can probably do a dozen, two dozen, maybe a thousand. I know what happens once you get to 10,000, but let's not go there. Let's ignore the issue. So and so, every time a customer wanted a new port, we'd have to actually go start up a new socket. Yep. We had to go, we had to release to a new proper release, add a new socket to systemd to bind on that port, on that prefix. And so we actually, this was in production, right? It was. Ok, interesting. And guess what? People wanted new ports. I think it's maybe I'm misremembering this, but like if you think about the time before we had Spectrum, basically the thing that Cloudflare was really good at is http, https, and then everything else was kind of like, ahh, it's kind of difficult. We had support for web sockets and stuff. So like Spectrum filled an important need, right? Absolutely. So Spectrum, so the TCP forwarding like that was definitely a missing feature on our stack. So while you could do decent security and kind of load balancing and caching on the http level, which is kind of most, where most of your traffic as an end user is, we really were not good at other protocols, so an obvious use case was in email, so port 25. Yeah, sure, you can put Cloudflare in front of your domain, but then the email will still kind of go directly to your origin, potentially leaking your IP address. So one of the early use cases was exactly that, was shielding the users from, the user origin, that the user origin IP address is not leaked with like things like, like email. So, so mainly protecting customers like email providers from DDoS attack so they could have their inbound, like SMTP ports? Ok, makes sense. I think we also had gaming pretty early on, não? Now we did and Spectrum is very successful in gaming, so it actually works very nice protecting of TCP and UDP inbound connections for game servers. Cool. So we are at the point where Spectrum is live and you have to... Yeah, exactly. ... every week. Yeah, so Spectrum is live, we announced it, we have requests, people, people trying to use different ports. It's very laborious for us. So how can we fix it? At some point, I don't remember how we found it, but we found the TPROXY IPTables kind of option or feature. TPROXY is obvious now to us these days, but it was really obscure. And if you look at the documentation, the docs for TPROXY are like very bad.They introduce the common case for which the TPROXY is created, but that's pretty much it. They don't really go into more details. It's really hard to understand what's under the hood. So do you know what TPROXY was created for initially? Yeah. So TPROXY's original work was created for transparent proxy, so as you can see by the name, so that you can put your Linux router, Linux server on the path of the cable, so you can cut the cable, put the server in the middle, and it should be able to terminate the connections flowing through that cable without really the end user noticing. So this is, for example, for Squid, for HTTP without s proxy. So you could terminate TCP connections, do whatever you want. Kind of like man in the middle, really. It is totally a man in the middle. It's totally about... there is a connection to someone else, and by the way, let me allow the network stack to actually terminate that. So it's totally a man in the middle case. And we, after we dug into the documentation and did some examples and read the code and read the historical writings about it, we realized that we can actually use it on the, as the terminating part, not in the middle part, the man in the middle, but as the part that actually terminates. And it's a very weird way of using TPROXY. That's why it took us so long to discover it. I don't think anyone else is doing that. But basically we can use the same technology on the end side, on the end host. Right. But that is also the reason why TPROXY happens so early in the ingress path right before you even make a routing decision, whether a packet should be delivered locally or forwarded. Totally. TPROXY is for the forwarding case. So if you are forwarding packets, even if you don't have the IPs locally, TPROXY will allow you to terminate them on your machine, while we do have IPs. So it's kind of a bit different but applicable. And so the big problem when you're proxying like that is preserving I guess the original forward tuple while still allowing you to have sockets, right? Right, exactly. So with TPROXY, how it works is you give it a local socket address, so like 1270141234 for example. And then TPROXY, which is configured on the firewall level, will take that configuration and will basically forward or move those connections to the networking stack, ignoring all the routing decisions. And it will allow you to answer with SYNACK, and then you can do accept on that TCP listening socket and get the connection. It's like, hey, it's very nice. Someone from remote connected to my 12701 socket. How come? How are you here? Why did you arrive here? So it's very confusing and it's kind of not really obvious why it happens. And indeed, then you need to do some magical syscalls figure out, okay, actually the source IP is correct, but then the destination for it actually was supposed to go to someone else and then you can use that in your application to do the normal application lookup. There are more many more issues with TPROXY. But anyway, so TPROXY allows us to solve this kind of problem on this time, not on the binding to multiple IP addresses like we had with DNS, but this time we had this problem of binding to all the ports. So TPROXY allows us to bind to all the ports. So with this experience we kind of knew, okay... So, would we end up with just so one socket for all the ports and we'd still... because we can... would we use this with bind to prefix? How many sockets would we actually end up with in - Spectrum? - Single socket. Single socket for... One TCP, one UDP. And so did we run into the same queuing problems that we did for DNS with the UDP socket. Exactly. So the TCP is again easy because it's after SYN and SYNACK, so the data that is in that socket that we can accept the connections are legitimate, basically. And if not, then we are, we are doing some magic, but it's next level. So, so that's usually kind of good enough for for basic DDoS. For UDP, exactly. This is exactly it. It's like how if all the packets from all the ports, potentially many IPs, many prefixes, land in one UDP socket, it's going to blow up. So we spent a significant amount of time writing a BPF filter that sits on that socket. So it's again, it's another layer of BPF, but it's controlled by application. So, so Spectrum controls it and this BPF is basically doing the throttling. So it's okay, let's measure how many packets per second to each destination IP from each source port, is that a reflection attack? And it strikes on the socket level to basically figure out more or less Is there an error problem, has the socket failed to match? And in that case it reacts and it will actually drop the packets. But again, it' has to protect from DDoS and it has to protect other services. So sometimes in some extreme cases, we can afford one customer on one service not being fully available globally. And this is usually an okay compromise, but we cannot accept that the whole service is going down for other customers as well because we have some networking event on for some other users. And so how did this work to kind of separate customers out? Because I guess before in the DNS case, we could have given different customers, different sockets or different of bind to prefix things. So one customer couldn't really impact other customers that much. So were we, in the BPF accounting, were we trying to account per customer and rate limit per customer in Spectrum? Pretty much. I think we even wrote a blog post about that is called Target Limit. I think it's called Rake. Right, Lorenz? So there's two versions. There's the internal version that's Target Limit and the open source version is called Rake Limit, which an intern did, Jonas. And it's, there's a blog and it's open source. It's really interesting. Perhaps it's worth mentioning why it's okay to have just one TCP socket because that's going to come up later, I suppose. So, when you have too many connections coming in and your accept queue starts overflowing, then SYN cookies kick in and the kernel will serialize the state, send it over in SYNACK response and the client will reflect it, and then the final ACK, right. So you don't need to keep that state on your site. And that's actually something that proved problematic for us in the case of load balancer and detecting which sockets are listening and which sockets are issuing SYN cookies currently, right. So we'll probably get into that a little bit later, but let's maybe take stock. Right now we have Spectrum and using TPROXY you fixed kind of the problem that every week you need to do a restart. We have the kind of the prog wild card, if you think about it that way. What's kind of the next thing? Because we're kind of collecting little problems, paper cuts that kind of end up to the thing that we open sourced. What's the next thing that happened? Right, so we defined kind of two issues right now. One is called the DNS issue with just too many sockets. And one is the Spectrum issue with too many ports. So we have kind of vertical lines right now. So we realized that we needed a kind of a better solution. Why is that? It is because TPROXY was very, very, very painful. We... it was easy, relatively easy to start. It's a firewall rule. But the more, the deeper you go there, the more dragons there are. So for example, for example, there is a performance issue which Jakub can talk more about that. There are... on some sockets, you need to select IP transparent flag, which you may... this flag is necessary on the listening socket, but it turns out in our case we also needed that on the other sockets, because we actually create sockets on top of existing kind of bind to star sockets. So it was it was super painful and then we had other issues, like there was a team that was working on Unimog, our load balancer, and it didn't really play nice with TPROXY. The load balancer needs to answer a question: is the socket... given the packet, is the socket available on the system? And because the socket has actually rewritten the destination IP, by TPROXY, Linux was confused. So so the load balancing mechanism Unimog didn't really work well. Then we really don't like doing any application stuff, including this kind of socket changes to TPROXY on firewall. Cloudflare's firewall is super complex already. And it's really not a place that people tune their applications and should make changes based on customer features. So there were really strong reasons to fix TPROXY, especially that we knew that in future we wanted to extend Spectrum to be able to do other things. There is a good example, another good example. So in Spectrum, you give an IP, we have an IP address that you can... we listen on all the ports effectively and the customer can say, okay, on port 4000, that's forwarded to my origin. That's easy. But what if the customer asks us to forward port 443? Again, that's easy. We terminate the connection on port 443 and we pass it back to the origin. But what, then, if the customer says, Oh, by the way, I want port 22 to go to my SSH server, but port 443 to go to Cloudflare because your guys are good at HTTP and it might sound easy from a kind of feature request, but internally it was basically impossible. Like, Come on, I'm terminating all the connections in Spectrum, I have all the ports, including 443, like sure I can send to your origin, but sending it internally. No, no, no, no, no, no. That's too much. So it was a real issue for us to connect those things. And we believe that maybe we can create a better abstraction to express those kind of mappings of this port, these IPs will go here, will go there. We wanted much more flexibility than t-proxy allowed us to do. Yeah, I think that's a key, at least from the sidelines, my point of view, when we were seeing this emerge, that was kind of the second big problem. The first one was having so many sockets, which kind of covered the DNS and the Spectrum case where you just, you have so many sockets, you have so many IPs and ports you want listen to. And the second problem was having the flexibility of being able to say We have tons of IPs and tons of ports and we need to hand-pick, oh, but this one port goes somewhere else and just have tons of exceptions everywhere. Right. So I think we can kind of finish the origin side here and kind of move on to how we how we fixed it. And basically at some point I asked Jakub, Jakub, you know enough about TPROXY and its issues. Can we please have a more coherent solution? So that's how the sk_lookup story starts. Awesome. So, anyone want to tell us a bit more about what sk_lookup even is? Right. Right. So like Marek already said, we spotted these three problems in production and that's kind of the beauty of having a diverse production environment that you can see what problems the applications on the top are having and you can come up with some kind of general mechanism that you can build into lower layers, the lowest one being the kernel that then many users will benefit from. And that's, that's what we decided to do here. So like we mentioned, TPROXY works for us. We, we built Spectrum on top of it. And truth be told, we could have built a more flexible dispatch on top of TPROXY, but with all the downsides that come from manipulating your firewall rules often and having applications integrate with TPROXY, so instead we, Marek came up with a better idea. Why don't we add flexibility into the socket lookup itself? Why don't we allow the user to be able to program the socket lookup even. For people like me who don't know much about all this. What is the socket lookup? Right, right. So, so, so you see, when the packet comes in to, to your metal, right, it first hits the NIC, the NIC hands it over, XDP processing happens, then traffic shaping happens, then some initial filtering happens, contract to name one, other things, then finally you hit the routing decision where the kernel takes a look at the packet, takes a look at the routing tables and decides, Is this packet destined for me or should I forward it along? So if you decide the packet is to be delivered locally, then some more filtering happens. Your input firewall tables are getting traversed and once the packet is through the firewall it's time to call on the transport layer, this being usually TCP or UDP, to find a socket that is willing to receive that packet and that is what we are calling the socket lookup here. Okay, cool. And so this is usually predictably just based on the forward tuple, - like the port IPs, we find what sockets are bound to what. - Exactly, what's your input to your lookup function, forward tuple that comes from the packet header for v6, also the 5-tuple write together with a flow label. And yeah, and the socket lookup itself, it's actually a process that is split into a few phases, so a few steps, right. First we always look for connected or established sockets which, which are bound to the forward tuple that matches the packet header exactly. If we don't find that connected socket, then we start looking for listening sockets that have just the local address set and if we fail to find one then we, only then we move on to the wildcard sockets which we have been talking about, which are bound to any address that is assigned to the host and some certain port number. So coming back to your question, what socket lookup is or what BPF socket lookup is, it's actually an extension that we added to this whole process. We added another step. Just after the connected socket lookup, we added another step where we actually run a BPF program that has a chance to return a listening socket but will be used to receive the incoming packet. Okay, cool. And so this, is this kind of a similar place as to where bind to prefix was implemented? Somewhat. Yeah, so I mean, bind to prefix extended the second step of the lookup, what used to be the second step of the lookup, where the kernel looked for a listening socket. It was just a couple of if branches that also compared the prefix that the socket was listening to, that Marek mentioned instead of just the local address. I'm laughing here because the socket lookup that we did the bind to prefix in, yeah, that's a couple of ifs, but they are the most complex ifs in the kernel, they change often and there are thousands of core cases and no one understands it. And if there's a change, there's always a regression around it. So yeah, yeah, yeah. There are a couple of ifs. Yes, but they're important ifs. Yeah, yeah. There are definitely many criteria that are being compared and it's not the prettiest part of the kernel. Yeah, and that's why we, we also didn't want to have this work on our hands just every time a new stable kernel release comes out, having to worry if the patch will apply. And every year when we switch to a new LTS kernel, porting the patch and hopefully, you know, we'll be able to kill that patch in our custom kernel someday. Yeah. Awesome. I think Lorenz also mentioned at some point that the testing was also really hard because of course the patch is out of tree, so you're not rolling it into your local developer machine and every time you need to backport it, you need to go try to find all your unit tests. Oh, wait, does this still work? It's kind of very... all the tests were kind of very dispersed. Yeah. I mean, the closer you run to upstream, the easier it is to do your regular development or to do kernel development, definitely. Cool. So I'm kind of curious, what was the experience like of proposing this extension to the Linux kernel? Right. I don't know what your... I think you had contributed to the kernel before. That's right? Right, but I mean, it was definitely nothing major. I mean, I wouldn't say I was a seasoned kernel developer or anything like that. But in the past, that's for sure. Well, not very seasoned, but I did enough to to know the process and then feel comfortable with the whole process. And because I started out probably like many people, just cleaning and cleaning up a driver in the staging tree, - that's always... - I do it all the time. Yeah. Like many people, everyone's brought up... Well, that's a good starting point, definitely, if anyone is looking to get into the kernel. Then, when I was at my, working for my previous employer where I did a couple of fixes in the networking stack and then yeah, I kind of got enough of the context to be able to dig through TPROXY and understand what is going on in socket lookup. But that was it. That was actually the first major project for me as well as for all of us here, in the kernel. But it did take you a couple of tries before we finally got it in, right? Yeah, yeah. I mean, definitely. And the idea, I mean, it wasn't like a point where we had to throw everything away and start over, but the idea certainly evolved as we went. I recall that the initial, the first idea and the first implementation, the RFC implementation, was just a BPF program that was receiving the lookup forward tuple, so the input that goes into the lookup and it had the ability to rewrite that key so you could rewrite the destination port or the destination address. Right, right. So the input is IPN forwarder, the output is the IPN port of the local socket. Sounds very familiar to the TPROXY, right? Maybe because we were using that before. Exactly. And it was also familiar to other BPF types of programs that I think already existed at that time like Bind Hook, so it was the simplest approach. But then when you presented it at the NetDev Conf, I think the feedback that you brought from there was that we should instead make it more flexible and try to return a socket from the BPF program. But the machinery was already there and the verifier was ready for something like - that. - Right, - so... - I would take.... I need to quickly throw up the verifier. An important thing about BPF programs is that there's this kind of threshold that every BPF program that you try to get into the kernel has to pass, which is the verifier, which does like analysis. So Jakub is saying, like BPF as a thing itself is like, you can think of it as a description of a virtual machine, but then BPF in the Linux kernel like has, over the last two years, I gathered a lot of features and I got where we were saying that at this point in time, like this eBPF and the Linux kernel had come to a place where it had an idea of what is the socket, right? Which is kind of... sounds kind of easy, but it's actually a big deal if you... It's a very big deal. Right, right. So I remember that we were told basically that it's very nice that you returned this 2-tuple as the return of the function, but that's kind of silly because it's a lookup function. So what are you going to do with this 2-tuple? You're going to give it to look up again. So it's kind of turtles all the way all the way down. It's, it's, it would be much nicer to break the abstraction. And it turns out that the that the facilities already exist. So I think before, before that the developers created SOCKARRAY, which is a an array like linear array, which you define the size. It's a BPF object that you can put sockets in, and that was a shocker to me. It's like, how cool is that? You can have a socket and bang. It's not really a file descriptor. It's kind of more than that. And yeah, going back to what Lorenz said about the verifier. I think this is a big part of it is the verifying able to support having sockets, tracking sockets, knowing what is a socket, what isn't a socket. I don't think we would be able to extend the verifier to support this. So thank you for doing that before. The next step was to extend the extend your program to Jakub to do make use of socket array, if I remember - right. - Right, right. Exactly. I mean, the only container, the only BPF map at that time that supported holding preferences to listening sockets was REUSEPORT_SOCKARRAY you mentioned, that was built for RESUSEPORT BPF programs, but do load balancing among groups of sockets, groups of listening or receiving sockets. And we tried to repurpose that for, for, for BPF socket lookup and it worked and that's, that's what we presented at Linux Plumbers, I think. And but we also noticed that REUSEPORT_SOCKARRAY was very tightly coupled to reuseport logic and that came with, with some downsides, how the sockets that we put into that map have to be configured and the way forward wasn't clear because we would have to make the reuseport code a bit more complex. And at that time, during that conference, John Fastabend, who's maintaining another BPF map that that is used to hold sockets called sockmap or sockhash, suggested that we should instead use sockmap as our socket container from which the BPF program selects a socket and then it returns it as a result of the lookup. The catch there was that sockmap did not support holding listening sockets... - or UDP. - Yeah, or UDP. So we only connect to TCP sockets in it. Exactly, because its initial use case was using BPF programs that essentially splice two connected sockets, but in the kernel context. Right. So, so but I want to give a bit of a feel of a timeline. I think the first kind of this RFC, the request for comments that you mentioned, I looked it up yesterday, I think you posted that like in March of 2019 and then Marek kind of read shortly after we went to conference and kind of gave a talk about it. And then this Plumber's conference you're talking about, I think towards the end of 2019, if I'm not mistaken. And so by that point, you had done the second version of your patch set, which kind of used this new approach with just kind of directly returning a socket. Yeah, that checks out. Yeah. And then we kind of keep going because this is the thing that I thought amazing when I kind of went through all the blog posts that you wrote, the internal ones. The biggest chunk of work was still to come, right? Yeah, exactly. Exactly. I mean, that was the biggest surprise on this project, but it wasn't about adding the new hook or the new BPF program type. Most of the work went into extending the sockmap to to be able to put listening sockets there and then to store UDP sockets which, which, which you worked on, so you know the best. Yeah. It turned out that it was really tricky. I mean, sockmap overrides some protocol callbacks to when when a socket is inserted into it. And that turned out to be a challenge for listening sockets which get cloned loc-lessly on the old path, on the ingress, but fortunately, with the help of more experienced network stack engineers, we managed to get it working and shake out all of the bugs. Yeah. And that was definitely a big chunk of work. And so this new helper applies only like if we're going back to the sk_lookup and kind of the path where first, we're trying to find an existing connected socket with the forward tuple, this runs after that right? So existing connected sockets, once you've found the listening socket for a given packet, then that's... the kernel remembers that and subsequent packets will just get routed, sent to that connected socket automatically. Right? Right. I mean once the first packet in the TCP connection gets routed, the connected socket will spawn a request socket for that and request socket is like a connected socket in that sense that it has a forward tuple assigned to it. So any subsequent packets, the final leg and the three-way handshake will land in that request socket. So, so kind of you get to amortize running this function as well even if it was even if it would have been expensive, you only kind of have to run it once to set up an existing connection. Yeah. TCP. Yes, interesting point. So what happens for unconnected UDP sockets? I mean, the same thing, the lookup was a little bit different because in UDP we have only two steps for the lookup. In one step, we look up both the connected and the receiving sockets and the second step of lookup checks for any wild card sockets. So here we had to hook up our logic a bit differently after the first lookup. And like you said, the cost is a bit greater because every packet that doesn't land in a connected socket, so UDP connected sockets, which are not that common, will go through that BPF program. Cool. Awesome. So and then I think one question we want to ask. Any regrets now that you've committed, we've contributed this upstream in Linux, it's kind of set in stone forever unchangeable? I'm sure. I mean, BPF is so hard, right? We should have made it an octal probably, right? No no, but jokes aside, definitely a couple things, which was kind of a learning on the job experience. And one thing that slipped through was capabilities for our program. Guys from Huawei just recently noticed that it's possible to load and ask a lookup program about net admin capabilities. And that is different than all the other programs that can affect kind of how the traffic travels in the networks like like Flow Dissector, for instance, or XDP. And that was something that that was missed during the reviews. So I was trying to backtrack how it happened and I think that it was just that Cab BPF got introduced in 5.9 and sk_lookup landed in 5.10 and we were still getting used to the whole idea of a Cab. Now we just need cab BPF to run BPF programs, but also because of a use case we were targeting is that you always use sockmap and sockmaps are still a guarded behind net admin capability. So it's... you want to be able to build an sk_lookup program but use a sock map without net admin anyway. But that is something that was missed, definitely. I guess another thing that we're still working today to straighten out is that so BPF programs get a context object which lists the input and this context object is usually backed by some kernel objects, kernel data structures. And when you have a mismatch between the field sizes and the BPF context and the kernel field structures, when you have a mismatch between the sizes of the fields, that is just a recipe for problems because you have to be very careful about how you convert your access to the context and rewrite it to to do loads from actual kernel memory. But that is it. I mean, I mean, it's, truth be told, it's still a very young feature. And I'm sure we'll find out with time if anything's seriously wrong with it. And yeah, I mean, if... We'll have to come back for a retrospective in five years. Let's see... Let's see if LWN covers us in the future, any major misdesigns. I think, I think we can pivot from here. So initially it was, as Jakub mentioned, was designed to take the BPF code that we kind of customized, we could put different rules, customer data loaded somehow there, return the 2-tuple, but then we changed that to return actual the socket. And then the socket needs to be referenced from somewhere. Initially it was sockArray, then it was changed to use sockmap, which is another kind of kernel BPF magical abstraction that holds sockets. And this sounds very nice on paper and it's very nice to kind of try to use for tests, but as it turns out, it's a big deal in production because, yeah, okay, I have a server running, how do I get its sockets? We quickly realized that if we do do this via sockmap, then we need to put socket from applications that are using system B or some other other way of launching, we need to handle the socket and put it into the sockmap so that then it can be referenced. The socket doesn't need to exist, it could be 12701, doesn't matter. All that matters is that we have this, the abstraction kind of moved up the BPF program. So we realized that, you know what, we actually will need some kind of full featured machinery to configure the logic there, to configure the sockets, to put the socket, maybe to report something, maybe to do some drops and maybe to do some logging. And we realized that this kind of userspace component will be far from trivial. And this is where Lorenz, you come in, right? Yeah, exactly. So I think as Jakub was developing the patch that you had written, kind of a proof of concept tool that would kind of do all of these, that would allow you to set up kind of the userspace bits of sk_lookup. And when I was working on the blog post, the open source, kind of the version of that, I took your proof of concept and I kind of, together with Jakub, turned it into like a production-ready version of that, basically. When I looked at the code for this thing called Tubular, it actually turns out I think we have like 150 lines of kind of BPF code, if you will, and there's like a couple of thousand lines of just code that does all the finagling. So you're entirely right that the userspace part is actually quite kind of complicated. And I think it's useful to look at the goals that we had for Tubular. Like you, you've mentioned all the bits that are kind of tricky with our previous approach like bind to prefix was a passion we had to maintain. TProxy is really hard to set up. It's really difficult to debug arrays. We have to change the IP tables, firewall, all these kinds of things. And we wanted to fix that fix that with Tubular. And I think one additional benefit that we haven't touched on is, by having something like sk_ lookup, we could actually on the fly change addresses for a service, which for us is a really big deal because it means we can, for example, we could say, oh, Spectrum, there's a new customer, we have people that bring their own IP addresses. Somebody comes onto the Cloudflare network and says, Here's my IPs. I would like to have Spectrum run this IP address. Before Tubular was a thing, it would mean that we kind of have to do some... change IP tables and add this here and add this there, which could actually mean that we have to kind of do what's like a full edge reboot, if you will, reboot every single machine, because that's the only way we had of sometimes adding new addresses to services. And the key bit that Tubular enables is that the service is running and it's kind of oblivious to which addresses and ports it's running on. We can just use Tubular to feed addresses and ports into the service. So the service only has one kind of, or a set of listening sockets and it needs those, to give those to Tubular. And so does it give them to Tubular or does Tubular give it listening sockets? Yeah. So exactly, this is the problem that Marek was talking about. The usual case for integration with Tubular is that you'll have like a TCP socket that is, for example, bound to a local host. You could also have a UDP socket if that's something you have to deal with, it's a little bit more tricky as always, but it does work. And now this socket has to, Tubular has to get ahold of this somehow. And I think the first and kind of the the thing that we built first was an integration with system D socket activation. And the, I'm not sure for people that aren't familiar with what system D or socket activation does is, at a high level, you can tell system D like these are the IPs and ports that I'm interested in and then System D will kind of create a socket for you and then pass it to your application when you start it up. Right. And the reason to use socket activation is that you don't have downtime when you start the server. So historically when you restarted like HAProxy, you had the old daemon that was doing, accepting sockets on Port 80 and then you killed it and then a new daemon came in and it opened the socket and there was a gap. And for, for our services, this is not acceptable to have this gap. So all our services use some kind of trick to keep the socket life beyond the duration of a lifetime of, of a single program. So socket activation is kind of the, these days is the stock method of doing that. It's well recommended. It's a feature of system D, so basically you dump the feature, you'd ask System D to do the socket for you and then System D is actually passing that down onto the server. Yeah. And the cool thing about this is that system D allows a socket to be shared. So multiple... you can configure another service that says, oh, I would like actually to get access at the Spectrum socket. That's something we're interested in and this is exactly what we used to kind of build the first integration of Tubular. So you can have, there's a Spectrum service that runs on each of our servers, and then we just add another service which kind of depends on the same socket unit that System D has. And that way we can get access to the socket and kind of put it, insert it into this magic sockmap that we're talking about. And then from then on, it's like smooth sailing, - basically. We built this and actually worked... - You almost sound surprised... You know, never say never... don't know what lurks. It worked. The kind of the problem with that is system D socket activation doesn't come for free. You kind of have to, usually have to modify your application in some way to be aware that it works or that it exists, like the user's system D Socket activation. You have to change the config on our metals or metals is slang for our service, I guess. You kind of have to change configuration there as well. So really what it boiled down to is that initially, it wasn't possible to use Tubular without modifying the application. And that was always a big concern because we also run other software that isn't using System D socket activation and having to tell the teams that maintain that software, Oh yeah, you should just kind of shell out to this thing or send a couple of messages here. It was difficult, actually. And did that have... kind of impact the rollout as well? So it didn't... It didn't initially because we... the first rollout was using, kind of targeting Spectrum because that's where kind of where the need was most immediate and Spectrum was using socket activation. So we were good, but we knew that to push this further, we needed another way of doing it. And I think Jakub actually discovered the other way of doing it when he was doing, was it a talk for, I think the BPF online conference or something like that. I don't remember exactly. I think it was a demo for a BPF summit. Yeah, exactly. And he actually discovered or I don't know where you where you found it, maybe you can tell us that was kind of magical, this is actually a magical, magical system called pidfd_getfd. And what it allows you to do is just kind of as as a privileged process on a system, you can say, I would like to have a copy of the file, the script of this other process. And that is incredibly powerful because all of a sudden we can just use the system call to look through all of the file descriptors of any service on the system and just pick the one that we're interested in by saying, Oh, we want, you know, local host port one, two, three, four, like Marek said, and then we can put that in the BPF socket map and everything works. And so does that even allow you to do like zero downtime rollouts of this where you can, you can have an existing program with normal sockets, then you start Tubular, it goes and sneakily finds all the sockets, changes everything in the background for you. Absolutely, yeah. And I think for Spectrum, we also did that. We did like a live roll out of this. We didn't do a reboot or anything. Right? So, so the problem of adding sockets from the user space to the BPF is you're saying it could be done in one of two ways. One is the system, the socket activation as the separate kind of listener at that socket. And the other one is, this is, the second method is the post hook, post run hook in system D, right? Which just steals the socket, basically steals it. Right? Yeah, we don't call it stealing because... I borrow it. Borrow. Yeah, exactly. The original's still there. It's like downloading music, is that pirate, we don't know. Just highly debated. You wouldn't download a socket. You wouldn't download a socket. That's true. So that's kind of where we're at. I think those are the big things that we're... the big discoveries. Jakub and I worked very closely on kind of making, working... Well, making Tubular kind of a thing. It was a really intense process, I think, for the both of us. Like we started out with very like diverging ideas of how it should work. Like for me, I came from a background where we kind of when you need to do something privileged, you would have like a persistent daemon or something running that had the right privileges, and you know, we would do the, all of the finagling that we needed to do and I think Jakub was much more coming at it from the angle, Well, we don't want that. We want the... If we have a central daemon that could crash, what happens when it crashes? How do we handle this? How would you reduce, you know, capabilities, all that kind of stuff? And in the blog post, I make it sound like from the start it was going to be true. It was going to look like the way it does. Of course that's not true. I think Jakub kind of wisely fought for it to be not a persistent process, but it's just kind of a command line tool that you can run. And then it does kind of... And the configuration is fetched from where? What is the source of the data for the configuration? So at Cloudflare, we have a thing called the addressing API and that kind of holds a lot of this stuff that we talked about like, oh, this is an IP address, what should be on this IP address? What services should go on there? Who owns this? etc., which is... So that's a separate system? Separate system, yeah. I see. So there's a separate team that manages this and what tubular just does is that we take the information from addressing API, we dump that into a config file and then there's some kind of orchestration that just uses tubular, that kind of a binary tool of that, and that's how you plug it together. And kind of at the end of the blog post, my teaser is that if you want to listen on another million IP addresses, it's just a post request away, which it's more involved, but theoretically, you can add something to this API which runs centrally. That information is kind of fed out to our servers and automatically you'll be able to kind of receive traffic, kind of mix and match stuff. I think we are running out of time. So, very nice job. Yeah. Thanks, everyone, for joining. Thanks for tuning in. We hope you enjoyed this. Yeah, there's also, I think, tons of other cool parts about tubular like introspection and testing that we don't get to touch on. Yeah, definitely check out the three blog posts and the sources on GitHub. Thank you very much, everybody. Thank you for joining us. Bye. Great, thank you. Bye bye.

Low-Level Linux: Technical Deep Dives

The Linux ecosystem offers an amazing array of tools and capabilities — and Cloudflare's engineers are often pushing them to their limit, and beyond. Tune in for insights on how Cloudflare's team is extending Linux to help power its global network.

Watch more episodes