Cloudflare TV

How to share IPv4 addresses by partitioning the port space

Presented by: Jakub Sitnicki, Marek Majkowski, Shawn Bohrer
Originally aired on December 11, 2023 @ 10:00 AM - 11:00 AM EST

Back in September, we gave a talk at Linux Plumbers 2022 conference on what challenges we face today when sharing an egress IP between Linux hosts, and how Linux networking API could be improved to facilitate this use case. Join Jakub, Marek, and Shawn for a recap of the talk, a discussion of the feedback we have collected so far and next steps.

English

Transcript (Beta)

Alright, hello everyone, welcome to Cloudflare TV segment on how to share IPv4 addresses by partitioning the port space.

My name is Jakub and together with me today here is Marek from technology team in Warsaw and Shawn from DOS team in Austin.

So Marek, can you kick us off and tell us a little bit about the problem that we've been having and the situation in which we were and why we had the need to share IP addresses between hosts?

Hi, it's a very long story.

It started, I think, back in 2014 and then it had many, many, took many turns and many kind of different projects and many different avenues and we are finally here so we have this discussion.

So the, what we call right now the sharing of the port space is kind of about a limitation of Linux API or maybe even POSIX API kind of depending how you look at that.

And we first hit that limitation in 2014 when we were working on adding web sockets to our NGINX into our CDN so that you could proxy, obviously, the web sockets.

So while the code in NGINX existed for kind of a long time at that stage, we didn't really enable that and that's because we were afraid of long -living connections like obviously web sockets is kind of long-living but historically NGINX and our infrastructure was much more about kind of short-lived HTTP thing, short-lived connections.

So why that doesn't matter?

Well, it matters because due to the kind of complexity of our networking setup, we are and we were using custom IPs to fetch content from the Internet.

So when requests come to our servers, there's a lot of magic happening and then at the end of that magic, sometimes it hits disk and sometimes it goes back to the origin to the Internet in which case we actually need to select the IPs like which one of the IPs should be used for, you know, to actually connect to the origin, to actually get the resource from the Internet.

And while it may sound simple, it's like, okay, you just do IP add or add and you put some flag somewhere in the config, it's much more complex than that.

The reason is that when you do choose the egressing IP, you are using slightly different path in the Linux kernel and that path was really uncharted territory at that stage.

So the issue that was happening was that because of the API, the APIs you do first bind in which you select the IP address that you want to kind of egress from and then you do connect which is kind of the different way around.

Normally you do bind and then you listen and you accept sockets.

So it's kind of the same bind but kind of different semantics at the end of that.

But because doing this bind before connect, the kernel of the Linux and I guess any POSIX or BSD sockets compliant system just cannot know whether the, when you're doing bind, what does it mean?

Are you going to do connecting socket?

Are you going to do accepting socket? It has no idea because it's not at that stage yet.

That means that the port is being locked at that point.

It's like, okay, the kernel doesn't know what to do so it will try to give you the, well, the free port.

Which is fine. But then that means that if you're doing it over and over again and if the connections are kept alive for a long time, that means you will eventually run out of ports because each of the new kind of bind before connect will try to allocate a new port.

Which means that the total concurrent number of connections, total, like on the whole operating system, like using this trick, it will be like under 64,000.

Whatever, 30,000 depending how you configure that.

But basically it's a fixed resource and that was a big issue. So we, I spent a lot of time to kind of dig into the API, try to figure out, okay, what should I, what settings should I set so that it works.

I was able to get that and then half a year later there was a patch to the kernel that implemented that as a feature and the kernel is called IP bind no port, correct me if I'm wrong.

Yeah, IP bind address no port, yeah.

Yeah, here we go. But if we can step back for a moment, why did you need to choose an egress IP?

I mean, all of our servers are obviously connected to the Internet.

Why couldn't we just connect to our regions from whichever default IP the machine has?

Yeah, right. So it goes even back into the history.

So we started with that. We started just having a single egressing IP from our servers.

That was fine. Everyone was happy, but soon we realized that we actually need to be kind of more careful.

I think there are two core reasons. One is our internal segmentation.

We have different kinds of our own traffic and it's very important for us to segment the traffic that is driven by customers, basically origin requests and traffic that is internal to us.

Like for example, our core infrastructure logs, right?

The service that received the logs should not be addressable in any way from the system, from the user kind of visible traffic, right?

The user should not be able to access those systems. And the simplest way to do that is by just IP segmentation.

So we had IPs for the internal stuff and we had the IPs for the kind of CDN traffic that is potentially not controlled by us.

It's a kind of a security feature, a segmentation feature.

But also there was one other aspect, which is our servers were doing quite a lot of traffic, each of them.

I mean, they're still doing, but compared to the numbers back in 2013 or 14, it was a kind of a large traffic from not a large number of our servers.

And that meant that from the origin side, how you saw is you saw a lot of traffic from a single IP address.

And there are many products, many middle boxes, and many kind of usual configuration that people did on their Apache servers on the origin.

Oh no, that's too much traffic from a single IP.

I'll kind of blacklist it. It's obviously malicious.

It was a big, big deal. That's why we added more IPs. So it was actually not a single IP as we were going to originally.

We were kind of hashing and we were selecting between, you know, I think four and five.

It changed after that because we added more and our infrastructure evolved, but yeah, basically there was a kind of need to select a specific using IP.

Okay. That's quite complicated, right?

Because the kernel, if you have more than one IP address on your interface, like one of them is only the primary, and then you have some secondary addresses, but the kernel will not help you use them in a round rubber fashion or in any other scheme of cycling through them, right?

Yeah, it was pure application logic.

So yeah, that gets quite messy. But going back to the port thing.

So the port exhaustion wasn't a real issue. So we didn't have that issue before WebSockets and now once we enable WebSockets, we have no idea if we will have, you know, 10,000 concurrent connections or more.

And we wanted to support that.

So anyway, so the IP bind, the port set up option is now the solution.

And that's awesome. It's very nice that it was covered by the kernel. And then fast forward a couple of years and we have the same issue with UDP.

A quick question there.

So the point, though, is to delay the port allocation to connect.

Is that what's happening there? Right. So that goes into the internals of the operating system.

Well, if you know how Linux works, then yes, then the kind of the early allocation I think is called in the kernel.

But obviously from a user point of view, from a developer point of view, you don't really want that.

You don't really care about the port allocation.

What you do care about is you just want to establish a connection and potentially share a source port.

And that's basically an unsupported feature in the kind of traditional politics API.

I see.

Right, right. The Sockets API just gives you a bind operation, right, for choosing your source IP and your source port potentially.

But at that time, the operating system doesn't know yet if it's going to be a listening socket or just an established socket, right?

So it cannot make assumptions whether it can share the source port with other established sockets, right?

Right. So I'll give an example to illustrate kind of the gravity of the issue.

So let's imagine that the ephemeral port range, the port range that Linux can use to source traffic from in the kind of in the direction, let's imagine the ephemeral port range is of size one.

So we have one port allocated for, allowed for the operating system to use.

If you do that, there's like an sysctl option in Linux. If you do that, you will probably not notice at all any issue.

Like, yeah, sure, your operating system, your machine will always egress from the port that you kind of gave it.

But then you will almost never hit a conflict.

The only reason when you can have the conflict is when you're connecting to the same two tuples.

Like if you're going to 8.8.8.8, port 53, obviously you can't have two concurrent connections because then the whole four tuple cannot be reused.

Like that's kind of fundamental that four tuple and networking must be unique.

But basically, unless you reuse that, you will not notice.

But that's different if you do this bind trick.

This bind trick locks the port and then this port cannot be reused for other use cases.

So basically, now you with kind of selecting the egress IP in a traditional API, you can just have one connection total.

Another connection will just fail because oh, no, no, no, no.

The port range is one, but that port is already occupied, right?

I cannot share it. Well, you can share it because it's a connected socket, but that's not how it works.

And that's exactly this set socket option that kind of solves that.

So it delays the early allocation so that it means that basically the socket is treated as a proper connected socket without the conflicts.

So the solution is correct. And it's nice and it's working well for TCP. And everyone is using that these days.

And it's kind of solved the problem. That's basically it.

Yeah, it should be the default now. Well, that's this kind of bind dance.

If you know that you're going to connect after bind, obviously that's how it should work.

But at the bind state, you don't know. It's basically an API issue, right?

When API was created, no one ever thought that, okay, there might be actually a select egress IP because there might be more than one.

That was kind of unusual.

Right. Well, 64,000 ports should be enough, right? Right. And then also one IP per machine should be enough.

So what the IP identifies, is it a machine or is it an interface or is it a VLAN?

So that goes to the way back of the Internet development.

Okay. So let's kind of roll forward. So now we hit exactly the same issue for UDP with a different project, with a kind of different use case.

So maybe I'll focus on that use case for a moment.

So I mentioned that early on, we had this traffic segmentation problem, and it seems that this kind of traffic segmentation issue comes back and back and hits us kind of more and more.

So we had some IPs for egressing the CDN traffic, the HTTP requests.

And then at some point later, we went and we had a war product, the kind of VPN wire guard thing, when users can connect to our edge and basically forward traffic, and then we can do, again, speed it up, and we can add other higher level features to that.

And that means that we are basically forwarding untrusted traffic.

And that also, that that cannot, should not egress from our CDN traffic, from the CDN IPs.

We don't want to pollute them with potentially low scores or some malicious traffic.

We basically don't know what the war product is, so that should be separated from the other services.

So that means that we added more and more IPs. And while we were okay with adding the kind of warp IPs for egressing, it turns out we need to add even more because they actually need to be country located.

So in the Internet, there are many services that look at that kind of resolve your IP to like where are we coming from, what country are we coming from.

And that's normally not an issue, but in our case, in our design, because we are using Anycast on ingress, that means that we cannot steer customers to the data centers that they could egress from.

So again, kind of simplifying, imagine you have, I don't know, 250 countries in the world, and then imagine you have 250 data centers.

You could imagine that you forward people from one country to one data center, like obviously the UK will go to London, and then I don't know, and Germany will go to Frankfurt or Berlin.

It's like the model can be quite simplistic, but we cannot do that.

Our infrastructure is working on Anycast, that means that the traffic from any person on IP and their Internet can go to any data centers.

That's a good thing. That's a feature.

We like that. It gives us many kind of good positive things, but it also means that German traffic can totally end up in London and UK traffic can totally end up in Frankfurt, and then we need to deal with that because, okay, so now we have London user, UK user in Frankfurt, and the user wants to watch the BBC, and BBC is doing geofencing, is looking, ha, you're coming from Frankfurt, so obviously we'll not give you content, but the user is not really coming from Frankfurt.

The user is really in London, and we can really attest to that, so we need to solve that.

We need to add the country IPs, kind of warp egress, and that was a big issue because we just run out of IP space.

We just don't have enough IPs to add to every server, every IP possible, so our solution was kind of creative.

We had to deal with kind of, we had to change the problem to fit our engineering constraints, and also not use not too many IPs, and the solution we derived is basically sharing the port space.

It's not the first initiative to do that. There are kind of many different hacks around over the years to share the port space, but I think we are the first one doing that on a serious scale, not on a kind of small, small, small hobby scale, so basically what we are doing is we are selecting, we have some IP addresses which have this proper geolocation tag, and then those IP addresses are shared inside the data centers across many machines, and that means that we also only need, well, we need less IPs.

Basically, we don't need all the problem space, all the countries for each machine.

We just need kind of all the countries for each data center, which is still a lot, but then we can kind of do the optimizations over that.

We can do some traffic forwarding, so basically our infrastructure right now allows us to do that in a quite dynamic way, but the kind of Linux API problem remains, which is, okay, so we have these connections flowing through our servers.

There's traffic that needs to egress from a selected IP and port range, and it turns out that on UDP it's basically very, very, very hard.

But wait, wait, like this is not a new problem, right, sharing the IP.

My home router has a public IP assigned, and I have, I don't know, five, seven devices sharing that IP, so why couldn't we do it like that?

Well, your home router is not doing 100 gigabits per second.

That's for one. Right, so this is about kind of many, it will solve the problem of sharing an IP across machines in the past.

There are different types of NUTS.

Basically, it's a source NUT issue, right, so you egress from whatever, and then there's a central party like a router that can be the source NUT that they place.

So the issue is that we don't have, the technologies we would be interested in doesn't exist.

So first, we don't want to have a single point of failure, so we don't really want a single router that owns all the state.

Second is, even if that kind of router existed, or it could be duplicated or something, we'd have some kind of fancy setup.

Then we would need to communicate with the router, oh, by the way, this traffic should egress from that IP, and we actually change that over time.

It's not a fixed thing that you hard-code in the router and you kind of move on.

We actually modify and change it. So it would mean that basically the router will need to be part of the kind of application scope, which is kind of not perfect.

And then the other option is just to do a distributed NUT, which is kind of what we are doing, but we are doing it in a stateless way.

So basically, you can think about kind of our solution as a stateless distributed NUT.

We basically pre-allocate port slices, ranges of ports to each machine.

So machine one can have ports from 2,000 to 3,000, machine two can have ports from 3,000 to 4,000, and so on.

So it's pretty simple.

It's easy to explain. There's no shared state. It's just a configuration in a single place.

We also have the advantage of having the infrastructure to do this kind of routing on the ingress, which is our load balancer called Unimog.

And it was actually a fairly easy addition to Unimog to just tell it, by the way, if you see packets from that port range, always forward it to that particular machine because that machine is responsible for that.

So it was actually fairly kind of fitting our needs.

Right. So let's go ahead. Yeah. So I guess the important part is you still need to steer traffic now, not just solely based on an IP address, but based on an IP and port.

And that sort of needs to be done somewhat upstream.

And we do that with our layer four load balancer that runs on every machine.

So every machine is sort of responsible of receiving that initial traffic and then potentially forwarding it to the correct machine in that data center.

That's correct? Right. Right. So we are in a lucky point that we have. We own the load balancer, so we can do the L3 logic as we wish.

Yes, the load balancer is distributed.

There isn't a single kind of failure. And the state is just the configuration of who owns what ports.

But you can think about this kind of problem in a kind of broader way.

So as you know, on ECMP, like the equal cost multipath that BGP routers can do, there is an algorithm that basically the router is using to, OK, I'll use this port and forward it on that path.

Right. So we, as also many other companies, we use the ECMP for our ingress and anycast traffic.

And you could, in theory, reverse the algorithm.

And you could also encode that on your setup.

So if you knew that you had your, whatever, 16 machines that were responsible for that IP, and then if you knew the algorithm, the ECMP algorithm, you could totally do the egressing traffic and say, by the way, this IP, this port, I selected the source port so that the for tuple will go back to me.

That's fine. So yes, we are using our internal load balancer, but there are many other ways of doing that.

And there's more. So the other use case that this kind of port slicing could be useful in is just container segmentation, container traffic segmentation without using NAT.

Sometimes you don't want to use NAT on Linux, although NAT on Linux's contract is awesome.

But if you didn't want to use that, then you could use similar tricks.

You could configure your local Linux to own IP address and then say, those ports should go to that namespace.

These ports should go to that namespace.

That actually would work quite well. So back to the UDP.

So while for TCP selecting the egress kind of port and sharing the tuple that we started this discussion from, that was kind of solved problem and easy.

For UDP, it turned out to be a big, big, big can of worms. And I think about a year ago, I did a blog post on the Uncensored blog about ConnectX.

So Jakub, if you can, you can show it.

Which was basically explaining my kind of low level hacks on if you want to select the source port from and use the egressing UDP connection, then this is how you do that.

And it goes into much detail and kind of uses many pretty hardcore kind of quirks in Linux.

But the point is that UDP works completely differently than TCP.

So in UDP, due to the way how it kind of behaves, you can totally have a poor tuple socket that is conflicting with other one.

And it will work fine.

Linux will be happy. Everyone will be happy. Semantically, this is completely allowed and in some cases, sensible.

So, again, back to the kind of BSD socket API, which is in some cases, like, for example, in multicast, it makes sense to share the same kind of in two tuple or four tuple across different sockets.

So the semantics of Linux allow that. While for us, it would mean that we have big, big issues in establishing new connected UDP sockets.

So this goes into details how you can solve that.

And at the end, it's kind of a solution which is using a pretty obscure features.

But the point is it's basically very hard.

And it shouldn't be that hard. Like, it should be you should be able to do basically the same what we're doing for TCP.

Easier for UDP. But that's, yeah, just hard.

Right. The UDP sockets API wasn't really built in mind with sharing the source port, right?

And detecting when there's a conflict on the local IP plus port, right?

Right. So it's basically impossible to see if the thing if the two tuple is or four tuple is already occupied.

So this is kind of where we started.

This is where our early implementation. And it's kind of working fine.

Although the issue is that this API is not really multi-tenant. So if you have cooperative tenants, like different applications using this trick, it should work fine.

But it took me a long time to kind of figure it out so that they can actually share it.

So this is kind of where you start, right?

So seeing my hacks, you looked at that and like there must be a better way.

And this is our submission to the conference, which is maybe there is a way to implement basically the same tricks for UDP to be able to share the detect the four-tuple conflict in UDP in a pretty coherent way.

Right.

Well, to take one step back is why do you need to detect the four-tuple conflict, right?

But then that's needed because if you want to egress from just a certain port range, you also cannot ask today the operating system to pick a free port for you from a given range, right?

I mean, you can do that globally by changing a system global configuration, but you can't do it per application, per cgroup or per socket.

So the way we work around that today is we select the port manually ourselves, the local port, but then we also want to share this local port with other flows, outgoing flows, so that we can have more than one connection or UDP flow from the same port.

Right. So I like to look at that as kind of two separate issues.

So one issue is if you know the source port, let's say that you want to select the ephemeral port range is one.

So you basically know the source port, right? If you know the source port, you want to establish connection to this nation, you'd call some kind of connect, Cisco is called ConnectX on the Mac, which is kind of nice.

That's why we stole the name. So you basically put the four-tuple in the ConnectX and you fire and everyone is happy, but then the issue is that if you have overlapping connection already, the four-tuple is already occupied, you want to see some kind of error.

So that's completely trivial in TCP, maybe not completely, but if you know what you're doing, it's easily doable in TCP and it's basically impossible in UDP.

So that's the first issue. It's like if you know the four-tuple, you need to detect the issue, the overlapping connection issue, and in UDP that's just not supported, that just doesn't work like that.

The other issue is, okay, so if you are actually selecting, if you actually know your port range, you can potentially defer the source port choosing to the kernel.

And it's slightly faster if you do that in the kernel, but I would say it's a separate issue.

So it would be nice to support both because for our use case, it will play nice, but I would be just happy with the conflict detection.

Right, yeah, correct.

TCP will not let you create two sockets that have the same four -tuple, and hence, there's the conflict which socket receives any return traffic that comes into the metal, right?

Whereas UDP, you either, like we said, when you're talking about buying Cisco, you either reserve exclusively the source port for yourself, and then you have a problem of running out of ports, or you tell the operating system that you would like to share this port, but then there is no way to detect that you are conflicting with any other socket that is using the same local address, but perhaps a different remote address, and to detect that, we had to resort to some hacks involving Netlink and manual querying of what sockets exist today.

Right, exactly. So, yes, so if you don't need to reuse the source port and have this kind of total number of concurrent connections is very small, that's fine.

Then let them support that. You basically lock one port, the source port, and you're done.

But we cannot afford that. We need much better. In that case, we do need the conflict detection.

Yeah, yeah. All right. Yeah, so like you said, the problem is kind of twofold.

One is detecting the conflict, and I'll talk about that in a little bit.

But first, we can attack the simpler problem of just making the operating system find a suitable egress port for you, but within a given range, right?

So, let me bring up the slides. There we go.

All right.

So, yeah, like we mentioned, it's possible today to tell the operating system to open connections from just a selected range of ports, and at least in TCP, you will be able to share the local source port when you set the IP bind address on the port socket option.

But the biggest downside is that the port range configuration is global, so you affect all the services, all the applications running in the same network namespace as your process, which in our case is a problem because we don't put every service in its own network namespace.

Right, so this is basically who is the owner of the ephemeral port range, right?

Is that the operating system, is that a namespace, or is that a process?

And I think we read the kind of it's an application feature as opposed to a kind of higher level feature.

Yeah, yeah.

So, that's one way to look at it, and I look at it a bit differently. Like, you tell the operating system to, hey, I want a port from an ephemeral port range, but I would like to also hint within which lower upper bound we want the port, right?

And the system in the end has to check if there are any free ports there, and yeah.

So, the first idea that applies to both TCP and UDP for an API improvement is that we could perhaps introduce another socket option through which we would be able to give this hint to the operating system that we would like our socket to use a local port, but the port should be within some given range, not within the global range.

And of course, we can only limit down, narrow down the range.

We can't go out of the network namespace ephemeral port range. And if we could do this together with the light bind, the IP bind address no port socket option, that turns out that would be everything we need in order to easily share the source ports for TCP.

Nothing else is needed. So, we took a stab at that, at implementing that, and turns out it's not so complex to do.

It's actually pretty, the patch, the proof of concept patch was pretty small.

So, we've posted it to NetDev for comments.

The biggest downside, the biggest change there is that we need more state in the socket, and that's always a bummer.

These structures are already quite heavy, and you need to allocate a socket for each outgoing connection.

So, we try to not put new stuff in there. But if we could have two more fields, U16 fields there, then we can use that state that is being configured based on the socket option that I showed earlier at the time when we are looking for a free ephemeral port range.

And instead of doing what we do today, which is checking the whole ephemeral port range as configured by your syscat tool, we could clamp down the range which we search through to the values set for this particular socket.

Sounds obvious and nice. Do we have a chance to push that to the kernel?

Yeah, I mean, we haven't seen any vetoes or negative feedback for the RFC patch set.

And when talking with Fox at Linux Blumbers, I also didn't hear any objections to that.

So, the plan is actually to make this patch complete, like merge warfy, because that still needs to be extended with support for IPv6.

I've only did v4 changes for now.

And submit it as v1 for actual review.

But let me guess, the UDP part is harder, right?

Yeah, that's the thing. I mean, even if we do that, that doesn't solve our problem because of the other part of a problem, which you were talking about that, hey, there are two problems here.

And one of them is detecting the conflict. And for UDP, even if we are able to configure the ephemeral port range per socket, that still creates a problem because we still don't get information back from the server if there was a conflict with another socket, right?

And you will be allowed today with or within the venue socket option to easily create two sockets bound to the same portable.

So, let me ask a quick question here. This is UDP.

UDP is inherently a connectionless protocol. So, why call connect at all? Why not just use send to and send to whoever you need to send to, and then maybe do the handling the return traffic in user space when you get it back?

Yeah, that's a totally valid option, I would say.

And then you don't need to worry about sharing the source ports between connected sockets.

I would say it's about choosing your design for your network server, right?

If you want to have multiplexing logic on egress in your application, having to choose the right socket bound to the port which you want to egress from in your application logic, and then do the same on ingress when you need to find the context or the threads to process an incoming packet.

So, connected sockets, they kind of follow the same programming model as TCP, so it's easy to if you need to handle both TCP and UDP traffic, like say you have a web server and it needs to handle both HTTP 1 and 2, and then HTTP 3 over QUIC and UDP traffic, you can have just one processing path for both.

If you can have the same model for both TCP and UDP, where there's a socket per outgoing connection.

Right, it's nice to be able to share the application logic and not have to have explicit UDP versus TCP paths.

I'll take a stab at this as well, if I may, Jakub.

Yeah, sure, sure. I was just going to say, but of course it's not the whole story, but you go ahead first.

Right, so this is a completely valid question.

Many kernel developers and many people ask, okay, but UDP, obviously you can do it differently.

So, in UDP, there are two different worlds. One is the egressing world, in which you're connecting to the Internet.

In that world, you do want to have a unique source port, because this may be a security feature.

For example, for DNS, you're doing a request and response.

Sure, you can have one unconnected socket and do requests to multiple DNS servers, but that's a vulnerability, because then you're revealing your source port, which is 16 bits of your entropy.

So, basically, the assumption is that for egressing, a lot of times you want to have a pretty random source port, right?

Some degree of randomness, but basically, you don't want to reuse that.

This is actually a fairly big deal.

And this is also what many people are doing. On egressing, you rarely use unconnected sockets.

You usually use connected sockets. That's just an easier programming model.

It's nicer. It's easier to keep state. It's just, kind of, everyone understands that.

That's basically, that's how you use UDP sockets. On ingress, though, it's different.

On ingress, though, people mostly use unconnected sockets, and there's a lot of pushback from, like, why would you use connected sockets on ingress?

That's very hard. I wrote a blog post about that some time ago when I, kind of, said that you can do that.

This is the hack you're doing.

You can do the connected socket over your, kind of, on the ingress. It's not working perfectly due to limitations that we'll discuss in a moment, I guess, but the point is it also has benefits.

The connected sockets on ingress also have benefits, and that's for restarting servers and, kind of, copying the same logic that you have in other, like, TCP servers.

So, for QUIC, QUIC may not be a perfect example, but let's say for QUIC, which is a UDP-based protocol, if you have a connected socket, it's, for example, way easier to handle restarts because there's state.

There's this file descriptor, so you may keep on using this file descriptor while your server is being restarted, and the new server may get the inbound, the, kind of, unconnected packet, but don't belong to any connections.

So, it's basically very similar to the, kind of, accept flow in TCP, and it's very useful.

For some services, it's very useful.

For others, not so much. For example, for DNS, it makes no sense.

DNS is fire and forget, sorry, request and response, so it's, like, yeah, it makes no sense to the connected socket, but basically, it's connected sockets exist, and we might as well use them, and for ingress, they are, kind of, obvious.

For ingress, they're not obvious, but they have their place, so why not support them?

Why not make sure it works nice? Right, so I guess that leads to the, how do we figure out the conflicts, right?

Right, but I wanted to touch on what Marek said, that using connected sockets on ingress, they don't follow the, obviously, the TCP listen and accept semantics.

One interesting bit of feedback from the talk that we got is that, from one of the network maintainers, was that, well, actually, there's nothing standing in the way of, perhaps in the future, UDP supporting the listen and accept semantics.

That would be something. That would be crazy, no?

Except on UDP, that would be very interesting. But this is exactly the, kind of, issue we have with this project and, kind of, talking about UDP.

UDP is just not used in a way it could be used. It's underutilized. It was not really used for heavy traffic for years and years, and now, it's becoming more and more useful.

With the wire guard, with QUIC, with gaming protocols, it's becoming, kind of, a big, big, big deal.

So, we are now discovering all these API limitations, and they happen both on the egressing side, they happen on the ingressing side, there's performance issues.

So, it's very interesting that they, kind of, these problems, kind of, overlap, and I think they merge in a similar, kind of, place.

Yeah. Yeah. All right. Let's go back to the conflicts. So, I, kind of, let me recap.

So, what's the status quo for today? So, if we let the operating system choose the source port for us, and whether we set the late bind option or not, the operating system will just reserve the local port exclusively for this particular socket.

We won't be able to share it with any other socket. On the other hand, if we ask the system to share the local address with Raius address option, then we won't get any information back that we have just created two sockets, which have the same local and remote address, and hence, there's a conflict.

If traffic comes back in, there's a question which Unicast socket should receive this packet and process it, right?

So, that's the second problem that we've been trying to tackle.

And so, to understand what solution we came up with, let's quickly look at how Connect actually allocates the port, so how the late bind today works for UDP.

So, when you call your Connect and then on a UDP socket, and your socket happens to not have a source port chosen yet, so you went for an ephemeral port, zero, you let the operating system choose it, the operating system will then try to find a local port, which is like a regular, which is the same procedure as would happen if you called bind without setting the bind address no port option.

Now, inside that protocol option for getting a free local port, we have a couple branches for processing, and the first one, which is concerned, we'll hit the first one, which is concerned with finding a local port if we haven't specified any preferred port.

And once we do that, we need to publish the binding, right?

So, why there is no way to make the system report back a conflict is in the details of how this function works.

So, today, when we ask the system to pick a local port, it's gonna pick a number at random and look up that port number in the UDP sockets hash table.

We're gonna then take a look at walk through all the sockets that are in this hash table entry and check which ports are already taken, right?

And confront that with the port that we have just chosen.

So, in other words, when looking for a free source port that we can use, we only consider the local addresses of all the sockets that already exist.

So, you can kind of see the problem here that we completely ignore the destination address of UDP sockets than whether of connected UDP sockets, right?

So, there's no way to report back if there's a conflict with just a connected UDP socket, but we could potentially serve a source address.

So, the idea we had is that, hey, we could maybe reuse all that logic, which is pretty complex and it's battle tested.

So, it would be nice to not make the same mistakes and write something from scratch.

But instead of checking for any existing sockets which might conflict with us in the way we do today, we will have some other predicate function.

And that predicate function will, in addition to looking at the local address, will also consider the remote address, right?

So, if we know that there's no other socket that is connected to the same remote addresses we are connecting to using the same source addresses we picked, then we can share with whoever else is bound to this local address.

Right.

So, how do we hook it up into the kernel? We can't modify the existing get port operation because that is also used by a bind syscall, right?

So, we can't just change that behavior and it would be cumbersome to change the signature of the get port operation because today it doesn't take the address argument that we are connecting to because it doesn't need to.

So, we need a new operation. But fortunately, we don't need to introduce a new protocol operation.

There is the first one already that we can reuse and it's called bind add, which seems kind of appropriate, which is used by a CTP today and not by UDP or TCP.

And it happens to take a socket and an address, which can be the destination address in our case.

So, once we have that, this new operation that we'll be able to find the local port that we can share with other sockets without running into four tuple conflicts, we have to glue it in into our connect syscall, right?

So, that's how it looks today.

And we can imagine that we can make this logic a bit more complex and we can have two branches.

So, one for the existing case where we just want to locate the port as usual and the other when we want to share the local source address as long as the destination is unique on some conditions, right?

There's a question then, when do we enter this new port allocation logic?

So, when can we call this?

What would be a criteria for entering into this branch? Well, for sure, we want to call this only when we know the source IP address, right?

We are not interested in wildcard sockets, right?

Because we need to have the full four tuple in order to check for our conflict.

And then we have to kind of make it backwards compatible by some combination of flags and possibly we don't need to introduce another flag.

And why is that? Well, that's because as it happens today, if you set rails address socket option on your socket, then the bind address no port has no effect on what the system will report back.

And you will always be able to bind the socket and connect it by creating a conflict.

It doesn't matter if you set that bind address no port socket option.

So, what we would be proposing is that we assign a meaning to actually using these two socket options together so that when both are set, you get source port sharing, which is what rails address kind of conveys, but you also get a conflict report or actually you get a local address assigned, but when there's no conflict with any other socket that connects to the same destination, right?

And in the extreme case, when we run out of ports, out of free ports, and we cannot find the socket that a local source port number, because all of them are taken and they're connected to the same destination as we are trying to connect to, you will get an E again error that, hey, we are not able to facilitate that.

So, I think what you're saying is that basically IP bind address no port makes no sense without reuse other.

And on the other hand, yeah.

So, basically, you need both. And that's kind of the difference between this proposal and the TCP.

In TCP, you don't need the reuse other while here you kind of do.

But this goes back to the kind of the, again, the API semantics.

It's like, if you want to share the port, you need to reuse others. So, otherwise, it makes no sense.

Yeah. It's worth pointing out that reuse address has completely different meaning in both TCP and UDP.

In TCP, you need to reuse address to recover the ports taken by time -weight socket options when your server dies and you need to rebind to the same port.

In UDP, it's needed if you want to force two sockets to be able to use the same local address.

And it was designed with, it was added with multicast sockets in mind, right?

Historical stuff again. It's like how it was derived.

Yeah. And if you said, if you said reuse address and bind address no port today, you won't notice the difference from the application point of view.

In reality, the port, well, if you call get, get stock name, you'll notice that the port allocation doesn't happen until connect time or when you set bind address no port.

But that's a very subtle change. Yeah. And still the port cannot be reused.

I think it's nice that actually this is completely feasible. So, actually adding the IP bind address no port to UDP is kind of, it's possible.

That's a kind of revelation.

That's interesting. But I think it touches on a very subtle difference between the TCP and UDP, which is in UDP is very hard to detect the conflicts because the sockets may coexist.

Well, in TCP, they can't exist.

And I think it's also implemented differently. I think in UDP, there is the one hash table for both connected and unconnected, while in TCP, there are different ones.

Is this correct? Yeah, that's right. But that's a major difference between TCP and UDP.

Like all your connected and unconnected sockets, they are all in one hash tables.

We do have two hash tables for UDP, but none of them is keyed on the full four tuple, right?

Right. And are they dynamic size? I think they're fixed size.

Yeah. If I remember correctly, size is picked at the boot time.

Right. It's picked at boot time based on the memory your system has, or you can set the size up to 65,000 ports.

So that means basically that if we are going to use many connected sockets, we can suffer at some point, right?

If the table is not tuned correctly.

Right. Yeah. That was another piece of feedback that we got.

Like, all right, we want to use connected sockets, but aren't we concerned about performance?

If you have all these connected sockets in your hash table and you distribute them only based on the port number or local IP and port number, then you are susceptible to building long chains of connected sockets that use the same local address, but even if they connect to different destinations, and that's a problem.

Your hash table is not, the entries in your hash table are not distributed well, right?

Like in the case of TCP, where the established, the eHash established hash table uses both the local and remote address as the key to the table, right?

Right. So that's basically why UDP connected sockets on Ingress don't really work at a scale.

In some of our products, we are using that, but then it's kind of different because we have multiple server IPs and ports, so they are kind of hashed differently.

But yeah, generally, if you're doing that on Ingress, that doesn't scale.

On Ingress, it's not a big deal. And also while we, I think we still will be using more and more connected sockets due to the quick, for example, for UDP, but it's still not there yet.

So I hope we'll solve the problem of the hash table conflict before we go to a properly big scale there.

Yeah. As long as we can vary the local port address and local IP, so as long as we have many local IPs, then we are able to keep the chains length under control.

But yeah, the suggestion from upstream maintainers was that, well, perhaps we should have a separate hash table for the connected sockets, right?

That would then enable using them on Ingress and on Ingress and creating lots of connected sockets without worrying about long hash table chains and potentially that could have even some beneficial aspects for Ingress, right?

Because we have this special mechanism called early DMUX for on Ingress, right, Sean?

Would having a dedicated hash table for connected socket improve anything there?

Do you think? Yeah, I think it could.

So many years ago, I worked in the financial world, which heavily uses multicast.

And the Linux kernel used to have a routing table lookup cache. And for a variety of reasons, they removed that.

But in that kernel version change between when they removed the routing table cache, we saw a big drop in our multicast receive performance.

And when I talked to the networking maintainers about that, the answer was, oh, well, for TCP, nobody has noticed this problem because TCP has early DMUX.

And UDP doesn't have early DMUX. Perhaps you could implement early DMUX for UDP.

So I did, but there's some some corner cases there. It really only works for multicast where you have, again, a chain of one in this UDP hash table, which isn't too hard to achieve with UDP.

And it only works for connected sockets. But again, we only check, we allow a chain, but we only check the first socket in that chain.

So anytime you end up with a chain, you lose this nice performance benefit of early DMUX.

So I think having a separate hash table could give everybody who's using connected UDP sockets a nice performance win.

Right, right. Because then maybe we could make an assumption that chains will be short enough and entries will be distributed well enough to just walk that chain on early DMUX.

And then that mechanism, just to give some more context, that will save you routing lookup on ingress, right?

Correct. It's all about skipping the routing lookup, which can be kind of expensive.

Okay, okay. So it looks like it could have several benefits if we go that way.

Right, exactly. So we kind of made the full circle.

We started with WebSockets, scalability, and kind of worrying about just no number of concurrent connections possible.

So up to the point when it seems that the UDP needs to look like TCP these days.

So yeah, so basically, if the proposal is to create the new hash table for connected sockets for UDP, that would make me happy because I could do the connect on ingress.

I guess, Jakub, that will make you happy because it will be way easier to implement the conflict detection.

It will be trivial because that's one lookup in memory. And I think it will also make Sean happy with just early DMUX working for basically all connected sockets and many more unconnected sockets as well, I guess.

Right.

Sounds so. All right. And with that, I think we can wrap up. So thanks, Sean.

Thanks, Marek. It was great chatting with you. Yeah, great conversation.

Thumbnail image for video "Low-Level Linux: Technical Deep Dives"

Low-Level Linux: Technical Deep Dives
The Linux ecosystem offers an amazing array of tools and capabilities — and Cloudflare's engineers are often pushing them to their limit, and beyond. Tune in for insights on how Cloudflare's team is extending Linux to help power its global network.
Watch more episodes