Missing Manuals - io_uring worker pool
Presented by: Jakub Sitnicki
Originally aired on May 23 @ 3:00 PM - 3:30 PM EDT
Join Cloudflare Systems Engineer Jakub Sitnicki for a technical deep dive into io_uring and its worker pool.
You can access the mindmap featured in this presentation at this link: https://cftv-io-uring.pages.dev/
For more, don't miss the companion blog post:
English
Linux
Deep Dives
Transcript (Beta)
Hello and welcome. My name is Jakub and thank you for joining me for the Missing Manuals Cloudflare TV segment.
Today we're going to talk about IoE Ring and its worker thread pool.
Let's get started. First off, what is IoE Ring?
Well, it's an asynchronous Io API. And in case you're not familiar with it yet, asynchronous Io allows your application to submit Io requests and continue on doing other tasks while it waits for the requests to be carried out and results to be delivered.
Io Ring got introduced into the Linux kernel in 2019 when version 5.1 was released.
And if you want to read on the background how we got here, it was covered by both kernel newbies changelog as well as Linux weekly news.
So these are both great tweets. And I'm saying it's an asynchronous Io API, but in reality, it's a full-blown Io runtime that has various components like thread pool, work queues where tasks wait for their turn to be executed, and a dispatcher that chooses how to carry out different tasks.
And before we go on, just a short warning that this video is not an introduction to IoE Ring, but I can recommend some excellent resources where you can get familiar with the basics.
First off, there is the Efficient Io with IoRing article by Jens Expo, the author and the creator of the API.
Then there is also Lord of the IoRing Guide by Shuab Hussein.
I hope I pronounced that correctly.
If not, sorry. And if you're more of a video tutorial person, and I assume you are because you're watching CloudHero TV, there was an excellent talk at Kernel Recipes Conference shortly after the IoRing API was added, and the talk was given also by Jens Expo.
And last but not least, IoRing comes with a set of manual pages.
Just remember to install liburine-devil-package.
There's the introductory man page for it in section seven.
That gives you an overview of the API, as well as for dedicated man pages for the three new syscalls that IoRing introduces.
The setup syscall, which we use to create the IoRing, the register syscall, which initially was intended for registering buffers and file descriptors on which we're going to operate with IoRing, but with time was also extended to be kind of like a control syscall for configuring your IoRing instance.
And we're going to see it later today in use. And last, there is an IoRing enter syscall, which we use to submit IO requests and optionally to wait for the request to complete.
All right. So, after the short overview of IoRing, let's move on to our main topic.
So, as I mentioned, IoRing is an IO runtime, an IO framework.
And using frameworks that take care of pulling file descriptors for readiness to be read or written and which spawns thread pools to which work is delegated is very convenient from the developer point of views.
But when we build services on top of such frameworks, it's also important to understand how underneath the abstractions this framework allocates resources.
And threads are one of such system resources.
And that's what we're going to be talking about today.
We're going to ask ourselves a question. When we use IoRing, how many worker threads will it actually create by default to handle our IO requests?
So, if we dig through documentation, we can come across a snippet of a bit of information in the IoRing register man page, which says that by default, IoRing will limit the number of worker threads to the maximum new process limit.
And it will also limit the number of bounded workers to a number that is a function of your submit queue depth and the number of CPUs on your system.
All right. That is something.
But we got to ask ourselves, what is actually an unbounded worker? And how does it differ from a bounded worker to understand what it means?
So, as it turns out, IoRing does not treat all IoR requests the same way.
And the requests get categorized into two types when they're submitted.
There are requests that we expect to complete in bounded time, which are normally requests that operate on regular files and block devices.
And then there are requests that may never complete like network IO.
And these are called unbound network and they're handled by unbounded workers, which IoRing will spawn as much as our limit and proc at maximum.
Today, we're going to focus on unbounded work and unbounded worker pool because we're going to be dealing with network requests.
All right.
So, now we know how many workers by default IoRing will spawn for unbounded work.
So, let's see how we can actually cap the unbounded worker pool size. To experiment and see what IoRing does, how it scales up the worker pool, we're going to need a toy workload that creates an IoRing and submits some requests.
And you can grab our toy workload from GitHub.
And it's not very complex.
It's about 200 lines of code that is just parametrized. So, it allows us to experiment with different options for setting up an IoRing.
And all it does, it creates UDP socket and submits read requests from that socket.
Nothing ever arrives on the socket unless we send something to it.
So, the request will just wait for completion.
Okay.
So, first off, let's see what happens when we just fill our submission queue.
With as many requests as we can, we'll confirm by tracing that IoRing has consumed the request that we submitted.
And for that, we can trace the IoRing submit SQE trace point, which gets triggered whenever IoRing takes out a request from a submission queue for processing.
And once we do that, we're going to check for the number of threads in our application to see how IoRing has scaled the unbounded worker pool to handle UDP read requests.
Okay.
So, as I mentioned, we're going to trace a one trace point to see how many requests were submitted and run our workload under perf.
Our workload is running.
Now we're going to check the thread count. And there seems to be just one thread running.
So, this is our main thread. No workers were spawned. And once we kill our workload, we can see that we have indeed submitted around 4,000 requests, IoR requests.
So, we have filled the queue completely. But despite that, there were no worker threads spawned.
So, as I mentioned, no worker threads spawned.
Why? Well, it turns out that IoRing is smart enough to know that all sockets can be pulled for readiness to read and write.
So, it doesn't enter the processing path where we spawn workers, which would be the upper path on our graph.
But rather, it enters the non -blocking weight where we just wait for the files to be ready to read in this case.
And we can confirm that if we rerun our workload, this time by recording the IoRing pull arm trace point, which gets hit whenever we whenever we enter the non -blocking path.
All right.
So, that's what happens by default if we try to read from a socket. So, what happens if we actually want to process the request asynchronously, meaning that we want to block on the read and we want to delegate the work to the worker thread?
First off, how can we do that?
How can we force IoRing to treat the socket file descriptor as if it was non-pullable?
Well, we can use a flag which we can use to ask IoRing to always treat the request as if we were to block on the operation.
And that's what we're gonna use.
So, we're gonna run pretty much the same experiment again.
But this time, we're gonna track by recording a trace point wherever we have pushed any requests onto the blocking path.
And from there, onto a run queue from which the worker pools consume the request to process.
All right. So, let's see what happens. All right. We run our workload again.
This time, with the async flag, which informs our total workload to set the async flag for requests.
And we check for the number of threads. And, yeah, this time, we see that, indeed, 4,000 threads were spawned.
So, IoRing has scaled up the worker thread pool this time.
So, why 4,096 threads?
Well, that is our submission queue depth by default, as I mentioned earlier.
And IoRing has decided to spawn one thread per each submitted IoR request.
So, moving on.
We now know what happens when we just submit requests and let IoRing scale up the worker thread pool as it wishes.
But admittedly, having 4,000 threads running just to cater to a bunch of read requests for a socket is not really efficient.
So, now we would like to know, well, how do we actually control the number of worker threads?
How can we limit the size of the pool? And to do that, we have a few methods at our hand.
So, first off, the naive approach.
We can just limit the number of inflate requests. If we never submit more than, say, eight requests and wait for them to complete before we submit new requests, then we don't expect IoRing to spawn more than eight workers.
So, that's what we're gonna try. Right?
We've run our workload, this time asking it to submit just eight requests. And as expected, IoRing has created just eight additional workers apart from the main thread.
And now we're gonna complete one of the requests by sending a packet to the UDP socket on which we are waiting to read.
And we're gonna wait a bit to give IoRing time to retire a worker that is now sitting idle and recheck the thread count.
And indeed, we can see that the thread count went down by one.
So, because there are only seven requests remaining to be completed.
So, as expected, the thread count never exceeded the number of inflates yet to be completed requests.
That's the first method we can use to control how many worker threads will be in the pool.
However, we can do a little bit better. IoRing has a IoWorkUMaxWorkers toggle, which we can use by configuring our URink with IoRing registers, it's called.
And we can use this command to configure how many workers we want to be spawned in maximum in our thread pool.
And the limit is separate for unbounded workers and bounded workers.
So, this is what we are gonna use in our next experiment.
Okay.
This time we're gonna trace the syscalls as we run our workload. And we're gonna pass the workers option so that we'll limit the number of workers to eight.
And indeed, we can see that we have just the main thread and eight workers spawned.
And once we terminate our workload, we can see that our toy workload has issued an IoRink register command to set up, configure the worker pool size.
So, we have submitted 4,000 requests just as before, but only eight worker threads were spawned.
So, that works well.
There's also another option to limit the number of created threads in the worker pool.
And that is, as we've seen already, by limiting the number of new processes that can be created.
In other words, by setting our limit proc resource limit. So, let's give that a try.
Right this time, we're gonna ask for 16 worker threads in the pool, but we're gonna limit the new process limit to just 12.
And something weird happened because IoRink created just three worker threads.
And if we trace what is happening, and we trace the function that creates the workers, we can see that it's being hit pretty heavily, and it returns an E again error because IoRink is not able to create new worker threads.
At the same time, we see that we're burning one core just trying to scale up the pool.
So, that's because we already have some processes created in our user namespace under our UUID.
So, in order to apply that limit reliably, we need to have fresh user namespaces or a fresh user ID, which has not used up any of the quota yet.
And if we do that, then, yeah, indeed, we can use error limit and proc resource limit to control how many worker threads will be in the pool.
So, to recap, it's easy to run in this situation by setting error limit proc to low, where IoRink will try to create threads and just will burn CPU doing that.
Because we might have already created new processes or new threads, which are treated the same way as tasks by the kernel, and simply speaking, IoRink is just not able to scale the worker pool size up to the target limit, up to the target size.
And just a note that we can actually fall into the same trap with cgroup process limits, but I'm not gonna go into that.
We have covered that in the companion blog post for this video.
All right.
So, we know how many workers IoRink will create by default, and we also learned how to limit the number of workers that IoRink will create and how to monitor it by checking the number of threads in our process.
And these are the basics of how the worker pool inside IoRink works.
But there are also other, say, more advanced scenarios, like the case when we have multiple rings.
We can imagine a network proxy that uses one IoRink on ingress and another IoRink on egress.
And in this case, we also have to ask ourselves the question, how are these limits applied?
Are they per IoRink or are they per process? Then there's also the case of NUMA systems, where we have nodes with CPUs, and the limits can apply differently as the IoRink manual page says.
All right. So, if you would like to learn about these cases as well, well, as I mentioned, please go read our IoRink filter blog.
We have covered these there as well, so you can find all the answers there.
Thank you, and I'll see you next time.
It's dead clear that no one is innovating in this space as fast as Cloudflare is.
Cloudflare's been an amazing partner in the privacy front.
They've been willing to be extremely transparent about the data that they are collecting and why they're using it, and they've also been willing to throw those logs away.
I think one of our favorite features of Cloudflare has been the worker technology.
Our origins can go down and things will continue to operate perfectly.
I think having that kind of a safety net, you know, provided by Cloudflare goes a long ways.
We were able to leverage Cloudflare to save about $250 ,000 within about a day.
The cost savings across the board is measurable, it's dramatic, and it's something that actually dwarfs the yearly cost of our service with Cloudflare.
It's really amazing to partner with a vendor who's not just providing a great enterprise service, but also helping to move forward the security on the Internet.
One of the things we didn't expect to happen is that the majority of traffic coming into our infrastructure would get faster response times, which is incredible.
Like, Zendesk just got 50% faster for all of these customers around the world because we migrated to Cloudflare.
We chose Cloudflare over other existing technology vendors so we could provide a single standard for our global footprint, ensuring world-class capabilities in bot management and web application firewall to protect our large public-facing digital presence.
We ended up building our own fleet of HA proxy servers such that we could easily lose one and then it wouldn't have a massive effect.
But it was very hard to manage because we kept adding more and more machines as we grew.
With Cloudflare, we were able to just scrap all of that because Cloudflare now sits in front and does all the work for us.
Cloudflare helped us to improve the customer satisfaction.
It removed the friction with our customer engagement.
It's very low maintenance and very cost effective and very easy to deploy and it improves the customer experiences big time.
Cloudflare is amazing.
Cloudflare is such a relief. Cloudflare is very easy to use. It's fast. Cloudflare really plays the first level of defense for us.
Cloudflare has given us peace of mind.
They've got our backs. Cloudflare has been fantastic. I would definitely recommend Cloudflare.
Cloudflare is providing an incredible service to the world right now.
Cloudflare has helped save lives through Project Fairshot.
We will forever be grateful for your participation in getting the vaccine to those who need it most in an elegant, efficient, and ethical manner.
Thank you. What is a WAF?
A WAF is a security system that uses a set of rules to filter and monitor HTTP traffic between web applications and the Internet.
Just as a toll booth allows paying customers to drive across a toll road and prevents non -paying customers from accessing the roadway, network traffic must pass through a firewall before it is allowed to reach the server.
WAFs use adaptable policies to defend vulnerabilities in a web application, allowing for easy policy modification and faster responses to new attack vectors.
By quickly adjusting their policies to address new threats, WAFs protect against cyberattacks like cross -site forgery, file inclusion, cross-site scripting, and SQL injection.
Thanks for watching!