(Filipino/Tagalog) Quicksilver: Configuration Distribution at Internet Scale
Presented by: Jayson Cena, Jet Mariscal
Originally aired on June 8, 2020 @ 10:00 PM - 10:30 PM EDT
Jayson and Jet, Systems Reliability Engineer from Cloudflare Singapore Office will be discussing Quicksilver, Key-Value Store for Cloudflare's Edge. (Discussions will be done in Filipino/Tagalog Language)
Filipino
Quicksilver
Scaling
Transcript (Beta)
🎵 Outro Music 🎵 Good morning everyone, and welcome to Cloudflare TV.
In the past few years, Cloudflare has been continuously building a large network of data centers around the world to fulfill our mission of helping to build a better Internet.
At present, there are more than 200 data centers in 95 countries and we continue to expand to cities worldwide where end-users are as close as possible.
Cloudflare's network processes more than 14 million HTTP requests per second on average and reaches 17 million requests at peak.
In DNS, it has 4.6 million requests per second and provides protection to more than 26 million Internet properties.
Here at Cloudflare, we are using all our capabilities to think of tools that we can use to make these requests faster and more secure.
But one of the secret sauce to make this possible is how we distribute the configurations globally.
To talk about the whole details, we will now have with us Jayson Cena, who is also an SRE here at Cloudflare.
Good morning, Jayson.
As I mentioned, for sure, there are databases that Cloudflare uses, public services like HTTP and DNS.
With the amount of requests and the large number of Internet properties that we serve, what kind of configuration database does Cloudflare use to support these many configurations?
Good morning, Jet.
Good morning, Jet.
For every application, especially Internet-facing services, they always have a database.
It can be a simple flat file or a structured format like JSON, YAML, or a database.
In the case of Cloudflare, it's a mixed configuration, but all customer configurations are stored in the database.
At the scale of Cloudflare, the requirements for the database are high.
The average response time is within microseconds.
It should not exceed 1 millisecond.
Storage should also be efficient because of the 26 million Internet properties registered in Cloudflare.
If the efficiency of storage is not considered, it will consume a large storage space.
Cloudflare created Postgres Extension, the Kyoto Tycoon foreign data wrapper, to integrate Kyoto Tycoon with Postgres.
Kyoto Tycoon can read and write inside Postgres, so all customer configuration changes stored in Postgres can also be stored in Kyoto Tycoon.
One of the features of Kyoto Tycoon that Cloudflare uses is master-slave replication.
But because encryption is important in Cloudflare and Kyoto Tycoon does not have encryption, Cloudflare contributed to add mutual TLS to Kyoto Tycoon.
This is one of the advantages of open source softwares.
Let's go back to master-slave replication.
Cloudflare uses master-slave replication to send all configuration updates to thousands of worker nodes.
What issues have you experienced so far with using KT?
There are latency issues when read and write are combined in one KT instance.
Even if it's just one writer, it suffers from read latency.
99.9% becomes 9 ms and 99.9% becomes 15 ms.
This is not a problem for worker nodes because worker nodes receive HTTP and TNS traffic.
There is no service that writes.
All services that run on worker nodes are only readers.
But the case of Topmaster is different because it is linked to Postgres and Postgres writes in Kyoto Tycoon using a foreign data wrapper.
One of the simple issues of KT is upgrading.
Because almost all service of worker nodes depends on KT, KT does not just upgrade.
KT's upgrade is always scheduled four times a year.
Cloudflare's scale is also one of the issues.
Kernel panics or processor bugs are common.
In such cases, the servers get stuck or respond on their own.
Like other databases, when the database application crashes, there is partial corruption.
In the case of KT, it supports offline repair, but in the scale and amount of data of Cloudflare in one instance of KT, the offline repair takes too long.
Since the repair is offline, KT does not receive replication updates until the replication lag is too high.
KT has an option for offline repair, but Cloudflare chose to file transfer the KT database from the management nodes.
File transfer is faster compared to offline repair because in the case of offline repair, the repair of the offline database is at the application level.
It is not like a simple file transfer.
Another problem that Cloudflare encountered is the scale of expansion.
Because of the server upgrades and expansion of Cloudflare's data centers, the number of servers is increasing, and the chance of corrupting the KT database is also increasing.
This is the most dangerous.
When the original server is corrupted, the corruption will replicate in different management nodes and also in the worker nodes.
Imagine how many servers Cloudflare has in one column and how many columns it has in one region.
SREs need to be synced manually, and eventually, Cloudflare's customer bases are increasing, so the data stored in KT is also increasing.
At the same time, the servers are also increasing, so the consistency and manual syncing of KT for SREs in Cloudflare becomes a critical issue.
It looks like this is a big challenge for SREs.
And obviously, based on what you said, we really reached the limits of KT.
So, no wonder, this became Cloudflare's engineering motivation to make a replacement for KT from scratch.
And this replacement is now Quicksilver.
Can you tell us how Cloudflare can solve KT's problems?
Cloudflare decided to make a key-value store from scratch, so that's Quicksilver.
It's written from Go, and its backing is LMDB.
One of the features of Quicksilver is bootstrapping and LMDB.
But before I explain Quicksilver, let me clarify the common terms first.
Bootstrap is copying a database file from one server to another server.
Replication is replicating updates in a database from one server to another server.
For example, if you have key-value updates, you can transfer key-value updates to another server.
And of course, LMDB is Lightning Memory Map Database.
In KT, if a server needs a complete copy, the KTE instance of a server needs to be shut down wherever the copy comes from before transferring the KTE database.
In the case of Quicksilver, since its backing is based on LMDB, it supports online snapshots, and snapshots are synced to worker nodes.
Because of this, the management node does not need to be shut down to sync the database.
Another feature of Quicksilver is the replication of transaction logs.
Each transaction log has at least one operation and two hashes.
The first hash is the hash of the previous transaction log entry.
The second hash is the combined hash of the previous transaction log entry and the current transaction log entry.
Because of the two hashes in each transaction log entry, Quicksilver can ensure that before applying the transaction to the database, the hash of the previous transaction is correct.
The next hash is the hash of the previous transaction.
The next transaction log entry will also do the same check.
Because of this process, Quicksilver ensures the consistency and ordering of updates in the LMDB database.
You mentioned the upgrades of KT.
How are the upgrades in Quicksilver?
Quicksilver uses system deactivation.
What is system deactivation?
In systemd, there are systemd services, but systemd also features a systemd socket.
Systemd releases sockets such as TCP, UDP, or Munich Socket.
When systemd starts your application, you just need to read the environment variable to know how many sockets systemd created for the application.
You can see in the illustration, that is lisiant underscore fd equals to.
In the case of systemd, in systemd, you can see that there are two definitions of sockets, TCP and Munich Socket.
Because there are two file descriptors, this is what will be passed to the Quicksilver service.
In the code of your application, in the case of Quicksilver, you just need to iterate the available file descriptor, then identify in the application what kind of socket the file descriptor is.
The advantage of this is that you can pass the file descriptor to a separate application, and the new application will be the owner of the socket.
You can see in the illustration, initially, the file descriptor is in the first Quicksilver instance.
After the Quicksilver updates, the ownership of the socket can be taken by the new Quicksilver and the file descriptor of the sockets.
This is how the Quicksilver takes advantage of passing the socket from the old Quicksilver instance to the new Quicksilver instance.
And because LMDB supports multiple processes, one LMDB database can be accessed by two Quicksilver instances.
So what happens is that the upgrade is just seamless.
Okay.
Let's talk about observability.
There are many blogs published by Cloudflare on how Prometheus is used for monitoring.
So in the case of Quicksilver, Prometheus is used for monitoring.
So in the case of Quicksilver, how do we monitor using Prometheus?
So in terms of observability, Quicksilver also has a built-in Prometheus exporter to scrape the Prometheus server metrics.
There are also many metrics available in Quicksilver like the Instagram response time, replication lag, and many more.
We also have a histogram of disk access, memory usage, and many other important metrics because in the case of the database, there are many important metrics that need to be monitored to ensure that Quicksilver's response time and uptime are reliable.
So with the metrics that you mentioned, the metrics that are available within Quicksilver itself, Quicksilver has already exposed the endpoint where Prometheus is scraping.
So this is included as a feature in Quicksilver, right?
Yes, this is included in Quicksilver.
Aside from that, we have a lot of Prometheus instances.
At the same time, we also have Grafana instances where there are a lot of Prometheus instances.
At the same time, we also have Grafana instances where we have different dashboards where we can see Quicksilver's health.
We have Quicksilver's health for a single machine, we have Quicksilver's health for a single column, and we also have Quicksilver's global health like replication lag in different regions, different columns.
So because of the capability of Prometheus to scale, we also have an overall view, different views of Quicksilver.
And our alerting is also included here, of course.
Of course, our alerting is also included.
This is also important for SREs and development teams.
Okay.
You mentioned earlier about the process of upgrading Quicksilver.
So usually, when we upgrade Quicksilver this time, as you mentioned earlier, we don't need to transfer via Quicksilver.
We don't need to transfer via FileSync like what we did in QS, right?
So how is this happening now, for example, within a column where we have a lot of bare metals, for example, that we need to sync at the same time?
Okay.
LMDB supports multiple snapshots.
For example, there's a worker node that will request a copy of Quicksilver.
LMDB will get a snapshot of the current database.
And if there's another worker node that needs another copy, LMDB will also snapshot it, and the snapshot will be transferred to the worker node.
So the snapshot of LMDB is considered as a complete database file.
So if you have a snapshot, you just need to transfer it, save it as a file, and the worker node will just open it.
So the current database of the management node and the snapshot are just similar files.
Okay.
Okay.
Now, I think we have reached the part wherein we can talk about a lot of things, but we also need to look at, I don't know if there are questions based on our viewers, but aside from the other features that KT has, what are the other features of Quicksilver that KT doesn't have?
So initially, like KT, by default, KT doesn't have encryption when it replicates.
So in Quicksilver, initially, from the ground up, we put a mutual TLS encryption.
So of course, like I said earlier, the seamless handover of sockets from the old Quicksilver service to the new Quicksilver service.
Now, in the case of KT, when we upgrade to KT, we need to shut down the services of the worker node because there are a lot of services that are dependent on KT.
Now, in the case of Quicksilver, because the upgrade is seamless, and because LMDB supports multiple process readers, the upgrade of Quicksilver is just seamless.
So after the upgrade is initiated in Systemd, automatically, the new Quicksilver instance will take over the socket and it will read the LMDB at the same time.
So in the case of KT, that's not possible.
So Mark, I just want to ask, because earlier we also mentioned the size of the database.
So how big is the difference now?
Is there a significant difference in terms of the size?
And what are the optimizations that we applied using Quicksilver?
That's why we achieved a very fast response time during the replication.
So one of the reasons why the response time of Quicksilver is fast is because of LMDB.
Because LMDB is a memory map.
So the address space of LMDB is treated as ordinary memory.
So when transferring data between functions inside Quicksilver, it's just like passing a memory pointer.
So it's not like other databases where the contents of the key-value store But in the case of LMDB, one LMDB file is treated as part of the memory.
So that's the memory map, which is a feature of other Unix and Linux kernels.
Because of that, the response time of LMDB is fast.
At the same time, the only disadvantage is that LMDB doesn't have an online compaction, unlike other databases.
But it's not a problem because we can make a regular compaction in LMDB.
Which is happening outside of LMDB itself, within the application side, or within LMDB?
So in compaction, what we usually do is we copy the contents of LMDB.
We also create a new LMDB file.
So all the contents, all the key-value stores that are in the old LMDB file, are being written in the new LMDB file and then ordered.
So all the empty spaces in the old LMDB are being left.
In the new LMDB file, the data in the new LMDB file is continuous.
Because LMDB, the way it manages the storage, it has pages.
So let's say in one LMDB, when we continuously write, the data is also continuous.
But in the case, commonly, in a single database, we don't always have read and write operations.
We have delete operations.
So what happens, in LMDB, when you delete, that space is not reused.
It's just put in a list of deleted pages.
Then the new writes, it uses new spaces.
New pages, I mean.
Eventually, when there are new writes, new entries, before it writes, it will look for an available space that is enough for the new entries.
So it just reuses the pages that we deleted.
So eventually, there are pages that are too small, too small that there are no new keys that will fit in the deleted pages.
So what happens, we have pages that are just empty.
So in the case of compaction, that empty space is left in the old LMDB database.
So what happens, when you write it in the new LMDB, it's just read and write operations.
It's just an operation.
So the order of the pages and the writes is continuous.
Most commonly, in Windows, we have fragmentation.
In the Windows file system, when you delete a file, the block in your storage doesn't immediately use the deleted file.
It's the same with LMDB.
So the empty spaces in the storage blocks don't use it immediately.
So what happens, the database grows because of the empty spaces left in the LMDB database.
So we also have fragmentation, right?
Yes, we also have fragmentation.
But because of the speed of LMDB, we also don't encounter slowdown because of fragmentation.
As you mentioned, because of the speed of LMDB, we were also able to quickly provide other services provided by Cloudflare to our customers.
So, it's not like KTE before where for example, the corruption of DB is not common in Quicksilver.
One of the features enabled by Quicksilver in LMDB is F-Sync.
So, every transaction log that is replicated from the management nodes, every transaction log that is replicated from the management nodes is always F-Synced to the disk.
So it's not like KTE where the writes are buffered so when the KTE application crashes, the database of the file system is incomplete.
So it's partially corrupted.
In the case of LMDB, every transaction log is synced to the storage.
So, LMDB in storage is always consistent.
But just in case there is corruption in LMDB, bootstrapping is seamless unlike KTE because in the case of KTE, KTE needs to be shut down in the management nodes and the KTE database file will be synced to the worker nodes.
But in the case of Quicksilver, it needs to get a snapshot that is not shut down by Quicksilver and the snapshot is synced to the worker node.
Eventually, after the snapshot is synced to the worker node, we will enable the dependent services and the worker nodes will serve traffic.
So, unlike KTE, if KTE shuts down the management nodes, there will be a replication but in the case of Quicksilver, because it's an online snapshot, there is no downtime in the management nodes.
So, I think one of the features that I know of is the feature of peer-to-peer P2P bootstrapping.
Can you share the P2P bootstrapping and its advantages that Quicksilver has?
So, we can think of P2P bootstrapping like Quicksilver.
So, because Quicksilver is an LMDB file, it is treated as a simple file.
So, usually, we can use SCB to transfer the database file.
But in P2P bootstrapping, we can use like BitTorrent, P2P bootstrapping to transfer the file.
P2P bootstrapping contributes to sync the Quicksilver database.
So, let's say we have 10 servers that need to sync the Quicksilver database.
So, what will happen is the large Quicksilver database because in the scale of Cloudflare, our LMDB slash Quicksilver database is large.
So, in P2P bootstrapping, in the slice of chunks, the Quicksilver database and every server included in P2P bootstrapping, will get chunks.
So, since those chunks are available in different nodes, instead of getting them from the management nodes, the worker nodes can contribute.
So, for example, worker node 1 is not available in chunk 1, it can get it from another worker node that has a specific chunk of the LMDB database.
So, eventually, instead of the bottleneck being in the management nodes, especially when we're bootstrapping hundreds of databases, using that, we can reconstruct using different worker nodes.
So, it's a shame that this is the last segment.
It looks like I still have a lot to talk about, but I think we need to say goodbye.
Before that, I have one last question.
Before we give the questions to the viewers, if there are any, one last question.
So, there are a lot of open-source projects here in Cloudflare.
So, do we plan to open-source Quicksilver in the near future?
So, Cloudflare plans to open-source Quicksilver, but for now, there's no definite timeline on when.
Okay.
Thank you very much, Jason.
This is a very exciting segment, and I learned a lot, in fact.
So, to the viewers, I hope you learned a lot in this segment.
We will continue to share different kinds of content.
So, just visit cloudflare.tv for the full schedule.
And, finally, we are hiring.
So, if you find exciting and challenging the projects that Jason and I are doing in Cloudflare as SREs, and you want to help us and support our mission to help build a better internet, just go to the jobs page of cloudflare.com and apply.
Until next time, we will be here.
And then, stay tuned, because the next Dynamic DNS with Cloudflare will be presented by Nico.
Thank you.