💻 The Future of Distributed Data: Why What You Learned in CS Class Is Outdated
One of the foundational theorems in CS states that it is impossible for a distributed data store to provide more than two of these three feautures: consistency, availability, and partition tolerance.
However, that is no longer the case. MongoDB Principal Engineer, Asya Kamsky, and Director of Product at Cloudflare, Aly Cabral, sit down to discuss why this CAP theorem is outdated and what this means for you.
Welcome, everyone. Today I have Asya and myself. Asya is a principal engineer for the office of the CTO at MongoDB, and I am myself a MongoDB alumni, but now the director of product for our Cloudflare Edge Compute platform.
And today we're going to be talking about the problems of simplifying distributed systems theory.
And we get to kind of riff on each other about the problems that arises in practice.
First things first, let's get into one of the main concepts to pick up on with, the CAP theorem.
Asya, can you give us a high-level description of what the acronym stands for?
Sure, why not? CAP theorem, it's a bit of a simplification. We'll start with that.
And the CAP letters stand for consistency, availability, and partition tolerance.
But of course, you can't just hear the word and know what it means, because those words could mean a lot of things, which is, of course, one of the biggest bones we're going to pick with the whole thing.
Yep, exactly. Why is consistency important?
Is it? Is it important? So this is an excellent question.
First of all, it's an excellent question, because the CAP theorem and the whole discussion was about distributed systems, systems where you had multiple nodes that were connected by a network.
This is why partition tolerance came in as a concept.
And consistency meant one thing. But we also are familiar with databases.
And in databases, we have ACID. And people would be like, oh, you can have an ACID database or a not ACID database, which is what the CAP theorem says.
I'm like, it doesn't say that at all.
The C in consistency doesn't even mean the same thing as the C in ACID consistency.
In ACID, it basically means we won't corrupt your database.
It means whatever constraints you have on your database will be true before and after you finish an operation.
And by the way, I mentioned, Ali, to you before, I was like Googling around about common misconceptions.
I read somebody who claimed that consistency in CAP is like atomicity in ACID.
I'm like, no, it's like isolation in ACID.
It's like, this is all wrong. But yeah, so one of the problems with acronyms and simplifications is a lot of times we mean different things, right?
So consistency, the way you and I are talking about it, and in terms of CAP and in terms of development, is the ability to assume that things are marching along forward in time in a way that's intuitive, right?
A consistent read is one that's of, if I just wrote the data, it would be consistent for me to read that data and get it back or get some later point of it back.
If something else got in there, right? I wrote something, then you wrote something.
When I read it, I better get either my version of it or your version of it.
I shouldn't get back the version that was there before I wrote, that's easy to reason about.
And that's, if the world functioned that way, it would be very easy to write code and you wouldn't have to think any more about it except, hey, everything just works.
Yeah. Yeah. I think that's exactly right. Like to put it in an example, like if I update my username, I don't want to see the old username value the next time I go and read, because then I'm like, oh shit, like it didn't propagate.
It didn't take, I better change it again. Exactly. Like, let me do the wrong action and try and take action against that, which is not intuitive for the developer, but also not intuitive for your end user, right?
That kind of experience.
Yeah. Or like, you post something on Twitter or Facebook or whatever, right?
And then you come back to the page and it's not there. And you're like, I could have sworn I posted this.
And now you post it again. And now you look like the idiot who's yelling the same thing over and over again, right?
Exactly. Now you have a double post, right?
And that really leads to imagine, yeah. Imagine if that's a charge or a transaction, like you're double, like charging someone, right?
Like then that's money. That would be bad. That would definitely be bad.
But of course, the behavior we're talking about comes with a tax, right?
There are trade-offs that exist in the world. What are typical forms of that tax a system like has to pay in order to offer a strong consistency?
Well, so it's funny that you ask that because if you only had a single server, you actually wouldn't have to worry about this, right?
And in the old days when we had big monolithic server and that's all you talked to, you didn't have to worry about it.
The tax was that that server cost you a few million dollars. And then how do you scale it once you get to a Cray or something like that, right?
So once we started scaling things by parallelizing workloads to multiple machines, all of a sudden we're like, oh, how do we keep the data consistent between the different nodes, right?
And so the cost or the tax can be paid in a bunch of different ways, right?
One way is you could say, okay, you can't actually talk to every node. You can only talk to the one that like you're currently talking to, because then you don't have to worry about asynchronous replication, right?
For example, or you could say, well, you can only talk to any of the nodes that are up to the same point in time as wherever you sent your data or worse.
How about just once you write your data, you have to block until we've informed you that the data now exists everywhere else.
And essentially that ends up being the trade-off between availability or performance or consistency or something, right?
And it's a range of options, which of course, why a simplification that just says, oh, it's just three things in a triangle.
It's not going to tell you all the subtle things about it.
Yeah. Yeah. That's exactly right. It's funny because also there are some people who look the CAP acronym and think the P stands not for partition tolerance as we discussed, but for performance.
But actually performance has a lot more relatability to availability.
Asya, any tips on how to think about availability and performance as related?
Here's a funny thing. I was thinking about this because it's like, yeah, you know, if you're willing to go to any node, including one that may not be as up to date, you might be able to get your answer quicker, right?
You don't have to wait until the data arrives there. But actually something occurred to me while I was mulling over this that might surprise you.
P actually does relate to performance in the sense of what is a partition?
If I send a request to a server and it takes a really long time, isn't that kind of not, I can't really differentiate that from a partition.
Am I partitioned or is the server just so busy that I can't get my answer within the SLA?
And I'm actually going to mention SLAs a lot because frankly they matter a lot.
If you have to have a response within five milliseconds and you don't get one within five milliseconds, does it really matter if the system was available or not?
No, no. And does it matter if it was not available because it's waiting to give you the most consistent or correct answer?
So yeah, I actually got to a point where I convinced myself that the word available might be completely meaningless, right?
Because, I mean, it's already said that it's not sufficient to return an error, right?
Availability means you return a response to the request.
A response that says the system is not available does not define availability.
We know that a slightly stale response makes the system available, though less consistent.
How bad, how off from consistent can the response be when you can still get away calling your system available?
What if you return a response that eventually turns out to be completely incorrect because you're one of those available systems that eventually heals the partition by throwing away some of the rights that you accepted?
To me, that's not available.
You said you were available. You accepted the right, but then you threw it away.
That's not available. So there's a lot of really hairy edges and transitions between these different states.
Yeah, yeah. I definitely get your point around partitioning and workload partitioning and performance too.
It's like you partition to get that performance to spread the work across multiple machines efficiently.
That does make sense on how they're also related. And you know how one of the read preferences, I'm sure we'll talk about this in a minute, but one of the read preferences MongoDB supports is nearest.
That's literally saying to the system, I care about getting my response back as quickly as possible more than I care about getting the latest possible version of the data.
It doesn't mean you won't get the latest version. It just means you're telling the system what you want to prioritize for a particular call, right?
And I think that granularity is pretty powerful, like offering users that kind of granularity and configuration.
Totally. If they know that it's available. If they know how to yield that power.
Yeah, yeah. That's right. Okay. Now that we've touched on the acronym, there's a silly, at least in my view, rule that systems can choose two of these things in the three acronym.
My construction manager has a similar rule for my new house that is being built.
You can have cheap, good quality on time, but you only get to choose two.
Sure. Software development. We've been saying this for decades, right?
It's a cheap, fast, correct pick two. Yeah, yeah.
I keep saying that instead of calling scope cutting. I do my projects, if I need to cut scope or whatever, instead of calling it that, calling it value engineering from now on, also another term to take from construction.
It's like, I like that framing much better. Yeah. Yeah. So how do you feel about that rule applied to CAP though?
I mean, for construction projects, for software engineering, it feels like, yeah, you get cheap, good quality on time, but is that really true for CAP?
No. I mean, and one of the ways that I see people talk about it, and there's a lot of critiques of the CAP theorem out there as a simplification, not as something that's completely wrong, but just as it's only useful as a very first step of talking and thinking about distributing systems.
And so what a lot of the critiques, including one that Eric Brewer himself, the author of the CAP theorem, says is that, look, the system should, in absence of network partition, the system should try to achieve both consistency and availability.
And if you don't have a network partition, it seems like you should be achieving both, right?
So by default, right, that's in MongoDB, you send your reads and writes to the primary.
This is the default. And as long as nothing ever happens, you're going to keep going to the primary.
You're going to keep getting the latest version of everything.
But the fact is that in the real world, things will happen, right? Nodes will fail.
Networks will fail. Nodes failing sometimes looks like network failing.
Network misconfiguration might be one or both way. Failure of packets to get there in time, right, and appear like failure.
And so MongoDB specifically doesn't, you know, says in default mode, we value both, right?
But we're going to give you consistent data.
But you can specify if you're willing to trade off that consistency for certain operations, if there should be a failover, right?
So if you will lose the primary, and there's an election happening, and it might only take one or two seconds, but during that time, 10,000 requests might come in.
And we need to know, do we hold those requests?
Do we answer them with the latest available data from one of the secondary nodes, right?
And without the user telling, without, and the user here, of course, is the developer.
It's not the end user who's, the end user is presumed to be spoken for by the business product owner, right, who's looking out for them.
And who says the end user must receive the fastest possible response of the most correct possible answer.
But the developer has to translate that into something and communicate that to the system.
And there are different contexts.
So, you know, you can specify read preference, which says, which node am I willing to read from?
Then there's read concern, right? That says, how committed, how durable should the data be for me to read from it for this particular query?
And it can be different for two different queries for multiple different operations, right?
MongoDB, you know, supports transactions, you know that very well.
Product managed that particular thing, right? The transactions are not necessary for every set of operations.
For some of them, they are. Who makes that decision?
Well, the developer. They have to know that that option is available, right?
Write concern, right? How durable should the write be before I acknowledge it to the application, which acknowledges it presumably to the end user.
So, these are actually the trade-offs that say that we're not going to tell you what's important for your application.
You should tell us, and we will then take the appropriate action.
And there's no wrong answer, right? There are applications where any of those combinations make sense, depending on the application, and even parts of the application, right?
Like you can choose this kind of configuration at the operation level.
So, it's not even like a blanket label that you apply to an application trade-off.
Right, right. And it frustrates me when people say, oh, you know, MongoDB is a CP system, or MongoDB is a whatever, or some other system is a whatever.
It's like, there's no such thing as a system. I mean, sure, there are probably systems that don't support strong consistency, no matter what settings you make, right?
Or there may be a system that doesn't have even a concept of inconsistent reads.
I'm not aware of any, but I'm sure there are some simplistic ones.
But all systems allow you to, just like in isolation levels in relational databases, traditionally, you actually could set different isolation levels, depending on if you wanted to go faster for some operations, trading off potentially phantom reads, or, you know, committed reads, or whatever else, right?
Yeah, yeah. Funny tidbit on that. So, when I was product managing transactions on MongoDB, I would ask everyone who was using a relational database that I would speak to, what isolation level are you using?
And guess how many people could answer that question?
I can tell you, because when interviewing people, you know, if they claim to be experts in relational databases, I would want to discuss isolation levels and transactions.
And even answering what ACID stood for, a lot of times people would kind of flounder.
And I'd be like, okay, maybe you have expert level understanding of something related to that system, but not that bit.
Yeah. Yeah. Be careful what you claim to be an expert in when you're talking to us.
Yeah. No, I mean, at the time, right, like when I first started having those conversations, I was shocked because I'm like, this is like real trade-offs, real trade-offs people don't realize they're making.
I mean, it actually has application behavior consequences that you're not considering.
And then I was like, well, I guess those consequences aren't painful enough for them to spend time.
Like what is going on here? And it was just a kind of like a very interesting eyeopening experience for me and wanted me to talk, like, it inspired me to talk about this more, right?
Like make people aware of the granular trade-offs they're really making.
Absolutely. Yeah. And a lot of times, you know, when people aren't aware of the trade-offs they're making, and we know this, they go, well, it must be that the tool I'm using is doing something wrong.
And that results in my favorite thing in the world, air quotes, which is the feature request for a database to break the laws of physics or something like that, right?
It's like, why can't you just do X? And it's like, well, we don't know that.
First of all, we don't know that X is what you want, right? Somebody else might say X is wrong behavior.
You should do Y, but also they go, well, why can't you do both X and Y?
And it's like, physics? Yeah. Yeah. It's like, why can't it be consistent and always available?
It's like, well, as long as nothing ever goes down, maybe it can be, right?
Yep. Yep. A lot of those conversations do get back to some physics fundamentals, don't they?
Like, if I could break the laws of physics, man, I wouldn't be talking to you about this product.
I'd be off breaking laws of physics because it would be fun.
I'm going to go back a little bit, though.
I sent you a screenshot recently of a question that's like, does MongoDB value consistency or availability more?
I sent it because I'm like, I don't even know how I would answer this question.
I hate that question. So how would you kind of like, if someone asked you that, how would you answer that question?
I mean, if I'm forced to answer that question with, it's a multiple choice and the choices are consistency and availability, output consistency.
Because of the default configuration, right?
By default, you're reading from a primary. If the primary is not available, by default, you will get back after whatever timeout you may have set.
It's like, eh, couldn't reach the node or primary not available or whatever.
Yep. So if you go by the defaults that we set, and also we only accept writes for any particular record on a single node and then replicate it from there, right?
Which is kind of like the primary secondary architecture. So based on that, I guess consistency, right?
But the whole point is you as a developer get to take advantage of the fact that there are secondary nodes with copies of the data, right?
That there are subsets of data that you may be able to reach versus the ones that you can't.
And depending on how you structure your operations, you may not see lack of availability, which another part of the application might see, right?
Yeah. And that's the most common follow-up to the cap theorem, is that it doesn't apply to a system.
It applies to an application interacting with the system during times of network partition or some kind of lack of availability event.
I think that that makes sense. Do you think that these are inherently different problems than concurrency or multi-threading?
No. Actually, I mean, in a way they're not, right?
And you and I have talked about this. A code that's written for a single-threaded system might be perfectly correct, but it holds within it some assumptions that might break down once you're in a multi-threaded system.
I mean, a database or data store, whether distributed or not, is a massive parallel system.
And that's why it's kind of tricky to serialize everything perfectly while not giving up performance.
And there's a lot of database research that goes on kind of around that.
This is similar in the sense that as an application developer, you kind of have to step away from assumptions that the system exists only in your one view, right?
There's a lot of other moving parts. My favorite application error is when you get a pop-up or an error message that says, should not be here, should never be here.
This should never happen. That's like, oh, my favorite error message, right?
Gives you tons of confidence. Yeah. But it tells me that there was an invariant that either somebody assumed an invariant that was not so, or they were unaware of an invariant that existed in the system, right?
Yeah. What are you going to do, right?
Writing good complex software is hard. That's a fact.
Sorry to disappoint everyone. No, I think that's right. I know that you already touched on several examples of where you can configure based on your availability or consistency needs in MongoDB.
Did you touch on W, like, write majority or those yet?
Yeah, that's a perfect example, because I think I only mentioned reads, where you want to read from nearest or you want to read from primary or secondary or whichever, or read concern, you only want to read majority committed data.
What does majority committed data mean? Well, in a distributed system, majority committed means the majority of the nodes have accepted that write and made it durable locally.
So write concern in MongoDB is a concept that when you send a write to the system, you can tell it when it should acknowledge that write to you as being successful.
And, you know, you might say, okay, just tell me if it's successful as in there was no error or constraint violation, right?
Or don't acknowledge it until it's been committed to the majority of the cluster.
And that means that when I tell the client and the end user that your write is good, that means that we're not going to lose it unless we lose the majority of the nodes in the cluster, you know, which hopefully won't happen and is far less likely to happen than losing a single node, right?
If you have five nodes, you know, three is a majority.
So you can lose two and still have one copy of that data, which will then, you know, replicate to the rest of the system.
Yeah. Yeah, yeah.
I think that makes a lot of sense. So what happens if I'm not using W majority and a failure happens?
Like, is there a guarantee there that it doesn't persist or what do I expect?
Funny thing is most of the time, absolutely nothing bad happens.
Because most of the time the system is just cranking along and there's no failures.
Sometimes there's a failure on a secondary and that doesn't impact your write.
So only during those times when you send the write to the primary and the primary accepts the write and executes it without a failure, tells you it did, but then the actual crash, let's say of the primary happened after that point, but before the point when replication to secondaries happened.
So in that case, you might lose that write.
Now it's still going to be on that primary and it's going to get set aside because when the primary comes back up, it's going to heal itself and try to get back into the replica set and sync itself up to a consistent point.
But if it was the only one that had that write, it's going to have to roll it back or set it aside.
And now it's going to require manual intervention to figure out whether something should be done with that write or not.
Now, what was that write? It was just like somebody clicked something.
Whatever. Data is now off by 0.00001% maybe.
But if it was a user changes their username, I would want to wait until it's acknowledged by majority because I don't want to piss off the user.
Even though frankly, they have changed their username again, and they'll be like, didn't I already do this?
I don't know. Did I? I don't know. I thought I did. Did I just mean to?
Did I forget to click save? Actually, a lot more often, I bet you when I go, oh, the system lost my data, it might be that I forgot to click confirm because the UI had a third place where I had to confirm something.
On the other hand, if they did click confirm and that was some kind of a financial transaction, I would want to make sure that it's there and saved before it's processed because you can always reprocess something if you manage to save it.
But if you forgot to save it, then it's like, ooh, oopsie.
Yeah. Yeah. And MongoDB also tries really hard, even in cases where there are failures to not lose writes to.
One of my favorite things to talk about is even if you have a five node replica set, if you even propagated to one secondary, that write to one secondary, we're going to try really, really hard to elect that secondary as the primary in an election.
So even in that case, just getting propagation to a secondary really even makes those edge cases even smaller.
Yeah. Yeah. And a lot of times people don't realize, I mean, any node can replicate from any other node that's ahead of it, right?
You don't have to replicate from the primary, right? And so yeah, the distributed systems are hard.
Oh my God. There's so much stuff going on. Yeah. But fun.
I think fun. Yeah. Yeah. All right. So especially right now where the world, like most of our world software runs on commoditized hardware, on machines that are distributed, any one node can go down at any time.
Having some awareness of the trade-offs seem critical.
In your view with users you speak to, how many folks seem to truly understand the way they work with distributed systems?
Well, definitely not enough.
Definitely I wish more people realized that systems can't break the laws of physics, right?
That if certain things are going on, there must be some complex system that's operating behind the scenes.
And, you know, re-examining assumptions is something that sometimes happens way later than you think.
So, right. If you're testing an application, you get a certain error.
I would think the first thing to do would be to re-examine your assumptions, not to be like open a support case with the vendor and scream at them that their system did something wrong.
And it's like, no, actually that's what the system is supposed to do, right?
So yeah. But I do think that there's huge value in abstracting away as much of this from the end user as possible, right?
We want, we don't want people to be distributed systems experts to write good applications.
We just want them to be savvy and have awareness at the high level, you know, about the assumptions that they can and can't make.
They don't need to be expert level, intricate details, read every publication, you know, like that's just, that's just not necessary.
And, and that's kind of, that's the philosophy behind, you know, reducing the number of knobs, right?
Give them a small number of options to kind of tell the system in a simplified way, what a particular operation wants to prioritize, what its tolerance for latency delay is, right?
Like various timeouts that you can specify with reads and writes, all tell the system that this, you know, okay, I need a response within five milliseconds.
Even if their correct response is not coming yet, I need a, I couldn't reach server within five milliseconds, right?
That's valuable too, right?
So. Yeah. Yeah. And it also speaks to the importance of a good default experience as well, right?
Like making sure that the default experience is not requiring people to make those intense trade-offs in the very beginning either.
It's something that can be viewed as an optimization even, that would be good.
What do you wish people would spend more time thinking about in this space?
You know, it's funny. The first thing that occurs to me always when you say that is, I wish they would spend time thinking more about what the end user actually needs.
What's the real requirement? And sometimes for a developer, it means they have to go back to the business owner and be like, well, what are the SLAs here?
Because the system, like, it's not reasonable to have an SLA of 100% uptime combined with, you know, one millisecond latency for every operation.
That's just, you know, you want that great, but even if you spend a billion dollars, you're not going to get that.
So, like, why not just be realistic? And so, I feel like developers are sometimes in a position where they don't know what these requirements are.
So, how can they even make the right trade-off decisions when they didn't get that from, you know, whoever wrote their requirements?
And so, it's super important that, you know, the business people think more about what the end user actually requires and the developer think more about, given those requirements.
Like, here's an example. I don't want the system to optimize to always most consistent, most available data, even if there's no partition, because it's more expensive, right?
Ask for only what you need, right? Yep.
Yep. That makes a lot of sense. And then we'll leave it at that. Thank you so much, Asya, for joining us today.
I had a lot of fun. Thanks for having me. Of course.