Cloudflare TV


Presented by Brian Bradley
Originally aired on 

Join Brian for a deep dive into how the paxos distributed consensus algorithm works and an overview of practical optimizations.


Transcript (Beta)

Hello. My name is Brian Bradley, and I work on the content delivery team at Cloudflare, which is primarily responsible for caching and purging assets on Cloudflare's massive network.

A little bit about myself. I started working at Cloudflare in January 2020.

Before that, I worked at a startup called Aware. I really enjoy computer science and hard puzzles, like video games and engineering.

One thing I don't like is the feeling of not understanding what I'm doing or the tools that I use.

When that happens, I tend to become indolent until I figure out what I'm doing.

Throughout my time working as a software engineer, I began to realize there's so much I just don't understand and a lot of stuff I just take for granted.

I try to learn as much as I can to minimize what I don't understand. One thing I've taken for granted is how databases achieve replication.

I felt like I didn't really understand that so well, or how, say, a peer-to-peer LAN video game manages to give an illusion that there's a consistent experience for everybody and even knows when someone is trying to cheat.

Sort of weird.

Have you ever thought about that? How can two machines agree with each other and keep a consistent state?

It doesn't seem so easy, if you think about it.

What do they have to do? They have a checkpoint and wait for confirmation from every other computer and then check everything matches before they make progress?

Doesn't really seem particularly efficient.

So, that's the sort of inspiration that got me sort of looking into all this.

It's like, okay, so let's do a bit of exploration and let's try and figure it out.

I found out that this problem actually is called consensus.

It's actually one of the most interesting problems in computer science because of how rich the theory and set of available solutions and tradeoffs are.

Computers are not the only participants that participate in consensus.

People also strive for consensus on a daily basis, if you think about it.

You're always trying to achieve consensus with your coworkers to work together.

Consensus is pretty much what you think it is.

It's achieved when every participant achieves the same conclusion, given the evidence that's available to them.

I mean, the evidence as it's presented to each participant isn't necessarily the same, because everyone doesn't necessarily have exactly the same worldview.

But if consensus is to be achieved, then the conclusion must be the same, even if the evidence doesn't appear the same.

If all participants can reliably communicate with each other, consensus may just be easy.

For instance, if there are, let's say, like an odd number of to achieve consensus.

The problem becomes a lot more difficult when the participants might fail to communicate with each other, or if the participants can make errors.

If you've ever been a participant on like a project, you may have experienced difficulty achieving consensus if there were communication failures.

Computers are a little bit different, but they can also have difficulty achieving consensus for many of the same reasons that people do.

Computers may be commanded to work on something that's like a project, like say, handle this request.

And they make progress by independently responding to the events that they witness.

Whereas humans may respond based on like a combination of training, intuition, emotion, rules, mood, that sort of stuff.

Computers pretty much are just programmed to respond based primarily on rules and sometimes training.

Examples of the kinds of projects computers collaborate on that require consensus would be like agreeing on the identity of a leader, ensuring that each computer has a consistent record of events that's transpired, like a check log of the past.

Guaranteeing that a state machine is perfectly replicated on multiple computers.

State machine is just like a conceptual device that can only be in one guaranteed state and it can move from one state to the other.

And there's like a defined way that lets you move from specific states to specific other states.

Projects like these can be combined with other rules and algorithms to solve problems in the real world, like synchronizing clocks, ranking web pages according to whatever arbitrary criteria that you'd like to rank them by, control over and collaboration of like unmanned vehicles.

Consensus would be really important in a military application for making sure that unmanned aerial vehicles are collaborating to extract the best intel that they can.

Same thing for probes and space vehicles.

More utilitarian would be bouncing demand of the same service across multiple hosts.

So consensus can be difficult or even impossible to achieve if enough of the participants fail to communicate with each other.

If a protocol is invented for achieving consensus and it's able to achieve consensus even when some of the participants fail to communicate or fail to respond properly to communication, then the protocol is just said to be resilient or fault tolerant.

The protocol must, it just has to be defined in a way that the participants, even though they're independently following the protocol and they could fail to communicate successfully with each other, they'll still arrive at a conclusion and the conclusion must be the same for every participant.

That's how you have a consensus protocol.

Consensus could always be achieved if every participant were given the rule to just totally ignore every event witnessed and always surmise the same value.

If you think about it, it's like a trivial thing, right?

The goal is just to have consensus, then my protocol could be ignore everything and the answer is 42.

Not very useful.

Technically some conclusion is being reached when the participants engage in this pretend protocol, but the protocol is not very useful because the conclusion doesn't depend on the events that the participants are witnessing.

Generally it is the consensus protocols where the conclusion does depend on the events that the participants witness that are useful.

For instance, in the voting scheme given earlier, the consensus could be interpreted sort of like a measure of how common some evidence is from a perspective of each participant.

In the study and in the comparison of consensus protocols, it's useful to label certain like desirable qualities.

So there's three really big ones. Termination is a constraint such that eventually every faultless participant reaches a conclusion.

Agreement is a constraint such that every faultless participant must agree on the same conclusion.

And integrity is a constraint such that if all faultless participants proposed the same conclusion, then all faultless participants must reach that conclusion if they reach any.

So integrity is also sometimes called validity.

And you can weaken it in some applications. For instance, to guarantee that the conclusions reached by any participant is a conclusion proposed by only some proportion or constant number of participants, maybe like even as few as just one.

It depends on what you want to tolerate. So the constraints you get to pick.

If you're inventing a new consensus algorithm, you can say these are the constraints.

Certain faults, so I used that term earlier, faults.

Certain faults demonstrate permanent compromise of a participant.

If a fault prevents a participant from ever reaching a conclusion, it's called a crash failure.

A certain category of faults is the set which includes all faults.

In the literature, those are called Byzantine failures.

The participants, it may be individually unreliable.

So like computers can catch fire. Maybe the individual participants can't be trusted to achieve consensus.

Like maybe they're malicious and they're actually trying to interfere with each other to prevent each other from reaching consensus.

Consensus protocols have different levels of tolerance to faults in different categories.

Fault tolerance is never fault immunity. There's always a limit to the proportion of communication failures or internal faults that can occur in any consensus protocol.

So let's see.

There are ways to make a consensus protocol resistant to Byzantine failures.

You can strengthen the integrity constraint. You can strengthen it so that if any faultless participant reaches a conclusion, then it must have been proposed by some other faultless participant.

And that way you can tolerate Byzantine failures.

So the spirit of that is basically like someone can't inject an idea into the heads of the participants.

The only ideas that are valid are the ones that were suggested by faultless participants.

So any protocol can guarantee consensus among some number of participants.

But if you do that and you say we're going to tolerate up to two faulty participants and still be able to satisfy our constraints, then what you have is said to be too resilient.

And if you can only guarantee consensus when you have up to one faulty participant, then what you have is you call that one resilient.

So a few other qualities that might be of interest is the time required to achieve consensus and the number and size of messages exchanged by their participants while executing the protocol.

The messages might have an attachment that serves as kind of like a log of every participant that's communicated it or received it.

And that model is called written communication model.

And an oral communication model just means that you don't have those kinds of attachments.

So it's just like everything's very transitory and the messages that you get are just what was said to you.

So different consensus protocols may make different assumptions about the communication capabilities or inclinations of each participant.

For instance, most companies are organized hierarchically.

And most people are inclined to communicate more readily with people they sit next to.

A consensus protocol that is designed for people in that context, maybe they take those inclinations as an assumption.

Another example would be for computers. So not every participant in a computer may be connected to every other participant in a computer.

If they're connected in a ring pattern or like a star pattern or some other way, a network topology may be taken as an assumption of a consensus protocol designed for the computers.

Some protocols may assume discrimination of messages based on the authorities of the sending and receiving participant.

Typically when that happens, an identifier for the sender or at least an identifier for the authority level of the sender is attached to the message that might be guarded cryptographically or might not be.

A weaker form of discrimination could be assumed that's just based on the channel through which the message was sent rather than being like attached to the message itself.

But generally, consensus schemes that involve more stringent guarantees about authority can offer a higher level of fault tolerance than those that offer less stringent guarantees if they depend on authority at all.

The manner in which communication takes place could also be of interest.

For instance, if communication involves passing messages back and forth, sort of like at will, then that could be called asynchronous message passing.

But if you sort of have communication happen in batches or like rounds and you make this guarantee that messages that are exchanged in a single round can't interfere with each other or inform each other, so you sort of collect a bunch of responses and send a bunch of requests or messages out, and that just happens in a big batch, and you get your batches one batch at a time.

Then what you have is called asynchronous message passing.

So that's a lot of terminology.

It's a lot of words. There's results, sort of like in the literature.

It was shown in the 1980s that there is an anonymous synchronous protocol that provides resilience to Byzantine failures if less than one third of the participants are faulty.

That was like in the 1980s, early 1980s.

And in 2004, it was shown that there exists no algorithm that solves the consensus problem if the number of participants that suffer Byzantine failures is at least one third of the number of participants total in any oral messaging scheme.

In a written messaging scheme, there are known consensus protocols that can tolerate Byzantine failures even if every participant except maybe, well, except one suffers from such a failure.

Probably like the landmark paper, like the big deal paper, is one that's just referred to as FLP.

Those are the initials of Fisher, Lynch, and Patterson, just called the FLP paper or FLP result.

And it shows that there is no consensus algorithm that can always tolerate one or more crash failures in a fully asynchronous messaging scheme.

They were able to show in general that any such setup where you have fully asynchronous messaging, you can always, just by using that property and some other known properties, you can always construct a situation in which you can't have consensus.

But those are generally pretty difficult to actually instigate those conditions.

So it's usually not so much of a problem, but it's something to keep in the back of your head if you're worried about perfect correctness.

Well, what I'll describe next is a framework that has distributed computing applications.

So, like, specifically a way to implement a service that's resilient to failures of the machines that are being used to host it.

To achieve that, you need to make sure that the participants interpret the state of the service, sort of in terms of, like, a state machine.

And there's a reason for that.

You want things to be in terms of a state machine because you can have this state machine be replicated on each of the participants.

And the goal is you only want to have to deal with maintaining consensus about what the current state is or what the next state is going to be for the state machines.

If you can do that, then you can basically maintain the same state machine everywhere.

And then even if one of your computers is taken away, goes down, bursts into flames, whatever, you can replace it.

You can add a new one. And then there's got to be some way of, there's a process for doing that and letting you copy the state back into that new one in a way that's at least somewhat efficient.

And then you're able to swap in and out machines like it's nothing, you know, and it doesn't affect your service at all.

So, in this framework, basically, every one of the participants receives every message.

And each of them submit every response. And in order to reach a consensus about the present state of the state machine, you could use any one of a multitude of consensus protocols.

The one I'll be covering later in this talk will be Paxos.

But by leveraging consensus, the state of the service can survive not just one failure of one machine, but a number of simultaneous failures.

It just depends on the choice of the protocol and some other parameters.

So, the nice thing about state machines for this is because they're deterministic, they will always achieve the same state if you give them the same inputs in the same order.

And so, that is basically the whole problem, is you want to be able to reach consensus on what that order should be, the order of those inputs.

And the client would be able to detect when something's gone wrong.

When one of the machines begins to deviate, the client can be pretty sure that that machine is the culprit and that it's faulty.

And then you can have some sort of fault resolution, like replace the authentic participant, the one that doesn't have the same result as everyone else.

If the inputs to each participant are stored in a log, the participants can provide each other some resiliency to faults that may occur if there's a transient issue with the communication between the client and the participants.

Maybe the connection goes bad temporarily.

If our participant loses its ability to communicate expediently with the client, it can request copies of the log entries from some other participant.

And then it can use that to derive, it can drive its copy of the state machine until it's able to communicate with the client's recovered.

To keep the log from going too large, you have to employ some strategies, right?

Because the log can't grow forever. You need to be able to squash it down or forget pieces of it.

But it's not ideal if you just forget pieces of it.

Because what happens if one of your partner machines goes down for a long time or you need to add new ones, what do you do then?

So the trick there is you just have a duplicate of the state. And you process the log.

In order to remove log entries, you duplicate the present state of the state machine and then you just allow yourself to append more log entries.

And then when you've got a huge batch of log entries and you're ready to start squashing them down, you apply them into the state, make another snapshot of the state, and you keep that around.

So if a new machine needs to come up or you need to help another machine out, you can just give it that copy of the state and also play any logs on top of that.

And that should put it in the same state as everyone else.

That's called checkpointing. Doing that is called making a checkpoint when you make a copy of a state.

So I guess one key requirement of this approach is, as I said before, they have to be able to agree on ordering for the input they receive.

So a choice of consensus protocol.

So Paxos is a crash failure, resilient, asynchronous messaging, fully connected with respect to topology, consensus protocol for solving this problem.

Messages can be lost, duplicated, take an arbitrary amount of time to deliver.

They can even arrive in an arbitrary order.

The simpler Paxos algorithm versions, the simpler versions of the protocol, are not resilient to Byzantine failures.

So messages must not be corrupt and participants must not conspire to subvert the protocol.

Paxos requires that one participant sort of serve as a leader in order to guarantee liveness.

Essentially, Paxos lets one of the participants decide what the next state should be and the other participants just follow that lead.

The participant only needs to serve as leader long enough to achieve consensus on the next operation of the state machine.

After that, after it's been leader long enough to just achieve the next state, another leader can be picked.

You know, you can lose leadership after that and another one can be picked.

It could actually change a bunch of times before you even get to that point.

But if you can't maintain leadership at least long enough to achieve the next state, you'll never achieve the next state.

So being able to have that leadership long enough is critical for Paxos to be live, to always be making progress.

So why the name Paxos?

In case you're wondering about the name, Paxos gets its name from a fictional legislative system used on the Paxos island in Greece.

Well, it's just a fictional story.

So the inventor of the Paxos algorithm, Leslie Lamport, imagined that the parliament had to function even though legislators continually wandered in and out of the parliamentary chamber.

It's useful to frame the Paxos algorithm in terms of voters and laws.

It makes a lot easier to understand, I think.

Just make everything seem legislative and the algorithm becomes so much easier to understand.

So technically Paxos is actually a family of protocols.

It's not just one. And this family has like, it's just a bunch of instances of the same one, but you make various tweaks, optimizations, and trade -offs here and there.

And there's like a lot of dimensions to it, like the failure modes that might be possible, number of participants, number of messages, number of message delays before achieving a consistent state transition, amount of activity by the participants would be another.

And like I alluded to earlier, Paxos does not necessarily guarantee progress.

It can't, right? Because it's an asynchronous messaging-based consensus protocol.

And FLP states that you can't always guarantee progress in that case. But it does guarantee consistency.

Although the conditions under which Paxos stops making progress are pretty difficult to actually instigate.

So there's a version of Paxos called Byzantine Paxos that can even survive Byzantine failures, such as corrupt messages or participants that are conspiring to push the system out of consensus.

And I don't know, like an example application for that would be if you had like a peer-to-peer video game where the individual players can't be trusted and not try and cheat.

It doesn't even matter what they do.

They could try polluting the packets, whatever they want. They could try cheating and the protocol could be designed to be resilient against some number of Byzantine failures like cheaters, for instance.

So let's get into the meat of Paxos.

The behaviors are like, they're divided up into five primary or important roles.

There's clients, there's voters, proposers, learners, and leaders.

So a client just issues requests to proposers. So remember, try and think about this as much as possible in like legislative terms.

I know that client isn't a particularly very legislative term, but keep that mental framing going forward.

So a client issues requests to proposers and waits for responses.

And many versions of Paxos, the participator that is responsible for giving the client a response to a request is not the participator that the client sent the initial request to.

That's where you get and see this whole asynchronous, just push messages around sort of world.

So what about proposer?

A proposer, well, accepts requests from clients and it reframes them in the language of like a law proposal, at least that's how I like to think of it.

And it keeps trying to convince voters to agree on the law, agree on this new proposal for a law.

And when conflicts occur, the proposer coordinates with other roles to ensure the protocol keeps making progress.

It's really important that no matter what, you want to be able to continue to make progress.

Voters are also, you could also call them, I think they're called in some places acceptors as well.

They collectively achieve fault tolerance for the protocol. So any message sent to any voter, they must be sent to all of them, all the voters.

Well, that's not quite right. They have to be sent to a quorum of voters. So any message received from any voter or you can call it acceptor is ignored unless it's received from a quorum of acceptors.

The quorum is just a simple majority. So for instance, if the number of voters is five, then the quorum would be a size three.

It could be any three of the five.

So learners replicate the conclusions of the voters.

Remember learners is one of these roles. So after the voters make their vote and come to a conclusion, learners are responsible for like replicating that and making it available.

The learners learn the conclusions and then they take further actions, including sending result back to the client.

So now, okay, so leader.

So a leader is a proposer that is elected as, it's sort of like a side effect.

So it's elected as a side effect of accepting a new law. And that doesn't make sense right now, but I'll explain that.

So a leader is a proposer, it's elected as a side effect of accepting a new law, and Paxos will continue to maintain consensus even if multiple proposers believe they are leaders at the same time.

I like to have a picture in my head of multiple people in the Senate thinking that they're the head honcho in the Senate at the same time, and there are two or three or four people trying to lead at the same time.

If they were following Paxos, they would still maintain consensus, but they wouldn't make progress until one leader is chosen.

That's one of the critical things about Paxos is it needs to be able to choose a single leader.

So leader selection is an interesting detail.

And then I mentioned this before, quorums.

Quorums are used to ensure that at least one participant, if you think about it, why do I have a quorum?

Why is it important to have a quorum of voters, a majority of voters?

It's because you want at least one participant that will have a record of the state.

So even if some of the voters become faulty, it doesn't matter, you've got the record of the state somewhere.

So as I mentioned, Paxos always meets the validity and agreement constraints, but it may fail to meet the termination constraint in rare cases.

It also includes a mechanism to drop a permanently failed participant.

So something that's undergone crash failure.

But I think also it has a mechanism to add a new participant. So there's four stages to the actual algorithm, the actual protocol.

And because it's all so asynchronous, different participants can be in different stages at the same time.

In the first stage, any client sends a message to any proposer.

And then the proposer is converted into a law proposal and a unique identifier.

It's got to be comparable and orderable, so like a number.

We can think of it like a number. And so it's given this unique number that identifies the law proposal.

And then at number, it has to be greater than any number used in any of the previous law proposals that were ever issued by that particular proposer.

The proposer then sends its law proposal to a quorum.

Not necessarily all, but it doesn't hurt to try all.

It has to send at least to a quorum of voters. And then if it can't do it, if it can't do that, if it can't access a quorum of voters, then it shouldn't initiate Pexos.

It should just return a response to a client saying it can't do, it can't proceed with this algorithm.

Or it could just block until conditions change.

It's worth mentioning that a law proposal is not just a number.

That represents newness.

It's also a description or value for the law itself.

So this number that has to constantly keep growing, you can think of it, a way I like to think of it is higher numbers indicate a law is newer than lower numbers.

And newer laws are given priority.

So that's one way of thinking of it. In the second stage, the voters wait for a law proposal from any of the proposers.

And when a voter receives a law proposal, it inspects the identifier to make a decision.

If a law proposal is newer than any law proposal that the voter has ever seen from any proposer, then the voter has to promise the proposer that it will ignore any proposal that is not newer than this specific proposal that I received.

So once a voter sees the newest thing that's ever seen, it has to promise that it'll ignore anything else that's not at least as new as that.

If the voter already previously accepted some other law proposal, then, well, that's a detail.

So right now, what I'm describing is a promise that the voter gives to the proposer saying that it will ignore anything that isn't at least as new.

That's just a promise of ignoring something. That's not an acceptance of the law.

In a separate stage, like a subsequent stage, you can actually get acceptance of the law.

And if due to that, if the voter had previously accepted some other law proposal, then the voter has to include the previous law proposal in its response to the proposer instead.

That's probably super confusing.

But I promise I'll explain better later in this talk.

If the law proposal is not newer than every law proposal the voter has seen from every proposer, then the voter ignores the received law proposal.

So, think about that.

Someone gives it a law proposal, but it's an old story, you know, months old.

And I've seen much newer proposals than that that have my attention.

So what do I do? Just ignore it. Technically, the voter doesn't have to tell the proposer that it's going to be ignoring it in order for Paxos to function.

But usually a rejection is sent to the proposer. It's kind of like an optimization saying, you know, you can stop trying to persuade quorum of voters to vote on a law, this particular law, because it's not new enough.

So that suits for stage two.

Stage three of Paxos protocol, the proposer awaits the promises or rejections from the voters.

If it has a quorum of promises, remember, we're talking about voters that were giving promises back.

We're looking for a quorum of promises, looking for a quorum of assurances that these voters won't accept anything unless it just happens to be newer than what you've already got.


So it won't accept older stuff. If you've got that, if you've got a quorum of promises, then the proposer tells the voters that it accepts the newest law proposal.

So it's like speaking past the point, saying, oh, you've promised me this.

I'm converting your promise into law. I'm converting it into reality. So the proposer tells the voters that it accepts the newest law proposal.

It might get, if you think about it, it could get, the promises that come back, because remember what I said earlier, I said that the promises that are given back, it could depend, right?

Like the voter may have already agreed and accepted a previous law, some other law, not the law that you're trying to get passed, something else.

And it has to give that back instead. Right? So when you look at your quorum of promises, some of them could be not the one that you're trying to get put out.

Some of them might be newer than your stuff. And what you want to do is as a proposer, you want to look at the quorum of promises and you'll look for the newest law proposal that exists in that quorum.

And whatever you get, the newest one, you send that back, send that back to all the voters.

Seems a little weird to retrieve a, you know, receive a value from, because you know, we're going to be getting a value from one of the voters and then sending it right back.

The value that you've got, it's like an echo. That's not the real trick of the algorithm.

The trick is that you're sending it to all the voters.

The newest one goes out to all of them. All of them in the quorum that you have.

So essentially that is like the consensus part of the entire protocol.

You're cooperating with the desires of other proposers and having a hard and fast tool that lets you decide whether to cooperate or lead.

So by doing that, the algorithm can ensure that multiple participators are able to reach a consensus.

In stage four of the PACSOS protocol, this is the last stage, if a voter is told that a proposer has accepted a law proposal, that voter must also accept the law proposal.

But only if and only if it's not already promised to only consider newer law proposals.

Some proposals that are newer than the one I'm referencing now. So that's important for guaranteeing order, right?

So voters told that a proposer has accepted a law proposal.

I said that that's a way that the proposer can sort of convert the promise that the voter gave into a law.

That's only gonna actually work if the voter hasn't already given a promise away saying they wouldn't do that, right?

So the voter has to be consistent with the promises that it gives out basically.

If the voter accepts the law proposal, then it sends a confirmation to the proposer and every learner.

And if the voter does not accept the law proposal because of a prior promise, then it just ignores the fact that a proposer has accepted a law proposal.

It shouldn't matter. It doesn't matter. Ultimately, a quorum of voters must reach consensus, and they might still even do so just even though one voter ignores the proposer.

Or possibly the prior promise that we're talking about will result in an even newer law being proposed.

That's a possibility. In case you're wondering, yeah, that means we're going through the whole cycle again.

So to make this a little more clear, because all these concepts are pretty tangly, and I want to get some diagrams up to make it a little easier.

Let me share a screen.

So this is the basic access algorithm. You can see we start with a client.

The client says, I want to make a law. So help me make a law, proposer. Try to get this passed for me.

The proposer takes the law and it submits it off to a quorum of voters.

In this case, our quorum is of size three. Like the example I gave earlier, you might have five voters total.

You could submit it to all five, and that's probably what you would do in practice, but you need to at least submit it to a quorum.

So the voters receiving these laws, well, law proposals, I'm sorry, give back promises.

They say, okay, I promise not to look at anything or consider anything that is not newer than one.

Remember, higher number means newer. So the proposer gets this, and it sees that it got a quorum of promises.

So it's like, okie doke.

So I'm telling you now that I accept this. I accept this new law. I'm declaring this new law.

It's sort of jumping the gun. The voters get to ultimately decide, but the proposer, you know, eager.

So it says, okay, I'm accepting.

And the voters go about doing their thing. They say, okay, well, I'm confirming this is a law.

Confirming to the proposer, I'm confirming to the learner, and converting to the other learner.

There's multiple learners in this diagram for availability.

And they each do that. Each do that exact thing. So you confirm one all over the place, going to everyone.

Once that's happened, the learners know now.

The learners know that they received a quorum. Remember what I said, it's receiving messages, receiving a quorum of messages from a voter of any kind.

You require that before you can make progress, regardless of your role. So the learner sees that it got a quorum, sends a response back.

So there's this other one, right?

And, you know, maybe that looks a little weird, because you have a client that sent out one message and got back two.

But that's totally expected. It's fine to have duplication there.

It doesn't matter. It just means if one of the learners goes down, it'll be just fine, actually.

We can look at that example.

So this is an example where the learner quits out, right? It begins the same way.

You propose a law. As a client, you want to get it passed. The proposer takes it, handles it, sends the law off, promises to come back, accepts to go out.

You get confirmations.

But somewhere along the line, when you have voter two trying to confirm to learner one, it sends out the confirmation.

But unfortunately, voter three isn't able to send out its confirmation to that learner, because that learner just went away.

Maybe the learner computer or whatever caught on fire or something.

It doesn't really matter, because the client still gets a response, gets one response back.

See? From learner two. So everything's great. So another example might be if the voters have problems.

That's one thing. The learners could have a problem.

The learning machine could go up in flames. But what about the voting machines?

Begins pretty similar.

So client says, here's a law I want to try and get passed. Proposes, like, I'll take it from here.

Sends it out to a quorum of voters. One of the voters has an issue.

And in this particular example, maybe it's a little confusing.

Let's say that the total number of voters is three, and a quorum would be two.

Okay? So not our normal three and five. Two and three instead.

So you tried sending it out, but you got a message that comes back. Right? You can see promises come back from voter one and voter two, but that's plenty.

You send your accepts out.

You send your accept out to voter three, but it didn't arrive because voter three burst into flames.

You still send your confirms out. And learner one and learner two see that they got a quorum, because quorum here would be two out of three.

And everything proceeds as normal. So this diagram should make it pretty clear that why we talk about things in terms of quorum.

Because it's pretty trivial if you have a quorum.

You don't really need to consider the other voters or machines in that context.

So in this scheme, a voter can accept multiple law proposals concurrently.

Might be pretty surprising.

And Paxos will still guarantee that a quorum of voters eventually agree on a single law.

They'll fight with each other. Eventually, they'll arrive at a single law.

That could happen when another proposer that is unaware of the new law proposal that is, like, in the process of being accepted by a voter sends a newer law proposal to that very same voter.

In that case, the voter can promise the proposer and then later accept the new proposed law, even though it has accepted an earlier law already that it proposed.

That seems like it could never work, but it does.

It works. So you can still get into some weird stuff.

Weird stuff can happen. So here's an example of something weird that could happen.

You have a client. It sends out a request to one of the proposers.

So in this case, you have two proposers. This is an example of something weird that can happen when you have more than one proposer.

Maybe you do that for availability to clients or resiliency. Remember I said earlier that laws have a content.

They have a value. It's not just a number.

I wasn't showing that in these other ones because it didn't matter, really.

We were only talking about one thing. Now I'm talking about values as well.

I send out this law, and I say the value of the law is A. As a proposer, I send out these laws to the three voters, and I get promises that come back promising me that it won't consider anything that isn't newer than one.

And then my proposer, right after it sent an accept for basically to say I'm commanding you to accept this as new law, right after it sends out that to one of the voters, it dies.

It just gives up and bursts into flames.

And so it doesn't even get to confirm back because it's gone. But the confirm does go out to both learners.

So this is an interesting situation. So maybe now we spin up a second proposer because our first one's gone.

Make a request to that.

Or perhaps the second proposer was there all along. But it just wasn't handling any requests at the time.

So we send out a request. And as you can see, it's a different value for the law.

So this is a problem, right? Up here, we have the law of sequence number one is A.

Here we have law of sequence number one is B.

So you send out this law to voter one, voter two, and voter three.

Voter one says it's already seen one equals A. And voter two and voter three say, no, I'm not going to consider anything that is that's one or less.

I need to see something newer.

But remember the algorithm I said before.

I said it would be confusing. Basically in this case, what you do is you look and you say, well, I was given something.

So I have to cooperate with the desires of the other proposers. In this case, the one that you're trying to cooperate with isn't even there anymore.

But you still follow the algorithm.

And you say, okay, well, law one equals A. I'm giving up on my law one equals B.

Law one equals A. So I send that out. Now, that may seem strange because it's almost like this request isn't being honored.

But this is how Paxos functions.

You send out law one equals A because you're trying to cooperate with the earlier intention.

Then you get promises coming back.

They say it's all fine. That's great. Send your accepts out.

You do your confirmation cycle to everyone. And you get responses coming back.

So that's an example of how you can sort of recover from this weird thing.

So as you can see, even when proposers, it's not only when learners and voters aren't available.

Those seem kind of trivial. But also proposers. When proposers lose their cool because they burst into flames, Paxos can still guarantee consensus.

Now, if it goes through a few of these cycles, it may not be able to provide strong guarantees on top of that.

So an example of how something even stranger can happen. Proposer one sends out so it gets a request.

It says I want to try to make this law.

So send this out to the voter one, two, and three. And then immediately dies.

Right after it tried to send out those requests. Proposer two comes along to take over.

And it sends out its own idea of what the law should be.

Law sequence number is two. Okay? So its law should override. And the voters are happy to oblige.

And they give back promise two. Well, maybe proposer one was only done temporarily and came back up.

That's what I'm calling proposer one B.

It sends out law two because its previous thing was law one. So it thinks it should send out law two.

Remember, it should always be incrementing. But then it gets back messages saying it's already seen two.

You know? You have to do better than that.

Nack. These are nacks. So it says okay. I'm going to send out law three.

But now proposer two sends out except two. Right? Because it got promise two.

It's going to send out except two. But now the voters say already seen three. Because that's exactly what proposer one B did.

You can see how they're fighting now. They're just going to keep fighting like this.

Until they get unentangled. So this is kind of like an entangling.

This is like a good demonstration of one of like the key drawbacks of Paxos, which is it doesn't guarantee progress.

Doesn't guarantee termination. It can't. Remember? The result from the like landmark paper can't always do that.

But it does remain consistent.

It still achieves consensus. It just isn't making any progress. The state still stays consistent.

Everything is still correct. So it's interesting that when a voter if we go back and look at like the normal cases, when the voters decide to accept a law proposal, they're effectively electing the proposer.

So like if you had multiple proposers, they're electing one of the proposers.

They're saying, you're the leader.

We're considering you now. Right? That's basically what that means.

And that's the meaning of leader in the Paxos algorithm. It's just what happens to a proposer when it gets a quorum of promises that come back that are good.

It's like, okay, now I'm the leader. Now I'm being listened to and my law is being considered.

So why have that?

In many versions of Paxos, different kinds of conflict can occur.

And it's the leader's responsibility to take action in order to resolve the conflict.

So this entire process can fail when multiple proposers send conflicting law proposals or when the proposers don't receive a quorum of promises or acceptances.

And when that happens, it just means you have to retry the whole process with a newer law.

So, okay.

So if we think about it, agreeing on laws in a defined order is tantamount to agreeing on commands in a defined order.

And reaching consensus for the ordering of a stream of commands would be necessary for implementing a distributed state machine.

However, there is a lot of overhead to effectively select a leader and thus accept the law.

If the same leader happens to be, to keep being selected, then the stages to get a promise, to not accept law proposals that aren't newer, those can just be skipped, right?

If you just constantly keep getting the same leader, you can just, as long as it's fairly stable and you have fallback mechanisms, you can just skip that step.

To do that, law proposals include a number that starts at some predefined value, for instance, zero.

Basically, when the voters give it back, I promise.

And it's incremented only by the proposer that's serving as the leader.

And every time it ends up serving as leader, it keeps incrementing that number.

And so the voters only have to inspect that to see whether it's just the next number in the sequence.

Because if it is, it means it's like the number of times that that proposer has served as leader in a row.

So if the next proposal, like I said, the next proposal that voters receive indicates that the same leader was selected again, then the voter doesn't bother sending back a promise, doesn't bother doing promises anymore, just simply sends back confirmation of accepting the new law immediately upon proposer saying, here's what I want you to do.

Voters are saying, yep, sure. So here would be an example of that.

This whole mechanism is called multi -paxos, by the way.

And when the client sends out a law proposal, the voters come back and they say, okay, I'm going to promise, and here's your sequence number for this.

It starts at zero. So this is the number of times you've served as leader in a row.

Right now, zero, because this is the first time. It gets back, okay, I command you to accept this as new law, if possible.

And then you get these confirmations going out, and it's all great.

And then you get responses coming back, just like normal.

But now look what happens. Now look what happens.

So that was the normal cycle. This looks very much like the normal paxos algorithm.

This is where the optimization kicks in. Law two comes in from the request, requester, the client, right?

Sends out an accept with a sequence number that's been incremented by one.

It's a second law that has to also increase. This is the first time that it's served as leader in a row.

So sends it out. It gets confirmations immediately.

You see, these are accepts. These are not like law proposals, and there's no promises.

These are just straight accepts, and then it gets confirmations.

And then it's good. It comes back with responses. And that's how that works.

That's how multipaxos works. You can also have participants act out like multiple roles at once.

So a participant can be fulfilling the learner, voter, and proposer roles without sacrificing any guarantees that paxos offers.

When this is done, the participants are just called servers, and the clients communicate to any server, and the servers all communicate with each other.

So here you have, like, three servers, and they are, and depending on the state of requests, they're taking on the role of learner, taking on the role of proposer, or taking on the role of a voter.

So here it gets requests come in. It sends out, it's pretending that it's the proposer, and it's sending out the law to the other two servers.

So it's pretending that those other two servers are voters. The voters, the other two servers, server two and server three, are going to play that game with them.

And you're like, okay, so I'm fine. I'll be the voter for you. I'm going to give you back prompts.

And here you go, zero. Zero time you've gone in a row. And then it gets accepts coming back.

And then it gets confirmations. And then it responds to the client.

In this case, it's actually acting as a learner, right?

All these confirmations go out, and these others have the state be duplicated as well.

So they're all learners. Another request comes in. It's the same model as before.

It just squashed into three columns now. You get your accepts, you get confirms, and you get response, just like earlier.

So there's this other version of Paxos, which is sort of, like, it involves this new role that I haven't discussed yet, called, like, an idle role.

And it's called, let me try to recall the name.

It's called cheap Paxos.

What makes it cheap is you don't have to keep as many voters going.

You can instead just have, like, a few voters going, like, two. Two instead of three.

And as long as the two constantly agree with each other, everything is great.

You don't have to involve a third one, right?

Maybe you're looking to do, like, a quorum of two out of three, for instance.

One of your voters misbehaves. So you no longer have a quorum.

If you send it out to all three, then maybe you get a response from the third one, right?

But realistically, you only need to get responses from two reliably, right?

Because you don't expect your machines to go up and down that often. So what you can do is you can reduce the availability or, yeah, you can reduce the availability of the third one.

So the idle one just sort of sits by. It doesn't really take an active participation in the protocol.

It's like a very lazy voter that only gets up and does anything when it actually has to.

So in this case, client sends out quest.

And I'm jump starting it. So the sequence number is already one. It sends out accept.

Immediately I'm just doing that to reduce the amount of clutter in the diagram.

Accept goes out to voter one and voter two.

Voter two dies immediately. So we don't have that anymore.

It gets a confirm coming back out to the proposer and also sends a confirm out to the learner.

But a failure is detected here by the proposer. It knows there's a failure because it only got one comeback.

It knows the quorum is two, size two.

It only has one. So what does it do? It does it again, but this time it involves the idle agent.

Normally it doesn't involve the idle agent. Idle agent isn't something you're supposed to talk to that frequently.

Supposed to just talk to voter one or voter two if possible.

So it sends it out to idle because it has no choice.

So voter one sends back confirms again. Idle also sends back confirms in this case, but idle wants to get a voter two up and running because it doesn't want to be involved because it's too lazy to be involved.

So it starts up voter 2B, whatever that entails, a machine starting machine or something.

That's great.

So now when a request comes in, voter one and voter two are able, well, voter 2B are able to handle it like normal and progress is made and then it gets responses coming back.

Everything functions just the way that it should. So that's it for cheat Paxos.

But there's lots of other optimizations that can be performed too, each with their own tradeoffs.

For instance, by letting the leader also serve, as a lead learner, like a new concept, it can be responsible for informing the other learners of the new law.

So that's like a tradeoff between extra delay and reduced number of messages.

That's just an example of how you can leverage the fact that there's so much dimensionality to the Paxos algorithm and one of the reasons it's so complicated.

You can leverage the complexity in it to basically invoke tradeoffs at will.

You can tweak it in a lot of ways because there's so much dimensionality to it.

If the voters store the newest law proposal in stable storage that they all share, then stages two and three of the protocol become unnecessary.

They all share that storage and then you just put it there. Then if it's always just the latest, then voter can just respond with that shared value.

You don't have to do anything. Another example of an optimization is instead of sending a law proposal, the leader can send a hash of the law proposal to the voters.

Using hashes instead of law proposals can save space and network activity if the proposals are large.

Of course, that doesn't come for free.

A learner will learn the new law is confirmed if it receives a confirmation for either the law or the hash from a quorum of voters and at least one of those messages contains the new law rather than the hash because then you know that the leader was involved.

However, a leader could receive a promise telling it to use the law behind the hash, a new hash that it gets, but it doesn't know what that is so it'd have to receive it from some other participant and that is additional complexity.

But that can be useful too for reducing the sizes of the messages.

Maybe the messages are very large in your particular application.

In the server model from before, in this model, the proposer can send its proposal only to the voter marked as the leader.

Rather than all voters. All of this requires that the leader election be broadcast to all the participants.

That would be another optimization.

Another is instead of each voter telling each learner that a new law has been accepted, the voters can tell the leader and later confirm each of the learners.

It's similar to what I already mentioned.

That also adds a delay. And one last thing is the first two stages are unnecessary when the protocol just starts up for the first time because the leader can be chosen beforehand, like a priori, before anything happens.

You just say this machine is leader. As long as you have that, then it can just start off by saying, it can tell the voters that I've accepted this new law proposal, even though it hasn't received a promise beforehand.

You can just start things off that way if you'd like. Um, so there's a bunch of different versions of Paxos that change the basic guarantees or assumptions.

I've already talked about cheap Paxos, but fast Paxos is another idea.

And it's basically like, okay, let's not have proposers, really. Let's just go client straight to voter.

So you can client is behaving like a proposer. But that requires, for reasons that are pretty arcane, that requires basically more voters in order to handle the same number of faults.

It's like three times n plus one instead of two times n plus one.

But the client must be configured to, this is one of the other things, is the client has to be configured to send multiple, basically send a request to multiple participants, to these voters.

In the earlier example, just client sends one request to proposers and the proposers branch out.

And if you're going to squash things down, then of course the client has to now do that job.

The advantage here is we reduce the number of steps, like the number of times messages have to bounce around.

Some of these can happen in parallel, concurrently with each other.

But there is some amount that can't. For instance, the accept has to go out before the confirms can come back or the response can come back.

But in this case, we really only have to send accept out, confirm, blast, response comes back.

So very, very efficient in those terms. But there's drawback here, and that's what happens, that would be what happens if you do have a collision or a fault of some kind.

In that case, you have to, here's an example of one.

You have a couple of clients, client one sends out, immediately says, okay, here, I'm on sequence one.

I'm telling you, I want you to accept this law of newness value two, and the law is jump.

Client two sends out something else and says, no, no, run.

Okay. So client one sends it out to voter one, client two sends it out to voter two, and they do their whole broadcast to each other, and a collision is detected by the leader.

Then, in response to the collision, the leader, which is just, in this particular context, it's, it doesn't normally do anything.

And it sort of lets voters and clients talk to each other, but it only steps in when it sees that there was a collision.

And in this case, it knows there was a collision because it sees a conflict between one accept two jump and one accept two run, right?

So, just in response to that, it just makes an arbitrary choice, and it just says, well, I'm just going to say run, then.

You'd have to have, it doesn't even really need necessarily to be deterministic, as long as it can make a choice that all voters will then receive.

So, it does that, and then, and then there's a confirm cycle, and then response comes back.

You'll notice that when these two clients sent out their requests, one said that it wanted it to be run, and the other said it wanted to be jump, and the one that said run got their response, but the one that said jump, the other one, it was just dropped.

It was never handled. It was never done.

So, FastPak says, yeah, okay, let's do this. Super fast when everything's happy, but things can sort of clobber over each other.

When things compete, they step over each other, and they like sort of interfere with each other, and some clients get their stuff to go through.

Others don't. So, one thing that you can do to make that even faster, it's not the best world to be in, but you can make this faster if you'd like, is you can have the leader decide on a strategy for doing this resolution beforehand, right?

So, that's what it does.

It says one except two, and then run is a recovery step. It sends that out before the clients even get their request coming in.

You can have a bunch of those rules if you'd like or a whole table of things to do, but then it just proceeds as it did before.

The requests go out.

You get these collisions. You get a confirmation. In this case, the voter is the one that can detect the collision because they've had a strategy pushed to them.

It says run as recovery, and you can see that it was getting jump and it was getting run at the same time, voter two.

Same thing for voter one.

They detect a collision, and in this case, they're using run as a recovery, so they send out the confirms, and then the confirm is run, and you get a response back, and then jump got dropped.

So, that's pretty much it for fast Paxos.

Then there is Byzantine Paxos. So, Byzantine Paxos, if you remember, Byzantine or basically any kind of failure, not just a halting failure, but even when things try to interfere with each other or be malicious and actually interfere with consensus itself, you still can have, it's pretty amazing, but you can still have a consensus protocol that works and is resilient to the attempts of the participants to interfere with consensus.

So, to go into detail about that, Byzantine Paxos is just Paxos with an extra step to broadcast the law proposal acceptance messages, right?

So, you get law proposal acceptance messages that voters receive to every other voter.

So, when a proposer says, here, accept this, I want you to accept this as new law, the voters, they take that, and they're like, okay, first, before I do anything, I'm going to broadcast this to all other voters, just to make sure, right?

By doing that, you can actually, it doesn't seem super intuitive, but by doing that, you can make your, you can make your protocol resilient to Byzantine failures.

So, here's an example of that.

So, laws come in, promises come back, accepts go out, and then the confirms go out, and then you get responses that come back, right?

This is, this is basically Paxos with no failures, but see how that compares with Byzantine.

There's this extra step, right?

This step called verify. This is like the broadcasting step between the voters.

You see they verify it to each other. Verify happens after the accept step, and voter three verifies with voter one and voter two.

Voter two verifies with voter three and voter one, and voter one verifies with two and three, and then after verifying, making sure everyone has that, then you do your confirm steps.

So, it adds delay, but it makes it resistant to Byzantine failures, which is pretty interesting, and you can try to speed that up a bit.

So, you can sort of blend all these ideas together, and this is, this would be multi -Paxos fast Byzantine, right?

A bunch of things. So, in this case, what you're doing is you sound accept two dog, one accept two dog, so the one here, again, means that this is the first time you're acting as leader, and two, here in this case, it's fast because the client is serving as proposer as well.

That's why it's able to send out this one.

The two means that it's the newness of the law, and dog is just the value of the law that you want.

So, you send that out to all the voters, and, you know, maybe immediately after sending this command to accept out to voter three, it has a fault, and it's not a crash fault.

It's a, I'm going to pollute your messages in the future, then turn your dogs into cats fault, right?

So, voter one, it received a command to accept, so it sends out, it's confirmed to the voters, voter two and voter three, for dog.

Voter two, same thing, dogs everywhere.

Voter three, cats. Cats. Maybe voter three was hacked into or something, sending cats back.

Well, voter one can see there was a fault because it's got dogs and cats.

Voter two, same thing, dogs and cats, fault detected. Can't say anything about voter three.

I mean, it's going to get dogs and cats, but it's getting two dogs from the other people, but it's had a fault, so we can't know anything about it.

So, but these two faults that were detected, the voters know that.

They know that they had faults. So, they send confirm for dog out to learner one and learner two, right?

And as you can see, voter three sends out cat.

So, fault was already detected by voter one and voter two, but now learners also ignore because there's a conflict, right?

Because they got cats and dogs as well. What happens in this case? Well, we know the learners get to ignore this, so we can just have the voter one and voter two, which are the valid ones, which detected the fault, do it again.

They can try again.

They're in the majority, so they rebroadcast. That's the rule. So, they confirm two dog again with each other, then they send out the confirm for two dog to the learners, and the learner gets it, and it's able to send a response back.

And that's pretty much it for my explanation of Paxos.

I tried to make sure to combine all ideas that I could at the very end.

As you can tell, it's like it's a it's like a very deep it's a very deep sort of subject, right?

There's like a lot of details.

There's a lot of complexity, and there's a lot of nice little levers that you can tweak to sort of build up your own versions of the protocol to achieve different kinds of constraints.

Yeah, I hope you learned something during this, and I hope you enjoyed learning about it as much as I did.

See you around on Cloudflare TV.