From: Bach Duy Bui Sent: Thursday, March 02, 2006 9:29 AM To: Indranil Gupta Subject: Re: 598ig review 03/02 BAR Fault Tolerance for Cooperative Service 1. Summary This paper describes a general approach to constructing cooperative services. The authors propose a model that accommodates three classes of nodes: (1) Rational nodes are nodes that try to gain net benefit (2) Byzantine nodes are nodes that can depart arbitrary from normal operation (3) Altruistic nodes are nodes who execute a proposed program regardless of rational choice. The system architecture has three levels: (1) the basic primitive level provides reliable distributed services. At this level, the system use a BART asynchronous replicated state machine a modification verion of PBFT with the following principles: (1) ensure long-term benefit to participants (2) mitigate the effects of residual non-determinism (3) limit non-determinism (4) enforce predictable communication patterns. (2) work assignment assigns work to individual nodes: this level protocols work to reduce the replication overhead required by cooperative applications. There are three protocols for work assignement: (1) guaranteed response: ensure that every requests are answered (2) periodic work: ensure that clients periodically answer requests (3) message binding: binds messages to authoritative time. (3) application level implements a desired service using the levels underneath. 2. Comments The assumption of rational nodes without colluding is not reasonable. Rational nodes are more likely to do cleverly for the best of their benefit. From: Long Hai Vu Sent: Thursday, March 02, 2006 8:49 AM To: Indranil Gupta Cc: Long Vu Subject: cs598ig review 03/02 Class CS598IG-Spring 2006 Name: Long Vu REVIEW PAPERS BAR Fault Tolerance for Cooperative Services In this paper authors present the idea of constructing the cooperative services over multiple administrative domains. The proposed protocol BART could overcome Byzantine and rational behaviors. This protocol is also the key component for state machine replication that relies on synchrony assumptions only for liveliness. In the level 1 of their architecture, there is the BART state machine with a few fundamental primitives. BAR model ensures the long-term benefit to participants to encourage rational node to participate the system faithfully. This seems hard for reality if there are so many selfish nodes in the system. They also provide the limit non-determinism to offer nodes multiple behaviors. Moreover, they enforce nodes to take part in the system’s activities long enough for a permission to execute a command. They provide three protocols for work assignment in level 2. Guaranteed Response protocol makes sure that all the requests are replied while the periodic work protocol ensures clients periodically answer the implicit request required by an application. And the last one for binding the message to an authoritative time called Message binding protocol. The third level presents BAR-B protocol as a cooperative backup system. Node could join the system by participating in the system’s sate machine and contributing a certain amount of disk storage. After node contributes this disk space to the system, it could use the same amount of space from other nodes. This may cause a problem if there is malicious node which could gain write access into remote machine. They obviously could not ensure the security issue here. There are several contributions of the paper: - Introduction of BAR model (Byzantine, Altruistic, and Rational). This model is used as a foundation for reasoning process in cooperative services. - Given this BAR model, this paper also describe the three level architecture to reduce complexity service under BAR - Describe the BAR-B system which could tolerate Byzantine and rational behaviors. The Byzantine Generals Problem Byzantine Generals problem has actually been one of the central issues in distributed systems so far. This paper focuses on a group of generals of the Byzantine problem to control the system whose malfunctioning nodes generate conflicting information. To solve the problem, at least two third of the nodes are loyal. In this case, nodes could reach the agreement regardless of the malfunctioning nodes. The paper also presents the context in which the system is divided into subgroup of nodes. Each subgroup has a leader who could be loyal or malfunctioning. All loyal generals (or leaders of subgroups) decide upon the same plan of action. Relying on this settings, they propose an OM protocol (Oral Message) to control the system by forcing nodes to keep exchanging message with each other, given that at least two third of nodes are loyal. At each node, the correctness of information is obtained by a majority basis. That is, after receiving messages from remote nodes in the network, the node will use the value by voting on values it has got from these messages. One requirement of this algorithm is that the message has to be sent through a certain nodes before reaching the destination node. If the path is shorter, there is no guarantee about the correctness of the vote. In addition, authors propose a SM protocol (Signed Message) in which nodes send the message signed by its signature to a set of nodes in the network. This is different from the above scenario as in the above approach message is not signed. The paper was written in 1980 with a naive solution but it has an important impact for later research in distributed system. This is because achieving a reliable system is the ultimate goal of research in general and distributed system in particular. From: Brandt Dewald Dusthimer Sent: Thursday, March 02, 2006 8:40 AM To: Indranil Gupta Subject: 598ig review 03/02 netid: dusthime Title: Practical Byzantine Fault Tolerance by Miguel Castro and Barbara Liskov Summary ------- This paper presents a replication algorithm that is able to tolerate Byzantine faults. While Byzantine fault-tolerant algorithms existed before this algorithm, most of them were slow or assumed an underlying synchronous system. One of the core ideas behind this paper was to design an algorithm that could be used on the Internet (asynchronous network) and was fast enough for practical purposes. The core algorithm exists as a replication library that other services could be build on. For example, in the paper they talk about building BFS, a Byzantine fault-tolerant distributed file system that supports the NFS protocol, on top of their replication library. The replication library works through a state-machine replication protocol; the service is modeled as a state machine that is replicated across different nodes in a distributed system, which consists of a primary node and replicas. The algorithm works by first having a client send a request to the primary node. The primary then multicasts this request to the appropriate replicas. The individual replicas execute the request and send a reply back to the client. Finally, the client waits until it hears back from f+1 replicas with the same result (f+1 because at most f replicas can be faulty.) Through out the whole process messages are sent and verified for authenticity using public-key signatures (for saying who you are) and message digests (for fast verification of message data). Beyond this basic overview of the algorithm, the paper also discusses handling different methods of operation in the network (how things work if the primary is down, how to garbage collect on logs, etc), about various optimizations that can be done to make the algorithm faster, and also talks about a NFS-compatible network file system that they wrote using the presented replication library and it's performance. Throughout the paper, it is vaunted that their NFS-compatible network file system offers all the gains of a distributed file system with only a 3% decrease in speed versus normal NFS. Comments -------- - Over all, a good paper complete detailed design concepts and real world results. - I would have liked it if the paper discussed why they decided on using a primary node versus a pure peer-to-peer system. - While performance results seem to be good, I still think the algorithm is very message heavy. For every request sent out, every replica is supposed to respond. This means that if there are, for example, 8 replicas for a 100 MB file, at *least* 800MB is transfered over the network when it might be possible to transmit just the 100MB. This would probably require a significantly different algorithm, but it would be worth looking in to. Title: BAR Fault Tolerance for Cooperative Services by Amitanand S. Aiyer, Lorenzo Alvisis, et. al. Summary ------- This paper presents, BAR, a framework for constructing cooperative services that span multiple administrative domains which can handle Byzantine behaviors. First it introduces how BAR should work, then discusses the architecture of the BAR model, and finally describes BAR-B, cooperative backup service that tolerates Byzantine users. Inside of the BAR model there are 3 different types of nodes: Altruistic nodes (ones that follow a given protocol exactly), Rational nodes (ones that are interested only in maximizing their own utility with the given protocol), and Byzantine nodes (any other behavior that's not rational or altruistic.) A BAR-based system is based on a couple of core beliefs. First, each node has a specific identity (which is enforced by a trusted authority). This allows reputations to be built within the system (for example, if you send out Byzantine data the system might mark you as "probably Byzantine".) By having reputation, you enforce good behavior in the system. This, along with the other core beliefs, turns a BAR system into a semi-automatic trust based network. And, different actions can be taken with respect to a given node depending on how much another node or a group of nodes trusts it. Comments -------- - Like the Practical Byzantine Fault Tolerance, I found this paper to be good. It contained design ideas, explained in depth why they worked, constructed a real world program from the idea, measured its performance, and got good results. - Potential research: Could a physically down node have a different node type than "Byzantine"? It is the right thing to do to have a node which has a very good reputation (and therefore receives a lot of requests) go down for a short while but (because of the magnitude of requests it gets) be considered a low reputation node after only a short while? Would things weigh themselves out? (Read: the node's reputation wouldn't be as tarnished, because once it dropped in reputation, it would receive a smaller magnitude of requests which it would not be able to fulfill?) Would it be possible to near accurately guess a "down" node from a Byzantine node in order to classify them differently? Brandt Dusthimer From: Mehedi Bakht Sent: Thursday, March 02, 2006 8:31 AM To: Indranil Gupta Subject: 598 IG review 03/02 netid : mbakht2 Title: The Byzantine Generals problem Summary: This is a classic paper on the problem of reaching a consensus among distributed units if some of them give misleading answers. The problem has been expressed metaphorically by a situation where a group of generals, each commanding a division of the Byzantine army, is encircling a city. These generals wish to formulate a plan for attacking the city. In its simplest form, the generals must only decide whether to attack or retreat. Some generals may prefer to attack, while others prefer to retreat. The problem is complicated by the presence of traitorous generals who may not only cast a vote for a suboptimal strategy, they may do so selectively. The goal is to ensure that every honest general agrees on a common decision. To do that, we also need to determine the minimum number of traitorous generals that can make consensus among the loyal generals impossible. To formalize, we need an algorithm that satisfies the two conditions: 1. Every general should have the same idea of what every other general sent (i.e., every loyal general's list should be identical). 2. If general i is loyal, then every other general must have general i's true vote in his list. If we did not have this condition, then traitorous generals would be able to get the loyal generals to agree on a bad plan (one for which most or all loyal generals did not vote). The paper shows that if there are t traitors, then there must be at least n=3t+1 loyalists to solve the problem with oral messages. The basic idea is to have multiple rounds to pass messages and the node takes a majority vote. To handle any number of traitors, we need to resort to written messages instead of oral messages. The idea is to use a public key signature to ensure that messages can not be forged, and to pass those messages. The above solutions assumed that each node can communicate with any other node. The paper also presents a solution that works for 3m-regular graphs (containing 3m+1 nodes). Comments: Even though it has been written more than 20 years ago, the conclusions of this paper are still applied in practice today. From a less informed reader’s point of view, the best thing about this paper is that it clearly explains the theory behind the conclusions drawn and provides a useful analogy to understand the reasoning easily. I have some reservations about one of the assumptions mentioned in the paper - does the assumption of reliable communication (that all messages sent will be received) always hold in real-life scenarios? Title: BAR Fault tolerance for cooperative services The paper describes an approach based on Byzantine- Altruistic-Rational BAR) model to construct cooperative services that span multiple administrative domains. The model classifies the nodes of a system into three groups – Rational, Byzantine and Altruistic. A rational node tries to maximize its own benefit – even if it means deviating from an agreed protocol. A Byzantine node represents arbitrary behavior –resulting from component failure or malicious intent. The remaining class of nodes is Altruistic – they adhere to proposed protocol irrespective of the gain they derive from doing so. The system is based on three-level architecture. An asynchronous replicated state machine (RSM) resides in the lowest level. The protocol used is based on PBFT. The key ways that the proposed protocol differ from PBFT are the use of TRB instead of consensus, the use of round-robin leader selection policy, and the requirement of at least 3f+2 nodes to tolerate f Byzantine nodes. Three proposed protocols form the basis of second level (work assignment). The protocols are guaranteed response, periodic work, and message binding. In the third and final level, the paper describe BART applications capable of discharging various responsibilities like assigning works to nodes in fault tolerant manner, sanctioning misbehaving nodes, etc. The paper also contains a description of experimental results to evaluate the performance of proposed replicated state machine and BAR-B (a co-operative backup system). Comments: I like the introduction of the idea of selfish nodes – something not covered by Byzantine faults. However, the requirement of digital signatures in the proposed system seems an obstruction to achieving good performance. I could not also find any good discussion on the scalability of the system. From: ercanucan@gmail.com on behalf of Ercan Ucan Sent: Thursday, March 02, 2006 7:37 AM To: Indranil Gupta Subject: "598ig review 03/02" Byzantine Generals Problem --------------------------------------- Problem Addressed: Main idea in presenting Byzantine Generals Problem and a solution for that is the following: A reliable computer system must be able to cope with the failure of one or more of its components. A failed component can act as sending conflicting information to different parts of the system. The stated problem of trying to overcome with this type of failure is expressed abstractly as the Byzantine Generals Problem. Situation can be depicted as a group of generals camped around an enemy city and they can only communicate via messenger. The problem is trying to find an algorithm to ensure that the loyal generals will reach the agreement where there is a possibility that some of the generals may be traitors. Approach Taken & Comments: First solution is by using oral messages. Here is the algorithm to solve the problem: Algorithm OM(0). (1) The commander sends his value to every lieutenant. (2) Each lieutenant uses the value he receives from the commander, or uses the value RETREAT if he receives no value. Algorithm OM(m), m > O. (1) The commander sends his value to every lieutenant. (2) For each i, let vi be the value Lieutenant i receives from the commander, or else be RETREAT if he receives no value. Lieutenant i acts as the commander in Algorithm OM(m - 1) to send the value vi to each of the n - 2 other lieutenants. (3) For each i, and each j != i, let vj be the value Lieutenant i received from Lieutenant j in step (2) (using Algorithm OM(m - 1)), or else RETREAT if he received no such value. Lieutenant i uses the value majority (v1,v2, . . . . . vn-1 ). This algorithm works perfectly fine in a step by step model. However, this requires that the system this algorithm would get run be a synchronous system. Obviously, this algorithm cannot be a solution to the same problem if was assume an asynchronous system. The reason is that as we have an asynchronous system, then this problem would reduce to the consensus problem which is proven to be unsolvable by the FLP result. The other solution is for the signed messages. I will not be putting the algorithm on this review in order to save space. Basically, the idea in using signed messages is to constrain the messages from being manipulated by each commander signing the message before sending his value. In this case, the problem is now much more simplified than it was in the former. Paper also discusses an algorithm in the case of there are missing communication paths in the graph. In conclusion, I should say that this algorithm is the only one that solves such a problem in the area, and probably it is the optimal one. Practical Byzantine Fault Tolerance ----------------------------------------------- Problem Addressed: The main problem addressed in the paper is the Byzantine faults. In order to overcome this type of failure, the authors describe a new replication algorithm which is able to tolerate Byzantine faults. There are several key points in design of the algorithm though. The algorithm proposed on the paper is aimed to work in practical environments such as the Internet where there is no guarantee of synchrony of the system. Approach Taken & Comments: The algorithm is mainly a form of state machine replication meaning that the service is modeled as a state machine which is replicated across different nodes in a distributed system. Each replica has this state machine and maintains the service state and implements the service operations. Replicas move through a succession of configurations that are called views. In a view one replica is the primary and the others are backups. These replicas must be deterministic so that we can be sure that they work correctly as far as the state machine is concerned. In the paper, the algorithm is outlined as follows: • A client sends a request to invoke a service operation to the primary • The primary multicasts the request to the backups • Replicas execute the request and send a reply to the client • The client waits for f+1 replies from different replicas with the same result where f is the number of replicas that could be faulty (similar variable naming scheme is used as in Lamport's Byzantine Generals paper); this is the result of the operation. The contributions of the paper could be stated as follows: First, it is the first paper to describe the state machine replication protocol which tolerates Byzantine faults in asynchronous networks. Secondly, there is a set of optimizations presented in the paper that allow the algorithm perform better enabling the deployment of this algorithm on real systems. Third, it describes the implementation of a distributed file system which used their algorithm and thus Byzantine fault tolerant. Lastly, the authors provide experimental results that quantify the cost of the replication technique. Results: After implementing a Byzantine-fault-tolerant NFS service using their algorithm, the authors found out that their service is only 3% slower than a standard unreplicated NFS. I think that this performance hit is quite acceptable. -- Ercan Ucan - eucan2@uiuc.edu Graduate Student Computer Science Department University of Illinois at Urbana-Champaign ------------------------------------------------------------ From: Praveen Jayachandran Sent: Wednesday, March 01, 2006 5:29 PM To: Indranil Gupta Subject: cs598ig review 03/02 CS598IG Review 03/02 Praveen Jayachandran The Byzantine Generals Problem ------------------------------ The Byzantine Generals Problem is defined as a variation of the consensus problem in a synchronous system, where malfunctioning nodes could give conflicting information to different parts of the system. In this problem, a commanding general must send out an order to n-1 lieutenants such that all the lieutenants must agree on the same order, and this order must be the order issued by the commanding general, if he is loyal. The authors show that with only oral messages, the problem is solvable if and only if more than two-thirds of the nodes are loyal (non-malicious). With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. With only oral messages, the authors' first reason the impossibility of solving the Byzantine generals problem with at most two-thirds of the generals being loyal. A recursive algorithm to solve the problem with only oral messages, and a proof of its correctness based on induction is presented. Then a solution with signed messages is discussed. They further generalize their solutions to p-regular graphs. With only oral messages, they show that their algorithm is correct when p >= 3m. However, with signed messages, their algorithm is correct for arbitrary graphs as long as all nodes are connected to one another. They discuss an application of the Byzantine Generals problem in reliability of distributed systems. Comment: 1. The algorithm with only oral messages is optimal only in terms of the number of rounds in which agreement is achieved, but not in the number of messages. How can the number of messages transmitted between nodes be reduced and still correctness be ensured? 2. The algorithm cannot handle nodes that crash and then recover after a period. Practical Byzantine Fault Tolerance ----------------------------------- This paper develops a state machine replication algorithm that can tolerate Byzantine faults. The algorithm does not require synchrony to ensure safety, but requires it for liveness. Safety is guaranteed as long as at most floor(n-1)/3 nodes are faulty. Liveness is also guaranteed as long as the delay between sending and receiving, delay(t), does not grow faster than t indefinitely. Their evaluations show that their algorithm improves response time by more than an order of magnitude compared to earlier algorithms. In the algorithm, one replica acts as the primary and the others act as backups. Whenever a primary failure is detected, the view is changed and the replica with the next higher id (mod no. of replicas) is chosen as the primary. In this environment, a three-phase protocol that clients and replicas execute is presented. Garbage collection and view changes are discussed. Optimizations to reduce communication between clients and replicas, and amongst replicas are described. Based on this algorithm, a Byzantine-fault-tolerant NFS service is implemented and evaluated. Comments: 1. Under normal operation, the system is not distributed in the true sense. The primary replica encounters a lot more load than the backup replicas. 2. In the garbage collection section, the authors mention that the proof generation can be done once every 100 requests. Although all of these requests may be executed by f+1 non-faulty replicas, it may not be the same set of replicas executing them. In such a case, it might be difficult to identify replicas that executed each request while generating the proof (replicas could have failed in the meanwhile). 3. If a primary crashes, and recovers after a period of time, it is unclear as to how it is informed of the view change and about the fact that it is no longer the primary replica. From: Juan Jose Jaramillo Jimenez Sent: Wednesday, March 01, 2006 10:38 AM To: 'Indranil Gupta' Subject: 598ig review 03/02 1 The Byzantine Generals Problem 1.1 Summary This paper copes with the problem of component failure in a computer system, where the failure can cause conflicting information to different parts of the system. This is posed as the Byzantine Generals problem, where a commanding general sends and order to its lieutenant generals such that all loyal lieutenants must obey the same order, and if the commanding general is loyal, then every loyal lieutenant obeys the order he sends. It is proved that this problem is solvable in the case of oral messages (where the sender of the message cannot guarantee the message will arrive intact) only if the number of generals is greater than 3f+1, where f is the number of traitor generals; in the case of signed messages, this problem is solvable for any n and f. The algorithm in the case of oral messages is based on f+1 rounds of messages, and assumes a function "majority". The idea is that every lieutenant general sends the order to the other n-2 generals and then every general applies the majority function to find the order to follow. For the case of signed messages, the commanding general sends its order signed to the other lieutenant generals, and then they retransmit it to its peers. When receiving an order from other lieutenants, the general checks if the order was already received: if not, it signs the messages and retransmits it again. 1.2 Comments The problem with the proposed solution, as it is stated in the paper, is the cost of the amount of time and messages involved, so more assumptions are required in order to decrease this cost. However, if high reliability is a big concern, and the general case of the Byzantine generals must be handled, this solution is optimal, as has been shown in similar works, like the one by Fischer and Lynch. 2 Practical Byzantine Fault Tolerance 2.1 Summary The paper presents an algorithm for state machine replication that tolerates Byzantine faults. For the algorithm it is assumed an asynchronous distributed system, where messages can be delayed, duplicated, get lost or received out of order. It is also assumed that nodes fail independently. It is assumed an adversary that can coordinate faulty nodes and delay communication or nodes in order to cause damage to the system. To avoid spoofing, corrupted messages and replays, it is used a cryptographic system containing public-key signatures for the messages, message authentication codes, and message digests produced by collision-resistant hash functions. The algorithm is a form of state machine replication, where replicas move through a succession of configurations called views, where in a view one replica is a primary and the others are backups. The algorithm works as follows: when a client requests a service operation it sends a message to the primary, which is responsible for multicasting the request to the backups. Then all replicas execute the request and send a reply directly to the client, which waits for f+1 replies from different replicas with the same result (where f is the maximum number of failures the system can handle), and this is the result of the operation. To prove the feasibility and practicality of the algorithm the authors implemented a Byzantine-fault-tolerant NFS service, and using the Andrew benchmark they showed that the performance is only 3% slower than the unreplicated NFS service. 2.2 Comments One of the assumptions the authors make is that nodes have independent failures, and to be able to assume that, the authors propose that for this to be true even in the presence of malicious users, some steps should be taken: (1) each node should run different implementations of the service code and operating system, (2) have different root password and (3) a different administrator. While (2) sounds reasonable, (1) and (3) are not that sound, specially (3), since it requires as many administrators as replicas, which in a big system can become impractical. Also, having different implementations of the OS and the service code would make the maintenance of the system cumbersome. Thus, it raises concerns on how practical this algorithm can really be if it is based on an impractical assumption.