From: hossein.ahmadi@gmail.com on behalf of Hossein Ahmadi [hahmadi2@uiuc.edu] Sent: Thursday, March 06, 2008 9:30 AM To: indy@cs.uiuc.edu Subject: 525 review 03/06 =============================================================================== On Scalable and Efficient Distributed Failure Detectors Indranil Gupta, Tushar D. Chandra, German S. Goldszmidt Failure detection is essential in distributed systems such as group membership protocols and computer clusters. According to this paper, two main properties of a failure detection scheme are "completeness" and "efficiency". Completeness means that the system can eventually detect any failure in the system while efficiency corresponds to both speed and accuracy of detection. This paper first studies the optimal message overhead for complete and efficient detection of failures. Based on the probability at which a non-faulty node can detect a fault by mistake, one can find a relationship between number of messages sent over a time period and the loss rate of links. In order to bound the minimum accuracy, the number of messages per each process is lower bounded. Applying the completeness condition yields a lower bound on the total number of messages. Based on the discussion proposed in the paper, distributed heartbeat approaches does not meet lower bounds on message complexity. Therefore, authors take an step further to design a new distributed failure detection which is more optimal. This is done by relaxing detection time constraint to expected value instead of deterministic guarantee. In their protocol, each node selects another node randomly and tries to check if it is failed. This is done by first sending a direct ping, and if it fails, then requesting a group of nodes to ping the specified node. Authors show that the protocol indeed has the desired expected value of speed and is complete. The approach proposed in this paper is interesting as you can sacrifice some accuracy and speed in order to gain much more scalability. The analysis on the optimal message overhead itself gives a very clear insight for designing new failure detection protocols. However, the analysis in based on the performance metrics presented (completeness, and efficiency). Now the question is whether or not the definition of accuracy and completeness (as in this paper) are enough to characterize the performance of a system. Regarding to protocol design itself, it is interesting to know if there is more distribution information about the speed of detection more than only mean. Another drawbacks of this paper is that it does not provide any evaluation based on simulations or real implementation to be compared against other schemes. ------------------------------------------------------------------------------- SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol Abhinandan Das, Indranil Gupta, Ashish Motivala The second paper proposes SWIM, a distributed protocol to address both failure detection and dissemination of newly joined nodes. SWIM splits into to separate component: first, a failure detection system. SWIM uses the probabilistic protocol presented in the previous paper as failure detection component. Second, the dissemination component which is responsible for distributing failure information, newly joined nodes, and voluntarily leaving nodes. One basic approach is to use multi-cast or broadcast for dissemination. However, it is not scalable since the message overhead for each process is proportional to the group size. In SWIM, dissemination component is an epidemic which piggybacks failure, joining, and leaving information into ping messages of failure detection system. Since the failure detection system has constant message overhead per process, the total message per node is constant. Using a round-robin table in failure detection against pure random pinging, allows SWIM to bound the maximum detection time (as opposed to the first paper which only provides expected value). Using some epidemic analysis, it can be shown that the total time required for all nodes to be aware of one nodes failure is indeed bounded. Simulation results verify that the message overhead is indeed constant, while detection latency follows the analysis provided. The proposed protocol has a very efficient design. SWIM achieves low message overhead at a very low cost which I believe is a great move from previous approaches. However, it still is highly dependent to failure and message loss probabilities and more importantly their stability. Therefore, it may not be suitable for dynamic infrastructures. Furthermore, simulation result does not compare this protocol against others. Therefore, it can not be shown that how much the saving in terms of number of messages has affected to overall performance which can be another point of strength for SWIM. =============================================================================== From: fariba.mahboobe.khan@gmail.com on behalf of Fariba Khan [fkhan2@uiuc.edu] Sent: Thursday, March 06, 2008 9:09 AM To: indy@cs.uiuc.edu Subject: 525 review 03/08 A gossip-based failure detection service, R. van Renesse et al, Middleware 1998 Summary: Authors propose a gossip protocol for failure detection. Though it is impossible to know if a process is has failed or incredibly slow, it is important to know if it reachable. In this paper authors use failed to mean unreachable . Authors propose a failure detection protocol with following properties: detection is independent of total members, resilience to both message loss and process failure, low false positive and false negative – most failure are detected correctly, scales in time with number of processes, and bandwidth grows linearly with increased processes. In the basic protocol each member keeps a list of all known members and a heartbeat count. Heartbeat count is incremented each round and gossiped to members. When a member receives this list and a heartbeat from another member he updates his list (with the higher gossip number from his and sender's). Periodically each member broadcasts its list. If member's heartbeat is not incremented for T_fail he is considered dead. To avoid resurrection of the dead, the dead member is removed after T_cleanup (> T_fail). The authors provide an analysis for how to fix these numbers so that certain quality of detection is achieved and worst-case bandwidth is linear to number of members. To scale in a larger system, author extend the basic protocol to multi-level gossiping. The members are partitioned on locality based on subnet and domain address. Gossip is mostly done in subnet and less between subnets and domains. Another extension to the protocol is to recover from massive failures or partitioning. Each member probabilistically broadcasts to subnet. This part was not clear to me is it broadcasting to his own subnet only? However the probability to broadcast increases if the member did not receive a broadcast longer, which is a sign that massive failures might be going on. Discussion: I guess the first paper on failure detection, introducing the idea of heartbeat. Is it requiring clock sync? But what if each subnet only has few members? Will this protocol results hold. I did not understand why it is less likely tha lot of members will broadcast at the same time. Intuitively, all of them will have high prob to broadcast after 20s so it should be highly likely. Also the broadcast will not work in a wireless environment. SWIM: Scalable Weakly-consistent Infection-style process group Membership protocol, A. Das et al, DSN 2002. SWIM focuses on membership information rather than failure detection. Or said differently they propose that unlike traditional systems failure detection and membership knowledge are different problems. They argue that heartbeating protocols scale badly in a large system and compromise either response time or false positive rate to deal with load. SWIM has: const msg per member, detects failure in constant time, deterministic bound on local time at member that detects (what?), propagate updates in infection-style, reduces false positive by suspecting before declaring. The failure detection component from the PODC paper, uses hierarchical ping to member that is suspected dead by one node. After suspicion by a node it chooses few other nodes randomly to ping him and if none get an ack, it is declared dead. This is also extended by using a suspicion round and disseminating the news and then waiting to see if it was just sleeping and then do the dead round. In basic protocol membership is simply broadcast. The extended version piggybacks membership info on ping and acks in case of broadcast unavailability or unsuccessful broadcast (multicast is not reliable). Discussion: The performance evaluation was only on 16 nodes. A simulation with larger number of nodes would make the protocol more visible. -- Fariba Khan 217-778-3922 PhD candidate Illinois Security Lab University of Illinois From: rebolledodaniel@gmail.com on behalf of Daniel Rebolledo Samper [dreboll2@uiuc.edu] Sent: Thursday, March 06, 2008 9:03 AM To: Gupta, Indranil Subject: 525 review 03/06 A GOSSIP-STYLE FAILURE DETECTION SERVICE In this paper, the authors propose a gossip-based scheme to detect node failures in a network with low clock drift, bounded delivery times with high probability and processes that may fail-stop. Nodes keep track of a heartbeat counter for each node and the time when the counter was last updated. Each node periodically increments its own counter and sends its heartbeat table to a randomly-selected node. The receiving node then merges the two tables by selecting the maximum value of each corresponding entry. A member is considered "failed" if his entry hasn't been updated after Tfail and is forgotten after twice that interval. The authors claim that the algorithm requires semi-linear (O(n log n)) detection time and linear bandwidth requirements, and that it is resilient to relatively small rates of message loss and process failure. As an extension, they propose multi-level gossiping: under this optimization, nodes located in the same subnet are more likely to gossip with each other, thus reducing bandwidth consumption at the public internet level. They achieve this by gossiping the subnet masks of the different nodes along with the heartbeat counters. Finally, they introduce a final optimization where nodes periodically broadcast their tables using the broadcast address of their respective subnets. The idea of using a gossip protocol to detect failures is very original, and two optimizations are very interesting (and potentially widely applicable). The first is exploiting node locality by encouraging communication between hosts in the same subnet. In many cases, nodes within the same subnet have higher bandwidth and lower latencies, hence the potential for performance improvements. Likewise, the authors use local subnet broadcasts to recover from "catastrophic" failures. However, this could be slightly impolite with other machines in the subnet. However, and even though this paper was accepted and published, the analysis in section 3 seems quite dubious: certainly the authors probably are more knowledgeable than they appear. Nevertheless, their new concept of round is tricky because, by definition, its length is random. Arguably, you could deduce asymptotic properties from limit theorems, but this is not done in the paper. Further, it is not very clear why the probability of mistake P(p,r) depends on the node p as from the definition it seems that it doesn't. Therefore, some additional experimental results would have bolstered their claims. Finally, the topology they use in section 4 (more specifically to generate Fig. 5) seems simplistic and I would have suggested that they use a topology generator with e.g. power laws (though maybe this paper predates studies of the internet's structure). SWIM – SCALABLE WEAKLY CONSISTENT INFECTION-STYLE PROCESS GROUP MEMBERSHIP This paper proposes a group membership protocol called SWIM. Its goal is to provide weakly consistent group-membership information to the nodes of a group. The protocol works in rounds, but nodes' clocks needn't be synchronized. In the basic protocol, each round, a node pings another node chosen uniformly at random. If it doesn't respond, the former queries k other nodes, which then ping the latter and transmit any acknowledgements to the original node. If none of these pings succeeds, the node is considered failed. This protocol ensures average detection time independent of the number of nodes and a (configurably) low false positive rate. The authors then introduce three optimizations. First, they use the ping/ack messages to transmit membership updates in an epidemic style. This ensures that dissemination of membership updates reaches a high percentage of the nodes in logarithmic time. Second, they introduce a suspicion mechanism whereby unresponsive nodes are first considered "suspected" (and this fact is gossiped) before they are deleted after a certain time-out. If a node resurfaces, an "alive" message is gossiped. Since nodes may be in several different "suspected" states throughout their lives, they introduce a reincarnation number controlled for each node (controlled by the node itself) that gets incremented each time a node sends an "alive" message about itself. All other gossip messages contain the corresponding node's incarnation number. Finally, to ensure that nodes detect failures in deterministic time, the authors modify the protocol to make the nodes probe other nodes following a round-robin scheme. The authors' mathematical analysis of the basic swim protocol is very simple and effective, and indeed the protocol seems very efficient: it ensures strong completeness – crash-failure is detected by all non-faulty members – with an average delay independent of the number of nodes, and a maximum delay linear in the number of nodes. The message average message load per member is also independent of the number of nodes, the rate of false positives is low and we can trade off speed for accuracy by configuring the protocol's parameters. From: Rahul Malik [rmalik4@uiuc.edu] Sent: Thursday, March 06, 2008 9:02 AM To: Gupta, Indranil Subject: 525 review 03/06 Submitted by: Rahul Malik Date: 03/06 A Gossip-Style Failure Detection Service SUMMARY: In this paper, authors have described a protocol for failure detection that is based on gossiping. The main advantage of the current protocol over the existing protocols is that the current protocol scales well with the number of nodes. It also provides timely detection of failures. First of all, they have presented a simple protocol that assumes a uniform network topology as well as a bounded number of host crashes. This protocol is based on gossiping of the heartbeat counters among the nodes. An analysis of this scheme has been performed using epidemics. Then next, they extend the protocol to discover some information about network topology and to use this information optimize network resources. There are two aspects to this protocol. First, the lengths of Subnet and Host numbers for each domain are gossiped along with the heartbeat counters of each host. Secondly, gossips are mainly done within subnets, with few gossips going between subnets, even fewer between domains. Finally, they present another protocol that when combined with gossiping, deals with arbitrary host failures and partitions. In order to deal with this, the failure detector does not immediately report members from whom it has not heard from Tfail, instead, it waits longer, but it doe! s ! stop gossiping to these members. PROS: The current protocol obviously has many nice properties. The algorithm is resilient against both message loss and process failures. The probability that a member is falsely reported as having failed is independent of the number of nodes. The algorithm detects all failures or unreachabilities accurately with known probability of mistake provided that the local clock drift is negligible. The algorithm scales well in detection time as well as the network load. Also, the protocol makes minimal assumptions about the network. CONS: This is a very good work and provides a very good approach for failure detection. However, there are certain assumptions in the paper such as local clock drift being negligible and messages being transmitted in bounded time, that may not be true in real time systems. They can improve by using a hierarchical tree-based approach. SWIM : Scalable Weakly-consistent Infection-style Process Group Membership Protocol SUMMARY: In this paper, the authors have described a protocol for disseminating weakly-consistent knowledge of process group membership at all the participating nodes. The approach is motivated by the fact that traditional heart-beating based protocols do not scale well with group sizes. The approach chosen is that they have separated the failure detection and membership update dissemination functionalities of the membership protocol. The failure detection protocol is based on random ping-ack based protocol. Next, for disseminating the component and dynamic membership, they simply multicasts the information to the group. Next, they propose a more robust and efficient SWIM protocol. Because the multicast capability is not really available on all the machines or is disabled due to administrative reasons, they propose to use epidemic infection style protocol for disseminating information. Another feature that they introduce in order to reduce the frequency of false positives is that th! ey! run a suspicion sub-protocol whenever a failure is detected by the basic protocol. PROS: The advantage of the current protocol is that it imposes a constant message load per group member. As a result, it is scalable. The protocol is able to detect a process failure in a constant time at some non-faulty process in the group. Also, the dissemination latency in the group grows slowly with the number of members. It also provides a mechanism to reduce the rate of false positives using suspicion sub-protocol. CONS: One of my concerns is that the experimental evaluation is provided using small group sizes only, the maximum being 56. However, one of the main claims of the paper is that their protocol is designed for large groups. So, they should provide results for bigger groups as in todays systems, much bigger group sizes are usually there. From: Alejandro Gutierrez [agutie01@gmail.com] Sent: Thursday, March 06, 2008 3:54 AM To: Gupta, Indranil Subject: 525 review 03/06 =============================================================== “SWIM: Scalable weakly-consistent/ Infection-stylem Process Group Membership” Abhinandan Das, Indranil Gupta, and Ashish Motivala. Reviewed by: ALEJANDRO GUTIERREZ =============================================================== This paper presents a project from the Department of Computer Science at Cornell University. In this paper the authors present a Scalable weakly-consistent infection-style process group membership Through out the paper, the authors provide a description for the SWIM protocol with the purpose of maintaining the group membership in scalable weakly-consistent manner. The protocol has two different phases: Failure detection and membership update. The difference between failure detection and membership update achieve is geared towards more savings in communication Each node is in charge of periodically sending a ping message to a randomly selected neighbor, and wait for acknowledgement. Failures are detected if there is no acknowledgement message received within the timeout period. Once the failure is detected, the membership updates the spread out via piggybacking on ping message and acknowledgements. =============================================================== =============================================================== “A Gossip-Style Failure Detection Service” Robbert van Renesse, Yaron Minsky, and Mark Hayden Reviewed by: ALEJANDRO GUTIERREZ =============================================================== This paper presents a project from the Department of Computer Science at Cornell University. The authors present a failure detection algorithm based on Gossiping. The algorithm works as follows: each node maintains a list of known active members. Every “Tgossip” seconds, a node increase its own personal heartbeat count and sends their list to another randomly chosen node. When a list is received, a node combines its own lists with the received one. If a node hasn’t received a heartbeat from any of the nodes in its list in Tfail seconds, it assumes the node has failed. Then it proceed to remove the failed node from its list after Tcleanup = 2*Tfail seconds. The paper analyzes the algorithms based on epidemic theory and tries to perform simulations. The result shows that the proposed approach is scalable and resilient against both message loss and process failure. In order to further reduce the overhead of useless gossip message, the authors propose a multilevel gossip algorithm. I am wondering the performance of this algorithm when there exist malicious nodes. Future approaches to this problem should verify the performance of such algorithm. The contributions of the described algorithm are: - I enjoyed guiding to a theoretical approach as well as the practical implementation in the paper. The main disadvantages of the described algorithm are: - It requires the clock drift for the computers to be tolerable. - The multilevel gossiping scheme increases the rounds for propagation of the active membership list. From: Justin Tulloss [jmtulloss@gmail.com] Sent: Thursday, March 06, 2008 1:32 AM To: Gupta, Indranil Subject: 525 review 03/06 Hi Indy, Here are my reviews for today. Thanks! Justin On Scalable and Efficient Distributed Failure Detectors: This paper discussed practical systems for detecting failure completely and accurately in an asynchronous, unreliable network. While completely achieving both goals has been proven to be impossible, this paper proposes that there is an optimal network load that will guarantee completeness and assure accuracy and speed to certain, defined probabilities. The paper proves this and then asserts that the currently favored failure detection methods, most of which are oriented around heartbeating, are very sub-optimal. The final part of the paper illustrates a randomized, distributed failure detector that achieves a much lower sub-optimal coefficient. I found this paper to be interesting in a couple aspects. First was the solid mathematical foundation. When the authors put these proofs at the onset of the paper, they are showing what is the known theoretical maximum of what they are trying to achieve. This very valuable not only for evaluating the systems illustrated in the paper, but also for comparing it to other systems. Once there is a mathematical maximum, we have an exact bar against which we can measure system performance. The second was the actual proposed protocol. I find it very interesting how again and again randomized/probabilistic approaches to distributed systems outperform their more traditional counterparts in the average case. This algorithm is another example of that. However, the protocol seemed overly simplistic. Without factoring in crash-stop failures, byzantine failures, and the lack of dynamic membership, it seems that this approach is not ready for real world systems. It would be interesting to see future work that established that this failure detection protocol is actually usable in these more complex environments. SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol This article was essentially an extension of the first. The main idea behind the paper was that by separating failure detection from membership dissemination, you can design a system that does both well, and certainly much better than by just using heartbeats. This paper stepped through a few different iterations of SWIM. The most naive was to just use multicast for membership dissemination and the failure detector used in the first paper. This added membership capabilities to the random failure detector, but was expensive and frankly unnecessary. The more intelligent versions of SWIM piggybacked on the packets already being sent around by the failure detector to also disseminate membership information. They then added on a suspicion protocol to reduce the numbers of false positives. SWIM was implemented, tested, and shown to work. It clearly uses much less overhead than heartbeating. I liked this paper for following up where the first fell short. This paper is really the culmination of the research, and while the first give a good background onto why this all works, I feel that it's all for naught when read outside of this paper. I was a bit confused by the suspicion protocol however, and perhaps I just need to read it more carefully. It does reduce false positives, but it does so in a way very similar to removing a node from the group. In a network with a known coordinator, it seems that the suspicion protocol does not actually reduce load at all and increases likelihood of the node being re-included only slightly. At the same time, it reduces the speed of failure detection immensely. I think some clarification on how this differs from standard joining and leaving would be nice. From: dkassa2@uiuc.edu Sent: Thursday, March 06, 2008 12:57 AM To: indy@cs.uiuc.edu Subject: 525 review 03/05 ============================================================================================= Review 9: Paper Title: A Gossip-Style Failure Detection Service The paper presents a failure detector based on random gossiping. In the failure detection protocol, each host in the network runs a failure detector process that executes the protocol and offers continuous process failure and recovery reports to interested clients. Output from one of the failure detector processes in the CS department at Cornel University is used to generate and automatically update a website at the department. This website gives an overview of the department’s network status, and a log of recent activity on it. The paper agrees that other approaches based on hierarchical, tree-based protocols have the potential to be more efficient than the ones it described. Nonetheless as what the paper said tree-based failure detector protocols require complex agreement protocols to construct the tree and repair it as failures occur. Gossip protocols do not have this problem, and in fact one of their important advantages is their simplicity. The probabilities involved in gossiping can be calculated using epidemic theory where for example the execution is broken up into synchronous rounds, during which every process gossips once. A member that has new information is called infected. At each round the probability that a certain number of members is infected given the number of already infected members is calculated. The paper presents the failure detector service in detail. It assumes that there is no bound on message delivery time. I have a problem with the failure detection time as a function of the probability of making false detections. When the probability of mistake decreases the detection time goes to 100s of seconds. I don't know how effective a protocol which detects failures after 100s of seconds is! ================================================================================================ Review 10: Paper Title: On Scalable and Efficient Distributed Failure Detectors This paper compares scalability of different failure detecting protocols and presents a randomized failure detector protocol. It shows how the optimal load scalability of the distributed complete failure detectors can be quantified as a function of application specified requirements: the quick failure detection by some non-faulty process and the accuracy of failure detection. The paper characterizes the optimal worst-case network load imposed by any failure detector that achieves an application's requirements. It then shows that tranditional heartbeating schemes are unscalable according to the optimal load. The paper then presents the randomized distributed failure detector algorithm that imposes an equal expected load per group member. The protocol uses a simple, probabilistically lossy, network model. This protocol satisfies the application defined constraints of completeness and accuracy, and speed of detection (average). The netowrk load it imposes differs from the optimal by a sub-optimality factor that is much lower than that for traditional distributed heartbeating schemes. The sub-optimality factor does not vary with group size (for large groups). The paper is presented in sufficient details and gives good future research directions and extensions. I do not have any major criticism against this paper. ============================================================================================ From: ysarwar@gmail.com on behalf of Yusuf Sarwar [mduddin2@uiuc.edu] Sent: Wednesday, March 05, 2008 11:57 PM To: Gupta, Indranil Subject: 525 review 03/06 --------- Reviewed by Yusuf Sarwar -------------- Title: On scalable and efficient distributed failure detectors Indranil Gupta, Tushar D. Chandra and German S. Goldszmidt The paper proposes a complete and scalable distributed failure detection technique, where timing and accuracy parameters can be specified by the applications. The authors argue that failure detection with strong completeness (detect all failures) and accuracy (no non-faulty can be detected as failed) is sufficient to solve consensus, hence impossible to solve. But the both completeness and accuracy can be attained to some thresholds specified a priori. The paper analytically shows that a protocol can be designed so that it can support the quality of failure detection in compliance to those thresholds. The proposed failure detection protocol gives two choices to the application, SPEED, a time bound T by which non-faulty group members should come to know of failure of any node in their group, and ACCURACY, a probability PM(T) such that the probability that a non-faulty node has been declared to be failed by the time T should be at least (1 - PM(T)). This value of T and PM(T) determines other parameters of the protocol such that the desired detection behavior is achieved. The authors show analytical results explaining how these bounds are achieved. Procs: - A randomized protocol, scales well for large system. - User tunable parameters for completeness and accuracy is an important point of the protocol. Cons: - =================== Title: SWIM: Scalable Weakly-consistent Infection-style process group membership protocol Abhinandan Das, Indranil Gupta, and Ashish Motivala The paper proposes a weak-consistent group management protocol. Membership management runs among a set of nodes and indents to notify the existence of all nodes to all nodes. Membership management protocol keeps track of all member nodes and eventually it turns out that failure detection and dissemination of these failures to all nodes are two fundamental tasks of a membership protocol. The authors designs SWIM, that provides constant message overhead per group member, time bound detection of faulty process, and infection-style member updates propagation. SWIM has two components, failure detector and dissemination. SWIM's failure detection appears in 'On scalable and efficient distributed failure detectors' that has just been reviewed. The idea is pretty simple. A node N of a group pings another node M in the same group. If it does receive a ping reply, it's all ok. But when it misses the ack back from M, N picks randomly a few nodes from the group that are asked to ping M. If any of them gets ping-reply, they relay it to N. If N receives none, M is detected as 'dead'. When a faulty node is detected, SWIM does not try to make a multicast to update its members. Rather it piggybacks this failure information in all future pings, ping-req and ack messages. The detector node when participating in usual pings and ping-ack passes this failure information to all, just like gossip in society. Eventually all members of the group come to know of failed members, and group membership is updated. The authors also introduce suspicion mechanism to reduce the frequency of false positives and round-robin probe target selection to provide time bounded completeness. Pros: - Randomized gossip based protocol requires less state maintenance at each group. - Failure detection is fairly scalable, allows applications to specify the degree of completeness and accuracy. - Overhead is preferably low Cons: - Group updates may be noisy due to infection-style gossip. Some nodes may be getting well updated, whereas some may be keeping slate membership. - Latency of spread of infection through group seems to be quite non-uniform and random, indicating arbitrary time delay in establishing updated memberships in the group. =========================== From: marefin2@uiuc.edu Sent: Wednesday, March 05, 2008 11:49 PM To: Gupta, Indranil Subject: 525 review 03/06 SWIM: Scalable Weakly-Consistent Infection-style Process Group Membership Protocol A. Das, I. Gupta and A. Motivala SWIM is a generic software module that offers failure detection and weakly consistent membership maintenance protocol for large-scale distributed system. The basic failure detection of SWIM protocol works by exchanging ping, ping-req and ack messages. When a process (Mi) detects another process (say Mj) as failed by not getting ping reply, sends a ping-ack message to k random neighbors telling that it detects Mj as failed. These random neighbors then ping Mj and reply the ack back to Mi. After the protocol period, Mi checks if it has received any ack for Mj or not directly or indirectly. If Mi finds no ack, it disseminates that Mj has failed. Dissemination protocol can be costly and to overcome this, the efficient SWIM protocol uses piggybacking approach through ping, ping-req and ack. It is infection style dissemination using gossip architecture. This failure detection protocol provides full completeness, but the detection time is not bounded. By selecting the ping target by Round Robin fashion from the neighbor list and rearranging the list randomly after the complete round, can bound the detection time. The basic protocol has higher frequency of false positive. To reduce the frequency of false positive, a process (Mi) when finds another process (Mj) failed after the protocol period, instead of declaring it failed, it marks it as suspected and disseminate this information. Any group member receiving this information marks Mj as suspected. Suspected members stay in the membership list and treated as non-faulty members for ping request. If any member of the group receives ack from Mj, it forwards alive message using another dissemination. When the suspected member receives the suspected message about itself, it also disseminates its liveliness message. If Mj is suspected at some member Mh , and this entry times out before receipt of alive message, Mh declares Mj as faulty and disseminates the confirm message. Confirm message override both suspect and alive messages. An incarnation number is attached to each of these messages to avoid the ambiguity. Also SWIM maintains the membership by multicast. Joining of a node can be done through a central manager or by broadcasting. Some of the pros and cons of this paper are below: · It scales nicely for large distributed system. It uses random peer-to-peer probing instead of heartbeating protocol. Figure 2 in the paper shows that the average message-load in the system is nearly constant with the increase of group size. · Idea of piggybacking seems to me an interesting approach of reducing the dissemination overhead. SWIM bounds the failure detection time by defining round-robin approach in ping selection. · SWIM provides completeness. Modified SWIM marks a process as suspected before declaring it as failed. This reduces the frequency of false positive. · This paper seems to me as a combination of several available ideas (some of these are author’s previous works, some by the others) in a software package with some modification. · The justification of protocol period and the suspecting time-out (both are 3log(N+1)) in the experiment are not explained in the paper and how to measure this generally for a system is not clear to me. · Also gossip does not work well in case of partitioning, and this paper does not tell anything about this and how to handle partitioning properly. Future work can be the modification of SWIM for managing the Byzantine failure. A Gossip Style Failure Detection Service R. V. Renesse, Y. Minsky, and M. Hayden This paper is probably one of the earliest papers that show the application of gossip style protocol in failure detection. The basic protocol is based on the Clearinghouse project. Each member maintains a membership list and their heartbeat counter. Every Tgossip seconds, each member increments its own heartbeat counter, and selects one other member at random to send the list to. After receiving the gossip message, the member merges the list with its own list and keeps the maximum. Occasional broadcast is used in order to be located initially and to recover from network partitions. If the heartbeat counter of a member has not increased for more than Tfail seconds, then that member is considered as suspected but does not deleted until next Tcleanup seconds. Tcleanup is chosen so that that the probability that a gossip is received about this member, after it has been detected as faulty, is less than some small threshold Pcleanup. This is done as the process might not have been failed actually, the network can be partitioned or the process is still alive but responding slowly or the link is slow. Within this Tcleanup time, a broadcast protocol tries to restore the connection to the remaining members. Based on this basic protocol, the authors have designed multi-level gossip protocol. Within the subnets, the basic gossip will work among the hosts. On average, one member of a subnet will gossip to another subnet within its domain and one member on each domain will gossip to another domain. Broadcasting can also be improved in the multilevel protocol for scaling purpose. Each subnet will run an instance of broadcast among the hosts, as well as each domain among the subnets and the domains each other. This is an improvement as most partitions are not arbitrary rather they occur in the boundary of subnets and domains. Some of the pros and cons of this paper are below: · This protocol makes minimal assumption about the network. It doesn’t enforce any bound on message delivery time and also allows that messages may be lost altogether. So this protocol provides much flexibility. · The multilevel gossip reduces the amount of bandwidth that flows through the Internet routers, since gossiping is concerned on the lowest level of the hierarchy. Also it accelerates the failure detection time (within the subnet) and resilience against the network partition for large network. · It provides completeness with known probability of false positive. But I have seen some later papers that have decreased the probability of false positive even further. · It is a heartbeat-based protocol. But the heartbeat message size increases with the increase of the number of members. The probe-based protocol can improve this. · Also this protocol uses broadcast for handling with partitioning, which can be expensive in network environment. This paper shows the nice analytical result and experimental output for the failure detection using gossip-based approach. The novelty of this paper is that it showed the approach of gossip in failure detection. Many other later works (and improved works) have been created based on this paper. Arefin From: Mirko Montanari [mmontan2@uiuc.edu] Sent: Wednesday, March 05, 2008 10:59 PM To: Gupta, Indranil Subject: 525 review 03/06 A Gossip-Style Failure Detection Service Review by Mirko Montanari The authors propose a gossip-based failure detection service that works within a group of processes in a distributed system and it is able to identify faulty processes. The basic functioning of the algorithm is the following: each group member keeps a list of all the other member he is aware of and, for each of them, it keeps a heartbeat counter. Every T seconds it increments its own heartbeat counter and forwards to the complete list of another randomly selected member. When the other member receives the list, it merges the member information by keeping the highest heartbeat counter. Any member from which no updated heartbeat information are received within Tfail seconds is considered as faulty and its entry in the list is removed after Tcleanup. Also, the authors propose an enhancement of the gossip protocol that keeps into account the structure of the Internet network (IPs divided into domain, subnet, host) to concentrate the gossips within subnets and minimize communication between subnets and domain. Broadcasts are used to recover from network partitioning. The authors provide a probabilistic analysis of their protocol and also give ways to compute the different time interval according to quality required in the detection. The analysis also addresses the problem of catastrophic failures (when half of the processes become faulty) and it theoretically prove that the broadcast and the gossiping algorithm can work well. PRO: + The authors provide a theoretical analysis of the performance of the algorithm. Through this analysis they are able to provide ways to compute the algorithm parameters analytically. + Implementing the proposed protocol is easy and it can be deployed without incurring in high bandwidth overhead. + The idea of relating the gossiping to the network locality is definitely interesting. CONS: - The authors do not seem to have systematically tested the system in a real environment. They run the system at Cornell but they did not provide performance results in a systematic form. Also, the results the behavior of the system in critical conditions. These results would have given an experimental verification of their formal probabilistic model. ---- On Scalable and Efficient Distributed Failure Detectors Review by Mirko Montanari The contribution of this paper is twofold: a formal analysis of failure detection protocols to determine optimality conditions and the introduction of a distributed failure detection algorithm that exploits the tradeoff between completeness, speed, accuracy and network load. The theoretical analysis provides a classification of the characteristics of failure detection algorithms in term of completeness, accuracy and speed. Completeness describes the fact that a crash-failure is detected by all or some of the non-faulty members. Strong accuracy describes the lack of false-positives in the faulty- process detection while speed describe within which time the faults are detected by at least some of the nodes. Also, an optimality bound on the network load of any failure detection algorithm that wants to deterministically satisfy some specified conditions about completeness, speed and accuracy in given conditions is provided. The second part of the paper introduces a randomized failure detection algorithm that is satisfy the completeness requirement and can be configured to obtain requested speed and accuracy. The proposed algorithm can also scale well in the size of the group. PRO: + The characterization of failure detection algorithms is interesting and the definition of an optimal network load provides a way to compare different algorithms analitically + The proposed algorithm performs well in term of network load and size of the group CONS: - This protocol does not keep into consideration locality in performing gossiping. It seems a good way to reduce traffic in the most congested links. From: Anthony Cozzie [acozzie@gmail.com] Sent: Wednesday, March 05, 2008 10:51 PM To: Gupta, Indranil Subject: 525 Review 3/6 I read both of the Cornell membership papers which seemed the easiest, since they are essentially the same paper. The original theoretical paper is based on a simple insight: if I know the message loss rate of a link, and I expect to receive 1 message every period, then after N periods without receiving a message, I can conclude that the node has failed with probability 1-p_ml ^ N. When fully parameterized this leads to the formula in the paper. SWIM follows this up with pointing out that it is more efficient to separate the failure detection and the propagation of failure information by gossip. Both papers are very simple IMO, but in a good way. It's difficult to comment on papers like this, which prove theoretical bounds and then achieve them. SWIM is basically a building block that stands as is: for its model, it cannot be improved. However, I am not totally convinced of the accuracy of the model (loss probability for packets being constant). From what I know if the internet, this probability should vary both in time and space. The SWIM paper addresses this, but their solution is simply to tune parameters conservatively, and their constant is 20 - pretty large. anthony From: emenese2@uiuc.edu Sent: Wednesday, March 05, 2008 4:54 PM To: Gupta, Indranil Subject: 525 review 03/06 Paper: A Gossip-Style Failure Detection Service Reviewer: Esteban Meneses (emenese2@uiuc.edu) Failure detection is one of the building block in many distributed system applications, for instance, resource management, replication, load balancing, group communication and so on. This paper presents a failure detector based on epidemic theory that according to the authors has the following properties. The probability that a member is falsely reported as having failed is independent of the number of processes. The algorithm is resilient to both message and member failures. Given that local clock drift is negligible all failures are detected accurately with known mistake probability. It scales in detection time, O(n log(n)) according to the number of processes. It scales in network load, because it goes up at most linearly with number of processes. The protocol assumes that every host in the network runs a failure detector process that executes the algorithm and that offers process recovery to the client. The basic protocol works in gossip-style, where a member forwards! n! ew information to randomly chosen members. It is supposed that gossip combines much of the efficiency of hierarchical dissemination with much of the robustness of flooding protocols (in this case each member sends new information to its neighbors). A list of all known members is kept by every member. In this list, the address and an integer for each host is kept. The integer is called the heartbeat counter. Every Tgossip seconds, each member increments its own heartbeat counter and randomly pick several members and sends its list to them. Upon reception, a node merges the two lists and kept the maximum heartbeat for one particular member. Moreover, for each member in the list, a node maintains the last time that its heartbeat has increased. Hasn't the counter increased for more than Tfail seconds, the member is considered failed. This limit is selected to that the probability that any node makes an erroneous failure detection is less than a particular threshold Pmistake. How! ev! er, a particular node doesn't remove a presumably failed member immediately, it will wait until Tcleanup seconds pass. This limit is selected to minimize the probability of receiving a heartbeat of a already deleted member. The authors performed an analysis about the probability of infecting (informing) all the particular nodes in the network about any event and also an upper limit for the probability of mistakenly informing about a failure node. An extension of this protocol is offered for the host, subnet and domain features of the Internet. The protocol scales well but there is still a trade-off. The more accurate we want the protocol to be, the more failure detection time. This is a fair feature in the sense that offers a quantifiable guarantee for detecting a failure node. In the Internet domain, it reduces the number of messages that flows through the Internet routers, given that the gossiping is concentrated in the lower levels of the hierarchy. At the same time, it provides faster failure detection time within subnets and it is resilient to network partitions. Its drawback is that it has negative influence on the number of rounds needed for information to be disseminated through the system and hence on failure detection time across subnets and domains. One thing that came into my mind was about the possible implication IPv6 protocol could have on the failure detector. It turns out that there is not such a big deal, as in IPv6 a similar idea to CIDR was integrated. The main complain I have for the paper appears in the Analysis section. Although it is very clear and illustrative of the main properties of the protocol, the assumptions are completely ridiculous for the real world. You cannot assume that you will have a static system with maximum number of failures and starting after all nodes fail. I wonder if all the deduction in that section can safely be translated into a real scenario. I doubt it. Also, in the experiment section I would like to have had a burst or churn test to see how the protocol behaves in that case. Paper: On Scalable and Efficient Distributed Failure Detectors Reviewer: Esteban Meneses (emenese2@uiuc.edu) The paper surveys several proposals for building scalable failure detectors and presents a new protocol based on epidemic theory. The authors made an effort to define a common vocabulary in order to grade the algorithms. A failure detector is supposed to be complete (it discovers, eventually, all the failed members), which is a liveness property, as well as efficient (it is fast and accurate) which is a safety property. By accuracy we mean that it doesn't make many mistakes about presumed failed nodes. Although there is a theoretical result that show why a protocol running over an unreliable network cannot be both accurate and complete, the proposals for failure detectors try to minimize the false positives. The model in this case in that the system is divided into groups and a group member has a list (called view) containing other members in the same group. The members can crash (non Byzantine), but they can recover from a crash in a new incarnation which is distingu! is! hable from the old node. A borrowed characterization is that protocols can be strong/weak complete if the crash of a member is detected by all/some members in the group. Also, by strong accuracy we mean that no non-faulty member is declared as failed. An interesting part of the paper consists in showing the requirements imposed by an application on a failure detector: the completeness (either strong or weak) and the efficiency (by means of speed and accuracy). The proposed algorithm uses a couple of parameters. The first is related to the protocol period T and the size of the failure detector subgroups, k. The algorithm basically says that every T time units every node picks a random member from its view and ping it. If it receives an ACK in the worst-time RTT, then it knows its neighbor is not failed. On the other hand, if it doesn't receives the ACK, it will pick k members from its view to help it to discover if the first member is alive or not. These other k members upon ! re! ceiving the message will ping the suspect node and after receiving an ACK will also acknowledge the original node. The initial node will declare other node failed if there is no witness of its aliveness. The authors presented an analysis of the algorithm but no single experimental result. Again, there is a trade-off in the accuracy of the system and the detection time or the overhead in the network. The more accurate we want our protocol to be, the higher the price we should be willing to pay. Although, this protocols tend to scale well and overcome several failures in the nodes. I wonder what happens in the case where crashes are not independent, which is a common assumption in many algorithms. What if we have cascade failures. Will all these epidemic algorithms work? I guess the main drawback of the paper is the lack of a proper experimental support for the conclusions. Also, there is no comparison with other proposals. They just presented a new protocol and generalized few concepts with an extended mathematical display, but never showed how well this protocol behaves within a real world scenario.