From: Mehedi Bakht Sent: Tuesday, March 14, 2006 9:28 AM To: Indranil Gupta Subject: 598 IG Title: A Gossip-Style Failure Detection Service Summary: This paper presents the use of a gossiping mechanism to detect failures in a distributed system. The basic idea is that each member maintains a list for each known member's address and a heartbeat counter. Every Tgossip seconds, each member increments its own heartbeat counter, and selects one other member at random to send its list to. Upon receipt of such a gossip message, a member merges the list in the message with its own list, and keeps the maximum heartbeat counter for each member. Each member also maintains, for each other member in the list, the last time that its corresponding heartbeat counter has increased. If the heartbeat counter has not increased for more than Tfail seconds, then the member is considered failed. Tfail is selected so that the probability that anybody makes erroneous failure detection is less than some small threshold. One of the main features of this failure detection mechanism is its scalability as it has a failure detection time of O(n log n). It also detects all faulty nodes with some mistake bound and is resilient against both transient message loss and permanent network partitions. Comments: The overhead of maintaining an accurate list at each member may limit its applicability to other resource- constrained distributed systems (like sensor networks). Title: SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol Summary: This paper describes a protocol for maintaining membership in a peer-to-peer system that does not use the heartbeat mechanism. SWIM uses a PING-ACK handshake to probe members. Every time period each member will choose a node to ping. When this fails, a second PING-ACK is relayed through intermediary nodes to attempt to circumvent congested links. If this also unsuccessful, the failed node is updated throughout the system via multicast. A more robust approach to SWIM involves piggy-backing updates on the PING-ACK packets. Another optimization is to include a pre-failure state called suspected to give nodes a second-chance to come alive. This greatly reduces the positive-failure rate of SWIM. The protocol provides constant detection time for failed processes and constant message load per process, with respect to the number of processes participating in the group. The paper presents analysis of a prototype implementation of their protocol on a large test-bed of nodes with ideal conditions. The evaluation seems to confirm the earlier protocol analysis in the paper of message overhead, latency, bounded detection-time, and positive-failure rate. Comments: According to the protocol, suspected entries in membership lists expire after a specified time-out. I believe it will be better if there is a mechanism to adjust this time-out value dynamically to adapt to network conditions. Mehedi Bakht PhD Student Dept of Computer Science University of Illinois at Urbana-Champaign 201 N Goodwin Avenue Urbana IL 61801 From: Juan Jose Jaramillo Jimenez Sent: Tuesday, March 14, 2006 9:14 AM To: 'Indranil Gupta' Subject: 598ig review 03/14 1 SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol 1.1 Summary SWIM tries to solve the problem of scalability of traditional heart-beating protocols. To do this, it separates the failure detection functionality from the membership update dissemination; it uses a periodic randomized probing protocol that achieves the property that both the expected time to first detection of a failure and the expected message load per member do not vary with the network size. Additionally, to decrease the rate of false accusations, it allows nodes to suspect a process before declaring it as failed. The work in this paper tries to incorporate the Membership Dissemination component presented in a previous paper to build a working membership subsystem; this system provides a substrate that imposes constant message load per group member, detects processes failures in a constant expected time, gives a deterministic bound to detect failure of a process by another process, and propagates membership updates using an epidemic algorithm. The analysis and experimental results show that this alternative to the traditional heart-beating protocols can give an order of magnitude reduction in overhead if used within more medium sized subgroups. 1.2 Comments One of the main contributions of this work, in my opinion, is to present the idea that membership updates and failure detection can be separate functionalities, and therefore, can be tackled in two different ways. It is also interesting to see a different approach towards diminishing false failure detection by the introduction of the concept of suspected process. 2 Using Random Subsets to Build Scalable Network Services 2.1 Summary Previous works have suggested that building distributed systems on top of a location infrastructure where each node can quickly locate any remote node while maintaining a relatively low local state can help scale distributed systems to million of nodes for applications that require to keep track of characteristics of a subset of their peers. This paper presents the hypothesis that there can be significant additional benefits from periodically distributing a different random subset of global participants to each node. To show this, it is presented the design, implementation and evaluation of RanSub, a protocol for delivering global state and global network probing. RanSub utilizes an overlay tree to periodically distribute random subsets to the overlay participants. To demonstrate the benefits of RanSub, it is presented the design and evaluation of a Scalable Adaptive Randomized Overlay (SARO). It is shown that SARO is able to match underlying networking topology and to adapt to dynamically changing network conditions by sampling members of its random subset once per epoch. A simulation of 1000 SARO nodes in a 20,000 node network show the stability and scalability of the approach. Additionally, experimental results from running SARO in PlanetLab are presented. 2.2 Comments RanSub is developed for a wide range of applications that need keep track of characteristics of a subset of their peers. It is interesting to see that for the case of membership updates and failure detection this protocol has some common points with SWIM. However, in this paper there are no performance guarantees, as opposed to the case of SWIM, where mathematical analysis showed the bounds of the protocol. It would then be interesting to further investigate if the concepts of SWIM can be applied to more general scenarios like the ones suggested in this paper. Other way to see it is trying to analyze the characteristics of RanSub to see if it is possible to find some performance guarantees. From: Brandt Dewald Dusthimer Sent: Tuesday, March 14, 2006 8:58 AM To: Indranil Gupta Subject: 598ig review 03/14 Title: Swim: Scalable Weakly-consistent Infection-style Process Group Membership Protocol by Abhinandan Das, Indranil Gupta, and Ashish Motivala Summary ------- This paper presents SWIM, a failure detection protocol. The core difference between SWIM and most previous failure detection protocols is that the number of messages in SWIM stays constant at a node as the network size grows. The paper first presents a basic SWIM failure detection. Then, it discusses how the original, simple model can be improved. In the SWIM model, there are two components: a Failure Detector Component, which detects failed members, and a Dissemination Component, which disseminated information about new or left members. In the simple SWIM approach, these two components are separate, with the Failure Detector working through random probing and the Dissemination Component working through network multicast. With the simple approach, during each period a given node, M_i, selects another node, say M_j, out of its membership list and pings it. If the ping isn't received during a specified timeout time, the pinging node selects a set of nodes in its membership at random, and has them ping M_j. If they get a ping back from M_j, they let M_i know. If M_i doesn't receive any responses back from any pings, direct or no, M_i marks M_j as failed and disseminates this information through the network through multicast. The paper then talks about two improvements to the basic model: SWIM with infection-style dissemination and SWIM with infection-style dissemination and a suspicion protocol extension. Finally, the paper shows results SWIM's performance, which shows that the messages at individual nodes do remain constant no matter what the network size. Also, it shows how the SWIM with infection-style dissemination and suspicion protocol extension performs much better than the other methods during the joining phase. Comments -------- - The experimental data presented only takes into consideration small groups (from about 0-56 nodes.) This doesn't reflect scaling into the "large-scale" realm, as the paper is aiming for. - How well would SWIM perform across different network topologies? Is there a way that we could reduce the number of non-local network messages? Title: A Gossip-Style Failure Detection Service by Robbert van Renesse, Yaron Minsky, and Mark Hayden Summary ------- This paper presents a method for failure detection based on random gossipping. The basic protocol consists of every T_g seconds, every node in the network sends out a gossip message to one of its neighbors. Every node also maintains a list of all the nodes that it has heard back from recently. If a node hasn't heard anything from a given node in a certain amount of time, it marks the node as failed. It then waits for another T_g before cleaning up the node from its neighbor list. It does this to avoid accidently marking a node as failed (read: it gives the node a grace period.) The paper then presents how this basic protocol performs. It shows how different failure detection probabilities perform and that the algorithm is fairly resilient to process failures and message loss. It is then discussed how the protocol can be expanded to take internet domains into consideration when dealing with large scale systems. Basically, most of the system remains local, while, probabilistically, only a few given nodes inside the local system will be communicating with a given network outside of the local system. This reduces the number of the messages sent over the non-local network, but the paper shows that this protocol is just about as resilient as the basic one. Comments -------- - Worthy research for later would be to see how the protocol works with the proposed gossip servers. - How would dynamic T_g values effect the system? (For example, let's say a given node was regularly slow. If we increase its T_g, then maybe we can avoid sending a regular flood of messages to it.) From: Maifi Khan [maifi.khan@gmail.com] Sent: Tuesday, March 14, 2006 8:06 AM To: Indranil Gupta Subject: 598ig review 03/14 Title: A gossip style failure detection service. Summary: Failure detection is a very important service for any system to be reliable and useful. In this paper they proposed a gossiping based failure detection service. Each member has a list of known members addresses and a heartbeat counter. Every Tgossip seconds heartbeat counter is incremented and sends the list to one of the member in the list at random. The receiver merges the two lists and update the maximum heartbeat counter. If the heartbeat counter for a member is not updated for a prespecified amount of time (Tfail)that is considered failed. The failed member is kept on the list for Tcleanup seconds to avoid resurrecting a faulty node by mistake. For large system they propose multilevel gossiping which considers the topology of the network to choose nodes while gossiping. They use domain and subnet information to choose nodes. Gossips are mostly done within subnets, few gossips between subnets and lot less among domains. If a large fraction of member is down, gossiping becomes less effective. So It uses periodic broadcast (every 20 sec) with high probability for recovery from network partition and catastrophe failure. Discussion: One of the attractive features of their algorithm is that probability of reporting false positive is independent of number of processes which is good for really big systems. It also shows significant resilience against message loss which is necessary for environments like internet or wireless network. Multilevel gossiping makes it scalable for larger system. One of the limiting assumption is they did not consider Byzantine fault model which limits its applicability in many situation. Their protocol detects failed nodes that are completely unreachable. Link failures between hosts are not detected as long as there is a path through some other node. Multilevel gossiping suffer from increased number of rounds and time to detect failure across subnets and domains. They require to manually configure subnet information in other domains. Title: SWIM: Scalable Weakly –consistent Infection-style Process Group Membership Protocol. Summary: The main idea of this paper is to separate the failure detection and distribution of membership information. They relaxed the consistency requirement to design a scalable system. Failure detector component at each member selects a member at random from membership list in a predefined protocol period and sends a ping message to it. If it does not receive a ack within timeout period, it selects k members from the list and sends ping-req message about the non replying node. If initiator does not receive any direct ack or indirect ack, it declared the member as faulty and hands this update off to Dissemination component which multicast this information to the rest of the group. They designed a Dissemination Component that piggybacks the update information with ping and ack messages and called this infection style dissemination. They use suspicion mechanism to reduce the false positives. Suspicion sub protocol is run whenever SWIM failure detector detects a failure. It disseminates a Suspect message about the faulty node to other group members. If anyone else can successfully ping the suspected member, it spreads an Alive message to other members. If no Alive is received in time period, it is marked as faulty, dropped from the list and a confirmation message is disseminated. They also propose the selection of members to ping in a round robin fashion from the membership list and reinitialize that list randomly after each round. This reduces the chance of missing a faulty nodes repetitively and thus provides time bounded strong completeness. They also showed promising experimental results about effect of message loss and overhead based on group size. Discussion: Some of the attractive features of the protocol is constant message load per group member, detection of failure in a constant time, logarithmic growth of dissemination time for update information and reducing the rate of false positive. The protocol guarantees a deterministic time bound to detect failure which makes it very attractive for time critical applications. By launching a denial of service attack someone can make a correct member appear as faulty one. This algorithm does not address this issue completely and this is an important aspects. As this is hard to detect at protocol level within a time bound and some algorithm guarantees eventual detection such attacks. Determination of optimum group size is also an issue in this algorithm. From: Ravishankar Sathyam Sent: Tuesday, March 14, 2006 4:24 AM To: Indranil Gupta Subject: cs598ig review 3/14 Ravi Sathyam CS 598ig A Gossip-Style Failure Detection Device Summary: This paper presents a protocol that uses gossiping technique in order to provide timely failure detection. The protocol for this basically works as follows: Each member maintains a list that consists of an address for every known member and a heartbeat counter. Now, for every T seconds, all members increment their own heartbeat counter and send this list to a random member. In addition, for each member in the list, the last time its corresponding heartbeat was incremented is also maintained. If a member’s heartbeat counter has not increased for a certain amount of time, then the member is considered failed. In the analysis of the protocol, the methodology for calculating parameters such as T and the rate of gossiping is explained. Alternate ways to apply the protocol included gossiping on the lower levels of the internet hierarchy (such as subnets), which would reduce the amount of bandwidth flowing through the internet routers. This ends up improving scalability. However, spreading members out into subnets can also result in a higher number of rounds for the protocol. Comments: Some of the assumptions made during the explanation of the protocol do not apply to real world scenarios. How about protocol performance during high churning of members? SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol Summary: This paper presents a module that offers weakly-consistent knowledge of process group membership information at all processes. It does so using two components: a) Failure Detector Component – This is in a way similar to the failure detection protocol described in the paper above. Nodes gossip with each other and if the required acknowledgement messages are not received from a node A either directly or indirectly, then A is said to be failed, and this node is handed over to the Dissemination Component. b) Dissemination Component – The dissemination component basically involves multicasting new information. For example, information about failed node A will be multicast, and nodes receiving this multicast will remove A from the membership list. A more robust version of the SWIM protocol has the dissemination component piggybacking membership updates on ping and ack messages (this is known as infection-style dissemination mechanism) sent by the failure detector. Also, processes not responding to pings are not labeled faulty outright, and are only labeled faulty after a timeout. This helps slower processes. An addition to the SWIM protocol involves selecting ping targets in a round-robin fashion rather than randomly – this helps in providing time-bounded strong completeness. In the performance analysis, we see that this protocol is most optimal when used in medium-sized subgroups. From: Mike Earnhart [mearnhart@gmail.com] Sent: Tuesday, March 14, 2006 4:01 AM To: Indranil Gupta Subject: 598ig review 03/14 Michael Earnhart March 14, 2006 CS598 IG Summary SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol and A Gossip-Style Failure Detection Service SWIM is a heartbeat protocol replacement designed for scalability. The focus is on separating the failure detection and membership updates to minimize traffic overhead. In addition they introduce extra states to help prevent unnecessary removal of computer nodes if in fact they are not truly failed. There are two components that are built and analyzed, the failure detector and the dissemination component. These two sets of code create the entire system and the various manners in which they are implemented create the optimal solution. The failure detector is designed as everything is in this paper to be scalable so that they can claim that nothing in the detector depends on the group size. This constant cost is very important because the number of messages increases a square of the nodes in a traditional failure detector. The dissemination component uses a gossip protocol to provide efficient fault tolerant information propagation. This however is performing dual duty in that the information is disseminated via a piggy-back mechanism on the ping request and replies used for failure detection. Therefore the additional overhead is alleviated a small amount because it is directly associated with the failure detection. SWIM shows promise because of its scalability. In addition the response times are bounded fairly well regardless of group size as Figure 3a showed. This is very important because every node does not attempt to heartbeat every other node and therefore I believe its only a probabilistic guarantee to detect all failures within a reasonable amount of time. Finally the use of the suspicious state clearly helps the stability of the system. The single downside as far as I am concerned is the rather simple gossip protocol. This is a very efficient protocol but it could definitely be optimized I would think to some degree. Other wise the cluster of computers was not large enough to show true stability and scalability. In A Gossip-Style Failure Detection Service the focus was again a scalable failure detection mechanism that utilized the gossip protocol not only to disseminate information but also to detect failures. There are two protocols introduced, the basic protocol, and the multi-level gossip or hierarchy protocol. The basic protocol instituted the standard gossip protocol and was on the O nlog(n) in terms of detection time after a failure. This system also employs an alternate state to just running and failed to mitigated large system failures and massive gossip messages in reaction to the failure. This allows for a more efficient means of distributing the information because too often several nodes see this failure and then they all gossip it and waste bandwidth. These are the essential topics of this paper. A clear strength of this paper is the restructuring of the gossip protocol into domains, subnets and global networks. This allows for a much better aggregation of data between these elements of the network but it does not significantly adversely affect the performance. Once problem is how to define exactly to what granularity do you sub divide to optimize the gossip protocol without seriously damaging the system. They present no data on optimizing this variable. In addition to that there is no implementation of this, and I would wonder how well it would work in a high failure prone environment. This simulated but never implemented which makes it rather difficult to believe the results. From: ercanucan@gmail.com on behalf of Ercan Ucan Sent: Tuesday, March 14, 2006 2:40 AM To: Indranil Gupta Subject: "598ig review 03/14" MapReduce ---------------- Problem Addressed: Google has many special purpose computations that process large amount of raw data to compute different types of derived data. Conceptually these computations are straightforward. However, as the input data gets larger and larger, several issues like parallelizing the computation, distributing the data and handling the failures show up in order to finish the job in a reasonable amount of time. All these make the job a lot more complex. Towards dealing with this complexity the authors of the paper came up with a new abstraction that allows expressing such computations by hiding the messy details. Approach Taken: Authors propose MapReduce, which is a functional programming type of a model. Mainly, users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs and a reduce function that merges all intermediate values associated with the same intermediate key. The run time takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures and managing the required inter-machine communication. Fault tolerance is well taken care of. The master that distributes the works to the workers pings every worker periodically and if no response is received from a worker in a certain amount of time, the master marks the worker as failed and this work gets scheduled on other workers. Currently in case of failures of master, computation aborts. However it could set as the taster takes periodic checkpoints. If it dies at some point of time, a new copy can be started from the last check pointed state. Comments: The programming model is easy to use, even for the programmers without experience with parallel and distributed systems. I think this is an important plus. As stated in the paper, by this expression method the can span a large variety of problems. They are taking locality into consideration in order to save network bandwidth, which is another important optimization from my point of view. They are using redundant execution in order to reduce the impact of slow computers near the end of the computation. This idea is another nice optimization I think. Also redundant is used to tolerate machine failure and data loss. This is an important improvement to the system in terms of fault tolerance. MON ------- Problem Addressed: Since it is a difficult task to run simulations and applications on a large scale system like PlanetLab, it is important to have a tool that helps application developers/programmers to manage their applications on them. There are many existing useful tools on the PlanetLab for status monitoring and query, resource discovery and software distribution. However, only few of these tools allow the users to execute instant management commands that are related to their own applications. There are tools like PSSH and vxargs that serve such purposes but they have the scalability problem since they use a centralized approach. Approach Taken: MON adopts a distributed management approach. Mainly, an overlay network, like a spanning tree or directed acyclic graph (DAG) is used for propagating the commands to all the nodes and for aggregating the results back. MON is a reactive system which builds these trees on demand since maintaining such overlays proactively would be really complex. Each node runs the MON daemon process which has three layers: Membership management, overlay construction and distributed system management. Comments: On demand approach makes the system simple and lightweight since no overlay structure is maintained when no commands are executed. However, I still think that creating the tree on demand is a more costly operation when compared to case where the tree is maintained proactively. As an advantage for on demand creation, the paper states the different structures can be created for different tasks. But I guess it would be better if there was more elaboration on this idea. For instance, what other types of structures could be built and how would they be useful? Moreover, I think it would be nice if the paper discussed some issues regarding the push message like fault-tolarence, churn support etc. -- Ercan Ucan - eucan2@uiuc.edu Graduate Student Computer Science Department University of Illinois at Urbana-Champaign ------------------------------------------------------------ From: Muyuan Wang [mwang2@uiuc.edu] Sent: Tuesday, March 14, 2006 12:47 AM To: Indranil Gupta Subject: 598ig review 03/14 A Gossip-Style Failure Detection Service This paper describes a new protocol based on gossiping that does scale well and provides timely detection, and extend it to discover and leverage the underlying network topology for much improved resource utilization. They also combine it with another protocol based on broadcast, to handle partition failures. The basic protocol is based gossiping. In this protocol, each host in the network runs a failure detector process that executes the protocol and offers continuous process failure and recovery reports to interested clients. A failure detector is described in this paper. It has several distinguished properties. First, the probability that a member is falsely reported as failed is independent of the number of processes. Second, the algorithm is resilient to message loss and process failures. Third, if local clock drift is negligible, the algorithm detects all failures or unreachabilities accurately with known probability of mistake. Fourth, the detection time is of O (nlogn), where n is the number of processes. Fifth, the required bandwidth grows linearly with regard to the number of processes. SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol This paper presents SWIM, a generic software module that offers weakly-consistent knowledge of process group membership information at all participating processes. SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. The expected time to the first detection of failure and the expected message load per member do not vary with group size. Information about membership changes is propagated via ping messages and acknowledgments. SWIM also reduces the rate of false failure detections by modifying the protocol to allow group members to suspect a process before declaring it as failed, which makes it easier to discover and rectify false failure detections. Therefore, a deterministic time bound to detect failures is guaranteed. It is claimed that the results of SWIM is generally applicable, because although is is targeted at large scale groups of processes, it is shown in the experiments that this method can greatly reduce the message overhead, compared with all-to-all distributed heartbeating. From: Raghu Kiran Ganti Sent: Monday, March 13, 2006 11:33 PM To: Indranil Gupta Subject: 598ig review 03/14 Reviews for Distributed Membership Protocols, March 14 ------------------------------------------------------ Paper title: SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol Summary/Main ideas: ------------------- In a distributed system, one of the main issues to be addressed is to detect what nodes are present in the group and which of them failed. This paper (SWIM) presents a distributed peer-to-peer process group membership protocol with weak semantics. The contribution of this paper is that it provides failure detection and membership update dissemination functionalities, such that the failure detection time is stable, the false positive rate is stable, and message load on each group member is low. This allows distributed applications to scale well. The authors claim that the earlier heart-beating membership protocols do not scale well due to quadratic message load, or poor response times/false positive frequency. The properties of SWIM are: 1. Impose constant message load per group member 2. Detect process failures in expected constant time 3. Deterministic bound on failure detection time (local) 4. Dissemination latency grows logarithmically with number of members 5. Reduces false positive rates The basic SWIM approach has two components - failure detector component and dissemination component. The basic failure detector protocol uses the random probing based scheme. The dissemination protocol is a network multicast. The authors improve the basic SWIM protocol to provide a more efficient and robust scheme through various methods. For example, they piggyback membership updates on ping and ACK messages sent by the failure detector protocol. In the basic SWIM approach, a process not responding to a ping is declared faulty. In the modified SWIM, this process is declared as a suspicious process, and only after a certain time-out is it declared as failed. This reduces the rate of false-positives. Further, the authors present a time bounded completeness guarantee, where the time interval between the occurrence of a failure and its detection at a given member is no more than two times the group size (this time bounded mechanism uses a round robin scheme). Comments: --------- 1. The authors tradeoff failure detection time with false positive frequency. Is this good? 2. How much will providing a weaker variant of group membership affect the protocol in general? 3. How can this protocol be applied to sensor networks? What changes need to be made to it? (This is a general thought, although I do not have a concrete answer to this question) ********************************************************* Paper title: A Gossip-Style Failure Detection Service Summary/Main ideas: ------------------- This paper presents a failure detection protocol that is based on gossiping which scales well and provides timely detection. The properties of the failure detector presented in this paper are: 1. False positive probability is independent of the number of processes. 2. Resilience against message losses and process failures. 3. If clock drift is negligible, the algorithm detects all failures accurately with a given probability of mistake. 4. Algorithm scales in detection time (detection time increases as O(nlogn)) 5. It also scales in network load (bandwidth requirements increase linearly) The basic protocol is to gossip the list of members every T seconds with a randomly selected member. A suspicion mechanism similar to the one presented earlier is used for reducing false positive rate. The authors analyze their protocol and show that the algorithm scales well with detection time and network load. In the modified (efficient) protocol, the authors use the subnet and host number lengths of each domain to reduce the number of gossip messages exchanged. Gossip messages are exchanged normally within subnets, whereas between subnets (or domains), the probability of gossip is tuned so that every round, on average one member per subnet will gossip to another subnet in its domain, and one member per domain will gossip to another domain. The authors also mention an approach taken to do recovery in the situation of a catastrophe. Comments: --------- 1. How does this failure detection protocol work in presence of Byzantine faults? Does it need to completely change or a simple addition will work? 2. It is not clear how the broadcast capability is implemented with a few gossip servers, how are these placed? 3. It is not clear how the authors conclude that the failure detection times are longer by not more than 50%, when subnets are considered. 4. Also, can this protocol be extended to be applied to sensor networks? From: Long Hai Vu Sent: Monday, March 13, 2006 11:08 PM To: Indranil Gupta Cc: Long Vu Subject: cs598ig-review 03/14/06 Name: Long Vu Class: CS598IG-Spring 06 REVIEWS A Gossip-Style Failure Detection Service In this paper authors present a new protocol or a failure detector based on random gossiping which provides timely detection for nodes’ failures and scales for large network. In proposed model, each host in the network runs a failure detector process, or protocol, and report the failure to interested clients which are chosen randomly. The most important advantage of gossip based protocol is its simplicity and robustness. Discussion 1. How is gossip message determined to avoid so much overhead? I think it depends on specific networks and application domains. 2. How often does a node send gossip messages? And, to whom does the node send – the whole networks or only particular groups? Selecting nodes randomly to send is the best approach? Is there any scenario we should not use random selection? If we divide the network into subgroups, do we need to send gossip message randomly? 3. How do we know that the messages are sent to every member of the network? Are there any mechanisms for this? If churn rate is high, how could we change the gossip protocol to ensure that nodes update latest information about the network? Using Random Subsets to build Scalable Network services In this paper, authors develop a system call SARO which make use of a scalable mechanism for delivering state about a random subset of global overlay. According to them, random subset approach could help overcome inherent scaling limitations to services dealing with global state maintenance and probing nodes’ state. Ransub is designed to deliver state of nodes in the overlay; it adapts dynamically to changing network conditions, and is used in SARO which runs on top of PlanetLab. Moreover, SARO is used to distribute streaming media content from server over the overlay. SARO also makes use of Ransub state information to locate appropriate peers satisfying delay and bandwidth constraints. Discussion: 1. Maintaining an overlay tree is not very scalable. What happen if churn rate is high? 2. In SARO system, how could they define the unit of media data? This is actually important as we need to ensure the delay and bandwidth constraints. Thus, unit of media data plays an important role. Moreover, order of media data segment exchanged among nodes in the overlays effect the delay. From: Praveen Jayachandran Sent: Monday, March 13, 2006 5:53 AM To: Indranil Gupta Subject: cs598ig review 03/14 CS598IG Review 03/14 Praveen Jayachandran SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol ---------------------------------------------------------------------------------- This paper presents a weakly-consistent group membership protocol for peer-to-peer applications, where all members need not have a consistent view of the group membership at all points in time. The central focus of the paper is scalability. The authors point out that previous membership protocols either impose network loads that grow quadratically with network size, or incur higher response times, or have a high false positive rate. To overcome these pitfalls, SWIM decouples the failure detection and the membership update mechanism. An infection-style randomized peer-to-peer probing protocol is used for failure detection, wherein the expected time for failure detection and the expected load per member is independent of the group size. The dissemination latency grows roughly logarithmically with the number of members. A mechanism to reduce the rate of false positives by 'suspecting' a process before assuming it as failed is also proposed. Comments: 1. Can SWIM handle heavy churn? The experiments do not seem to answer this very clearly. 2. A method to estimate the period for sending probe messages, based on the application-specified failure detection time is specified. Also, the period needs to be at least three times the round-trip estimate, and round-trip estimates are made quite conservatively. Will the value obtained from the application-specified failure detection time be always greater than three times the round-trip estimate? 3. The 'confirm failed' message is said to over-ride the 'suspect' as well as the 'alive' messages. However, some obvious conflicts could occur in the system. A node that has recently received an ack from a suspected node, could receive a 'confirm failed' message and be forced to remove the node from the neighbor list. I suppose that the intention here is to announce the node as failed when it is unable to connect to a reasonable number of nodes. However, this point is not very clear. 4. It isn't clear how SWIM decouples the failure detection from the membership update protocol. The ping, ping-req, and ack messages seem to be used for both purposes (with piggybacking). 5. The SWIM protocol is not localized and messages could be transmitted in a manner that is wasteful of network resources. Is it possible to include some locality information to improve the network utilization? 6. Could knowledge of link failures and congestion be used to improve the performance of SWIM? It appears that SWIM would use ping-req messages whenever a link is not congested or failed. This overhead can be reduced by sending ping messages through alternate paths, and ping-req messages to nodes that are responsive. This probably is the job of the underlying routing protocol. A Gossip-Style Failure Detection Service ---------------------------------------- This paper describes a simple failure detection protocol based on gossiping and provides analytical and experimental studies to support their proposal. An important consideration for this protocol is scalability. The authors show through analytical and experimental studies that the probability of false reports is independent of the number of members, the protocol is resilient to message loss, the failure detection time increases logarithmically with the number of members, and the network bandwidth increases at most linearly with the number of members. Each node periodically selects another node to transmit its view of the current non-faulty members and the latest time at which it has received a reply from those machines. The recipient of such a ping message merges this view with its own view. Analytical studies to estimate the number of gossiping rounds required to achieve a certain quality of detection, the frequency with which members need to gossip, and the time taken to achieve a certain low probability of false failure detection are presented. Extensions to the basic protocol to reduce the number of cross-subnet and cross-domain traffic, and to make the protocol more resilient to network partitions are presented. Comments: 1. This paper uses a push technique to maintain membership knowledge. However, a pull, or a push-pull technique could help to reduce the time taken to disseminate failure information. 2. The protocol cannot detect poor or partial connectivity of nodes. 3. Will this protocol be able to handle heavy churn in the system? The protocol is targeted towards the Internet rather than P2P systems, and may not be suitable for P2P systems. 4. Gossip protocols typically exhibit bimodal behavior. However, this paper incorrectly assumes that the probability of a gossip dying early is remote.