From: Long Hai Vu Sent: Thursday, March 09, 2006 9:30 AM To: Indranil Gupta Cc: Long Vu Subject: cs598ig review 03/09/06 Name: Long Vu CS598IG-Spring 2006 REVIEWS MON: On-demand Overlays for Distributed System Management In this paper, authors present an on-demand approach to manage MON system, a distributed system running on the PlanetLab testbed. According to the authors, this is a light-weight system which requires minimum amount of resources, especially when there is no command executed. Moreover, this approach avoids a complex failure repair process as there is no need to maintain the overlay structure for a long period of time. The main idea of this paper is dynamically constructing a spanning tree when necessary. This is adaptive approach to learn the instant status of the overlay and avoid keeping the long-time spanning tree which obviously becomes degrade over the time. Furthermore, creating spanning tree on-demand provides flexibility for different tasks. However, if the frequency of demands is high, creating spanning tree for all the requests will result in significant overhead. Could we maintain a list of temporary tasks in which each element consists of ? If the same task happens in a very short period, considering reuse the last spanning tree is reasonable. This also depends on how stable the overlay is. On the other hand, scalability is an issue in this approach. Current experiment is performed in PlanetLab with about 300 nodes. This is not big an overlay. Will MON be scalable with more number of nodes? Obviously, the larger the overlay is, the more number of demands MON has to serve synchronously. This raises the issue of overhead because we have many spanning trees at the same time. MON seems to be suitable for the small overlays as it creates spanning tree when requested. In addition, it propagates the command to all the nodes of the overlay. Should the network was large and the churn rate was high, this propagation would get problem as links would be broken during the process. Future work: - Larger network (Scalability) - Specified subset of nodes: there are cases we need to propagate the message to a certain subsets of node rather than the whole overlay. This requires to change the spanning tree construction process MapReduce: Simplified Data Processing on Large Clusters In this paper, authors present a implementation of MapReduce which runs on a large clusters of commodity machines and is highly scalable with terabytes data and thousands of machines for a typical MapReduce computation process. Moreover, hundreds of MapReduce programs are created and executed on Google’s clusters currently. MapReduce is mainly used for Google web search service, for sorting, for data mining and machine learning. In essence, MapReduce is a programming model and an associated implementation which could generate and process very large data sets. Given a map and a reduce functions, MapReduce helps to generate a set of intermediate key/value pairs and merges all intermediate values associated with the same intermediate keys. This allows handling the very large set of values which could not fit in the main memory. One of the most significant uses of MapReduce is a complete rewrite of the production indexing system used by Google currently. Given the large document set as input (about 20 terabytes of data) indexing process uses five to ten MapReduce operations. MapReduce is considered simple to learn, good performance, and easy to operate. From: Mike Earnhart [mearnhart@gmail.com] Sent: Thursday, March 09, 2006 9:23 AM To: Indranil Gupta Subject: 598ig review 03/09 Summary MON: On-demand Overlays for Distributed System Management And ACMS: The Akamai Configuration Management System In the MON: On-demand Overlays for Distributed System Management the authors built a system that provides a method for administering large clusters. They emphasize the ability to execute instant management commands as well as pushing software, and querying for data. The distributing method for their management messages is an overlay network. However this overlay network is constructed as needed and it is often customized for the management task being performed. This allows the system to avoid incomplete and broken overlay networks due to untimely failures. They implemented this system on 300+ planetlab nodes and the results were very reasonable. However the software push mechanism was only tested on 20+ nodes for some reason, which is not mentioned. The strength of this paper is in the on the fly overlay which has significant potential for delivering small payloads of information to a large distributed network reliably even with failure prone nodes. The created of overlay structures in a matter of seconds is impressive especially given the coverage of the nodes. In addition for specific queries they provide in network aggregation to reduce traffic and increase response time. The portions that are questionable are specifically the software push and the need for aggregation. They test the software push but not nearly on a large enough set of nodes to make a case for scalability. This is a rather important feature of a management overlay network for a distributed system. Also the aggregation of data is very important in applications such as wireless motes where power is of utmost concern but in these networks power is cheap and aggregation reduces the raw data received to nothing and is therefore good for only specific queries. However the collection of raw data could potentially provide a wide variety of information. In ACMS: the Akamai Configuration Management System the system by which the Akamai network is configured and managed is described in some detail. The focus is on their scalability and fault tolerance. Essentially the scalability is provided via the client pull mechanism and the fault tolerance is provided by quorum replication. Also they describe their acceptance algorithm, which closely mimics a two-phase commit protocol. The important strength of this paper is that their system is used on a commercial profitable system of over 15000 nodes connected via an unreliable network and between several different network administrative domains. Also a few innovative ideas were the use of vector messages, utilizing the Akamai web caching to reduce loading on the storage points, and their index tree. Essentially they provide an almost self-healing system, which is quite an achievement. The downside of this protocol is its rather narrow view or application. Unlike the MON, which is designed to run on any distributed network, this system has a very specific system to manage and no surprisingly its optimized for that system. From: Anthony Cozzie [acozzie@gmail.com] Sent: Thursday, March 09, 2006 4:13 AM To: Indranil Gupta Subject: 598ig review 03/09 Anthony Cozzie MON This actually appears to be the result of a CS598IG paper. Anyway, while MapReduce uses a master process to control all the other processes (quite practical in a cluster, especially since it looks like the actual data is read off of GFS) MON plans for the PlanetLab scenario, which involves greater latencies, lower bandwidth, and less reliability. It constructs an overlay for each command the user wants to run (MON nodes maintain connectivity anyway, but the DAGs they construct are temporary). They also did some cute things; a node downloading a new program can download from multiple-sources bittorrent-style. I also like how they bash the Rochester sysadmins. MapReduce: The clever folks at Google had built a large cluster of cheap commodity machines to run their massive web processing tasks on ~1 billion webpages. Unfortunately, when running on that many machines load balancing, failure recovery, and extreme parallelism become significant issues. Functional programming is intriniscly parallel, so they borrowed the map and reduce concepts from lisp: map is a function f(a) -> b where a is an input element, and reduce is a function f (list(b) -> list(b) where in theory the second list is smaller. This allows them to separate out the overhead code that deals with reliability and write code for their cluster more easily. From: ercanucan@gmail.com on behalf of Ercan Ucan Sent: Thursday, March 09, 2006 9:12 AM To: Indranil Gupta Subject: "598ig review 03/09" SWIM -------- Problem Addressed: In this paper, the authors address the process group membership protocol. SWIM is a generic software module that offers this service for large-scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency with respect to detecting process crashes. Approach Taken & Comments: The abstract of the paper summarized the method of the research very neatly: Unlike traditional heartbeating protocols, SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol. Processes are monitored through an efficient peer-to-peer periodic randomized probing protocol. Both the expected time to first detection of each process failure, and the expected message load per member, do not vary with group size. Information about membership changes, such as process joins, drop-outs and failures, is propagated via piggybacking on ping messages and acknowledgments. This results in a robust and fast infection style (also epidemic or gossipstyle) of dissemination. The rate of false failure detections in the SWIM system is reduced by modifying the protocol to allow group members to suspect a process before declaring it as failed – this allows the system to discover and rectify false failure detections. Finally, the protocol guarantees a deterministic time bound to detect failures. A Gossip-Style Failure Detection Service ------------------------------------------------------ Problem Addressed: This paper presents a new protocol for failure detection problem. Main concerns are scalability of the protocol and timely detection of the failure in the system. Approach Taken & Comments: The paper makes minimal assumptions about the network which is a nice approach I think. The basic protocol is based on gossiping as pioneered in the Clearinghouse project. The protocol presented in the paper gossips to figure out whom else is still gossiping. Conclusions: As discussed in the conclusion part of the paper, the service provides accurate failure detection with known probability of a false detection, and is resilient against both transient message loss and permanent network partitions, as well as host failures. The service uses two separate protocols that automatically take advantage of the underlying network topology, and scale well in the number of members. The failure detection service can be used by distributed applications directly, or support other middleware services such as system management, replication, load balancing, and group communication and membership services. As such, failure detection is a valuable extension to current O.S. services. -- Ercan Ucan - eucan2@uiuc.edu Graduate Student Computer Science Department University of Illinois at Urbana-Champaign ------------------------------------------------------------ From: Mehedi Bakht Sent: Thursday, March 09, 2006 8:15 AM To: Indranil Gupta Subject: 598 IG Review : 9th March 2006 Title: MON: On-demand Overlays for Distributed System Management Summary: The paper presents Management Overlay Network (MON) - a system for managing large distributed applications. The basic goal is to facilitate efficient execution of instant management commands in a large distributed system like PlanetLab. To achieve this goal in a scalable way, MON adopts a distributed management approach instead of a centralized one. A spanning tree or a DAG is used as an overlay structure for propagation of commands and also for collection and aggregation of the results of those commands. One important feature of the overlay structure is that it is created on-demand. Each time a user wants to execute one or more management commands, an overlay structure is dynamically created. It makes the system simple, lightweight, flexible (support for different structures at different times), and efficient. The MON system consists of a three-layered daemon process (called a MON server) running on each node of the system. The bottom layer manages membership with a gossip-style approach. The middle layers is responsible for the most important task of constructing the overlay on demand. Third and topmost layer manages the execution of different management commands. For constructing the overlay, the emphasis is not on full coverage - but rather on probabilistic node coverage by using quick and efficient algorithms like random tree construction. Once this overlay has been dynamically created, MON is ready to accept different types of management commands. The paper discusses the implementation details of two specific commands - Status Query and Software Push. The paper also contains the result of deploying MON on 330 nodes of PlanetLab. Presented data show MON performs well in terms of command response time and aggregate bandwidth for software push. Comments: The idea of on-demand creation of the overlay structure sounds good when we consider a network that is prone to node failures and the topology of the network changes frequently. But if a system is more failure resistant and robust, I do not think it be will efficient to incur the overhead of overlay construction for each management session.I was also wondering whether some learning mechanism can be introduced so that this overlay creation process does not begin from the scratch every time. The level of details provided about implementation of aggregation of status query results seemed inadequate to me. I was looking for a more elaborate description so that I could compare with TAG and other aggregation techniques that we have discussed with respect to sensor network. Title: MapReduce: Simplified Data Processing on Large Clusters Summary: The paper presents MapReduce - a programming tool developed by Google to facilitate parallel computations over large sets of data. The terminology of "Map" and "Reduce", and their general idea, is borrowed from functional programming languages use of the constructs. The basic idea goes like this. At first, input data is divided into multiple splits. The data within each input split is parsed to a set of key/value pairs. A key/value pair is called record.Then for each split the records are passed to the map function. The map function, which is provided by the user, creates a set of intermediate key/value pairs out of each input key/value pair. Subsequently the reduce function is called. The reduce function is also provided by the user. A reduce function takes all the values that share the same intermediate key from the map functions output as input. So basically for each intermediate key produced by the map functions, a reduce function is called. The reduce function reduces the passed values to a possibly smaller set of output key/value pairs, typically to only one output key/value pair. The output of the reduce function is written into an output file. Various other mechanisms has been incorporated with this basic idea of map/reduce for fault tolerance and performance improvement. The paper presents results from implementation of MapReduce on Google's clusters which shows that it has greatly enhanced the performance of large-scale distributed jobs like indexing used by Google for web search. Comments: The fault tolerance model described in the paper omits any mechanism for the master. Won't it become a single point of failure for a MapReduce program? MapReduce obviously gels well with GFS - as it was surely designed keeping GFS in mind. But how will MapReduce perform on top of other distributed file systems needs to be further investigated. From: Maifi Khan [maifi.khan@gmail.com] Sent: Thursday, March 09, 2006 7:44 AM To: Indranil Gupta Subject: 598ig review 03/09 Title: MapReduce: Simplified data processing on large clusters. Summary: This paper describes a programming model called "MapReduce" which can be used even by programmers without any experience with parallel or distributed systems to develop distributed applications. It hides the details of parallelization, fault tolerance, data distribution and load balancing from the programmers and supplied the necessary library. User has to write Map and Reduce function and also write a MapReduce specification object to specify the names of input and output files and some tuning parameters. Map produces intermediate key/value pair from supplied pair. Reduce merges the intermediate values into relevant groups if possible. The user program calls the MapReduce function which consists of a master that distributes work to the worker programs. Worker can perform either map or reduce tasks depending on the assignment by the master. When all the workers are finished and final output is written to the files specified by user, MapReduce wakes up the user program. Master keeps information about the worker machines and any assigned task failed due to worker failure is reexecuted by a replaced worker. Handling of master failure is handled by clients. MapReduce master try to use the location information of input files to assign tasks to workers to avoid copying input files. To avoid delay due to slow worker, at the ending of the operation, master schedules backup execution of the remaining ongoing tasks and accepts the result from either primary or backup whichever finishes first. It improves the performance by 44% in case of sort operation. They showed impressive performance of this scheme in case of grep, sort application and using it for various tasks at Google everyday. Advantages: 1. Highly scalable and currently used by Google every day which proved its practicality. 2. Can be used by inexperienced programmers. Criticism: 1. Interface of MapReduce function varies with application domain and thus requires deep understanding. 2. Configuration files need to be written manually. It would be easier and less error prone if it could be done automatically and dynamically. Title: ACMS: The Akamai Configuration Management System Summary: Akamai corporation of responsible for hosting third party's (e.g. CNN) web contents for efficient delivery to the end user. To efficiently manage contents Akamai need to use configuration files. For example, third party controls their content's various configurations such as cache timeout for web pages etc. and they submit such configuration files to Akamai. Akamai also need to configure to allow dynamically assign clients to different networks. This paper deals with how to efficiently distribute configuration files to 15000+ servers hosted at different countries and ISPs. They have statically allocated 5 servers as storage point (SP). To publish a new configuration file, Client contacts any one of this server and submits the files to the Accepting SP (ASP). ASP replicate the file on at least a majority of SP's and exchange vector for recovery purpose. Once this is done ASP notify client that the file is accepted. The receiver runs a background process that periodically checks for new updates and downloads if available. To efficiently download and check for updates, they maintain a group index tree and root index file which is checked before downloading the original file. After restarting from a down time, servers use vector to restore and get the missing updates. Due to clock skew problem they allow bounded reordering for ease of use. They have showed that to distribute the files to all the clients may take up to several minutes and to replicate the file in a quorum it takes about 50 ms on average. The size of configuration files can also vary from few kilobytes to 100 MB. Criticism: The proposed solution is designed targetting Akamai CDN network which is inherently highly reliable and use the facility of NOCC (Network operations command center) for manual maintenance and recovery. So their test results may be optimistic comparing to unreliable network. They does not allow dynamic addition or removal of SPs . Old configuration files are overwritten by the new ones. In case of an error, there is no way to recover the old files. They also did not address the problem of malicious updates by faulty publisher. Their solution is not applicable for p2p system as there is no central control and no authority to assign some nodes as SP. Future work: It would be helpful if it can be extended for adding or removing SPs dynamically which can improve reliability and availability. It is not a pure distributed protocol as some node is more important than others. A pure distributed protocol is needed for p2p network. From: Muyuan Wang [mwang2@uiuc.edu] Sent: Thursday, March 09, 2006 1:10 AM To: Indranil Gupta Subject: 598ig review 03/09 MON: On-demand Overlays for Distributed System Management Summary: The paper builds a management overlay network system, and test it on the PlanetLab platform. It is designed to facilitate the management of large distributed applications. It builds an on-demand overlay structure to allow the users to execute instant management commands. The system focuses on the ability of a user to execute instant management commands, and it does not rely on other infrastructures such as DHT. The MON system contains a daemon process running on each node of the distributed system, and has a 3-layer architecture. The top layer is 'distributed system management', the middle layer is 'overlay construction', and the bottom layer is 'membership management'. In the bottom layer, gossip-style membership management is used, and the 'age' entry is adopt to keep the freshness. The middle layer is the central component of the system. It is constructed using 2 kinds of overlay structures, trees and DAGs. In the top level, two management commands, Status Query and Software Push, is discussed. Comments: I believe it is a paper generated from the project of this class. It is really amazing that they can do so much things in such a short time. In order to complete the project, we must not only have good ideas, but also have strong implementation skills. I will ask Jin about how to quickly master these tools such as PlantLab. Moreover, I am not sure about the reason to implement the on-demand overlay using 2 different ways. The two different ways are not compared in theoretically or experimentally, and the DAG approach are only described using less than 10 lines. MapReduce: Simplified Data Processing on Large Clusters Summary: This paper presents a programming model for processing and generating large datasets, named MapReduce. It works like this: the users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function t merge all intermediate values associated with the same intermediate key. This work provides a simple and powerful interface which enables the automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of PCs. The model hides the details of parallelization, fault tolerance, locality optimization and load balanceing. A large variety of problems are easily expressible using this model. Also, the work aims at reducing the amount of data sent across the network, by using the locality optimization method. Comments: I do not understand this paper very well. The style of this paper is very different from previous papers I read for this class, too. I have many questions about the evaluation of this kind of models. It seems that no other comparable systems are implemented and experimented on; and I am also not sure about the major concerns if we want to build such a programming model. Besides, what if we just want to use it to generate data for some particular usage, as the paper suggests, for data mining, or machine learning? Is there any difference? From: Bach Duy Bui Sent: Wednesday, March 08, 2006 11:43 PM To: Indranil Gupta Subject: 598IG review 03/08 MapReduce: Simplified Data Processing on Large Clusters Bach Duy Bui - 598IG I. SUMMARY This paper introduces a prgramming model for processing and generating large data sets. The programming model abstracts away the complicated underlying details of parallelization, fault-tolerance, data distribution and load balancing in the library. The user of the programming model expresses the computation as two functions: Map and Reduce. Map takes an input pair and produces a set of intermediate key/value pairs. Reduce acceptes an intermediate key and a set of value for that key producing an aggregated value for the key. The Reduce function is called via an iterator. The implementation of the model on Google cluster-based computer system is also described. Basically, the input is splitted into smaller size inputs. Many copies of the main program will run on different machine: one master program and multiple workers. They are all share the job of processing the inputs to which they are assigned by the master. The intermediate values produced by workers are buffered in memory. These values are partitioned into group and their locations are passed back to the master who is resposible for forwarding these information to reduce workers. The reduce workers collect the intermediate values and aggregate them. Upon finishing the jobs, the master wakes up the user program returning the results. MapReduce is resilient to lar-scale worker failures. Redundant executions are used to alleviate the effect of slow machines. II. COMMENTS This paper shows the success of simple and practical-oriented designs. However, more insight analytical analyses should be paid attentions. A balance between practical and analytical analyses will help the community gain more knowledge which may result in more effective future researches. MON: On-demand Overlays for Distributed System Management Bach Duy Bui - 598IG I. SUMMARY MON is a distributed system designed to facilitate the management of large distributed application. MON builds on-demand overlays strutures that allow users to execute instant management commands. The MON system consists of a daemon process running on each node of the distributed system. Each MON server has a three layer architecture: (i) membership management (ii) overlay on-demand management (iii) Instant command layer. In membership management layer, the authors adopted a gossip-style management. Overlay on-demand management support two kinds of overlay structures: trees and directed acyclic graphs. A random tree construction algorithm is adopted. Its extension is proposed to fix its lack of locality. The instant command layer supports two types of management commands: (i) status query (ii) software push. II. COMMENTS This paper propose a simple management overlay for distributed system. It would be better if the authors give more detail analysis of the effectiveness of the design in large-scale networks. The question of how the overlay operate in a non-coorperative environment is not addressed in this paper. From: Ravishankar Sathyam Sent: Wednesday, March 08, 2006 9:06 PM To: Indranil Gupta Subject: cs598ig review 3/9 Ravi Sathyam CS 598ig MON: On-Demand Overlays for Distributed System Management Summary: This paper presents the Management Overlay Network system that helps manage distributed applications. Essentially, MON dynamically constructs overlay structures based on need. MON basically lets users of large distributed applications to facilitate instant management commands (such as querying current status etc). For each instant management command, an on-demand overlay structure is created. This overlay tree construction is the key to MON, and can happen in various ways. One way to do this is random tree construction, where MON clients send a Session message to a nearby MON server. Each MON server that receives a Session message and responds with SessionOK becomes a child of the MON client, and fans out accordingly by sending session messages to other nodes. Such an algorithm can be modified accordingly to add locality, or also to create DAG’s. After construction of an on-demand overlay structure, one can perform several types of management commands. A couple of types of commands covered in the paper are 1) Status Query and 2) Software Push. Status Query commands are basically propagated down the overlay tree to all nodes. These commands are then executed locally on each node, and the results are then aggregated and pushed upwards. For Software Push commands, DAG’s are created and a multi-parent, receiver driven downloading approach is adopted. The evaluation of MON is then presented – and as expected, DAG’s work better for Software Push over randomized trees (bandwidth wise), and the two-stage randomized tree algorithm covers more nodes quicker than the one-stage algorithm. Comments: Why aren’t any security mechanisms just built into MON? Are there possible other ways to construct paths – something more efficient (reduction in the number of paths) while still spanning similar number of nodes? ACMS: The Akamai Configuration Management System Summary: This paper proposes the Akamai Configuration Management System, which gives clients a higher level of control over their content when using a third-party CDN. Basically, the ACMS consists of a front-end (several Storage Points) and a back-end (the Akamai CDN). The ACMS basically works as follows: Say an application wants to submit a configuration file. This publisher basically contacts an SP, which is known as the Accepting SP for the submission. A quorum of SP’s then agree on a submission after which the SP acknowledges the publisher’s request. In order to receive updates, each node in the CDN network runs a process called Receiver that coordinates subscriptions for the node. It is then shown that the acceptance algorithm amongst the SP’s satisfies the acceptance and correctness requirements. SP’s recovering from crashes also use Index Merging in order to regain information from other SP’s. Basically, the configuration files are organized as an index tree in each SP. The index trees from all SP’s are merged and the missing information is then gained. This index formation and merging algorithms are also presented in the paper. In the receiver end, subscriptions for configuration files specify the location of that file in the index tree. The receivers then combine all local subscriptions into a subscriptions tree. Receivers check for updates to the tree by making HTTP IMS requests to the root, and if changes are made to the root, parses down the tree in order to check which intermediate indexes (that are also in the receiver’s subscription tree) have been updated. In the evaluation of the ACMS, it is shown that the Agreement Algorithm amongst the SP’s scales quite nicely. Also, for most of the configuration files, the submission time takes less then 1 second!! From: Raghu Kiran Ganti Sent: Wednesday, March 08, 2006 2:48 PM To: Indranil Gupta Subject: 598ig review 03/09 Reviews for Distributed Management, March 8, 2006 ------------------------------------------------- Paper title: MON: On-demand Overlays for Distributed System Management Summary/Main ideas: ------------------- This paper describes an architecture and implementation of a distributed system designed to manage large distributed applications. The paper addresses the issue of instant management commands execution when the distributed application is running. There are two types of management commands from a user's perspective, pushing the application code onto a set of selected nodes, and querying the current status of an application. MON architecture is three layered, with the lowest layer managing membership, the middle layer doing overlay construction and the top-most layer handling distributed system management. When a request is issued by the user application, an on-demand overlay is constructed and the query is executed. The authors argue that an on-demand overlay construction provides several advantages, which are: 1. The system is simple and lightweight, as overlay structures are absent when no commands are being executed. 2. On-demand overlays provide good performance, as they are built based on the current network configuration. 3. Different structures can be created for different tasks. In MON, the membership management is done by using a simple gossip-style protocol. Overlays are constructed in one of the two ways- using trees or Directed Acyclic Graphs (DAGs). The authors look at two types of tree construction, a random and a two stage. They argue that the random tree construction is simple and has good coverage, but is not locality aware. On the other hand, the two stage algorithm works the same as a random algorithm in its first stage, but selects nodes from a local list (neighbors) in the second stage. This provides a better locality for the two stage algorithm. Commands are of two types, status query and software push. As explained earlier, the status query is a command to find the status of the distributed application and software push is used to push code updates to different nodes. Comments: --------- 1. The problem addressed in the paper is quite interesting, given that it is difficult to keep track of the progress of a distributed application. The authors present a simple and nice architecture towards this direction. 2. Using on-demand overlays, which is the main idea of the paper is also quite interesting. 3. The evaluation is not very detailed. 4. One important direction for MON is to look into security issues. Quoting the authors, ``...in theory any arbitrary operation to be executed on each node...'', such commands can be pretty nasty! *************************************************** Paper title: MapReduce: Simplified Data Processing on Large Clusters Summary/Main ideas: ------------------- This paper presents the MapReduce programming model for processing and generating large data sets in a large scale distributed environment (in this case, Google engineering environment). The basic idea is that a user specifies a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The map and reduce functions are written by the user, but the processes are executed in the background on a large set of machines. This paradigm was implemented in a specific environment (Google), where machines are simple commodity machines connected by a commercial gigabit Ethernet, with inexpensive IDE storage. In the current implementation, Map invocations are distributed across several machines. The following sequence of operations occur when a user program calls a MapReduce function: 1. The user program splits the input files into M pieces and starts several copies of the program on a cluster of machines. 2. One copy is designated the master, which assigns work to other worker nodes. 3. A worker, which is assigned a map task executes code and buffers the results in memory. 4. Buffered pairs are written to local disk, partitioned into R regions. These regions are passed back to the master. 5. The master notifies the reduce worker about the disk locations, which uses RPCs to read this buffered data. 6. When all the map and reduce tasks are completed, the master wakes up the user program and the MapReduce call completes. Two types of failures can occur: worker and master failure. When a worker fails, the master assigns a new worker to do that particular task. These failures are detected by using periodic heartbeats. Master failure is handled using checkpointing. During non-deterministic failures, weaker semantics are provided. The authors identify that network resources are at a premium and hence try to assign tasks in a locality aware manner. For example, if a file is on a given node, the map task is executed on that node and if it is not available on a node which is nearby. Load balancing is achieved by executing different tasks on each worker. The authors discuss a straggler machine, which increases the computation time of a MapReduce call, because of unexpected failures. To avoid long delays when large MapReduce operations are being executed, they propose to use a backup tasks mechanism, where a backup task is assigned by the master (later in the execution). Several optimizations and extensions have been provided for the implementation of map-reduce paradigm in the Google environment. Also, provided are counters to keep track of the execution of MapReduce calls made by the user. Comments: --------- 1. The concept of map-reduce is quite innovative. This reduces the burden of the programmer, as he/she need not worry about how to parallelize his/her code. 2. The methods presented in the paper to reduce network bandwidth consumption are also simple and nice. 3. In my opinion, having a master is a more practical (easy?) approach, but from a theoretical point of view, it is not scalable and is a single point of failure. 4. It is not clear from the paper how this master is chosen (is a leader election algorithm used?) 5. Also, when workers' status is set, it is not obvious whether these states are set by the master or the worker itself? (I would assume that the natural thing to do is to let the worker set idle/busy states, but the master tells the worker what to do) From: Praveen Jayachandran Sent: Wednesday, March 08, 2006 2:28 PM To: Indranil Gupta Subject: cs598ig review 03/08 CS598IG Review 3/8 Praveen Jayachandran MON: On-demand Overlays for Distributed System Management --------------------------------------------------------- This paper describes a Management Overlay Network (MON) system that facilitates the management of large distributed systems. Their current implementation on more than 300 nodes on PlanetLab, builds on-demand overlays that allow users to execute instant management commands, such as query the current status of the applications, or push software updates to all the nodes. The on-demand overlay construction approach enables MON to be light-weight, scalable, and requiring minimal resources when there are no commands executed. The performance of this approach is likely to be good as the overlay is constructed considering the current network conditions. Also, since the overlays are created on-demand, different structures can be created for different tasks. For examples, status queries can be served using trees, and software push can be served using directed acyclic graphs (DAG). The authors provide two algorithms for tree construction. When an overlay is created as a DAG, a downstream node can download the software update from multiple parents simultaneously. The failure resilience of the overlay network is also improved as long as each node has at least one alive parent. The authors provide results from their implementation currently running on PlanetLab. The coverage percentage, the overlay construction latency, and query response time are measured. The bandwidth achieved for software updates is also presented. Comments: 1. How are node failures during a session handled? A node failure could leave a large portion of the nodes in the system unreachable. The exposition seems to assume that there are no failures within each session. To further motivate the on-demand approach, the authors could have presented data on average session duration and mean time before overlay is disconnected. 2. What is the frequency of management commands? If a user creates watchdogs that execute frequently, creating overlays every time may be costlier than simply repairing failures and optimizing the overlay constructed. ACMS: The Akamai Configuration Management System ------------------------------------------------ This paper describes the Akamai Configuration Management System (ACMS) that provides fast, light-weight, reliable, and asynchronous delivery of configuration information in a fault-tolerant and extremely scalable manner. Akamai currently provides configuration management to over 15000 servers, deployed over more than 1200 different networks in over 60 countries. It provides their customers with close control over the manner in which content is distributed and provides several options that include varying html cache timeouts, whether to allow cookies, and whether to store session data. The key issues that face ACMS are the large number of failure-prone clients, their wide-area distributed nature, unreliability of Internet connections, strong consistency requirements, and security. The ACMS architecture is divided into two subsystems, which the authors call the 'front-end' and the 'back-end'. The front-end consists of a small set (typically 5 machines) of storage points (SPs), which are responsible for 'accepting' and storing configuration updates. For availability reasons, the SPs are placed on independent ISPs. ACMS 'accepts' a configuration update request, when a quorum number of SPs are aware of the request. The quorum is set as the majority of the SPs. A quorum based system is used in order to improve the availability of ACMS. If only a single node received an update, the entire system would have to be halted if that node failed. A publisher contacts any one of the SPs, termed as the Accepting SP. The Accepting SP executes an algorithm called Vector Exchange to establish a quorum of SPs to accept the request. When a quorum is established, an 'Accept' message is sent back to the user and the data is ready for download. Occasionally, when a SP experiences failures or network outage, a recovery mechanism based on index merging is used to bring the SP up to date. This index management also allows SPs to provide snapshots for receivers to learn the latest configuration updates. ACMS also uses download points, which do not participate in the quorum system, but help alleviate the load on the SPs. The back-end of ACMS consists of a multitude of Receivers that receive data from the SPs. This update propagation is done using a pull-based approach as against a push-based approach, as the Akamai CDN is itself fully optimized to handle HTTP download. The pull-based approach allows the SPs to be oblivious to the number of Receivers and their location, providing scalability. The Receivers are responsible of contacting the closest SP. Also, it allows Receivers to make use of caching to reduce network bandwidth requirements. Comments: 1. One of the major issues that drives the design of ACMS is their strong consistency requirements. This is the reason why a quorum is used to accept an update. However, a client contacts only one of the SPs to receive an update. Although, the SPs use a recovery mechanism to bring itself back up to date, during this down time, it could be serving stale or no information to clients connected to it. This could be avoided if the clients also contact a quorum set of SPs and chose the most recent update. This would ensure consistency, but would increase the overhead considerably. 2. If in future, Akamai needs to increase the number of SPs, finding a majority quorum for each update could be wasteful. 3. ACMS uses caching to reduce network load at the cost of increasing the update propagation time. Why not use a gossip protocol for propagating the updates. This would serve the same purpose as caching, and in addition would reduce and distribute the load on the network. 4. The evaluation (the table) shows that up to configuration file sizes of 100K, the average submission time is nearly a constant less than a second. Beyond that the average submission times seem to increase exponentially. The reason for this trend is unclear. From: Juan Jose Jaramillo Jimenez Sent: Wednesday, March 08, 2006 11:40 AM To: 'Indranil Gupta' Subject: 598ig review 03/09 1 MON: On-demand Overlays for Distributed System Management 1.1 Summary Large distributed computing systems used for research require new tools that help the user execute some instant management commands on some selected nodes and get results immediately. Several tools exist to help the researcher on status monitoring and query, resource discovery and software distribution, but none of them help to execute instant management commands pertaining to the applications. MON tries to solve this problem using an on-demand approach: it builds an overlay structure for propagating commands to all nodes, and aggregates the results back. MON consists of a daemon process running on each node of the distributed system, and has a three layer architecture: membership management, on-demand overlay construction, and distributed system management. Since the overlay network is created on demand, there several advantages: there is no concern about maintaining the structure, the overlay network is likely to have good performance, and different structures can be created for different needs. MON is currently deployed in PlanetLab, running on more than 300 nodes, and its two management commands are status query and software push. 1.2 Comments The main contribution of the paper is presenting a system that is built on demand, and that can construct an overlay network tailored to the specific needs of the management command that is executed. Since the overlay network is decentralized, there are no bottleneck points and in-network aggregation can be achieved. However, there is a tradeoff while building on-demand systems: the response time to the command includes the overhead of building such a network, so a possible optimization could be storing in a cache previous overlay structures and using them if they have not expired and are suitable to the command to be executed. 2 MapReduce: Simplified Data Processing on Large Clusters 2.1 Summary The major contribution of the paper is a simple and powerful interface that enables automatic parallelization and distribution of large scale computations, and that can achieve high performance on large clusters of commodity PCs. MapReduce is a programming model and implementation that helps processing and generation large data sets, allowing programmers with no previous experience in parallel and distributed systems to easily utilize resources on large scale systems. The idea behind MapReduce is that the issues of how to parallelize computation, distribute data, and handle failures obscure the programmer's main objective of processing large amounts of data, so the implementation creates an abstraction that allows to express simple computations while hiding the messy details of parallel and distributed systems. In the programming model, the computation takes a set of input key/value pairs and produces a set of output key/value pairs. To do this, the interface provides two functions: Map (produces an intermediate key/value pair from an input key/value pair) and Reduce (accepts an intermediate key and a set of values and merges together those values into a smaller set). Fault tolerance is basically handled by re-execution. MapReduce has been successfully used at Google for several reasons: it is easy to use, a large variety of problems can be easily expressible as Map/Reduce computations, and the implementation is able to scale to large clusters of machines. 2.2 Comments Probably I am biased because these guys work at Google, but it is nice to see a research problem that has already been tested in a real world environment, not just through simulations or test beds, but with real systems, showing the feasibility and practicality of the proposed approach. One big point in the paper is that the interface is really simple, having basically two functions; however, it could be interesting to explore if there are any other type of function abstractions that could benefit from this architecture, enhancing the system to do more complex tasks and allowing it to be deployed in other kinds of distributed systems where the goal is focusing on your task rather than worrying about the complex details of parallelization and distribution.