Designing Large Distributed Systems Using Novel P2P and Hybrid Approaches Abstract Our society is increasingly deriving social and economic benefits from large distributed computing systems such as enterprise networks and online collaboration environments. As a result, peer-to-peer (P2P) technology, due to its inherent scalability and resilience, has become a promising approach to dealing with the scale and complexity of such systems. Existing research has not fully uncovered the power of P2P technology, either because the scope is artificially limited (to file sharing applications), or because the focus is on the technology itself (e.g., insisting on pure decentralized design), rather than the problem at hand. In this talk, I will present my thesis research, which focuses on developing novel ideas and algorithms for building scalable, reliable and dependable large scale computing systems. First I will talk about our work on management overlay networks (MON), a system we have built to facilitate the management of large computing infrastructure such as those in an enterprise network. The goal is to support dynamic distributed status query and control. To achieve this goal, MON adopts a novel on-demand approach. It builds overlay networks for distributed management command execution. However, an overlay is built only when it is needed, and it is discarded as soon as the management commands are finished. As a result, MON is simple and lightweight, because no overlay is maintained when no management commands are executed, and there is no need for complex failure repairs. Building overlays on-demand means there is little chance for gradually overlay optimization, thus the overlay construction algorithm is vital to the performance of on-demand overlays. We describe several algorithms we have designed to achieve high coverage, reliability and performance for on-demand overlay construction. In the second part, I will talk about using control plane services to improve the performance/QoS of P2P applications. P2P applications are often designed to be entirely self-organizing. However, the use of a small scale, control plane service can often simplify the application design while improving their performance/QoS. For example, we have designed a locality aware P2P streaming system called DagStream, which allows peers to stream media data from nearby neighbors. To enable locality aware neighbor selection, we have designed a control plane service called RandPeer, which manages membership information on behalf of P2P applications. The use of RandPeer not only simplifies the DagStream design, but allows peers to quickly locate good neighbors, thus minimizing the impact of neighbor failures. Through the design of several P2P applications, we have discovered a general layered architecture for such applications called OCMA (Overlay Construction and Maintenance Architecture). Toward the end of this talk, I will briefly describe this architecture and contrast it to some alternative approaches for P2P application design.