Course outline
Class Notes
LectureFull page4-up
IntroductionPDFPDF
Arch and PerfPDFPDF
Virtual MemPDFPDF
Mem and Inst.PDFPDF
PipelinesPDFPDF
Cache ModelsPDFPDF
Parallel PerfPDFPDF
More on Parallel PerfPDFPDF

Home work
HW1

Presentation Schedule

CS598: Topics in High Performance Computing: Architecture, Algorithms, and Programming Models

William Gropp
Taught Fall, 2008

This class is an exploration of the relationships between algorithms, programming models, and computer architecture for high performance computing. Topics include the impact of modern architectural features on the complexity models used to inspire and evaluate algorithms, the impact of memory hierarchy and other features on both node performance and scalability, and the support for scalable and efficient code in programming models.

This course will discuss and attempt to answer the following questions:

Architecture
What are the features of modern architectures that impact performance of scientific applications? What features of scalable systems are important to algorithm designers? How do FPGAs, GPGPU, and similar approaches fit into the scientific computing picture?
Algorithms
How should complexity and performance be modeled? What changes at large numbers of processes? What changes with changing memory hierarchy? What are the consequences of these performance models when chosing the "best" algorithm?
Programming Models
What are the barriers to the adoption of high-performance algorithms? What role (good or bad) do current programming models play? Is there a component or compositional approach to programming high-performance applications in scientific computing?
The course will combine lectures with discussion of papers addressing some of these issues. Students are invited to nominate papers for consideration. This course does not assume a background in scientific computing; the algorithms that we will discuss will be reviewed as they come up in discussions.

An Example of the Challenge

Matrix-matrix multiply of dense matrices is a simple and exhaustively studied problem. The code for this operation in C and in Fortran is quite simple. In addition, there are many floating point operations for each data item, meaning that the performance on cache-based systems can be very good. But in practice, such performance is rarely observed from the natural C or Fortran versions of this elementary operation.

Try it yourself! Note that your laptop is capable of several gigaFLOPs.

  • Matrix-matrix multiply in Fortran
  • Matrix-matrix multiply in C
  • Matrix-matrix multiply in Fortran, using subroutine
  • (Caveat - these tests are written to be simple and relatively portable. Better versions will use finer-grain timers, better initialization of the memory hierarchy, etc.)

    We will discuss:

    1. The reasons for the observed performance, including the dependence on problem size and the discrepancy with "classic" complexity analysis that you will find in texts
    2. The impact on algorithms (both in terms of adapting algorithms to the hardware and in terms of selecting alternative algorithms)
    3. The role of programming models in enabling better architecture/algorithm matches.

    Project

    A term project is required for this course. Students will select an algorithm and analyze it with respect to one or more architectures and programming models. Students will make a proposal for a project, due October 17th. The final project report is due December 15th. Details about the project report are available here.

    • How does the algorithm fit the selected architecture(s)?
    • What is an appropriate performance model?
    • How well do the selected programming models express the algorithm?

    Some examples of projects:

    • For one of the following algorithms, analyze how well the natural implementation in several programming models works on one or more architectures:
      • Dense matrix-matrix multiply
      • Sparse matrix-vector multiply
      • Sparse matrix-matrix multiply
      • Stencil computations
      • Distributed hash
      • Map Reduce
    • For one of the following algorithms, consider alternatives that can solve the same problem to a similar degree of accuracy but with a better fit to one or more selected architectures
      • Global FFT (e.g., replaced with a different basis with better locality)
      • Conjugate gradient for solving linear systems on parallel machines (e.g., allow overlap of reduction operations with other operations)
    • For one algorithm, focus on programming model support for achieving performance with that algorithm while retaining the ability to maintain and modify the implementation.

    Papers

    Here are some papers that we may discuss. The links are to their official home where possible; in many cases, access from a Univesity IP address is easy. This list is under development; check back often.
    Architecture
    There are several good books on both serial and parallel computer architecture, along with numerous book chapters and online. The papers here were written for algorithm developers and talk about the realities of architecture in that context.
  • Bit reversal on uniprocessors (Alan Karp, SIAM Review, 1996)
  • Experimental Analysis of Algorithms (Catherine McGeoch, Notices of the American Mathematical Society, March 2001)
  • Reflections on the Memory Wall (Sally McKee, ACM Conference on Computing Frontiers, 2004)
  • Algorithms
    These papers consider some aspects of algorithms that must perform well on real hardware, as opposed to idealized hardware.
  • The Landscape of Parallel Computing Research: A View from Berkeley, particularly Section 3. (Tech report EECS-2006-183, 2006)
  • Cache-Oblivious Algorithms (Algorithms for Memory Hierarchies, LNCS 2625, pages 193-212, Springer Verlag)
  • Is Cache-Oblivious DGEMM Viable provides a different perspective on cache oblivious programs for what is probably the best-case for the cache-oblivious approach. This is a shorter version of An experimental comparison of cache-oblivious and cache-conscious programs
  • An analysis of sparse matrix-vector multiply is presented in this paper: Achieving high sustained performance in an unstructured mesh CFD application (SC1999)

    In parallel computing, the assignment of processes to processors can be an important step in achieving good performance. A good paper with results from a recent supercomputer is Topology Mapping for Blue Gene/L Supercomputer

    Sometimes the algorithm must compensate for a problem in the architecture. Extending Stability Beyond CPU Millennium in section 2 discusses how a molecular dynamics application worked around errors in the L1 cache by modifying the algorithm.

    Programming Models
    Many programming models have been proposed to address various goals. These papers look at some of the issues in programming for high performance while retaining productivity.
  • A comparative study of the NAS MG benchmark across parallel languages and architectures SC 2000
  • Programmability of the HPCS Languages: A Case Study with a Quantum Chemistry Kernel
  • Automated Transformation for Performance-Critical Kernels ACM SIGPLAN LCSD'07, Yi and Whaley
  • Further Reading

    The book Performance Optimization of Numerically Intensive Codes is a good place to find examples of performance optmization strategies applied to code that you can go and run, as well as examples of why this course is also discussing programming models.

    For a through discussion of the single node architecture, see the classic Computer Architecture: A Quantitative Approach

    LLCbench is a collection of performance benchmarks that may be useful in applying some of the ideas developed in this class.