Course outline
Class Notes
Home work
Presentation Schedule
|
CS598: Topics in High Performance Computing: Architecture, Algorithms, and
Programming Models
William Gropp
Taught Fall, 2008
This class is an exploration of the relationships between algorithms,
programming
models, and computer architecture for high performance computing.
Topics include the impact of modern architectural features on the
complexity models used to inspire and evaluate algorithms, the impact
of memory hierarchy and other features on both node performance and
scalability, and the support for scalable and efficient code in
programming models.
This course will discuss and attempt to answer the following questions:
-
Architecture
-
What are the features of modern architectures that impact
performance of scientific applications?
What features of scalable systems are important to algorithm
designers?
How do FPGAs, GPGPU, and similar approaches fit into the
scientific computing picture?
- Algorithms
-
How should complexity and performance be modeled?
What changes at large numbers of processes?
What changes with changing memory hierarchy?
What are the consequences of these performance models when chosing
the "best" algorithm?
- Programming Models
-
What are the barriers to the adoption of high-performance
algorithms?
What role (good or bad) do current programming models play?
Is there a component or compositional approach to programming
high-performance applications in scientific computing?
The course will combine lectures with discussion of papers addressing
some of these issues. Students are invited to nominate papers for
consideration. This course does not assume a background in scientific
computing; the algorithms that we will discuss will be reviewed as
they come up in discussions.
An Example of the Challenge
Matrix-matrix multiply of dense matrices is a simple and exhaustively
studied problem. The code for this operation in C and in Fortran is
quite simple. In addition, there are many floating point operations
for each data item, meaning that the performance on cache-based
systems can be very good. But in practice, such performance is rarely
observed from the natural C or Fortran versions of this elementary operation.
Try it yourself! Note that your laptop is capable of several
gigaFLOPs.
(Caveat - these tests are written to be simple and relatively
portable. Better versions will use finer-grain timers, better
initialization of the memory hierarchy, etc.)
We will discuss:
- The reasons for the observed performance, including the
dependence on problem size and the discrepancy with "classic"
complexity analysis that you will find in texts
- The impact on algorithms (both in terms of adapting algorithms to
the hardware and in terms of selecting alternative algorithms)
- The role of programming models in enabling better
architecture/algorithm matches.
Project
A term project is required for this course. Students will select an
algorithm and analyze it with respect to one or more architectures and
programming
models. Students will make a proposal for a project, due October
17th. The final project report is due December 15th.
Details about the project report are available here.
- How does the algorithm fit the selected architecture(s)?
- What is an appropriate performance model?
- How well do the selected programming models express the algorithm?
Some examples of projects:
- For one of the following algorithms, analyze how well the natural
implementation in several programming models works on one or more
architectures:
- Dense matrix-matrix multiply
- Sparse matrix-vector multiply
- Sparse matrix-matrix multiply
- Stencil computations
- Distributed hash
- Map Reduce
- For one of the following algorithms, consider alternatives that
can solve the same problem to a similar degree of accuracy but with
a better fit to one or more selected architectures
- Global FFT (e.g., replaced with a different basis with better locality)
- Conjugate gradient for solving linear systems on parallel
machines (e.g., allow overlap of reduction operations with other operations)
- For one algorithm, focus on programming model support for
achieving performance with that algorithm while retaining the
ability to maintain and modify the implementation.
Papers
Here are some papers that we may discuss. The links are to their
official home where possible; in many cases, access from a Univesity
IP address is easy. This list is under development; check back often.
- Architecture
-
There are several good books on both serial and parallel computer
architecture, along with numerous book chapters and online. The
papers here were written for algorithm developers and talk about the
realities of architecture in that context.
- Algorithms
- These papers consider some aspects of algorithms that must perform
well on real hardware, as opposed to idealized hardware.
An analysis of sparse matrix-vector multiply is presented in this paper:
Achieving high sustained performance in an unstructured mesh CFD application (SC1999)
In parallel computing, the assignment of processes to processors can be an important step in achieving good performance. A good paper with results from a recent supercomputer is
Topology Mapping for Blue Gene/L Supercomputer
Sometimes the algorithm must compensate for a problem in the architecture.
Extending Stability Beyond CPU Millennium in section 2 discusses how
a molecular dynamics application worked around errors in the L1 cache by modifying the algorithm.
- Programming Models
-
Many programming models have been proposed to address various goals.
These papers look at some of the issues in programming for high
performance while retaining productivity.
Further Reading
The
book Performance
Optimization of Numerically Intensive Codes is a good place to
find examples of performance optmization strategies applied to code
that you can go and run, as well as examples of why this course is
also discussing programming models.
For a through discussion of the single node architecture, see the
classic Computer Architecture: A Quantitative Approach
LLCbench is a
collection of performance benchmarks that may be useful in applying
some of the ideas developed in this class.
|