Refactoring for High
Performance Computing
Tunning parallel codes for performance is a necessary but tedious
activity that is not much helpted by current tools. Automated
code optimization (e.g., by a compiler) has limited capabilities.
Various projects have tried to develop interactive compilers where the
compiler provides feedback to the user that can then guide the compiler
transformations. Success has been limited: One reason is that the
intermediate representation form used by the compiler is not human
readable and it is hard to feed back to the user the result of
transformations performed. A promising alternative is to use a
refactoring infrastructure, as discussed in our recent
article: The transformations are
source-to-source, but a compiler infrastructure can be used to automate
much of the transfromation process; multiple source versions can be
maintained as one code artifact. We are curruntly investigating the use
of this approach.
Exascale Computing
If the current increase in supercomputing performance continues at
the same rate as in previous decades, then one can expect systems with
exascale performance in little more than a decade. Getting there will
require changes that are more significant than the change from vector
architectures to clusters of microprocessors that happened three
decades ago. In particular:
- Single single-thread performance is not increasing, an exascale
system will require billions of concurrent threads. It is not clear
which applications can scale to that level, and how a software stack
will be structured to manage such a level of parallelism
- Power restrictions will probably entail the massive use of hybrid
architectures, much lower CPU to memory ratio, and significantly
enhanced locality.
- Higher error rate (and the avoidance of costly error recovery
mechanisms) will require algorithmic fault tolerance.
- I am involved in several efforts to define the software research
agenda for exascale and start executing on it. Note that exascale may
be the end of the road, at least for a system using silicon technology.
We should start thinking seriously about alternative research model on
computational scioences that are not predicated on a continuous
increase in system performance.
Intel amd Microsoft are funding a center for research on multicore
computing at Illinois. I am the co-director of this center. Among the
main research directions in the center are:
- 3D visual interfaces: Improved visual interfaces
will enable new applications (such as those we explore in our
teleimmersion project) and enhance user-computer interaction. More
compute power is needed on clients to support 3D modeling and
rendering, gesture and face recognition, etc. The interactive use of
such tasks requires low latency, hence requires computating on the
client.
- Safe parallelism: We firmly believe that
parallel programs should be race-free by design. I also beleive that
nondeterminism is seldom, if ever, needed in parallel trasnformational
code (i.e., code where parallelism is introduced uniquely to enhance
performance, but is not inherent in the problem specification). A
strong focus in our research (mine included) is on languages that
ensure code is race-free and provide, by default, determinisitc
behavior.
- Easy performance: The goal of parallelism is
performance -- especially scalability (increasing performance with an
increasing number of cores). The creation of scalable code should be
significantly facilitated by enhanced programming environments.
NCSA recently won in a
competition to host a $200M petascale supercomputer at U.
Illinois. I am one of five co-PIs on this proposal, focusing on the
software architecture of Blue Waters. Among the main concerns of this
effort are:
- Making performance engineering a reality by developing a more
systematic methodology for performance prediction of large codes; the
goal is that codes be ready t run on Blue Waters as soon as the machine
is in operation; performance modeling should not be an exercise done by
a small number of expert teams, but an inherent part of code
development.
- Using effectively large SMP nodes: We are pushing changes in
MPI3, and looking at possible usages of PGAS languages and restricted
OpenMP dialects for that purpose
- Large-scale resource management: we are looking at support for
workflow and for better integration of storage management with CPU
scheduling.
- Improved routing: The topology of Blue Waters raises very
interesting questions about the best way of allocating partitions and
of routing in the system.
- System monitoring: We are looking at the best way to continuously
collect data on the health of the system and its performance, and
continuously montior the system for anomalous behaviors.
Classification for
Performance Tuning
In many cases, there is a
choice of multiple algorithms to solve a problem; and each algorithm
can be implemented in different ways. One wishes to make the choice
that yields the best performance, but this choice may be
platform-dependent and input-dependent. Neither choices (of algorithm
and implementation) can be done automatically, e.g., by a compiler.
We can consider the choice of an optimal algorithm or an optimal
implementation as a classification problem: each input is tagged by the
code that woks best for this input; one wishes to find a fast compute
classifier that associate (with high probability) a "good" code with
each input. It turns out that this apporach works well in practice. One
can use machine learning techniques to train a classifier, using inputs
for which we found, by brute force, the best code, and then use it at
execution time to select the right code.
We have shown this technique to be effective for the well-studies
Frequent Item Mining problem
Patterns of High Performance
Computing
There is significant work on
the development
of better parallel programming models and environments. However, the
evaluation
of this work is far from systematic. The purpose of our research is to
collect
in amore systematic manner a library of parallel programming patterns,
and use
them to evaluate the usefulness and expressiveness of existing and
proposed
parallel programming models and environments. A community effort to
help this
effort was started at the
patHPC workshop that took place in April 2005 at UIUC (work
funded by DOE as part of the
Center for Programming Models for
Scalable Parallel Computing).
PERCS
DARPA awarded IBM $53.5M in
funding to pursue research in High Productivity Computing Systems. The
work on the Productive, Easy-to-Use, Reliable
Computing System (PERCS) covers chip technology, architecture, OS,
compiler and programming environment. UIUC is a partner in this
research; the UIUC team includes Profs. Vikram
Adve, Ralph Johnson, David Padua, Marc Snir, and Josep Torrellas and
their students. I am involved with students in several aspects
of this project. In particular, we work on architectural extensions to
support key applications; work on benchmarks and performance modeling;
and work on compiler technology, in support of semi-automated, run-time
program tuning. (Work funded by DARPA; see recent publications).
Blue
Gene
IBM Research announced
in December
1999 a $100 million research initiative, to build “Blue Gene”, a Petaflop/s supercomputer. This level of
performance would
be achieved by connecting together a massive number (32K) of single
chip
multiprocessors, where each chip is a highly parallel, shared memory
multiprocessor. The main intended application for this machine is the
simulation of protein folding, a process that is key to the
understanding of
the fundamentals of life.
The exploitation of such a machine will
present challenging software problems. In particular, one needs to
ensure that
the machine will continue to operate for months, even though individual
components may fail during this time. One needs to develop new methods
for
programming and controlling this massive number of concurrently
executing
parallel threads. One needs to adapt software algorithms to achieve
peak
performance on a novel architecture with on chip memory and hardware
supported
concurrent multithreading.
I initiated the project and led the
system work in the initial phase. We demonstrated a speed-up in excess
of
300,000 in a simulation of protein folding in Blue Gene [ICS 2000].
The Blue Gene effort had a large number of
contributors, listed as authors in a
survey paper. The Blue Gene project underwent significant
technical and managerial changes a year after it
started and
is now focused on a different architecture and different set of
applications.
Scalable Parallel System
Architecture
Today, most large-scale
parallel
computers are built by assembling smaller ``commodity'' compute nodes.
Two
main architectures have emerged:
- Shared Memory
Multiprocessors (SMM's), where each
processor can access all memory (and all I/O), and all processors'
caches are maintained coherent. Such systems are usually controlled by
one operating system image. The prevalent parallel programming model is
thread parallelism: each processor runs a separate execution thread.
The threads run in a common address space, and communicate via shared
memory.
- Clusters,
where each node has a separate memory and a separate I/O subsystem, and
each node is controlled by a separate operating system image. The
prevalent parallel programming model is message passing: each processor
runs a process in a private address space; processes communicate with
each other via messages.
SMM's
typically offer tighter coupling both in terms of performance (higher
bandwidth, lower latency), and in terms of function (``everything
shared''
model). This facilitates the development of parallel applications. On
the down
side, tighter coupling restricts scalability -- commercial single image
operating
systems do not scale nowadays beyond 64-128 processors. Error isolation
is
harder in a tightly coupled system -- any failure of a disk, processor
or
memory, or any system software failure brings down the entire system.
The lack
of internal firewalls also prevent
concurrent
maintenance -- hence prevents continuous availability -- an
increasingly
important goal for large systems. Finally, tightly coupled SMM's
usually require homogeneous hardware and software, and thus are
harder to
upgrade.
Large SMM's
are built by assembling smaller SMM's into
NUMA
systems. Multiple nodes are attached via a high bandwidth, low latency
packet
switching System Area Network (SAN). This, essentially, is the same
structure
as for a cluster node. The main difference is the function and
performance of
the communication controller. A cluster communication controller
contains a
message passing engine that offloads much of the messaging protocol
from the
main processor(s). Such an engine handles the link level protocol and
low level
error recovery; it manages message queues, handles protection and
address
translation, and has one or more data mover (DMA) engines. This enables
support
of ``zero-copy'' protocols in user space, where data is directly copied
from
sender memory to receiver memory in user space. A NUMA
communication
controller will also handle the link protocol and low level error
recovery.
Rather than a messaging engine, it contains a coherence engine that
keeps track
of the location of cache lines checked out from local memory and
executes
required operations to maintain caches coherent. The two types of
controllers
provide much common functionality, and it is feasible and advantageous
to
design a controller that combines both functions. Such a combined
adapter provides
two new capabilities:
- A data mover engine
that can be used for memory to memory or memory to cache copying of
data in a large NUMA system. In a conventional NUMA system, data can be
moved only via load and store operations. This consumes precious CPU
cycles and overloads the critical resource in such a system, namely the
CPU to memory communication path. A data mover engine can prefetch data
so as to hide long latencies without tying up key resources.
- A coherence engine that
can be used to maintain coherent shared memory segments across
different operating system images. To do so, the coherence engine needs
additional protection mechanisms, to ensure that a faulty kernel will
not issue wild writes against the memory of another OS image, and to
ensure that an error in one system will not bring down other systems.
However, much of these protection mechanisms are build in messaging
engines, as they are used to communicate across separate OS images.
Such a combined adapter can support
efficiently a partitionable
shared
memory multiprocessor. Increasingly, vendors of large NUMA systems
provide
a dynamic partitioning capability on their system. The system's
physical
resources can be partitioned into several physical partitions, each
controlled
by a separate operating system image. The partition boundaries can be
moved
without rebooting affected OS images. Partitioning provides better
fault
isolation, and enables concurrent maintenance. A combined adapter
enables the
use of the shared memory interconnect for fast, efficient message
passing
across partitions. The main obstacle to the scaling up of NUMA systems
is
software, not hardware. Therefore, it is very likely that, in coming
years, the
ability of hardware developers to scale up SMM's
will
significantly outstrip the ability of OS and subsystem developers to
scale up
their software. With partitioning, it becomes possible to scale up NUMA
systems
beyond to the number of processors that can be supported by one OS
image. The
shared memory hardware can be exploited in various ways to support
limited
sharing for applications and subsystems that span multiple OS images.
In
particular, one can support efficiently a shared memory programming
model for
applications that span multiple OS images. In this model, each
processor runs a
process in a separate address space. Shared variables are kept in a
segment
that is mapped into the address space of each process, and is
maintained
coherent by hardware. Similarly, shared memory can be used to provide
efficient
global lock services, or an efficient global cache for shared disks.
One can expect that such partitionable
systems will become increasingly prevalent in
coming years, providing a convergence point for large-scale computer
architectures. Rather than being forced into a choice between ``shared
all''
and ``shared nothing'' architectures, users will dynamically select an
appropriate level of sharing.
Some of these ideas were
demonstrated in
the
Prism
project.
Java for High Performance
Numerical Computing
First proposed as a
mechanism for
enhancing Web content, Java has taken off as a serious general
purpose
programming language. Industry and academia alike have expressed great
interest
in using Java as a programming language for scientific and engineering
computations. Such usage has been hampered by the relatively inferior
performance of Java in numeric intensive computing applications, and by
the
lack of key features in the Java language. In the
Ninja
project, we have developed compiler technology and libraries that lead
to Java
numerical codes with performance comparable to Fortran
or C, the more traditional languages for this field.
The
Java Grande Forum,
reflecting in part results from our own research, has identified five
critical
Java Language and Java Virtual Machine issues related to Java's
applicability
to solving large computational problems in science and engineering.
Unless
these issues are resolved, it is unlikely that Java will be a
successful
language for numerical computing. The five critical issues are:
- Multidimensional
arrays: True rectangular multidimensional arrays are the most important
data structures for scientific and engineering computing.
- Complex arithmetic:
Complex numbers are an essential tool in many areas of science and
engineering. Computations with complex numbers need to be supported as
efficiently as computations with the primitive
real number types, float and double. The issue of high performance
computing with complex numbers is directly tied to the next issue.
- Lightweight classes:
The excessive overhead associated with manipulation of objects in Java
makes it difficult to efficiently support alternative arithmetic
systems, such as complex numbers, interval arithmetic, and decimal
arithmetic. The ability to manipulate certain objects as having just
value semantics is absolutely fundamental to achieve high performance
with these alternative arithmetic systems.
- Use of floating
point hardware: Achieving the highest possible level of performance on
numerical codes typically requires exploiting unique floating
point features in each processor. This is often at odds with the Java
goal of exact reproducibility of results in every platform.
- Operator overloading:
If multidimensional arrays and complex numbers (and other arithmetic
systems) are to be implemented in Java as a set of standard packages,
then operator overloading is necessary to make the use of these
packages more attractive to the application programmer.
Parallel
Programming Environments and Tools
My group at
IBM provided the
overall design and key components of the software architecture for the
IBM SP scalable parallel system.
Close to half of the compute power of the
Top 500
supercomputing sites is now provided by SP systems. SP is the main
platform
used by the
ASCI
program. Much of this work in described in
several articles published in
IBM System
Journal 34(2), 1995, which also list the main
contributors. The SP
is listed as one of
fourteen major innovations at IBM Research..
Some contributions are listed below.
Message Passing Libraries
I worked with several
collaborators on
the design and development of the MPL message
passing library
of native SP communication commands. Based on this work, I contributed
to the
design of
MPI
which has become the industry standard message passing interface (I
successfully wrote around half of the
MPI-1
standard and less successfully wrote several
chapters of the
MPI-2
standard.) Hubertus Franke
developed
MPI-F,
an early, complete,
high-performance implementation of MPI1 on the SP2. We designed and
implemented
jointly with researchers at the Parallel Systems Group at the
NAS facility (NASA
Ames Research
Center)
MPI-IO, a portable parallel I/O library that evolved to become part of
MPI2.
Parallel I/O
The Vesta parallel
file system prototype that was developed at IBM Research provided much
of the
technology for the first SP parallel file system product.
Performance Tools
UTE Unified Trace Environment is a powerful, trace
driven tool for studying the performance of parallel programs.
Recent Projects
A somewhat random
selection of old personal research topics.
Bayesian Induction
A Bayesian model of
induction uses the
following framework: A prior probability function Pr()
represents our initial belief Pr(A) in A, for each
empirical
statement A about the state of our world. Over time, we
accumulate
evidence, learning that statements e1, e2,..., en actually
holds true
in our world. As we do so, the prior probability Pr(A) or A
is replaced by the conditional probability Pr(A|e1,...,en). A
simple
example
for
such
framework would be a "world" that consists of an
infinite sequence of coin tosses, and statements that express
assertions about
this world, such as "the fifth toss is a head", "the ratio
between heads and tails converges to 1", etc. Suppose that we
systematically try coin toss after coin toss. We would like to believe
that
Bayesian induction works. I.e., if statement A actually holds
in our
world, then we would like Pr(A|e1,...,en) to converge to 1;
if
statement
A is false, then we would like the conditional
probability to converge to 0.
The good news: if we were not dogmatic
about A, i.e., if we initially assigned to A a probability 0<Pr(A)<1,
then
Bayesian
induction
works for A. That's as good as one
can hope
for: clearly if Pr(A)=0, then Pr(A|e1,...,en)=0; if
we assume
up front that A cannot be, then no amount of evidence will
change our
mind, similarly, if Pr(A)=1, so that we assume up front that A
must be, then no amount of evidence will change our mind.
The bad news: If the probability
function Pr() can be expressed in a formal mathematical
notation, then
there must be some empirical statement A that is assigned a
priori
probability zero: we cannot avoid some preconceptions. So, if the world
does
not fit our preconceptions, then we are in trouble.
Some simple example of a
"forced preconception" follows. Suppose that, in our simple world of
coin tosses, we would use the function Pr() to bet on
subsequent
tosses. I.e., if e1,e2,... is the sequence of outcomes of the
coin
tosses, after n tosses, we bet "head" if Pr("n+1
toss is head"|e1,...,en)>1/2, "tail",
otherwise. Consider the statement A="the bet will be
wrong
at every coin toss". If the function Pr() can be expressed in
a
formal mathematical notation, so can our statement be. This is an
empirical
statement on the sequence of coin tosses, and it is quite easy to see
that it
must be that Pr(A)=0. I.e., if we use Pr() to
measure our
prior belief in the state of the world then, to the least, we harbor
the
misconception that Pr() cannot continuously mislead us.
This seminal work is described in a
hard-to-read long
paper
published in the Journal of Symbolic Logic. The
work is quoted in Stanford’s online Encyclopedia of Philosophy
and seems to
continue to generate follow-up research by philosophers interested in
the
epistemology of inductive inference. (I was flattered to see a 1995
Master
thesis from CMU’s dept of philosophy entitled “A
Commentary on the First Three Sections of Gaifman and Snir's
1982 JSL Paper Concerning Probabilities Defined on a First Order
Empirical Language,"
by Timothy Herron.) Here is a
pointer
to a recent presentation on this work.
Memory Hierarchy
Conventional abstract
computing models
assume constant access time to memory. In practice, memory accesses may
take
100's of instruction cycles. Caches are key to the performance of
modern
computers, and good cache locality is key to the performance of
algorithms.
Therefore, it is important to develop abstract computing models that
reflect
the reality of a memory hierarchy.
In joint work with Aggarwal
and Chandra we considered first a
hierarchical memory model where memory hierarchy is representing assuming
that access to address a costs f(a), for some
monotonic
function f() (most results focus on logarithmic and
polynomial cost
functions). We show, in this model, how to develop optimal algorithms
for
various problems. A simple example is provided by the matrix
multiplication
problem (assuming the n^3 product computation). A
simple
recursive algorithm turns out to have optimum locality (within a
constant
factor): each matrix is split into 2x2 block submatrices;
the matrix product is expressed as a product of two 2x2
matrices
involving, recursively, products of submatrices.
This
algorithm
does
not
depend on the exact memory access cost function; Leiserson and co.
recently coined the term cache oblivious for such algorithms.
Furthermore, within a constant factor, an on-line LRU type memory
management
algorithm will work as well as off-line memory management: programmers
can be
effectively relieved of the need to explicitly copy data from one
storage
location to another.
In subsequent
work, we
refined the model to reflect the effect of spatial
locality: access to contiguous data is cheaper than access to
random
locations. Interestingly, on-line memory management is not effective in
handling spatial locality. Also, some problems do not admit cache
oblivious
algorithms (recent work by Bilardi and co.).
Shared Memory Programming
Optimizing compilers improve
code
performance by reordering instruction execution, when such reordering
does not
change the computation outcome. For sequential programs, it is well
understood
when such reordering is legal: one has to respect data dependencies
and control dependencies. The situation is more complicated
for
explicitly parallel programs. It is possible to come up with examples
of
transformations that would be correct when applied to each sequential
thread in
isolation, but that lead to an incorrect result when the threads
execute
concurrently and communicate via shared variables, due to violations of
the
shared memory semantics (we assume sequential consistency). In joint
work
with Shasha, we
developed a formalism that enables the specification of order
constraints
within each individual executing thread so that program transformations
that
respect these constraints within each thread will be correct. This
framework
also provides conditions for the preservation of the atomicity of
compound
operations. The recent popularity of Java, a programming language
with
explicit shared memory parallelism, has brought these set of issues
again to
the forefront, with an ongoing heated
discussion
on the correct memory model for Java.
Decision Trees
The decision tree
model is
often used to analyze the complexity of computations that are dominated
by test
and branch operations. This model is used to show that sorting requires
nlog(n) comparisons, or
that finding a
maximum requires n-1 comparisons. Note that these two lower
bounds are
different in nature. The first one is an information theoretic
argument: There are n! possible outcomes to sorting n
elements, therefore a decision tree that sorts has n! leaves,
and
depth log(n!). The second does not follow from an information
theoretic argument: maximum has n possible outcomes, so that
the
information theoretic lower bound is log(n). Rather, it is an
adversary
argument: the element that was picked as a maximum must have been
compared to
all other elements, otherwise one could change the inputs so that
another
element is the maximum, without changing the outcome of the
comparisons.
Information theoretic arguments are valid, irrespective of constraints
on the
input domain (as long as there are still n! possible
orderings) and
irrespective of the type of predicates used to test and branch. It is
not clear
that the same holds for problems such as maximum. In fact, as the well
know
riddle about finding one fake coin out of n shows, the
maximum (or
minimum) problem can be solved with fewer comparisons, using more
complex
tests, when the inputs can have only two values (good heavy coin and
fake light
coin).
Would the same be true if coins could
have more than two weights? In joint
work
with Moran and Manber
we have applied Ramsey's theorem to show that lower bounds obtained in
a model
where simple comparisons are used hold for a model where arbitrary
binary
predicates are used, provided that the input domain is large enough.
Thus, if
coins can have many different weights, n-1 tests are
necessary, even
if one is allowed to use arbitrary predicates. Note that
Ramsey's
numbers are really large; this results
tells nothing about the number of comparisons needed when the number of
possible values (coin weights) is small.
NYU
Ultracomputer
The
NYU Ultracomputer
was never an
ultracomputer nor
a supercomputer: it was a
paracomputer. It
is still
an open problem whether
paracomputers can
be
supercomputers.