Scalable Spectral Element Methods
LEFT: Earlytime pressure distribution for simulation of coolant
flow in a 217pin wirewrapped subassembly, computed on 32768 processors
of the Argonne Leadership Computing Facility's BG/P using Nek5000.
The Reynolds number is Re~10500, based on hydraulic diameter.
The mesh consists of 2.95 million spectral elements of order N=7
( ~988 million gridpoints).
Click here for a larger movie
[34 MB]. RIGHT: Strong scaling results for this problem
on BG/P. Vertical axis is the CPU time (s) for the first 50
timesteps. Horizontal axis is the number of processors.
80% parallel efficiency is realized for P=131072, which
corresponds to only 7500 points/processor.

The simulation pictured above is a watershed computation as
it is our first to exceed one million elements (2.95 M used)
and our first to use one billion gridpoints (0.988 B used).
The domain is a model of a 217pin subassembly with wirewrapped
pins. The pins partition the hexagonal cannister into 438
(communicating) subchannels, each with a length Lz/Dh ~ 75, where
Dh is the hydrualic diameter of the subchannel. Thus, this
simulation is equivalent to LES of channel flow in a channel
of length Lz ~ 90000h, where h is the channel halfheight,
save that this geometry is complex. Wires spiraling
around each pin serve not only to separate the pins but
also to induce interchannel mixing, thus mitigating local
hot spots.
The domain is bounded on six sides by canister walls and
is periodic in the axial direction with length
corresponding to a single period of the wire wrap.
Computational time on the IBM BG/P at the ALCF was provided
through the DOE Office of Science INCITE program.
The Nek5000 development and simulation effort is
supported by the DOE's Applied Mathematics Research and
AFCI Advanced Simulation and Modeling programs.
Realizing this degree of scalability required the following
developments:

A scalable spectral element multigrid solver for the pressure.

A 4thgeneration coarsegrid solver (based on algebraic
multigrid).

Elimination of virtually all arrays scaling with global element count
(only two remain), which resulted in a memoryfootprint reduction from
1 GB/proc to 90 MB/proc for P=65536.
This savings was enabled through James Lottes's
development of a custom, efficient, generalpurpose alltoall utility
based on the Crystal Router algorithm in the text of Fox et al. (1988).

Development of communication algorithms that can rapidly discover
and instantiate the required communication topology (again, using
the CR algorithm).

Development of a scalable partitioner. (Standard partitioners
were unable to meet our requirements.)
Our partitioner employs recursive spectral
bisection on the element graph, with elementtoelement connectivity
measured by the number of vertices shared between adjacent elements.
This graph is consequently a "27point" stencil, rather than 7point,
which eliminates a large fraction of the disconnected subsets that
are problematic for mesh partitioning algorithms.

Parallel I/O  We open and write a limited number of files in parallel
(typ. 1664). These files are written to distinct subdirectories
so that the number of files in the working directory is bounded, while
the number of files in each subdirectory scales only as the desired
number of outputs, independent of P or Q (the number of "writing" processors).
The processor space is divided into Q subsets, and each processor q' in
a given subset sends to the lowest numbered processor in that set, which
is designated as the I/O processor.

Parallel visualization. We use the LLNLbased VisIt software
on ALCF's visualization platform, Eureka, which has direct
access to files generated on the BG/P.
VisIt has readily accommodated the billioncell meshes resulting
from these runs.
The pictures below show some closeups of the mesh and
a plot of the velocity magnitude near one of the bounding walls.




