I sometimes wonder about the relationship between my affection for a paper that I have written, the effort required to write it, and how much recognition the paper obtained. I think that the papers that took the most time are still the ones I like the best, even if sometimes they did not generate so much buzz in the community.
One paper that took a lot of time and of which I remain proud is Parallel Algorithms for the Spectral Transform Method, which Pat Worley and I published in the SIAM Journal of Scientific Computing in 1997. Perhaps I just like it because it is my only publication in a SIAM Journal. But I also think this must be one of the most thorough examinations ever performed of alternative approaches to parallelizing an important computational method.
The spectral transform method is (or at least was: it is less popular today) widely used in atmospheric modeling. A 3-D grid represent the state of the atmosphere (see figure: from Wikipedia). At each time step, it performs a Fourier transform for each latitude and then a Legendre transform for each longitude. A weather or climate model must then compute the effect of so-called “physics” processes such as radiation transport, convection, and cloud physics, which can be calculated independently for each column. So there are basically three computational steps: Fourier, Legendre, and physics, each with strong data dependencies along a different dimension of the 3-D grid.
Alternative parallel algorithms are possible because the 3-D grid can be distributed over processors in different ways and at different times. When a transform step is applied, the data for each individual latitude (or longitude) and level can be on the same processor or be distributed over multiple processors. In the former case, the transform can proceed without communication; in the latter case, communication is required. In practice, the “physics” computations can only proceed when all data for each column is on the same processor. Thus, the programmer can choose either to retain a fixed mapping of data to processors (either a 1-D or 2-D decomposition in the horizontal), in which case one or both of the transforms require communication; or, alternatively, can transpose the 3-D grid from one mapping to another to avoid communication within each transform stage.
These few variants result in a surprisingly large number of alternative parallel algorithms, with interesting tradeoffs. For example, if computing with D data on P processors, then if data is distributed, the parallel Fourier transform must exchange D data in each of Log P communication steps; if a transpose is used to avoid this communication, then a simple all-to-all transpose must exchange D/P data in each of P-1 steps. The first approach communicates more data in fewer messages; the latter, less data in more messages. Furthermore, each algorithm can be implemented in different ways, and the performance of each variant can vary greatly depending on such factors as problem size, number of nodes, and interconnection network architecture on the target parallel computer. Thus choosing the best parallel algorithm for a specific problem and computer is not easy.
What Pat and I did in this paper is to apply a method that I expounded upon in my parallel programming book. Develop simple performance models for algorithmic variants that relate performance to key parameters such as problem size, number of processors, and the time required to transmit a message of a certain size (e.g., see Table. Use benchmark experiments to calibrate communication cost parameters. Then compare a real parallel code (in this case, a 3-D and parallel version of Jim Hack and Rudi Jakob’s Spectral Transform Shallow Water Model, with the many different parallel algorithms included) to the models in many different situations. Our models predicted actual execution time with considerable accuracy in most cases, suggesting that the models can then be used to understand performance in other situations.
I continue to think that this approach to what we might call today application-algorithm-computer co-design (because such models make it easy to understand the consequences of changing any feature of application, algorithm or computer) is a useful approach to designing and building parallel programs.
This paper was also my first exposure to the often lackadaisical pace of publication in computing. We submitted our paper in April 1994 and it finally appeared three years later. Amusingly, none of the supercomputers on which we ran our experiments were still in service by the time the paper was published. Of course, the principles and the models continue to be valid—just not the results of our parameter calibration experiments.