Abstract | Programs often exhibit more parallelism than is actually available in the target architecture. This thesis introduces and evaluates three methods –- \em loop unrolling, \em loop common expression elimination, and \em loop differencing –- for automatically transforming a parallel algorithm into a less parallel one that takes advantage of only the parallelism available at run time. The resulting program performs less computation to produce its results; the running time is not just improved via second-order effects such as improving use of the memory hierarchy or reducing overhead (such optimizations can further improve performance). The asymptotic complexity is not usually reduced, but the constant factors can be lowered significantly, often by a factor of 4 or more. The basis for these methods is the detection of loop common expressions, or common subexpressions in different iterations of a parallel loop. The loop differencing method also permits computation of just the change in an expression from iteration to iteration. \par We define the class of \em generalized stencil computations, in which loop common expressions can be easily found; each result combines w operands, so a naive implementation requires w operand evaluations and w-1 combining operations per result. Unrolling and application of the two-phase common subexpression elimination algorithm, which we introduce and which significantly outperforms other common subexpression elimination algorithms, can reduce its cost to less than 2 operand evaluations and 3 combining operations per result. Loop common expression elimination decreases these costs to 1 and log w, respectively; when combined with unrolling they drop to 1 operand evaluation and 4 combining operations per result. Loop differencing reduces the per-result costs to 2 operand evaluations and 2 combining operations. We discuss the tradeoffs among these techniques and when each should be applied. \par We can achieve such speedups because, while the maximally parallel implementation of an algorithm achieves the greatest speedup on a parallel machine with sufficiently many processors, it may be inefficient when run on a machine with too few processors. Serial implementations, on the other hand, run faster on single-processor computers but often contain dependences which prevent parallelization. Our methods combine the efficiency of good serial algorithms with the ease of writing, reading, debugging, and detecting parallelism in high-level programs. \par Our three methods are primarily applicable to MIMD and SIMD implementations of data-parallel languages when the data set size is larger than the number of processors (including uniprocessor implementations), but they can also improve the performance of parallel programs without serializing them. The methods may be applied as an optimization of a parallelizing compiler after a serial program's parallelism has been exposed, and they are also applicable to some purely serial programs which manipulate arrays or other structured data. \par The techniques have been implemented, and preliminary timing results are reported. Real-world computations are used as examples throughout, and an appendix lists more potential applications. |