Abstract: | A high performance implementation is presented for three kernel routines commonly found in element-byelement preconditioned conjugate gradient finite element codes. These routines include forming the element stiffness matrices and loading vectors, or in the case of a non-linear problem, element residual vectors; and routines for applying element matrix–vector products. The present study considers tensor product elements of arbitrary mapping in 2-D, although the generalization to triangular elements and serendipity elements is straightforward. The implementation presented is most appropriate for high p type finite element methods, where the element matrices are relatively large and dense. This results in a set of high performance kernels for superscalar architectures, which otherwise may be memory bandwidth limited. Performance studies are presented for a representative superscalar microprocessor, the Intel i860. As these types of microprocessors are at the heart of modern workstations as well as several parallel supercomputing systems, this work is relevant across a variety of platforms. The resulting kernels yield both high performance on a variety of sequential architectures as well as a high degree of code portability through the basic linear algebra subprograms mechanism. |