VAX 6000 Series Vector Processor Programmer 's Guide Order Number: EK-60VAA-PG-001 This manual is intended for system and application programmers writing programs for the VAX 6000 system with a vector processor. Digital Equipment Corporation ------------------------------------------------------------ First Printing, June 1990 ------------------------------------------------------------ The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. Any software described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license. No responsibility is assumed for the use or reliability of software or equipment that is not supplied by Digital Equipment Corporation or its affiliated companies. Restricted Rights: Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013. ------------------------------------------------------------ © Digital Equipment Corporation 1990. All rights reserved. Printed in U.S.A. ------------------------------------------------------------ The Reader's Comments form at the end of this document requests your critical evaluation to assist in preparing future documentation. The following are trademarks of Digital Equipment Corporation: DEC DIBOL UNIBUS DEC/CMS EduSystem VAX DEC/MMS IAS VAXcluster DECnet MASSBUS VMS DECsystem-10 PDP VT DECSYSTEM-20 PDT DECUS RSTS DECwriter RSX dt (a) This document was prepared with VAX DOCUMENT, Version 1.2. ------------------------------------------------------------ Contents ------------------------------------------------------------ PREFACE ix ------------------------------------------------------------ CHAPTER 1 VECTOR PROCESSING CONCEPTS 1-1 ------------------------------------------------------------ 1.1 SCALAR VS. VECTOR PROCESSING 1-2 1.1.1 Vector Processor Defined ------------------------------------------------------------ 1-3 1.1.2 Vector Operations ------------------------------------------------------------ 1-3 1.1.3 Vector Processor Advantages ------------------------------------------------------------ 1-6 ------------------------------------------------------------ 1.2 TYPES OF VECTOR PROCESSORS 1-6 1.2.1 Attached vs. Integrated Vector Processors ------------------------------------------------------------ 1-6 1.2.2 Memory vs. Register Integrated Vector Processors ------------------------------------------------------------ 1-8 ------------------------------------------------------------ 1.3 VECTORIZING COMPILERS 1-8 ------------------------------------------------------------ 1.4 VECTOR REGISTERS 1-9 ------------------------------------------------------------ 1.5 PIPELINING 1-11 ------------------------------------------------------------ 1.6 STRIPMINING 1-14 ------------------------------------------------------------ 1.7 STRIDE 1-15 ------------------------------------------------------------ 1.8 GATHER AND SCATTER INSTRUCTIONS 1-17 ------------------------------------------------------------ 1.9 COMBINING VECTOR OPERATIONS TO IMPROVE EFFICIENCY 1-18 1.9.1 Instruction Overlap ------------------------------------------------------------ 1-18 1.9.2 Chaining ------------------------------------------------------------ 1-18 ------------------------------------------------------------ 1.10 PERFORMANCE 1-19 1.10.1 Amdahl's Law ------------------------------------------------------------ 1-19 1.10.2 Vectorization Factor ------------------------------------------------------------ 1-21 iii Contents 1.10.3 Crossover Point ------------------------------------------------------------ 1-22 ------------------------------------------------------------ CHAPTER 2 VAX 6000 SERIES VECTOR PROCESSOR 2-1 ------------------------------------------------------------ 2.1 OVERVIEW 2-2 ------------------------------------------------------------ 2.2 BLOCK DIAGRAM 2-3 ------------------------------------------------------------ 2.3 VECTOR CONTROL UNIT 2-5 ------------------------------------------------------------ 2.4 ARITHMETIC UNIT 2-5 2.4.1 Vector Register File Chip ------------------------------------------------------------ 2-6 2.4.2 Vector Floating-Point Unit Chip ------------------------------------------------------------ 2-7 ------------------------------------------------------------ 2.5 LOAD/STORE UNIT 2-7 ------------------------------------------------------------ 2.6 VECTOR PROCESSOR REGISTERS 2-9 ------------------------------------------------------------ 2.7 MEMORY MANAGEMENT 2-11 2.7.1 Translation-Not-Valid Fault ------------------------------------------------------------ 2-11 2.7.2 Modify Flows ------------------------------------------------------------ 2-11 2.7.3 Memory Management Fault Priorities ------------------------------------------------------------ 2-12 2.7.4 Address Space Translation ------------------------------------------------------------ 2-12 2.7.5 Translation Buffer ------------------------------------------------------------ 2-12 ------------------------------------------------------------ 2.8 CACHE MEMORY 2-13 2.8.1 Cache Organization ------------------------------------------------------------ 2-13 2.8.2 Cache Coherency ------------------------------------------------------------ 2-16 ------------------------------------------------------------ 2.9 VECTOR PIPELINING 2-17 2.9.1 Vector Issue Unit ------------------------------------------------------------ 2-17 2.9.2 Load/Store Unit ------------------------------------------------------------ 2-18 2.9.3 Arithmetic Unit ------------------------------------------------------------ 2-19 ------------------------------------------------------------ 2.10 INSTRUCTION EXECUTION 2-21 iv Contents ------------------------------------------------------------ CHAPTER 3 OPTIMIZING WITH MACRO-32 3-1 ------------------------------------------------------------ 3.1 VECTORIZATION 3-2 3.1.1 Using Vectorization Alone ------------------------------------------------------------ 3-2 3.1.2 Combining Decomposition with Vectorization ------------------------------------------------------------ 3-3 3.1.3 Algorithms ------------------------------------------------------------ 3-5 ------------------------------------------------------------ 3.2 CROSSOVER POINT 3-5 ------------------------------------------------------------ 3.3 SCALAR/VECTOR SYNCHRONIZATION 3-6 3.3.1 Scalar/Vector Instruction Synchronization (SYNC) ------------------------------------------------------------ 3-6 3.3.2 Scalar/Vector Memory Synchronization ------------------------------------------------------------ 3-7 3.3.2.1 Memory Instruction Synchronization (MSYNC) · 3-8 3.3.2.2 Memory Activity Completion Synchronization (VMAC) · 3-9 3.3.3 Memory Synchronization Within the Vector Processor (VSYNC) ------------------------------------------------------------ 3-9 3.3.4 Exceptions ------------------------------------------------------------ 3-10 3.3.4.1 Imprecise Exceptions · 3-10 3.3.4.2 Precise Exceptions · 3-11 ------------------------------------------------------------ 3.4 INSTRUCTION FLOW 3-11 3.4.1 Load Instruction ------------------------------------------------------------ 3-12 3.4.2 Store Instruction ------------------------------------------------------------ 3-13 3.4.3 Memory Management Okay (MMOK) ------------------------------------------------------------ 3-14 3.4.4 Gather/Scatter Instructions ------------------------------------------------------------ 3-14 3.4.5 Masked Load/Store, Gather/Scatter Instructions ------------------------------------------------------------ 3-15 ------------------------------------------------------------ 3.5 OVERLAP OF ARITHMETIC AND LOAD/STORE INSTRUCTIONS 3-15 3.5.1 Maximizing Instruction Execution Overlap ------------------------------------------------------------ 3-16 ------------------------------------------------------------ 3.6 OUT-OF-ORDER INSTRUCTION EXECUTION 3-18 ------------------------------------------------------------ 3.7 CHAINING 3-20 ------------------------------------------------------------ 3.8 CACHE 3-21 ------------------------------------------------------------ 3.9 STRIDE/TRANSLATION BUFFER MISS 3-22 ------------------------------------------------------------ v Contents 3.10 REGISTER REUSE 3-25 ------------------------------------------------------------ APPENDIX A ALGORITHM OPTIMIZATION EXAMPLES A-1 ------------------------------------------------------------ A.1 EQUATION SOLVERS A-2 ------------------------------------------------------------ A.2 SIGNAL PROCESSING--FAST FOURIER TRANSFORMS A-7 A.2.1 Optimized One-Dimensional Fast Fourier Transforms ------------------------------------------------------------ A-7 A.2.2 Optimized Two-Dimensional Fast Fourier Transforms ------------------------------------------------------------ A-9 ------------------------------------------------------------ GLOSSARY ------------------------------------------------------------ INDEX ------------------------------------------------------------ EXAMPLES 3-1 Overlapped Load and Arithmetic Instructions ------------------------------------------------------------ 3-16 3-2 Maximizing Instruction Execution Overlap ------------------------------------------------------------ 3-17 3-3 Effects of Register Conflict ------------------------------------------------------------ 3-18 3-4 Deferred Arithmetic Instruction Queue ------------------------------------------------------------ 3-19 3-5 A Load Stalled due to an Arithmetic Instruction ------------------------------------------------------------ 3-19 3-6 Use of the Deferred Arithmetic Instruction Queue ------------------------------------------------------------ 3-20 3-7 Example of Chain Into Store ------------------------------------------------------------ 3-21 3-8 Matrix Multiply--Basic ------------------------------------------------------------ 3-24 3-9 Matrix Multiply--Improved ------------------------------------------------------------ 3-24 3-10 Matrix Multiply--Optimal ------------------------------------------------------------ 3-26 A-1 Core Loop of a BLAS 1 Routine Using Vector-Vector Operations ------------------------------------------------------------ A-3 A-2 Core Loop of a BLAS 2 Routine Using Matrix-Vector Operations ------------------------------------------------------------ A-5 A-3 Core Loop of a BLAS 3 Routine Using Matrix-Matrix Operations ------------------------------------------------------------ A-6 vi Contents ------------------------------------------------------------ FIGURES 1-1 Scalar vs. Vector Processing ------------------------------------------------------------ 1-5 1-2 Vector Registers ------------------------------------------------------------ 1-10 1-3 Vector Function Units ------------------------------------------------------------ 1-11 1-4 Pipelining a Process ------------------------------------------------------------ 1-12 1-5 Constant-Strided Vectors in Memory ------------------------------------------------------------ 1-16 1-6 Random-Strided Vectors in Memory ------------------------------------------------------------ 1-16 1-7 Vector Gather and Scatter Instructions ------------------------------------------------------------ 1-17 1-8 Computer Performance Dominated by Slowest Process ------------------------------------------------------------ 1-20 1-9 Computer Performance vs. Vectorized Code ------------------------------------------------------------ 1-21 2-1 Scalar/Vector Pair Block Diagram ------------------------------------------------------------ 2-3 2-2 FV64A Vector Processor Block Diagram ------------------------------------------------------------ 2-4 2-3 Vector Count, Vector Length, Vector Mask, and Vector Registers ------------------------------------------------------------ 2-10 2-4 Virtual Address Format ------------------------------------------------------------ 2-11 2-5 Address/Data Flow in Load/Store Pipeline ------------------------------------------------------------ 2-13 2-6 Cache Arrangement ------------------------------------------------------------ 2-14 2-7 Physical Address Division ------------------------------------------------------------ 2-14 2-8 Main Tag Memory Organization ------------------------------------------------------------ 2-15 2-9 Data Cache Logical Organization ------------------------------------------------------------ 2-15 2-10 Vector Processor Units ------------------------------------------------------------ 2-17 2-11 Vector Arithmetic Unit ------------------------------------------------------------ 2-20 A-1 Linpack Performance Graph, Double-Precision BLAS Algorithms ------------------------------------------------------------ A-4 A-2 Cooley-Tukey Butterfly Graph, One-Dimensional Fast Fourier Transform for N = 16 ------------------------------------------------------------ A-8 A-3 Optimized Cooley-Tukey Butterfly Graph, One-Dimensional Fast Fourier Transform for N = 16 ------------------------------------------------------------ A-9 A-4 One-Dimensional Fast Fourier Transform Performance Graph, Optimized Single-Precision Complex Transforms ------------------------------------------------------------ A-10 A-5 Two-Dimensional Fast Fourier Transforms Using N Column and N Row One-Dimensional Fast Fourier Transforms ------------------------------------------------------------ A-10 A-6 Two-Dimensional Fast Fourier Transforms Using a Matrix Transpose Between Each Set of N Column One-Dimensional Fast Fourier Transforms ------------------------------------------------------------ A-11 A-7 Two-Dimensional Fast Fourier Transform Performance Graph, Optimized Single-Precision Complex Transforms ------------------------------------------------------------ A-12 vii Contents ------------------------------------------------------------ TABLES 2-1 Memory Management Fault Prioritization ------------------------------------------------------------ 2-12 3-1 Qualifier Combinations for Parallel Vector Processing ------------------------------------------------------------ 3-3 viii ------------------------------------------------------------ Preface ------------------------------------------------------------ Intended Audience This manual is for the system or application programmer of a VAX 6000 system with a vector processor. ------------------------------------------------------------ Document Structure This manual has three chapters and an appendix, as follows: · Chapter 1, Vector Processing Concepts, describes basic vector concepts and how vector processing differs from scalar processing. · Chapter 2, VAX 6000 Series Vector Processor, gives an overview of the vector coprocessor and related vector features. · Chapter 3, Optimizing with MACRO-32, using MACRO-32 and FORTRAN programming examples, illustrates particular programming techniques that take advantage of the high performance of the VAX 6000 series vector processor. · Appendix A, Algorithm Optimization Examples, provides examples of optimization in two application areas: equation solvers and signal processing. · A Glossary and Index provide additional reference support. ix Preface ------------------------------------------------------------ VAX 6000 Series Documents Documents in the VAX 6000 series documentation set include: ------------------------------------------------------------ Title Order Number ------------------------------------------------------------ VAX 6000-400 Installation Guide EK-640EA-IN VAX 6000-400 Owner's Manual EK-640EA-OM VAX 6000-400 Mini-Reference EK-640EA-HR VAX 6000-400 System Technical User's Guide EK-640EB-TM VAX 6000-400 Options and Maintenance EK-640EB-MG VAX 6000 Series Upgrade Manual EK-600EB-UP VAX 6000 Series Vector Processor Owner's Manual EK-60VAA-OM VAX 6000 Series Vector Processor Programmer's Guide EK-60VAA-PG ------------------------------------------------------------ ------------------------------------------------------------ Associated Documents Other documents that you may find useful include: ------------------------------------------------------------ Title Order Number ------------------------------------------------------------ CIBCA User Guide EK-CIBCA-UG DEBNI Installation Guide EK-DEBNI-IN Guide to Maintaining a VMS System AA-LA34A-TE Guide to Setting Up a VMS System AA-LA25A-TE HSC Installation Manual EK-HSCMN-IN H4000 DIGITAL Ethernet Transceiver Installation Manual EK-H4000-IN H7231 Battery Backup Unit User's Guide EK-H7231-UG Installing and Using the VT320 Video Terminal EK-VT320-UG Introduction to VMS System Management AA-LA24A-TE KDB50 Disk Controller User's Guide EK-KDB50-UG RA90 Disk Drive User Guide EK-ORA90-UG RV20 Optical Disk Owner's Manual EK-ORV20-OM x Preface ------------------------------------------------------------ Title Order Number ------------------------------------------------------------ SC008 Star Coupler User's Guide EK-SC008-UG TK70 Streaming Tape Drive Owner's Manual EK-OTK70-OM TU81/TA81 and TU81 PLUS Subsystem User's Guide EK-TUA81-UG ULTRIX-32 Guide to System Exercisers AA-KS95B-TE VAX Architecture Reference Manual EY-3459E-DP VAX FORTRAN Performance Guide AA-PB75A-TE VAX Systems Hardware Handbook -- VAXBI Systems EB-31692-46 VAX Vector Processing Handbook EC-H0419-46 VAXBI Expander Cabinet Installation Guide EK-VBIEA-IN VAXBI Options Handbook EB-32255-46 Vector Processing Concepts Course EY-9876E-SG VMS Installation and Operations: VAX 6000 Series AA-LB36B-TE VMS Networking Manual AA-LA48A-TE VMS System Manager's Manual AA-LA00A-TE VMS VAXcluster Manual AA-LA27A-TE VMS Version 5.4 New and Changed Features Manual AA-MG29C-TE ------------------------------------------------------------ xi 1 ------------------------------------------------------------ Vector Processing Concepts This chapter presents a brief overview of vector processing concepts. Sections include: · Scalar vs. Vector Processing · Types of Vector Processors · Vectorizing Compilers · Vector Registers · Pipelining · Stripmining · Stride · Gather and Scatter Instructions · Combining Vector Operations to Improve Efficiency · Performance 1-1 Vector Processing Concepts ------------------------------------------------------------ 1.1 SCALAR VS. VECTOR PROCESSING Vector processing is a way to increase computer performance over that of a general-purpose computer for certain scientific applications. These include image processing, weather forecasting, and other applications that involve repeated operations on groups, or arrays, of elements. A vector processor is a computer optimized to execute the same instruction repeatedly. For example, consider the process of adding 50 to a set of 100 numbers. The advantage of a vector processor is its ability to perform this operation with a single instruction, thus saving significant processing time. In computer processors, a vector is a list of numbers, a set of data, or an array. A scalar is any single data item, having one value. A scalar processor is a traditional central processing unit (CPU) that performs operations on scalar numbers in sequential steps. These types of computers are known as single-instruction/single-data (SISD) computers because a single instruction can process only one data item at a time. A list of elements can be placed in an array. The array is defined by giving each element and its location. Example: a 12 is the value a located at row 1 and column 2. The dimensions of the array are m and n. An array element is a single value in an array, such as a 12 below. a 11 a 12 . . . a 1 n a 21 a 22 . . . a 2n . . . . . . . . . a m1 a m2 . . . a mn A one-dimensional array consists of all elements in a single row or single column. A one-dimensional array can be expressed as a single capital letter such as A, B, or C. Collectively, the elements within a vector are noted by A(I), B(I), C(I), and so forth. Example: B and C are vectors, where: B = (-3, 0, 2) C =  9 5  The elements B(I) are -3, 0, and 2. The elements C(I) are 9 and -5. 1-2 Vector Processing Concepts ------------------------------------------------------------ 1.1.1 Vector Processor Defined A vector processor is a computer that operates on an entire vector with a single vector instruction. These types of computers are known as single-instruction/multiple-data (SIMD) computers because a single vector instruction can process a stream of data. A traditional scalar computer typically operates only on scalar values so it must therefore process vectors sequentially. Since processing by a vector computer involves the concurrent execution of multiple arithmetic or logical operations, vectors can be processed many times faster than with a traditional computer using only scalar instructions. ------------------------------------------------------------ 1.1.2 Vector Operations A computer with vector processing capabilities does not automatically provide an increase in performance for all applications. The benefits of vector processing depend, to a large degree, on the specific techniques and algorithms of the application, as well as the characteristics of the vector-processing hardware. Operations can be converted to code to be run on a vector processor if they are identical operations on corresponding elements of data. That is, each operation is independent of the previous step, as follows: A(1) = B(1) + C(1) A(2) = B(2) + C(2) A(3) = B(3) + C(3) . . . A(n) = B(n) + C(n) To create the vector, A(1:n), the same function [adding B(i) to C(i)] is performed on different elements of data, where i = 1,2,3 ... n. Notice that the equation for A(3) does not depend on A(1) or A(2). Therefore, all these equations could be sent to a vector processor to be solved using one instruction. Two vectors can be added together if both vectors are of the same order; that is, each vector has the same number of elements, n. The sum of two vectors is found by adding their corresponding elements. For example, if B = [2, -1, -3] and C = [3, 5, 0], then A =[2+3, -1+5, -3+0] = [5, 4, -3] 1-3 Vector Processing Concepts A scalar processor operates on single quantities of data. Consider the following operation: A(I) = B(I) + 50. As illustrated in Figure 1-1, five separate instructions must be performed, using one instruction per unit of time, for each value from 1 to I max (some CPUs may combine steps and use fewer units of time): 1 Load first element from location B. 2 Add 50 to the first element. 3 Store the result in location A. 4 Increment the counter. 5 Test the counter for index I max . If I max is reached, the operation is complete. If not, steps 1 through 5 are repeated. To calculate A(I) for 100 elements using these instructions (5 X 100), or 500 scalar instructions, takes 500 units of computer time. Since a vector processor operates on complete vectors of independent data at the same time, only three instructions are needed to perform the same operation A(I) using a vector processor: 1 Load the array from memory into the vector processor register B. 2 Add 50 to all elements in the array, placing the results in register A. 3 Store the entire vector back into memory. The flow of data optimizes the use of memory and reduces the overhead to perform each operation. Within the vector processor, much the same processing may occur as in the scalar processor, but the vector processor is optimized to do it faster. It is important to remember that vector operations generate the same result as scalar operations. 1-4 Vector Processing Concepts ------------------------------------------------------------ Figure 1-1 Scalar vs. Vector Processing ------------------------------------------------------------ ------------------------------------------------------------ 1-5 Vector Processing Concepts ------------------------------------------------------------ 1.1.3 Vector Processor Advantages Vector processors have the following advantages: · A vector processor can use the full bandwidth of the memory system for loading and storing an array. Unlike the scalar processor, which accepts single values at a time from memory, the vector processor accepts any number up to its limit, say 64 elements, at a time. The vector processor processes all the elements together and returns them to memory. · The vector processor eliminates the need to check the array index as often. Since all values, up to the vector processor limit, are operated upon at the same time, the vector processor does not have to check the index for each element and each operation. · The vector processor can free the scalar processor to do further scalar operations. While the vector processor is doing operations other than transferring data to or from memory, the scalar processor can do other functions. This process is in contrast to a scalar processor performing a math operation where the scalar processor must wait until the calculation is complete before proceeding. · The vector processor runs certain types of applications very fast, since it can be optimized for particular types of calculations. ------------------------------------------------------------ 1.2 TYPES OF VECTOR PROCESSORS Vector processors are classified according to two basic criteria: · How closely coupled they are to their scalar coprocessor--whether they are attached or integrated · How they retrieve vector data--whether they are memory or register processors ------------------------------------------------------------ 1.2.1 Attached vs. Integrated Vector Processors In general, there are two types of vector processors: attached and integrated. An attached vector processor (also known as an array processor) consists of auxiliary hardware attached to a host system that consists of some number of scalar processors. An attached vector processor, which generally has its own memory and instruction set, can also access data residing in the host main memory. It is typically attached by a standard I/O bus and is treated by a host processor as an 1-6 Vector Processing Concepts I/O device, controlled under program direction through special registers and operating asynchronously from the host. Program data is moved back and forth between the attached processor and the host with standard I/O operations. The host processor requires no special internal hardware to use an attached vector processor. There is no "pairing" of a host processor to an attached vector processor. A system can have multiple host scalar processors and one attached vector processor. Some systems can also have one host processor and a number of attached vector processors, all driven by a program executing on the host. Because it runs in parallel with its host scalar CPU, an attached vector processor can give good performance for the proper applications. However, attached vector processors can be difficult to program, and the need to use I/O operations to transfer program data can result in very high overhead when transferring data between processors. If the data format of the attached processor is different from that of the host system, input and output conversion of the data files will be required. To perform well on an attached vector processor, an application must have a high percentage of vector operations that need no I/O support from the host. Also, the computational time of those vector operations should be long compared to any required I/O operations. An integrated vector processor, on the other hand, consists of a coprocessor that is tightly coupled with its host scalar processor; the two processors are considered a pair. The scalar processor is specifically designed to support its vector coprocessor, and the vector processor instruction set is implemented as part of the host's native instruction set. The two processors share the same memory and transfer program instructions and data over a dedicated high-speed internal path. They may also share special CPU resources such as cache or translation buffer entries. Since they share a common memory, no I/O operations are needed to transfer data between them. Thus, programs with a high ratio of data access to arithmetic operations will perform more efficiently on an integrated vector processor than on an attached vector processor. An integrated vector processor can run synchronously or asynchronously with its scalar coprocessor, depending on the hardware implementation. When the scalar processor fetches and decodes a vector instruction, it passes the instruction to the vector processor. At that point, the scalar processor can either wait for the vector processor to complete the instruction, or it can continue executing and synchronize with the vector processor at a later time. Integrated processors that have this 1-7 Vector Processing Concepts ability to overlap vector and scalar operations can give better performance than those that do not. ------------------------------------------------------------ 1.2.2 Memory vs. Register Integrated Vector Processors There are two types of integrated vector processor architectures: memory- to-memory and register-to-register. In a memory-to-memory architecture, vector data is fetched directly from memory into the function units of the vector processing unit. Once the data is operated on, the results are returned directly to memory. With a register-to-register (or load/store) architecture, vector data is first loaded from memory into a set of high-speed registers. From there it is moved into the function units and operated on. The resulting data is not returned to the registers until all operations are complete, at which point the vector data is stored back in memory. For applications that use very long vectors (on the order of thousands of elements), a memory-to-memory architecture works quite well. Once the overhead involved in starting the vector operation is completed, results can be produced at the rate of one element per cycle. On the other hand, with a register-to-register architecture, only a limited segment of the array can be processed at once, and the load/store overhead (or latency) must be paid over and over. With long vectors, this overhead can reduce the performance advantage of high-speed registers. However, several hardware techniques can be implemented by a register- to-register architecture that can help amortize this load/store overhead. By using techniques such as chaining and instruction overlap, multiple operations can be executed concurrently on the same set of vector data while that data is still in the vector registers. Intermediate (temporary) values need not be returned to memory. Such techniques are not possible with a memory-to-memory architecture. ------------------------------------------------------------ 1.3 VECTORIZING COMPILERS Developing programs to take maximum advantage of a specific vector processor requires a great deal of knowledge of, and attention to, the particular vector computer hardware. Fortunately most applications that benefit from vector processing can be written in a high-level programming language, such as FORTRAN, and submitted to a vectorizing compiler for that language. The primary function of a vectorizing compiler is to analyze the source program for combinations of arithmetic operations 1-8 Vector Processing Concepts for which vectorization will yield correct results and generate vector instructions for those operations. If the compiler cannot be certain that a particular expression can be correctly vectorized, it will not vectorize that portion of the source code. The vectorizing compiler can reorganize sections of the program code (usually inside formal loops) that can be vectorized. Certain portions of all applications are nonvectorizable. Some programming techniques, by their nature, cannot be vectorized. For example, conditional branches into or out of a loop make it impossible for the compiler to know the range of the loop (that is, the vector length) before the code is executed. In other instances, there may be an unclear relationship between multiple references to the same memory location (called an unknown dependency). In such a relationship, the final value of the location may or may not depend on serial execution of the code, and the compiler does not have enough information to determine whether it can vectorize. Finally, there may be instances of constructs that could be vectorized but are not. The compiler may not be sophisticated enough to do so, the compiler may determine that vectorization would not be profitable in terms of performance, or the compiler may have insufficient information to determine that vectorization is safe. ------------------------------------------------------------ 1.4 VECTOR REGISTERS A scalar register is a location in fast memory where data is stored or status bits can be placed to be read at a later time. A register has a set length, say 32 bits, or four consecutive 8-bit bytes. A vector register is considerably larger. With the VAX, the vector register has a maximum length of 64 elements. Each element can contain up to 64 bits. The elements used can be enabled or disabled by setting bits in a Vector Mask Register (VMR). The programmer usually determines the range, or limits the number of elements used through a Vector Length Register (VLR) (see Figure 1-2). This range can vary, for example, from 0 to 64 elements. Of course, if the vector length = 0, no vector elements will be processed. 1-9 Vector Processing Concepts ------------------------------------------------------------ Figure 1-2 Vector Registers ------------------------------------------------------------ ------------------------------------------------------------ 1-10 Vector Processing Concepts ------------------------------------------------------------ 1.5 PIPELINING A vector function unit, or pipe, performs a specific function within a vector processor and operates independently of other function units. Some vector processors have three function units (see Figure 1-3): one for memory instructions, one for operations (such as add, subtract, and multiply), and one for miscellaneous instructions. Some vector processors have additional function units, specifically to perform additional arithmetic functions. The performance of the arithmetic and memory function units of a vector processor can be improved using instruction pipelining. Pipelining can be thought of as "assembly line" processing. If a complicated operation can be divided into smaller subprocesses that can then be executed independently by different parts of the function unit, the total time required to execute the operation repeatedly is substantially less than if the operation is performed serially. ------------------------------------------------------------ Figure 1-3 Vector Function Units ------------------------------------------------------------ ------------------------------------------------------------ 1-11 Vector Processing Concepts Figure 1-4 shows a concurrent execution of a process divided into four subprocesses. Each subprocess has five operations (A, B, C, D, and E), which start at different times. Operation A might be a load, operation B might be an add, ... and operation E might be a store. There is some overhead in starting the process, but once that overhead (or pipeline latency) is paid, the function unit produces one result per cycle. ------------------------------------------------------------ Figure 1-4 Pipelining a Process ------------------------------------------------------------ ------------------------------------------------------------ 1-12 Vector Processing Concepts Because most arithmetic and memory operations can be broken down into a series of one-cycle steps, the function units of a vector processor are generally pipelined. Thus, after initial pipeline latency, the function units can process an entire vector in the number of cycles equal to the length of the input vector--one vector element result per cycle. This time interval (known as a chime) is approximately equal (in cycles) to the length of the vector plus the pipeline latency. A vector instruction operates on an array of data, so the pipelined execution of vector instructions allows the overlap of multiple iterations of the same vector instruction operating on different data items. The pipeline length equals its number of segments. The maximum number of data elements operated on at any one time equals the pipeline length. Pipelining accommodates the variable array lengths found in vector instructions. Instruction pipelining can be enhanced by providing multiple parallel pipelines, which operate on different vector elements, within a function unit. As an example, assume a vector has 64 elements. If the vector processor has a function unit with four pipelines, the following processing can be executed in parallel: Pipe 0 operates on elements 0, 4, 8, ... , 60 Pipe 1 operates on elements 1, 5, 9, ... , 61 Pipe 2 operates on elements 2, 6, 10, ... , 62 Pipe 3 operates on elements 3, 7, 11, ... , 63 This obviously results in much faster execution than a single pipeline, giving four results per cycle instead of only one. After the pipeline latency, the 64 elements can be processed in 16 cycles rather than in 64. 1-13 Vector Processing Concepts ------------------------------------------------------------ 1.6 STRIPMINING An array longer than the maximum vector register length of a vector processor must be split into two or more subarrays or vectors, each of which can fit into a vector register. This procedure is known as stripmining (or sectioning), and it is performed automatically by a vectorizing compiler when the source program operates on loops longer than the maximum vector length. For example, suppose the following FORTRAN loop is vectorized to be run on a vector processor with registers that are 64 elements long: DO 20 I=1,350 A(I) = B(I) + C(I) 20 CONTINUE Because the vector registers can only hold 64 elements, the compiler vectorizes the loop by splitting the vector into six subvectors to be processed separately. As typically happens, one subvector in this example is shorter than the full length of a vector register. This short subvector is processed first. Conceptually, the compiler generates vector instructions for the following functions: DO I = 1,30 A(I) = B(I) + C(I) ENDDO DO I = 31,350,64 DO J = I,I+63 A(J)=B(J) + C(J) ENDDO ENDDO Note that the compiler must also generate code to set the Vector Length Register to 30 before processing the short vector and then reset it to 64 before processing the remaining vectors. 1-14 Vector Processing Concepts ------------------------------------------------------------ 1.7 STRIDE To a vector processor, a vector in memory is characterized by a start location, a length, and a stride. Stride is a measure of the number of memory locations between consecutive elements of a vector. A stride equal to the size of one vector element, known as unity stride, means that the vector elements are contiguous in memory. A constant stride greater than the size of one element means that the elements are noncontiguous but are evenly spaced in memory (see Figure 1-5). Most vector processors can load and store vectors that have constant stride. Not all vector data is constant-strided, however. In some vectors, the distance between consecutive elements in memory is not constant but varies for each pair of elements. Such vectors can be scattered throughout memory and are said to be random-strided or nonstrided (see Figure 1-6). For example, a sparse matrix is generally treated as a nonstrided vector. Some earlier vector processors did not support this kind of vector. On those systems, special software routines were required to gather the nonstrided vector into a temporary contiguous vector that could be accessed by constant-strided vector memory instructions. Today, most vector processors support nonstrided vectors with special load and store instructions called gather and scatter. 1-15 Vector Processing Concepts ------------------------------------------------------------ Figure 1-5 Constant-Strided Vectors in Memory ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ Figure 1-6 Random-Strided Vectors in Memory ------------------------------------------------------------ ------------------------------------------------------------ 1-16 Vector Processing Concepts ------------------------------------------------------------ 1.8 GATHER AND SCATTER INSTRUCTIONS To support random-strided vectors, gather and scatter instructions operate under control of a vector register that contains an index vector. For each element in the vector, the corresponding element in the index vector contains an offset from the start location of the vector in memory. The gather instruction uses these offsets to "load" the vector elements into the destination register, and the scatter instruction uses them to "store" the vector elements back into memory (see Figure 1-7). ------------------------------------------------------------ Figure 1-7 Vector Gather and Scatter Instructions ------------------------------------------------------------ ------------------------------------------------------------ 1-17 Vector Processing Concepts ------------------------------------------------------------ 1.9 COMBINING VECTOR OPERATIONS TO IMPROVE EFFICIENCY Some of the techniques available to increase vector instruction efficiency include overlapping and chaining. ------------------------------------------------------------ 1.9.1 Instruction Overlap Overlapping instructions involves combining two or more instructions to overlap their execution to save execution time. If a vector processor has independent function units, it can perform different operations on different operands simultaneously. Overlapping provides a significant gain in performance. If a register must be reused or if data is not yet available, overlapping may not be possible. ------------------------------------------------------------ 1.9.2 Chaining Chaining, a special form of instruction overlap, is possible with multiple function units. Chaining is passing the result of one operation in one function unit to another function unit. For example, an add instruction followed by a store command can "combine" so each element of the vector is stored as soon as the result is obtained. The processor does not have to wait for the add instruction to finish before storing the data. VADD V1,V2,V3 VSTORE V3 As results are generated by the add instruction, they are immediately available for input to the waiting store instruction. The store instruction can then begin processing the data. Instruction chaining only works if all the data to be processed is available at the beginning of the pipeline. If the result of one operation must be used as input to another operation in the same data stream, instruction chaining cannot be used. 1-18 Vector Processing Concepts ------------------------------------------------------------ 1.10 PERFORMANCE The performance of scalar computers has been measured for some time using millions of instructions executed per second (MIPS). MIPS is not a good measure of speed for vector processors, since one instruction produces many results. Vector processor execution speed, instead, is measured in millions of floating-point operations per second (MFLOPS). Other abbreviations used are MegaFLOPS, MOPS (millions of operations per second), and RPM (results per microsecond). Some of the largest computers measure speed in gigaFLOPS or billions of floating-point operations per second. The peak MFLOPS value is a vector processor 's best theoretical performance, in terms of the maximum number of floating-point operations per second. For a vector processor having a processor cycle time of 5 nanoseconds and 1 arithmetic unit per pipeline, its peak MFLOPS performance (defined as 1 divided by the cycle time) is determined as follows: ( 1 = 5 9 ) = 0 : 2 X (10 9 ) = 200 M F L O P S ------------------------------------------------------------ 1.10.1 Amdahl's Law Amdahl's Law indicates that the performance of an application on a computer system with two or more processing rates is dominated by the slowest process. Vector processing is faster than scalar processing executing the same operation, yet the primary factor that determines a computer 's speed is the speed of the scalar processor (see Figure 1-8). Amdahl's law, expressed as an equation, gives the time (T) to perform N operations as: T = N X (%scalar operations X time/scalar operation + %vector operations X time/vector operation) 1-19 Vector Processing Concepts ------------------------------------------------------------ Figure 1-8 Computer Performance Dominated by Slowest Process ------------------------------------------------------------ ------------------------------------------------------------ 1-20 Vector Processing Concepts ------------------------------------------------------------ 1.10.2 Vectorization Factor Some computer programs use only a portion of the code during the majority of the execution time. For example, a program may spend most of its time doing mathematical calculations, which comprise only 20% of its code. The vectorization factor may be defined as the percentage of the original scalar execution time that may be vectorized. Figure 1-9 shows how the performance increases as more code is vectorized. ------------------------------------------------------------ Figure 1-9 Computer Performance vs. Vectorized Code ------------------------------------------------------------ ------------------------------------------------------------ 1-21 Vector Processing Concepts Consider a scalar program that uses 20% of its code 70% of the time. If this 20% portion of the code is converted to vector processing code, the program is considered to have a vectorization factor of 70%. If the time for the scalar operation is set to 1 and the time for a vector operation is 10%, we have: T = N * (.30 * 1 + .70 * .1) T = N * .37 If performance (P), equals operations performed (N) per unit time (T) then, with T = N * .37: P = N / T = N / (N * .37) = 1 / .37 = 2.7 The improved performance, shown in Figure 1-9, would be about 2.7 times faster than a scalar processor. Vectorization factors above 70% achieve performance above the same computer using scalar processing. The speedup ratio is defined as the vector performance divided by the scalar performance. ------------------------------------------------------------ 1.10.3 Crossover Point The crossover point is the vector length or number of elements at which the vector unit exceeds the performance of the scalar unit for a particular instruction or sequence. To achieve a performance improvement on a given vector processor, a vectorized application should have an average vector length that is larger than the crossover point for that processor and the vector operations used. The smaller the crossover point, the better. A crossover point of 11 means that DO loops below 11 elements are performed faster using a scalar processor than by using a vector processor. This point is a result of the overhead instructions and time required to set up the vector processor, process the data, and return the solution. This point varies from computer to computer. Vector operations add some startup overhead, putting a limit on the minimum number of elements in an array. For small arrays, the time to process and compile the data is usually longer than doing the same process on a scalar processor. 1-22 2 ------------------------------------------------------------ VAX 6000 Series Vector Processor This chapter describes the vector processor module for the VAX 6000 series. The basic hardware is briefly described and then the hardware components are discussed from the software perspective. The chapter includes the following sections: · Overview · Block Diagram · Vector Control Unit · Arithmetic Unit · Load/Store Unit · Vector Processor Registers · Memory Management · Cache Memory · Vector Pipelining · Instruction Execution 2-1 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.1 OVERVIEW The FV64A vector processor is a single-board option that implements the VAX vector instruction set. This module requires a scalar CPU module for operation. The scalar/vector pair implement the VAX instruction set plus the VAX vector instructions. Figure 2-1 is a block diagram of the scalar/vector pair. The vector processor occupies a slot adjacent to the scalar CPU on the XMI. The two processors are connected by the vector interface bus (VIB) cable. The C-chip on the scalar module provides the operand and control interface between the scalar CPU and the vector module. This interface is used to issue vector instructions to the vector module, which then executes the instruction, including all memory references necessary to load or store vector registers. The vector processor receives all instructions and returns status to the scalar CPU across the VIB. For memory references, the vector processor has its own independent path to main memory. The system supports multiple scalar CPUs with a single scalar/vector pair. For a single scalar/vector pair, two memory controllers are required. It also supports a dual scalar/vector pair, for which four memory controllers are required to support the memory traffic. 2-2 VAX 6000 Series Vector Processor ------------------------------------------------------------ Figure 2-1 Scalar/Vector Pair Block Diagram ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ 2.2 BLOCK DIAGRAM The FV64A module is divided into three separate functional units: · Vector control unit · Arithmetic unit · Load/store unit All three functional units can operate independently. Figure 2-2 is a block diagram of the vector module. 2-3 VAX 6000 Series Vector Processor The FV64A chipset consists of five core chips, as follows: · Vector instruction issue and scalar/vector interface chip · Vector register file chip, 4 chips · Vector arithmetic data path, floating-point unit (FPU) chip, 4 chips · Load/Store--Vector module translation buffer, cache, and XMI interface controller chip · Clock generation chip (same as on scalar module) ------------------------------------------------------------ Figure 2-2 FV64A Vector Processor Block Diagram ------------------------------------------------------------ ------------------------------------------------------------ 2-4 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.3 VECTOR CONTROL UNIT When the vector control unit receives instructions, it buffers the instructions and controls instruction issue to other functional units in the vector module. The vector control unit is responsible for all scalar/vector communication. The vector control unit also contains the necessary register scoreboarding to control instruction overlap. The scoreboard implements the algorithms that permit chaining of arithmetic operations into store operations. To summarize, the vector control unit performs the following functions: · Interface to the scalar processor; receives instructions from the scalar module and also returns status. · Instruction issue. The vector control unit issues instructions to the other functional units of the vector module and maintains a register scoreboard for the detection of interinstruction dependencies. · Cache data (CD) bus master control. It relinquishes partial control to the load/store unit during execution of load/store instructions. · Implementation of the Vector Count Register (VCR), Vector Processor Status Register (VPSR), Vector Length Register (VLR), Vector Arithmetic Exception Register (VAER), and Vector Memory Activity Check Register (VMAC). ------------------------------------------------------------ 2.4 ARITHMETIC UNIT All register-to-register vector instructions are handled by the arithmetic unit. Each vector register file chip contains every fourth element of the vector registers, thus permitting four-way parallelism. These chips receive instructions from the vector contol unit and data from the cache or load/store, read operands from the registers, and write results back into the registers or into the mask register. If two 32-bit operands come over in a single 64-bit transfer, they can be read or written by two separate register file chips. The register set has four 64-bit ports (one read/write for memory data, two for read operands, and one for writing results). While one instruction is writing its results, a second can start reading its operands, thus hiding the instruction pipeline delay. Variations in pipeline length between instructions are smoothly handled so that no gaps exist in the flow of write data. The register file can hold two outstanding arithmetic instructions in its internal queue. The arithmetic unit executes two arithmetic instructions in about the time it takes one load or store 2-5 VAX 6000 Series Vector Processor operation to take place. The data from the register file chip flows to the vector FPU chip. Input data to the vector FPU chip comes over a 32-bit bus that is driven twice per cycle, and results are returned on a separate 32-bit bus that is driven once per cycle. The two operands for single-precision instructions can be passed in one cycle, while double-precision operands require two cycles. The FPU chip has a throughput of one cycle per single- precision operation, two cycles per double-precision operations, and 10 or 22 cycles per single- or double-precision divide. Its pipeline delay varies for different operations; for example, the pipeline delay is 5 cycles for all longword-type instructions and is 6 cycles for all double-precision instructions except multiply. ------------------------------------------------------------ 2.4.1 Vector Register File Chip The vector register file chip is the interface between the floating-point processor and the rest of the vector module. Among its features are: · It contains one quarter of the storage needed to implement the vector registers defined by the VAX vector architecture (2 Kbytes/Verse). · It provides four ports on the register file: a 64-bit, read/write port to the CD bus for loads and stores, a 32-bit (64-bit internal) read port for operand A, a 32-bit (64-bit internal) read port for operand B, and a 32- bit (64-bit internal) write port for results. A load or store instruction can be writing or reading the registers at one port, and an arithmetic instruction can be reading its operands out of two other ports, and another arithmetic instruction can be writing its results from still another port. All three operations can be done in parallel. When two longword operands are packed into the quadword, two separate vector register file chips can each select the appropriate longword. · It contains registers for holding two instructions, two scalar operands, the vector length embedded in each instruction, and the vector mask. · It performs the vector logical and vector merge instructions and formats integer operations so that they can be executed by the FPU. 2-6 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.4.2 Vector Floating-Point Unit Chip The FPU chip is a multi-stage pipelined floating-point processor. Among its features are: · VAX vector floating-point instructions and data types. The FPU implements instruction and data type support for all VAX vector floating-point instructions as well as the integer multiply operation. Floating-point data types F_, D_, and G_floating are supported. · High-throughput external interface. The FPU receives two 32-bit operands from the vector register file chip every cycle. It drives back a 32-bit result to the vector register file chip in the same cycle. · Based on the floating-point accelerator chip (the F-chip) on the scalar module. ------------------------------------------------------------ 2.5 LOAD/STORE UNIT When a load/store instruction is issued, the load/store unit becomes bus master and controls the internal cache data (CD) bus. Once a load/store instruction starts execution, no further instructions can be issued on the CD bus until it completes. The load/store unit handles the memory reference instructions, the address translation, the cache management, and the memory bus interface. If a memory instruction uses register offsets, the offset register is first read into a buffer and then each element of the offset register is added to the base. This saves having to turn around the internal bus for each offset read. If a register offset is not used, addresses are generated by adding the stride to the base. This virtual address is then translated to a physical address by using an on-chip 136-entry, fully associative translation buffer (TB). Two entries are checked at once by an address predictor looking for "address translation successful" on the last element. The early prediction permits the scalar processor to be released and appear to be asynchronous on memory reference instructions. The load /store unit handles translation buffer misses on its own but returns the necessary status on invalid or swapped-out pages. Once the scalar processor corrects the situation, the instruction is retried from the beginning. Once a physical address is obtained, the load/store unit looks it up in the 32K entry tag store. The address is delayed and then passed to the 1-Mbyte cache data store. This delay permits cache lookup to complete before data is written to the cache on store operations. In parallel, the 2-7 VAX 6000 Series Vector Processor corresponding register file address is presented to the four register file chips. The data and addresses are automatically aligned for load and store operations to permit the correct reading and writing of the register file and cache data RAMs. Up to four cache misses can be outstanding before the read data for the first miss returns, and hits can be taken under misses. Cache parity errors cause the cache to be disabled, the instruction retried, and when the instruction completes, a soft error interrupt is sent to the scalar processor. A duplicate copy of the cache tag store is maintained for filtering cache invalidates from the main memory bus. The cache is write through, with a 32-element write buffer, and memory read instructions that hit in the cache can start while the memory write instructions are emptying the write buffer. The cache fill size is 32 bytes. The entire process is pipelined so that a new 64-bit word can be read or written each cycle. The load/store unit implements the following functions: · Execution of all load, store, gather, and scatter instructions. · Virtual address generation logic for memory references. · Virtual to physical address translation logic, using a translation buffer. A 136-entry TB is part of the load/store unit. The load/store unit also contains the data path and control necessary to implement full VAX memory management (with assistance from the scalar CPU). · Cache control. The load/store unit supports the tag and data store for a 1-Mbyte write-through data cache. It also supports a duplicate tag store for invalidate filtering. · XMI interface. The load/store unit serves as the interface between the vector module and the XMI bus. This includes support for four outstanding cache misses on read requests and a 32-entry write buffer to permit half the data from one store/scatter instruction to be held in the buffer. The performance of the high-speed CD bus can thus be isolated from the performance impact of the slower XMI bus. 2-8 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.6 VECTOR PROCESSOR REGISTERS The vector processor has 16 data registers, each containing 64 elements numbered 0 through 63. Each element is 64 bits wide. A vector instruction that reads or writes longwords of F_floating or integer data reads bits <31:0> of each source element and writes bits <31:0> of each destination element. Other registers used with the data registers are the Vector Length, Vector Count, and Vector Mask Registers (see Figure 2-3). The 7-bit Vector Length Register (VLR) controls how many vector elements are processed. VLR is loaded prior to executing a vector instruction. Once loaded, VLR specifies the length of all subsequent vector instructions until VLR is loaded with a new value. The Vector Mask Register (VMR) has 64 bits, each bit corresponding to an element in a vector register. Bit <0> corresponds to vector element zero. The vector mask is used by the vector compare, merge, IOTA, and all masked instructions. The 7-bit Vector Count Register (VCR) receives the length of the offset vector generated by the IOTA instruction. VLR, VCR, and VMR are read and written by Move From/To Vector Processor (MFVP/MTVP) instructions. The Vector Count and Vector Length Registers are in the vector control unit. The Vector Mask Register and vector data registers are split across the four vector register file chips. 2-9 VAX 6000 Series Vector Processor ------------------------------------------------------------ Figure 2-3 Vector Count, Vector Length, Vector Mask, and Vector Registers ------------------------------------------------------------ ------------------------------------------------------------ 2-10 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.7 MEMORY MANAGEMENT The vector processor implements memory management as described in the VAX Architecture Reference Manual. The 32-bit virtual address is partitioned as shown in Figure 2-4. ------------------------------------------------------------ Figure 2-4 Virtual Address Format ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ 2.7.1 Translation-Not-Valid Fault If the V bit = 0 for a page table entry (PTE) which is being used for address translation, and no access violation (ACV) fault has occurred, then the vector module passes status back to the scalar CPU indicating a translation-not-valid (TNV) fault has occurred. ------------------------------------------------------------ 2.7.2 Modify Flows If the PTE for the page being accessed has V bit = 1, access is a write, no ACV fault has occurred, and the Modify (M) bit is not set, then the memory management unit enters the modify flows. The load/store unit sets the PTE M bit and continues. 2-11 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.7.3 Memory Management Fault Priorities Table 2-1 shows the priority order, from highest to lowest, by which the vector processor reports faults. ------------------------------------------------------------ Table 2-1 Memory Management Fault Prioritization ------------------------------------------------------------ ACV Alignment TNV I/O Modify Error Reported ------------------------------------------------------------ 1 x x x x ACV vector, ACV parameter 1 1 x x x ACV vector, align parameter 0 0 1 x x TNV vector, TNV parameter 1 0 0 1 x ACV vector, IOREF parameter 0 0 0 0 1 Execute modify flows 0 0 0 0 0 None; reference OK ------------------------------------------------------------ ------------------------------------------------------------ 2.7.4 Address Space Translation The memory management hardware translates virtual to physical addresses using the VAX Architecture Reference Manual requirements for vector processors. ------------------------------------------------------------ 2.7.5 Translation Buffer The translation buffer (TB) contains 136 page table entries (PTEs). The TB has 68 associative tags with two PTEs per tag. The TB uses a round- robin replacement algorithm. When a TB miss occurs, two PTEs (one quadword) are fetched from cache. If the fetch from cache results in a cache miss, eight PTEs (one hexword) are loaded into cache from main memory. Two PTEs are installed in the TB. The TB can be invalidated by executing a translation buffer flush. This is accomplished either by writing the VTBIA register or by writing the VTBIS register with the desired virtual address to invalidate a single location. 2-12 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.8 CACHE MEMORY The vector module implements a single-level, direct-mapped cache. In addition, the load/store unit can hold the data and addresses for one complete vector store or scatter instruction. Figure 2-5 shows the flow of address and data in the load/store pipeline. Each stage is a single or multiple stage based on the 44.44-ns vector module clock. The XMI stage is a multiple of 64 ns, and the time taken depends on the transaction type and the XMI bus activity. All memory references must flow through the cache stage. ------------------------------------------------------------ Figure 2-5 Address/Data Flow in Load/Store Pipeline ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ 2.8.1 Cache Organization The vector processor implements a 1-Mbyte cache, direct-mapped, with a fill of a hexword (block) and a hexword allocate (block size). The cache is read allocate, no-write allocate, and write through. There are 32K tags, and each tag maps one hexword block. Each tag contains one block valid bit, a 9-bit tag, and one parity bit. Each data block contains 32 bytes and 8 parity bits, one for each longword. 2-13 VAX 6000 Series Vector Processor Associated with each of the 32K main tags is a duplicate tag in the XMI interface. This tag is allocated in parallel with the main tag and is used for determining invalidates. All XMI write command/address cycles are compared with the duplicate tag data to determine if an invalidate should take place. The resulting invalidate is placed in a queue for subsequent processing in the main tag store. Figure 2-6 shows the cache arrangement. Figure 2-7 shows how the physical address is divided. ------------------------------------------------------------ Figure 2-6 Cache Arrangement ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ Figure 2-7 Physical Address Division ------------------------------------------------------------ ------------------------------------------------------------ The physical address passed to the cache is 27 bits long and is longword aligned. Bit <29> is never passed to the cache, because an I/O space reference generates a memory management exception in the translation buffer.Bits <28:20> are compared to the tag field. Bits <19:5> provide the row select for the cache, bits <4:3> supply the quadword address, and bit <2> supplies the longword address. 2-14 VAX 6000 Series Vector Processor Figure 2-8 shows how the main tag memory is arranged. The main tag is written with PA<28:20>, and the valid bit covers a hexword block. The parity bit covers the tag and valid bits. The duplicate tag memory is identical to the main tag memory. Figure 2-9 shows the organization of the cache data. Each cache block contains four quadwords, with eight longword parity bits. ------------------------------------------------------------ Figure 2-8 Main Tag Memory Organization ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ Figure 2-9 Data Cache Logical Organization ------------------------------------------------------------ ------------------------------------------------------------ 2-15 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.8.2 Cache Coherency All data cached by a processor must remain coherent with data in main memory. This means that any write done by a processor or I/O device must displace data cached by all processors. The XMI interface in the load/store unit continuously monitors all XMI write transactions. When a write is detected, the address is compared with the contents of a duplicate tag store to determine if the write should displace data in the main cache. If the write requires that the main cache tag be invalidated, then an invalidate queue entry is generated. The duplicate tag store is a copy of the main tag store. When a main cache tag allocate is performed, the corresponding duplicate tag is also allocated. When an invalidate request is generated, the duplicate tag is immediately invalidated. This mechanism permits full bandwidth operation of the main cache without missing an invalidate request. The invalidate queue is 16 entries long. In its quiescent state the load /store unit can process invalidates faster than the XMI can generate them. However, during execution of a load or store instruction, the invalidate queue can fill to a level where normal processing must cease, and the invalidate queue is then emptied before an overflow occurs. The number of entries before this mechanism is enabled is nine. 2-16 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.9 VECTOR PIPELINING The vector processor, which is fully pipelined, has three major function units: the vector issue unit, the load/store unit, and the arithmetic unit (see Figure 2-10). These function units operate independently and are fully pipelined. Vector instructions are received by the issue unit from the scalar processor. The issue unit decodes the instruction, performs various checks, and issues the instruction to either the load/store unit or the arithmetic unit. At that point, the issue unit is finished with that instruction and control of the instruction passes to the function unit to which it was issued. ------------------------------------------------------------ Figure 2-10 Vector Processor Units ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ 2.9.1 Vector Issue Unit The vector issue unit acts as the controller for the vector pipeline. It handles vector instruction decomposition, resource availability checks, and vector instruction issue. When an instruction is received by the vector module from the scalar processor, the instruction is decomposed into its opcode and operands, and the availability of the requested resources is checked. All source and destination registers must be available before instruction execution begins. The function unit to be used by the instruction during execution must also be available. The instruction is not issued until all required resources are available. 2-17 VAX 6000 Series Vector Processor The availability of registers is handled by a method called scoreboarding. The rules governing register availability depend on the type of instruction to be issued. · For a load instruction, the register to be loaded must not be modified by any currently executing (arithmetic) instruction, and it must not be modified or used as input by any currently deferred (arithmetic) instruction. · For a store instruction, the register to be stored must not be modified by any currently executing or deferred instruction, but it may be in use as input. The exception is when a chain into store may occur. In this case the store instruction can be issued while the chaining arithmetic instruction is still executing. · For a scatter or gather instruction, the restrictions for a load or store instruction apply, but also the register containing the offset to be used in the scatter or gather instruction must not be modified by any currently executing or deferred instruction. · For a load or store under mask instruction, the restrictions for a load or store instruction apply, but also the mask register must not be modified by any currently executing or deferred instruction. · An arithmetic instruction may be issued as soon as the deferred instruction queue of the arithmetic unit is free. Register checking for these instructions is handled by the arithmetic unit. In general, there must be no outstanding writes to a needed register from prior instructions, and the destination register of the instruction must not be used by a currently deferred instruction. Once an instruction is issued, it may take multiple cycles before the result of the calculation is available. Meanwhile, in the next cycle the next instruction can be decoded and, if all its issue conditions are satisfied, it can be issued. ------------------------------------------------------------ 2.9.2 Load/Store Unit The load/store unit handles all cache and memory access for the vector module. The load/store unit includes a five-segment pipeline that can accept a new instruction every cycle. In general, the load/store pipeline handles a single element request at a time. The exception occurs when a load instruction is acting on single-precision, unity vector stride data. In this special case, consecutive elements are paired and then each pair is handled as a single request by the load/store pipeline. 2-18 VAX 6000 Series Vector Processor Once a load or store (or scatter or gather) instruction is issued, no further instructions may be issued until a Memory Management Okay (MMOK) is received. The scalar unit is also stalled until the MMOK is received. Chapter 3 suggests certain coding techniques to minimize the impact of this behavior as much as possible. ------------------------------------------------------------ 2.9.3 Arithmetic Unit The arithmetic unit is composed of two parts: the vector ALU and the vector FPU. The FPU performs all floating-point instructions as well as compare, shift, and integer multiply. (There is no integer divide vector instruction.) The ALU does everything else, including merge and logical instructions and all initial instruction handling. The ALU receives instructions from the vector issue unit. One instruction may be queued while another instruction is executing in the arithmetic unit. A queued (or deferred) instruction begins executing as soon as any current instruction completes. Some overlap of instructions is possible if both the current and the deferred instructions require the FPU and are not divide instructions. Also, the second instruction must not begin outputting results before the first instruction completes. The ALU decodes the instruction and determines the type of operation requested (see Figure 2-11). If the instruction is a Boolean or merge instruction, the ALU performs the required operation. Floating-point instructions, as well as integer, compare, and shift instructions, are sent to the vector FPU for execution. Once an instruction begins execution in the arithmetic unit, the number of cycles delay (startup time) before the first results are returned depends on the particular instruction executed. With the exception of any type of divide, all instructions return new results each cycle for single- precision data, or every other cycle for double-precision data, following the return of the first results. The total number of cycles required for an instruction to complete depends on the length of the vector and the particular instruction. 2-19 VAX 6000 Series Vector Processor ------------------------------------------------------------ Figure 2-11 Vector Arithmetic Unit ------------------------------------------------------------ ------------------------------------------------------------ An instruction continues executing until all results are completed. A deferred arithmetic instruction begins execution after the instruction in the pipeline completes or when all the following conditions are met: · The deferred instruction must not be a "short" instruction; that is, the vectors used by the instruction must be at least eight elements in length. · The current instruction must not be a "long" instruction; that is, the instruction must not require more than two cycles per element to execute. (The divide instructions are the only "long" instructions.) In other words, overlap of instruction execution can occur if the results of the deferred instruction will not be completed before the last results from the current instruction. The overlap of instructions will be particularly significant for shorter vectors. All instructions, except floating-point divide instructions, are fully pipelined. For increased performance all arithmetic instructions are executed by four parallel pipelines. 2-20 VAX 6000 Series Vector Processor ------------------------------------------------------------ 2.10 INSTRUCTION EXECUTION The vector pipeline is made up of a varying number of segments depending on the type of instruction being executed. Once an instruction is issued, the pipeline is under the control of the load/store unit or the arithmetic unit. The interaction between the different function units of the vector module can greatly affect the performance/execution of vector instructions. The execution time of a vector instruction can be calculated using the following equation: FC + IC * round_up [ VL / NPP ] where FC is the fixed cost and IC is the incremental cost per vector element, NPP is the number of parallel pipelines, and VL is the length (number of elements) of the vector operand. This can be rewritten in terms of the data as: Startup_latency + Execution_time where Execution_time is a function of vector length. Note that the execution of D_ and G_floating (64-bit data) type arithmetic instructions (except divide) can only produce results every two cycles due to the bandwidth of the interconnect between the register file and the vector FPU, whereas F_floating type arithmetic instructions (except divide) produce results each cycle. The execution time of a sequence of instructions is not necessarily equal to the sum of the execution times of the individual instructions. Overlap can occur between arithmetic instructions and load/store instructions as well as between individual arithmetic instructions. It is possible that a sequence of instructions consisting of two arithmetics followed by a load or store can have a total execution time just slightly longer than the execution time of the load or store or equal to the total execution time of the arithmetics, whichever is longer. In the case of overlap between individual arithmetic instructions, a minimum of one cycle must elapse between the final result of the first instruction and the first result of the following instruction. In other words, when overlap occurs the total execution time decreases. For all overlapping arithmetic instructions, other than the first instruction to enter the empty pipeline, the effective fixed cost (or startup latency) is reduced to a minimum of one cycle. 2-21 3 ------------------------------------------------------------ Optimizing with MACRO-32 This chapter discusses optimization features of the VAX 6000 series vector processor. Appendix A provides additional optimization examples. This chapter includes the following sections: · Vectorization · Crossover Point · Scalar/Vector Synchronization · Instruction Flow · Overlap of Arithmetic and Load/Store Instructions · Out-of-Order Instruction Execution · Chaining · Cache · Stride/Translation Buffer Miss · Register Reuse 3-1 Optimizing with MACRO-32 ------------------------------------------------------------ 3.1 VECTORIZATION Many loops that VAX FORTRAN decomposes can also be vectorized. VAX FORTRAN performs vectorization automatically whenever /VECTOR is specified at compilation. VAX FORTRAN can vectorize any FORTRAN-77 Standard-conforming program; and vectorized programs can freely call and be called by other programs (vectorized, not vectorized, and non- FORTRAN) as long as both abide by the VAX Calling Standard. The VAX vector architecture supports most FORTRAN language features, as follows: · Data types: LOGICAL*4, INTEGER*4, REAL*4, REAL*8, COMPLEX*8, and COMPLEX*16 · Operators: +, -, *, /(floating point), and ** · All VAX FORTRAN intrinsic functions Although no VAX vector form exists for integer divide operations, VAX FORTRAN vectorizes them by converting them to floating-point operations. ------------------------------------------------------------ 3.1.1 Using Vectorization Alone Vectorize a program using the following iterative process: 1 Using /CHECK=BOUNDS, compile, debug, and run a scalar version of the program. 2 Compile and run the program using /VECTOR and the suitable vector-related qualifiers. 3 Evaluate execution time and results. The results should be algebraically equivalent to the scalar results; if not, check the DUMMY_ALIASES or ACCURACY_SENSITIVE settings. · If performance is adequate, stop. · If performance is inadequate, you have similar options as with autodecomposition: - Check the /SHOW=LOOPS output to see if CPU-intensive loops vectorized. To vectorize effectively, source code must not contain certain inhibiting constructs such as unknown dependencies. However, you can use LSE diagnostics and add assertions to the source code to overcome dependencies. 3-2 Optimizing with MACRO-32 - Combine vectorization with decomposition. - Consider a solution at a higher level hierarchy. - Retest the program compiling with /VECTOR (or any combination with a parallel qualifier) and return to the start of step 3. ------------------------------------------------------------ 3.1.2 Combining Decomposition with Vectorization To produce code that executes in parallel and on vector processors, compile /VECTOR with a parallel qualifier. Table 3-1 lists the compilation combinations and their interpretations. ------------------------------------------------------------ Table 3-1 Qualifier Combinations for Parallel Vector Processing ------------------------------------------------------------ Combination Interpretation ------------------------------------------------------------ /VECTOR/PARALLEL=AUTOMATIC Performs a dependence analysis on suitable loops and optimizes them for parallel-vector processing; chooses loops and prepares them for vector or parallel processing based on whether they will execute efficiently and produce correct results. In a nested structure, decomposition and vectorization may occur for multiple loops -- but no loop is decomposed inside a decomposed loop and no loop is vectorized inside a vectorized loop. /VECTOR/PARALLEL=MANUAL Performs a dependence analysis, optimization, and vectorization only; disqualifies from vectorization any loops preceded by the CPAR$ DO_PARALLEL directive from vectorization; in these loops, parses the user-supplied directives. /VECTOR/PARALLEL (Same as VECTOR/PARALLEL=MANUAL) /VECTOR/PARALLEL=(MANUAL,AUTOMATIC) Same as /VECTOR/PARALLEL=AUTOMATIC except disqualifies loops preceded by CPAR$ DO_PARALLEL. In those loops, only user-supplied directives are parsed. Any loops contained in a manually decomposed loop are disqualified from autodecomposition but not vectorization. ------------------------------------------------------------ Both parallel and vector processing have certain tradeoff qualities, which affect the aggregate speedup of vector and parallel processing. The combined vector-parallel processing will be somewhat less than the 3-3 Optimizing with MACRO-32 aggregate speedup of each because of these qualities; however, both CPU time and wall-clock time can be reduced most dramatically when vector and parallel processing are combined. The qualities involved are as follows: · Large amounts of vector load-stores can create a bottleneck in the system. On the other hand, small amounts of CPU work can cause the parallel processing startup overhead itself to become a bottleneck. Vector operations have smaller startup overhead than parallel processing, so they amortize this CPU expense much sooner. However, vector processing demands more from memory than parallel processing (on scalar CPUs) because one vector load or store can affect up to 64 elements, whereas a scalar load or store typically affects only one element. · Vector processing is "free" for the scalar CPUs because it is done on a vector processor; both wall-clock time and scalar CPU time are decreased. On the other hand, parallel processing is not free for the scalar CPUs; it can never decrease CPU time. But it can reduce wall-clock time more dramatically than vector processing. Vectorization can be effectively combined with decomposition: 1 Compile, debug, and run the program serially and in scalar. 2 Evaluate the algorithm and make suitable changes. 3 Unless your algorithm and system environment are especially suitable for parallel processing or you have already decomposed the program, compile, debug, and run the program using /VECTOR first. This is because vectorization is "free," as stated in this section. 4 Using /VECTOR/PARALLEL=AUTOMATIC, recompile, debug, and run the program. 5 Evaluate performance. · If performance is adequate, stop. · If performance is inadequate, review the /SHOW=LOOPS output and LSE diagnostics and modify the source code as needed for important loops that neither vectorized nor decomposed (this most probably will include adding assertions to resolve unknown dependencies). Then retest the program. If performance is still not acceptable, consider manually decomposing certain loops and look for other bottlenecks such as I/O or other performance inhibitors. 3-4 Optimizing with MACRO-32 ------------------------------------------------------------ 3.1.3 Algorithms At times it is necessary to consider the algorithm that is represented by the code to be optimized. Some algorithms are not as well suited to vectorization as others. It may be more effective to change the algorithm used or the way it is implemented rather than trying to optimize the existing code. Increasing the work performed in any single loop iteration and increasing the ratio of arithmetic to load/store instructions are two effective methods to consider when optimizing an algorithm for vectorization. Using unity stride rather than nonunity stride and longer vector lengths are other approaches to consider. ------------------------------------------------------------ 3.2 CROSSOVER POINT For any given instruction or sequence of instructions, there is a particular vector length where the scalar and vector processing of equivalent operations yield the same performance. This vector length is referred to as the crossover point between scalar and vector processing for the given instruction or instruction sequence and varies depending on the particular instruction or sequence. For vector lengths below the crossover point, scalar operations are faster; above the crossover point vector operations are more efficient. A low crossover point is considered a benefit, since it indicates that it is relatively easy to take advantage of the power of the vector processor. For any single, isolated vector instruction, the crossover point on the VAX 6000 is quite low, generally about 3 elements. But an instruction is not performed in isolation. Taken in the context of a routine or application, other factors affect the performance of the operations on short vectors, in particular whether the data of the short vector is used in other vector operations as well. In general, on the VAX 6000 vectorizing as much code as possible, including short vector length sections, leads to higher performance through more optimal use of cache. Specifically, once a set of data has been operated on by vector instructions, that data will be in the vector cache. A subsequent scalar operation on any of that same data will require that the data be moved out of the vector cache into the scalar cache. A vector operation would not require this data movement and thus is usually more efficient. Overall, the crossover point on the VAX 6000 is low enough that only for isolated operations on short vectors is scalar processing the faster alternative. 3-5 Optimizing with MACRO-32 ------------------------------------------------------------ 3.3 SCALAR/VECTOR SYNCHRONIZATION For most cases, it is desirable for a vector processor to operate concurrently with the scalar processor so as to achieve best performance. However, there are cases where the operation of the vector and scalar processors must be synchronized to ensure correct program results. Rather than forcing the vector processor to detect and automatically provide synchronization in these cases, the architecture provides software with special instructions to accomplish the synchronization. These instructions synchronize the following: · Exception reporting between the vector and scalar processors · Memory accesses between the scalar and vector processors · Memory accesses between multiple load/store units of the vector processor Software must determine when to use these synchronization instructions to ensure correct results. ------------------------------------------------------------ 3.3.1 Scalar/Vector Instruction Synchronization (SYNC) A mechanism for synchronization between the scalar and vector processors is provided by the SYNC instruction, which is implemented by a Move From Vector Processor (MFVP) instruction. SYNC allows software to ensure that the exceptions of previously issued vector instructions are reported before the scalar processor proceeds with the next instruction. SYNC detects both arithmetic exceptions and asynchronous memory management exceptions and reports these exceptions by taking the appropriate VAX instruction fault. Once it issues the SYNC, the scalar processor executes no further instructions until the SYNC completes or faults. When SYNC completes, a longword value (which is unpredictable) is returned to the scalar processor. The scalar processor writes the longword value to the scalar destination of the MFVP instruction and then proceeds to execute the next instruction. When SYNC faults, it is not completed by the vector processor, and the scalar processor does not write a longword value to the scalar destination of the MFVP instruction. Also depending on the exception condition encountered, the SYNC itself takes either a vector processor disabled fault or memory management fault. After the appropriate fault has been 3-6 Optimizing with MACRO-32 serviced, the SYNC may be returned to through a Return from Exception or Interrupt (REI) instruction. SYNC only affects the scalar/vector processor pair that executed it. It has no effect on other processors in a multiprocessor system. ------------------------------------------------------------ 3.3.2 Scalar/Vector Memory Synchronization The scalar processor and the vector processor can access memory at the same time during: · Asynchronous memory management mode · Synchronous memory management mode, after the vector processor indicates no memory management exceptions occurred When the scalar processor and the vector processor access memory at the same time, it may be desirable to synchronize their accesses. Using an MFVP from MSYNC vector control register causes the scalar CPU to stall until previous memory accesses by either the vector processor or the scalar processor are completed and visible to the other. MSYNC is for user software. Scalar/vector memory synchronization allows software to ensure that the memory activity of the scalar/vector processor pair has ceased and that the resultant memory writes have been made visible to each processor in the pair before the pair 's scalar processor proceeds with the next instruction. Two ways are provided to ensure scalar/vector memory synchronization: · Using MSYNC, which is implemented by the MFVP instruction · Using the Move From Processor Register (MFPR) instruction to read the Vector Memory Activity Check (VMAC) internal processor register In the following example, both the vector processor load instruction (VLDL) and the scalar processor move instruction (MOVF) would be using the same BASE memory. MSYNC ensures that the load instruction completes before beginning the move instruction. VLDL BASE, #4, V1 MSYNC R0 MOVF#^F3.0, BASE 3-7 Optimizing with MACRO-32 Scalar/vector memory synchronization does not mean that previously issued vector memory instructions have completed; it only means that the vector and scalar processor are no longer performing memory operations. While both VMAC and MSYNC provide scalar/vector memory synchronization, MSYNC performs significantly more than just that function. In addition, VMAC and MSYNC differ in their exception behavior. Note that scalar/vector memory synchronization only affects the processor pair that executed it. Other processors in a multiprocessor system are not affected. Scalar/vector memory synchronization does not ensure that the writes made by one scalar/vector pair are visible to any other scalar or vector processor. Software can make data visible and shared between a scalar/vector pair and other scalar and vector processors by using the mechanisms described in the VAX Architecture Reference Manual. Software must first make a memory write by the vector processor visible to its associated scalar processor through scalar/vector memory synchronization synchronization) before making the write visible to other processors. Without performing this scalar/vector synchronization, it is unpredictable whether the vector memory write will be made visible to other processors even by the mechanisms described in the VAX Architecture Reference Manual. Note that waiting for VPSR to be clear does not guarantee that a vector write is visible to the scalar processor. ------------------------------------------------------------ 3.3.2.1 Memory Instruction Synchronization (MSYNC) Once it issues MSYNC, the scalar processor executes no further instructions until MSYNC completes or faults. When MSYNC completes, a longword value (which is unpredictable) is returned to the scalar processor, which writes it to the scalar destination of the MFVP instruction. The scalar processor then proceeds to execute the next instruction. Arithmetic and asynchronous memory management exceptions encountered by previous vector instructions can cause MSYNC to fault. When MSYNC faults, all previously issued scalar and vector memory instructions may not have finished. In this case, the scalar processor writes no longword value to the scalar destination of the MFVP. Depending on the exception encountered by the vector processor, the MSYNC takes a vector processor disabled fault or memory management 3-8 Optimizing with MACRO-32 fault. After the fault has been serviced, the MSYNC may be returned to through a Return from Exception or Interrupt (REI) instruction. ------------------------------------------------------------ 3.3.2.2 Memory Activity Completion Synchronization (VMAC) Privileged software needs a way to ensure scalar/vector memory synchronization that will not result in any exceptions being reported. Reading the Vector Memory Activity Check (VMAC) internal processor register with the privileged Move From Processor Register (MFPR) instruction is provided for these situations. It is especially useful for context switching. Once an MFPR from VMAC is issued by the scalar processor, the scalar processor executes no further instructions until all vector and scalar memory activities have ceased; all resultant memory writes have been made visible to both the scalar and vector processor; and a longword value (which is unpredictable) is returned to the scalar processor. After writing the longword value to the scalar destination of the MFPR, the scalar processor then proceeds to execute the next instruction. Vector arithmetic and memory management exceptions of previous vector instructions never fault a privileged MFPR from the Vector Memory Activity Check Register and never suspend its execution. ------------------------------------------------------------ 3.3.3 Memory Synchronization Within the Vector Processor (VSYNC) The vector processor can concurrently execute a number of vector memory instructions through the use of multiple load/store paths to memory. When it is necessary to synchronize the accesses of multiple vector memory instructions, the MSYNC instruction can be used; however, there are cases for which this instruction does more than is needed. If it is known that only synchronization between the memory accesses of vector instructions is required, the Synchronize Vector Memory Access (VSYNC) instruction is more efficient. If a conflict results within the vector processor for accessing memory, a VSYNC instruction can be used. VSYNC ensures that the current memory access instruction is complete before executing another. This instruction does not affect scalar processor memory access instructions. VSYNC orders the conflicting memory accesses of vector memory instructions issued after VSYNC with those of vector memory instructions issued before VSYNC. Specifically, VSYNC forces the access of a memory location by any subsequent vector memory instruction to wait for (depend 3-9 Optimizing with MACRO-32 upon) the completion of all prior conflicting accesses of that location by previous vector memory instructions. VSYNC does not have any synchronizing effect between scalar and vector memory access instructions. VSYNC also has no synchronizing effect between vector load instructions because multiple load accesses cannot conflict. It also does not ensure that previous vector memory management exceptions are reported to the scalar processor. ------------------------------------------------------------ 3.3.4 Exceptions There are two categories of exceptions within the vector processor: · Imprecise exceptions · Precise exceptions ------------------------------------------------------------ 3.3.4.1 Imprecise Exceptions Imprecise exceptions can occur within the vector processor when arithmetic instructions are processing. They may be caused by typical arithmetic problems such as division by zero or underflow. Because the vector processor can execute instructions out of order, it is not possible to determine the instruction that caused the exception from the updated program counter (PC). The PC in the scalar processor is pointing further down the instruction stream and cannot be backed up to point at the failing instruction. To report the exception condition in this case, the vector processor disables itself so that the scalar processor will take a vector disable fault when it attempts to dispatch a vector instruction. The vector disable fault handler then determines the cause. When this type of exception occurs, the vector controller sets a bit in the register mask in the Vector Arithmetic Exception Register (VAER) IPR to indicate the destination vector register which received data from the exception. It then informs the scalar CPU of the exception. When debugging code, it is often necessary to be able to find the precise instruction causing the problem. Inserting a SYNC instruction after each arithmetic instruction will cause the machine to run in precise mode, waiting for each instruction to complete before executing the next. However, it will run much slower than when imprecise exceptions are allowed to occur. 3-10 Optimizing with MACRO-32 ------------------------------------------------------------ 3.3.4.2 Precise Exceptions The vector processor produces precise exceptions for memory management faults. When a memory management exception occurs, microcode and operating system handlers are used to fix the exception. The vector processor cannot service Translation Not Valid and Access- Control Violation faults. To handle these exceptions, the vector processor passes sufficient state data back to the scalar CPU. Then if a memory management fault occurs, the microcode can build a vector exception frame on the stack so that vector processor memory manangement exceptions will be handled precisely and the faulting instruction restarted. To enforce synchronous operation, after a vector load/store operation is issued, the scalar CPU will not issue additional instructions until memory management has completed. To reduce the delay from the issue of a load/store instruction to the issue of the next instruction, the load/store unit has special logic which predicts when load/store instructions can proceed fault free. When the load/store unit knows it can perform all virtual to physical translations without incurring a memory management fault, it issues the MMOK signal to the vector control unit. The scalar CPU is then released to issue more instructions while the load/store unit completes the remainder of the data transfers. This mechanism reduces the overhead associated with providing precise memory management faults. ------------------------------------------------------------ 3.4 INSTRUCTION FLOW Vector instructions are read from the scalar CPU's I-stream. The scalar issue unit decodes the vector instructions and passes them to the vector CPU. The instructions are decoded by the vector control unit and then issued to the appropriate function unit through the internal bus. Before instruction issue, the instruction is checked against a register scoreboard to verify that it will not use a corrupted register or attempt to modify a register that is already in use. Load, store, scatter, and gather instructions are processed in the Load/Store chip. These instructions either fetch data from memory and place it in the vector register file or write data from the vector register file to memory. Arithmetic instructions are passed to the arithmetic pipelines by way of control registers in the register file chips. An arithmetic instruction has a fixed startup latency. To minimize the effects of this latency, the arithmetic pipelines support the ability to queue two arithmetic instructions. This permits the arithmetic pipeline controller to start the second instruction without any startup latency. The removal of 3-11 Optimizing with MACRO-32 startup latency for the second arithmetic instruction (deferred arithemetic instruction) is a benefit in algorithms that require less than eight Bytes /FLOP of load/store bandwidth. Typical algorithms benefit greatly from the ability to chain an arithmetic operation into a store operation. The vector control unit, along with the ALU unit, implements this capability. The following sections describe by instruction type the flow of instructions in the machine. ------------------------------------------------------------ 3.4.1 Load Instruction When a load instruction is received by the vector control unit, the destination vector register is checked against outstanding arithmetic instructions. A load instruction cannot begin execution until the register to which it will write is free. A register conflict may occur if the destination register of a load instruction is the same as one of the registers used by a preceding arithmetic instruction. If instruction execution overlap could occur if the load instruction were using a different register, then the register conflict can be eliminated by simply changing the register used. If there are no register usage conflicts, the instruction is dispatched to the load/store unit. An example of a memory access instruction in assembler notation is as follows: VLDL base, stride, Vc where: VLD = vector load (load memory data into vector register) L = longword (Q would equal quadword) base = beginning of first element stride = number of memory locations (bytes) between the starting address of the first element and the next element Vc = vector register destination result This instruction means: Load the vector register (Vc) from memory, starting at the base address (base), incrementing consecutive addresses by the stride in bytes. The load operation writes the data from memory into the destination register. The store operation writes the data from the vector register back to memory. 3-12 Optimizing with MACRO-32 In the load/store instruction, the Vector Length Register (VLR) and the Vector Mask Register (VMR) with the match true/false (T/F) (when the mask operate enable (MOE) bit is set) determine which elements to access in Vc. For longwords, only bits <31:0> may be accessed. The elements can be loaded or stored out of order, because there can be multiple load /store units and multiple paths to memory, a desirable effect of vector processors. A Modify Intent (MI) bit may be used with the VLD instruction to improve performance for systems that use writeback caches. The MI bit is not used for store or scatter instructions. During a load operation, the first element in memory at the base address loads into the destination vector register. The next element in memory at the base address plus the stride loads into the next location in the destination vector register. With a vector load/store operation, the stride is constant, so that the third address in memory is the base address plus two times the stride. ------------------------------------------------------------ 3.4.2 Store Instruction When the vector control unit receives a store instruction, the source vector register is checked against outstanding arithmetic instructions. If there are no conflicts, the instruction is dispatched to the load/store unit. If the source for the store is the destination of the current arithmetic instruction, and the deferred arithmetic instruction does not conflict with the source vector register, and the arithmetic instruction is not a divide, then the vector control unit waits for a signal from the arithmetic unit to indicate that the store operation can start. The instruction is then dispatched to the load/store unit. During a store operation, the data moves in the opposite direction, from the destination vector register back to memory. The elements of the vector are placed back into memory at the base address plus a multiple of the stride, as shown in the following example: VLDL base,#4,V3 Load vector V3 from memory, starting at the "base" address and obtaining next elements every 4 bytes apart (stride = 4). VSTL V1,base,#16 Store vector V4 into memory starting at "base" address and placing next elements 16 bytes apart. 3-13 Optimizing with MACRO-32 The data from a store instruction is internally buffered in the chip. This offers the advantage of allowing cache hit load instructions to issue and complete while the write executes over the XMI. ------------------------------------------------------------ 3.4.3 Memory Management Okay (MMOK) When a memory reference occurs, control is turned over to memory management until an MMOK is returned indicating that all memory references were successful. An algorithm is used to predict when MMOK will be returned, to determine when new instructions can be issued. For every vector element a new last element virtual address is calculated based on the current element virtual address, the number of remaining elements, and the stride. Every element virtual address is compared to the calculated last element virtual address to determine whether both reside in the same two virtual page window. If they do reside within the same two pages and if the current virtual address has been successfully translated, then MMOK is asserted. If not, then the generation of virtual addresses continues. ------------------------------------------------------------ 3.4.4 Gather/Scatter Instructions An array whose subscript is an integer vector expression is "indirectly addressed." Indirect addressing appearing on the right side of an assignment statement is called a gather operation; on the left side it is known as a scatter, as shown in the following: DO 80 I = 1,95 J = IPICK(I) A(I) = B(J) + C(K(I)+3) * D(I) 80 CONTINUE Array A requires a scatter operation. B and C require gathers. Loops that contain references to a scattered array or stores into a gathered array [have potential for data dependency, as shown in the following: DO 10 I = 1,N A(I) = B(I) + C(I) / D(ID(I)) B(IB(I)) = X(I) * Y(I) D(I) = E(I)**2 G(JG(I)) = 2. * G(NG(I)) 10 CONTINUE Potential data dependency exists for arrays B, D, and G. 3-14 Optimizing with MACRO-32 When a gather or scatter instruction is received by the vector control unit, the destination/source register is checked against outstanding arithmetic instructions. If there are no conflicts, the instruction is dispatched to the load/store unit. The load/store unit will then fetch the offset vector register. When this is complete, the vector control unit reissues the instruction and the gather/scatter operation takes place using the previously stored offset vector register to generate the virtual addresses. A gather instruction is used to collect memory data into vector registers when the memory data does not have a constant stride. The memory data starts with a base address plus an offset number of up to a 64-element (depending on VL) register of offsets. The elements are loaded nearly as fast as a load instruction and are loaded sequentially in the destination register. (The scatter instruction stores the result back to memory using the same offsets.) ------------------------------------------------------------ 3.4.5 Masked Load/Store, Gather/Scatter Instructions The operation for masked memory instructions is identical to the unmasked versions except the following operations are performed first. The vector controller checks if any outstanding arithmetic instructions will modify the mask register. If not, the vector controller reads the mask from the arithmetic unit and sends it to the load/store unit. The sequence is then performed as above. ------------------------------------------------------------ 3.5 OVERLAP OF ARITHMETIC AND LOAD/STORE INSTRUCTIONS Arithmetic instructions and load/store instructions may overlap because the functional units are independent. To achieve this overlap, the following conditions must be met: · The arithmetic instruction must be issued before the load/store instruction. · There must be no register conflict between the arithmetic and load /store instructions. In the following example, while the results of vector register 2, V2, are being calculated, vector register 4, V4, is being stored in memory. Consequently, this is referred to as overlapping instructions. VVADDL V1,V3,V2 VSTL V4,base,#4 3-15 Optimizing with MACRO-32 In the following examples, an I represents instruction issue time and an E represents instruction execution time. A series of periods represents wait time in the arithmetic unit for deferred instructions. Notice that these are not exact timing examples, since they do not correspond to individual instruction timings, but are for illustration purposes only. In Example 3-1 the execution of the VLDL instruction does overlap the VVADDL instruction because there is no conflict in the destination vector registers, V3 and V1, for the add and load respectively. ------------------------------------------------------------ Example 3-1 Overlapped Load and Arithmetic Instructions ------------------------------------------------------------ VVADDL V1,V2,V3 IEEEEEEEE VLDL base,#4,V1 IEEEEEEEEEEEEEE ------------------------------------------------------------ ------------------------------------------------------------ 3.5.1 Maximizing Instruction Execution Overlap Three important hardware features help to maximize instruction overlap in the load/store unit. First, a load or store instruction can execute in parallel with up to two arithmetic instructions, provided the arithmetic instructions are issued first. Second, the chain into store sequence can reduce the perceived execution time of a store instruction. Finally, early detection of no memory faults allows scalar-to-vector communications to overlap with load or store instruction execution. In the first instruction sequence in Example 3-2 there is little overlapping of instructions, whereas in the second sequence the VVMULL and the second VLDL instructions overlap and require less total time to complete execution. The only difference between the two instruction sequences is the order in which they are issued. Because the VVMULL does not require the result of the second VLDL and can precede that instruction, a significant reduction in execution time is achieved. Another effective way to maximize the overlap of load/store instructions is to precede, wherever possible, all load and store instructions by at least two arithmetic instructions. In this way both the load/store pipeline and the arithmetic pipeline will be in use. 3-16 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-2 Maximizing Instruction Execution Overlap ------------------------------------------------------------ Instruction Sequence 1 VLDL base1,#4,V1 IEEEEEEEEE VLDL base2,#4,V2 IEEEEEEEEE VVMULL V3,V1,V1 IEEEEE VVADDL V1,V2,V2 I....EEEEE VSTL V2,base,#4 IEEEEEEEEE Instruction Sequence 2 VLDL base1,#4,V1 IEEEEEEEEE VVMULL V3,V1,V1 IEEEEE VLDL base2,#4,V2 IEEEEEEEEE VVADDL V1,V2,V2 IEEEEE VSTL V2,base,#4 IEEEEEEEEE ------------------------------------------------------------ A load instruction cannot begin execution until the register to which it will write is free. A register conflict may occur if the destination register of a load instruction is the same as one of the registers used by a preceding arithmetic instruction. If instruction execution overlap could occur if the load instruction were using a different register, then the register conflict can be eliminated by simply changing the register used. Example 3-3 shows the effects of register conflict. In the first instruction sequence the VLDL instruction must wait until the VVADDL instruction completes and the VVMULL instruction begins because VLDL will change the contents of one of the registers that provides input to the deferred VVMULL instruction. In the second instruction sequence it is possible to take advantage of the deferred arithmetic instruction queue and overlap the VLDL and arithmetic instruction execution because the VLDL instruction does not change the registers used by the arithmetic instructions. By simply changing the register to which the VLDL will write, the total execution time for the instruction sequence is reduced. The locality of reference of data plays an important role in determining the performance of load/store operations. Unity stride load and store instructions are the most efficient. For this reason, whenever possible data should be stored in the sequential order in which it is usually referenced. 3-17 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-3 Effects of Register Conflict ------------------------------------------------------------ Instruction Sequence 1 VVADDL V1,V2,V3 IEEEEE VVMULL V1,V2,V4 I....EEEEE VLDL base,#4,V1 IEEEEEEEEE Instruction Sequence 2 VVADDL V1,V2,V3 IEEEEE VVMULL V1,V2,V4 I....EEEEE VLDL base,#4,V5 IEEEEEEEEE ------------------------------------------------------------ Nonunity stride loads and stores can have a significantly higher impact on the performance level of the XMI memory bus as compared to unity stride operations. A far greater number of memory references are required for nonunity stride than is the case for unity stride. If the ratio of cache miss load/store to arithmetic instructions is sufficiently high and nonunity stride is used, bus bandwidth can become the limiting performance factor. ------------------------------------------------------------ 3.6 OUT-OF-ORDER INSTRUCTION EXECUTION The deferred instruction queue (of length 1) associated with the arithmetic unit allows the vector issue unit to queue one instruction to the arithmetic unit while that unit is still executing a previous instruction. The issue unit checks the status of this queue when it does the functional unit availability check for an instruction. (Both the deferred and currently executing instructions are checked for register availability.) This frees the issue unit to process another instruction rather than having to wait for the arithmetic unit to complete its current instruction. Example 3-4 shows the use of the deferred arithmetic instruction queue. If a deferred instruction queue was not implemented, the VVMULL instruction could not be issued until the VVADDL was completed (or nearly completed). The VLDL instruction would then not issue until after the VVMULL was issued and would complete much later than in the deferred instruction case. Once the VLDL instruction is issued, no other instructions may be issued. The overlap of instruction execution made possible by the deferred instruction queue can significantly reduce the total execution time. The VLDL instruction can overlap the deferred VVMULL instruction because there are no register conflicts between the two instructions. 3-18 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-4 Deferred Arithmetic Instruction Queue ------------------------------------------------------------ Instruction Sequence VVADDL V1,V2,V3 VVMULL V3,V1,V4 VLDL base,#4,V2 Execution without Deferred Instruction Queue Issue VVADDL IEEEEEEEE Issue VVMULL IEEEEEEEE Issue VLDL IEEEEEEEEEEEEEE Execution with Deferred Instruction Queue Issue VVADDL IEEEEEEEE Issue deferred VVMULL I.......EEEEEEEE Issue VLDL IEEEEEEEEEEEEEE ------------------------------------------------------------ In Example 3-5 the VLDL instruction cannot begin before VVMULL because VVMULL needs data in V3 before the VLDL takes place. ------------------------------------------------------------ Example 3-5 A Load Stalled due to an Arithmetic Instruction ------------------------------------------------------------ VVADDL V1,V2,V3 IEEEEEEEE VVMULL V3,V4,V5 I.......EEEEEEEE VLDL base,#4,V3 IEEEEEEEEEEEEEE ------------------------------------------------------------ To take advantage of the deferred instruction queue, close attention to instruction ordering and register use is required. Generally, a divide or two other arithmetic instructions should precede each load or store instruction. (In the case of divide instructions, multiple load/store instructions can be overlapped with a single divide instruction.) This is not always possible, since initial loads are usually necessary and there may not be two arithmetic instructions per load/store. Also, some instruction ordering is dictated by the use of the data. But even with these restrictions, it is still important to watch for potential instruction execution overlap. 3-19 Optimizing with MACRO-32 Example 3-6 is another example of the use of a deferred arithmetic instruction. In this case, a divide instruction is followed by an add and then a load. The deferred instruction queue and the length of the divide instruction combine to "hide" the load instruction (that is, the execution time of the load instruction does not contribute to the total execution time of the instruction sequence). Note also that the divide instruction completes after the load completes. Out of order completion of instructions is possible. ------------------------------------------------------------ Example 3-6 Use of the Deferred Arithmetic Instruction Queue ------------------------------------------------------------ Instruction Sequence VVDIVL V1,V2,V3 VVADDL V3,V1,V4 VLDL base,#4,V5 Execution without Deferred Instruction Queue Issue VVDIVL IEEEEEEEEEEEEEEEEEEEE Issue VVADDL IEEEEEEEE Issue VLDL IEEEEEEEEEEEEEE Execution with Deferred Instruction Queue Issue VVDIVL IEEEEEEEEEEEEEEEEEEEE Issue deferred VVADDL I...................EEEEEEEE Issue VLDL IEEEEEEEEEEEEEE ------------------------------------------------------------ ------------------------------------------------------------ 3.7 CHAINING Vector operands are generally read from and written to the vector register file. An exception to this process occurs when a store instruction is waiting for the results of a currently executing arithmetic instruction. (Divide instructions are not included in this exception because they do not have the same degree of pipelining as the other instructions.) As results are generated by the arithmetic instruction and are ready to be written to the register file, they are also immediately available for input to the waiting store instruction. Therefore, the store instruction can begin processing the data before the arithmetic instruction has completed. This process is called "chain into store." The store instruction will not overrun the arithmetic instruction because the store instruction cannot process data faster than the arithmetic unit can generate results. 3-20 Optimizing with MACRO-32 In Example 3-7, the VSTL instruction requires the result of the VVADDL instruction and without chain into store would have to wait for the VVADDL to complete before beginning the store operation. The use of chain into store allows the VSTL operation to begin after the first result of the add is complete, while the VVADDL is still executing and greater overlap of instruction execution is the result. The instruction sequence requires a shorter period of time to complete. The coordination of the arithmetic operation and the VSTORE for a chain into store is handled by the vector arithmetic unit and depends on a number of factors such as vector length. ------------------------------------------------------------ Example 3-7 Example of Chain Into Store ------------------------------------------------------------ Instruction Sequence VVADDL V1,V2,V3 VVMULL V1,V2,V4 VSTL V3,base,#4 Execution without Chain into Store: Issue VVADDL IEEEEEEEE Issue deferred VVMULL I.......EEEEEEEE Issue VSTL IEEEEEEEEEEEEEE Execution with Chain into Store: Issue VVADDL IEEEEEEEE Issue deferred VVMULL I.......EEEEEEEE Issue VSTL IEEEEEEEEEEEEEE ------------------------------------------------------------ ------------------------------------------------------------ 3.8 CACHE With the 1-Mbyte vector cache, up to four load operations with cache misses can be queued at one time. The pipeline continues processing vector element loads until a fourth cache miss occurs. At that point the cache miss queue is full and the pipeline stalls. The pipeline remains stalled until one of the cache misses is serviced. Cache misses on a load instruction degrade the performance of the load/store pipeline. 3-21 Optimizing with MACRO-32 A cache miss is serviced by a hexword fill. On the XMI, a hexword transfer is 80 percent efficient since one address is sent to receive four quadwords of data. An octaword transfer is 67 percent efficient since one address is sent to receive two quadwords of data. A quadword transfer is only 50 percent efficient since one address is sent to receive one quadword of data. For this reason, stores are more efficient with unity stride than with nonunity stride. A larger piece of memory can be referenced by a single address so that fewer memory references are required. In the case of load instructions, the comparison of unity and nonunity stride is less straightforward. A nonunity stride cache miss load causes a full hexword to be read from memory even though the load requires only a longword or quadword of data. If the additional data is not referenced by subsequent load instructions, then the nonunity stride load is much less efficient than a unity stride load. If subsequent loads do reference the extra data, then nonunity stride load performance improves due to high cache hit rates for the subsequent loads. For double-precision data there is little degradation due to nonunity stride in this case. For single-precision data, unity stride loads will show significantly higher performance because of the load/store pipeline optimization for single- precision unity stride loads. ------------------------------------------------------------ 3.9 STRIDE/TRANSLATION BUFFER MISS A vector 's stride is the number of memory locations (bytes) between the starting address of consecutive vector elements. A vector with a stride of 1 is contiguous; it has no gaps in memory between vector elements. Consider the vector arrays A and B in the following DO loop. Vector A has a stride of 1; vector B has a stride of 2. DO 100 I=1,5 A(I) = B(I*2) 100 CONTINUE When a translation buffer (TB) miss occurs, two PTEs (1 quadword) are fetched from cache. If this fetch results in a cache miss, then a hexword (eight PTEs) is loaded into cache from memory but only two PTEs are installed in the TB. This handling of TB misses has a large effect on the performance of nonunity stride vectors. A stride of two pages (256 longwords or 128 quadwords) or more can result in a TB miss for each data item. A stride of eight pages (1024 longwords or 512 quadwords) or more can result in a TB miss that can cause a cache miss for each data item. Unity stride is 3-22 Optimizing with MACRO-32 most efficient in that it runs sequentially through the data and makes full use of all PTEs fetched. An example of how to avoid large vector strides can be seen in a simple matrix multiplication problem: DO I = 1, N DO J = 1, N DO K = 1, N C(I,J) = C(I,J) + A(I,K)*B(K,J) ENDDO ENDDO ENDDO If coded as written, there is a choice of which variable to vectorize on. If the "K" variable is chosen, array A will access FORTRAN rows that are nonunity stride. This choice also means that for every K, a reduction operation is required to sum the product of A and B into the C array. Although reduction functions vectorize, they are less efficient than other methods. A better choice is to vectorize on either I or J. J is not the best candidate because it involves nonunity stride for both the B and the C arrays. For large values of N, this is an inefficient use of the bus bandwidth, the translation buffer, and the cache. Clearly the optimal solution is to vectorize on the I variable. Example 3-8 shows a first attempt to code the matrix multiplication in MACRO pseudocode for vectors. Although this example uses unity stride, it is far from optimal. Notice that it is not necessary to load and store C for different values of K because C is dependent only on the I and J variables. By removing the load and store of C from the inner loop, the bytes/FLOP ratio (load and stores: arithmetics) drops from 12 to 2 down to 4 to 2. Example 3-9 shows an improved version. 3-23 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-8 Matrix Multiply--Basic ------------------------------------------------------------ msync R0 ;synchronize with scalar LOOP: vldl A(I,K),#4,V0 ;col of A is loaded into V0 vsmulf B(K,J),V0,V0 ;V0 gets the product of V0 ;and the scalar value B(K,J) vldl C(I,J),#4,V1 ;col of C gets loaded into V1 vvaddf V0,V1,V1 ;V1 gets V0 summed with V1 vstl V1,C(I,J),#4 ;V1 is stored back into C INC K ;increment K by one IF (K < N) GOTO LOOP ;Loop for all values of K INC J ;increment J by vector length IF (J < N) GOTO LOOP ;Loop for all values of J INC I, RESET J ;increment I by vector length IF (I < N) GOTO LOOP ;Loop for all values of I msync R0 ;synchronize with scalar ------------------------------------------------------------ ------------------------------------------------------------ Example 3-9 Matrix Multiply--Improved ------------------------------------------------------------ msync R0 ;synchronize with scalar IJLOOP: vldl C(I,J),#4,V1 ;col of C gets loaded into V1 KLOOP: vldl A(I,K),#4,V0 ;col of A is loaded into V0 vsmulf B(K,J),V0,V0 ;V0 gets the product of V0 ;and the scalar value B(K,J) vvaddf V0,V1,V1 ;V1 gets V0 summed with V1 INC K ;increment K by one IF (K < N) GOTO KLOOP ;Loop for all values of K vstl V2,C(I,J),#4 ;V2 is stored back into C INC J, RESET K ;increment J by vector length IF (J < N) GOTO IJLOOP ;Loop for all values of J INC I, RESET J ;increment I by vector length IF (I < N) GOTO IJLOOP ;Loop for all values of I msync R0 ;synchronize with scalar ------------------------------------------------------------ 3-24 Optimizing with MACRO-32 ------------------------------------------------------------ 3.10 REGISTER REUSE The concept used in Example 3-9 to reuse the data when it has already been loaded into a vector register is known as register reuse. Register reuse can be extended further by using all available vector registers to decrease the bytes/FLOP ratio and improve performance. With maximum register reuse, programs on the VAX 6000 Model 400 vector processor can approach a peak single-precision performance of 90 MFLOPs and a peak double-precision performance of 45 MFLOPs. To implement register reuse for matrix multiply, the J loop must be unrolled. By precomputing 14 partial results, using only the first column of A with 14 different columns of B, it is possible to use 14 vector registers (instead of 14 memory locations) to hold the partial results. Thus, all N rows of B can be accessed in groups of 14 columns to compute the first 14 columns of C. When the final row of B is reached, the results are chained into a store into array C. Then the next set of 14 columns of C will be calculated. The unrolling depth of 14 is chosen because of the number of vector registers. Example 3-10 shows the MACRO pseudocode to accomplish this for values of N <= 64. Although the code length is longer, the performance is greatly improved by the segments of code that are purely vector arithmetics. The bytes/FLOP ratio has dropped to better than 4 to 14, allowing the algorithm to approach peak vector speeds. When implemented in matrix solvers, speedups greater than 25 have been realized in a VAX 6000 Model 410 vector processor computer system. 3-25 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-10 Matrix Multiply--Optimal ------------------------------------------------------------ msync R0 ;synchronize with scalar mtvlr #N ; ; loop2: ;first segment ; vldl A(I,K),#4,v0 ; ;mul ; vsmulf B(K,J),v0,v2 ; vsmulf B(K,J+1),v0,v3 ; vsmulf B(K,J+2),v0,v4 ; vsmulf B(K,J+3),v0,v5 ; vsmulf B(K,J+4),v0,v6 ; vsmulf B(K,J+5),v0,v7 ; vsmulf B(K,J+6),v0,v8 ; vsmulf B(K,J+7),v0,v9 ; vsmulf B(K,J+8),v0,v10 ; vsmulf B(K,J+9),v0,v11 ; vsmulf B(K,J+10),v0,v12 ; vsmulf B(K,J+11),v0,v13 ; vsmulf B(K,J+12),v0,v14 ; vsmulf B(K,J+13),v0,v15 ; ; update ; INC K loop1: ; ;load col of A ; vldl A(I,K),#4,v0 ; ;mul and add ; vsmulf B(K,J),v0,v1 vvaddf v1,v2,v2 ; vsmulf B(K,J+1),v0,v1 vvaddf v1,v3,v3 ; vsmulf B(K,J+2),v0,v1 vvaddf v1,v4,v4 ; vsmulf B(K,J+3),v0,v1 ------------------------------------------------------------ Example 3-10 Cont'd on next page 3-26 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-10 (Cont.) Matrix Multiply--Optimal ------------------------------------------------------------ vvaddf v1,v5,v5 ; vsmulf B(K,J+4),v0,v1 vvaddf v1,v6,v6 ; vsmulf B(K,J+5),v0,v1 vvaddf v1,v7,v7 ; vsmulf B(K,J+6),v0,v1 vvaddf v1,v8,v8 ; vsmulf B(K,J+7),v0,v1 vvaddf v1,v9,v9 ; vsmulf B(K,J+8),v0,v1 vvaddf v1,v10,v10 ; vsmulf B(K,J+9),v0,v1 vvaddf v1,v11,v11 ; vsmulf B(K,J+10),v0,v1 vvaddf v1,v12,v12 ; vsmulf B(K,J+11),v0,v1 vvaddf v1,v13,v13 ; vsmulf B(K,J+12),v0,v1 vvaddf v1,v14,v14 ; vsmulf B(K,J+13),v0,v1 vvaddf v1,v15,v15 ; ; update ; INC K IF (K < N) GOTO LOOP1 ;Loop for all values of K ; ;last element loopa1: ; ;load col of A ; vldl A(I,K),#4,v0 ; ;mul, add and store ; ------------------------------------------------------------ Example 3-10 Cont'd on next page 3-27 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-10 (Cont.) Matrix Multiply--Optimal ------------------------------------------------------------ vsmulf B(K,J),v0,v1 vvaddf v1,v2,v2 vstl v2,C(I,J),#4 ; vsmulf B(K,J+1),v0,v1 vvaddf v1,v3,v3 vstl v3,C(I,J+1),#4 ; vsmulf B(K,J+2),v0,v1 vvaddf v1,v4,v4 vstl v4,C(I,J+2),#4 ; vsmulf B(K,J+3),v0,v1 vvaddf v1,v5,v5 vstl v5,C(I,J+3),#4 ; vsmulf B(K,J+4),v0,v1 vvaddf v1,v6,v6 vstl v6,C(I,J+4),#4 ; vsmulf B(K,J+5),v0,v1 vvaddf v1,v7,v7 vstl v7,C(I,J+5),#4 ; vsmulf B(K,J+6),v0,v1 vvaddf v1,v8,v8 vstl v8,C(I,J+6),#4 ; vsmulf B(K,J+7),v0,v1 vvaddf v1,v9,v9 vstl v9,C(I,J+7),#4 ; vsmulf B(K,J+8),v0,v1 vvaddf v1,v10,v10 vstl v10,C(I,J+8),#4 ; vsmulf B(K,J+9),v0,v1 vvaddf v1,v11,v11 vstl v11,C(I,J+9),#4 ; vsmulf B(K,J+10),v0,v1 vvaddf v1,v12,v12 vstl v12,C(I,J+10),#4 ; vsmulf B(K,J+11),v0,v1 vvaddf v1,v13,v13 vstl v13,C(I,J+11),#4 ; vsmulf B(K,J+12),v0,v1 vvaddf v1,v14,v14 vstl v14,C(I,J+12),#4 ; vsmulf B(K,J+13),v0,v1 vvaddf v1,v15,v15 vstl v15,C(I,J+13),#4 ; ; update ; ------------------------------------------------------------ Example 3-10 Cont'd on next page 3-28 Optimizing with MACRO-32 ------------------------------------------------------------ Example 3-10 (Cont.) Matrix Multiply--Optimal ------------------------------------------------------------ RESET K RESET I INC J by 14 IF (J < N) GOTO LOOP2 ;Loop for all values of K msync R0 ;synchronize with scalar ------------------------------------------------------------ 3-29 A ------------------------------------------------------------ Algorithm Optimization Examples This appendix illustrates how the characteristics of the VAX 6000 series vector processor can be used to build optimized routines for this system and how the algorithm and its implementation can change the performance of an application on the VAX 6000 processor. The VAX 6000 series vector processor delivers high performance for computationally intensive applications. Based on CMOS technology, the VAX 6000 Model 400 vector processor is capable of operating at peak speeds of 90 MFLOPs single precision and 45 MFLOPs double precision. Linear algebra and signal processing applications that utilize the various hardware features have demonstrated vector speedups between 3 and 35 over the scalar VAX 6000 CPU times. With the integrated vector processing available on the VAX 6000 series, the performance of computationally intensive applications may now approach that of supercomputers. Algorithm changes can alter the data access patterns to more efficiently use the memory subsystem, can increase the average vector length, and can minimize the number of vector operations required. By applying Amdahl's Law of vectorization, performance can be improved by increasing the percentage of code that is vectorized. Four basic optimization methods that take advantage of the processing power of VAX 6000 series system include: · Rearrange code for maximum vectorization of the inner loop and remove data dependencies within the loop · Vectorize across contiguous memory locations to produce unity stride vectors for increased cache hit rates and optimized cache miss handling · Reuse the data already loaded into the vector registers as frequently as possible to reduce the number of vector load and store operations · Maximize instruction execution overlap by pairing arithmetic instructions between load and store instructions wherever possible Further information on optimization techniques in FORTRAN can be found in the VAX FORTRAN Performance Guide available with the FORTRAN-High Performance Option. A-1 Algorithm Optimization Examples Two groups of applications that have high vector processing potential include equation solvers and signal processing routines. For example, computational fluid dynamics, finite element analysis, molecular dynamics, circuit simulation, quantum chromodynamics, and economic modeling applications use various types of simultaneous or differential equation solvers. Applications such as air pollution modeling, seismic analysis, weather forecasting, radar imaging, speech and image processing, and many other scientific and engineering applications use signal processing routines, such as fast Fourier transforms (FFT), to obtain solutions. ------------------------------------------------------------ A.1 EQUATION SOLVERS Equation solvers generally fall into four categories: general rectangle, symmetric, hermitian, and tridiagonal. The most common benchmark used to measure a computer system's ability to solve a general rectangular system of linear equations is Linpack. The Linpack benchmarks, developed at Argonne National Laboratory, measure the performance across different computer systems while solving dense systems of 100, 300, and 1000 linear equations. These benchmarks are currently written to call subroutines from the Linpack library. The subroutines, in turn, call the basic linear algebra subroutines (BLAS) at the lowest level. For each benchmark size, there are different optimization rules which govern the type of changes permitted in the Linpack report. Optimizations to the BLAS routines are always allowed. Modifications can be made to the FORTRAN source or by supplying the routine in macrocode. Algorithm changes are only allowed for the largest problem size, the solution to a system of 1000 linear equations. The smallest problem size uses a two-dimensional array that is 100 by 100. The benchmarks are written to use Gaussian elimination for solving 100 simultaneous equations. This two-step method features a factorization routine, xGEFA, and a solver, xGESL. Both are column- oriented algorithms and use vector-vector level 1 BLAS routines. Column orientation increases program efficiency because it improves locality of data based on the way FORTRAN stores arrays. As shown in Example A-1, the BLAS level 1 routines allow the user to schedule the instructions optimally in vector macrocode. Deficiencies in BLAS 1 routines include frequent synchronization, a large calling overhead, and more vector load and store operations in comparison to other vector arithmetic operations. A-2 Algorithm Optimization Examples ------------------------------------------------------------ Example A-1 Core Loop of a BLAS 1 Routine Using Vector-Vector Operations ------------------------------------------------------------ xAXPY - computes Y(I) = Y(I) + aX(I) where x = precision = F, D, G MSYNC ;synchronize with scalar LOOP: VLDx X(I),std,VR0 ;X(I) is loaded into VR0 VSMULx a,VR0,VR0 ;VR0 gets the product of VR0 ;and the scalar value "a" VLDx Y(I),std,VR1 ;Y(I) get loaded into VR1 VVADDx VR0,VR1,VR1 ;VR1 gets VR0 summed with VR1 VSTx VR1,Y(I),std ;VR1 is stored back into Y(I) INC I ;increment I by vector length IF (I < SIZ) GOTO LOOP ;Loop for all values of I MSYNC ;synchronize with scalar ------------------------------------------------------------ The performance of the Linpack 100 by 100 benchmark, which calls the routine in Example 3-7 showing execution without chain into store, shows how an algorithm with approximately 80 percent vectorization can be limited by the scalar portion. One form of Amdahl's Law relates the percentage of vectorized code compared to the percentage of scalar code to define an overall vector speedup. This ratio between scalar runtime and vector runtime is described by the following formula: Time Scalar Vector Speedup = ______________________________________________ (%scalar*Time Scalar)) + (%vector*Time Vector) Under Amdahl's Law, the maximum vector speedup possible, assuming an infinitely fast vector processor, is: 1.0 1.0 Vector Speedup = ____________________ = ____ = 5.0 (.2)*1.0 + (.8)*0 0.2 As shown in Figure A-1, the Model 400 processor achieves a vector speedup of approximately 3 for the 100 by 100 Linpack benchmark when using the BLAS 1 subroutines. It follows Amdahl's Law closely because it is small enough to fit the vector processor 's 1-Mbyte cache and, therefore, incurs very little overhead due to memory hierarchy. A-3 Algorithm Optimization Examples ------------------------------------------------------------ Figure A-1 Linpack Performance Graph, Double-Precision BLAS Algorithms ------------------------------------------------------------ Refer to the printed version of this book, EK-60VAA-PG. ------------------------------------------------------------ For the Linpack 300 by 300 benchmark, optimizations include the use of routines that are equivalent to matrix-vector level 2 BLAS routines. Example A-2 details the core loop of a BLAS 2 routine. BLAS 2 routines make better use of cache and translation buffers than the BLAS 1 routines do. Also, BLAS 2 routines have a better ratio between vector arithmetics and vector load and stores. The larger matrix size increases the average vector length. Performance is improved by amortizing the time to decode instructions across a larger work load. By removing one vector load and one vector store from the innermost loop, the BLAS 2 routine has a better ratio of arithmetic operations to load and store operations than BLAS 1 routines. Although the 300 by 300 array fits into the vector processor 's 1-Mbyte cache, not all the cache can be mapped by its translation buffer. By changing the sequence in which this routine is called in the program, the data access patterns can be altered to better use the vector unit's translation buffer. Thus, higher performance is obtained. The percent of vectorization increases primarily because of the increase in the matrix size from 100 by 100 to 300 by 300. With a vector fraction of approximately 95 percent, Figure A-1 shows the speedup improvement in the 300 by 300 benchmark when using methods based on BLAS 2 routines. With a matrix vector algorithm, the 300 by 300 benchmark yields speedups of between 10 and 12 over its scalar counterpart. A-4 Algorithm Optimization Examples ------------------------------------------------------------ Example A-2 Core Loop of a BLAS 2 Routine Using Matrix-Vector Operations ------------------------------------------------------------ xGEMV - computes Y(I) = Y(I) + X(J)*M(I,J) where x = precision = F, D, G MSYNC ;synchronize with scalar ILOOP: VLDx Y(I),std,VR0 ;Y(I) is loaded as VR0 JLOOP: VLDx M(I,J),std,VR1 ;VR1 gets columns of M(I,J) VSMULx X(J),VR1,VR2 ;VR2 gets the product of VR1 ;and X(J) as a scalar VVADDx VR0,VR2,VR0 ;VR0 gets VR0 summed with VR2 INC J IF (J < SIZ) GOTO JLOOP ;Loop for all values of J VSTx VR0,Y(I),std ;VR0 gets stored into Y(I) INC I IF (I < SIZ) GOTO ILOOP ;Loop for all values of I MSYNC ;synchronize with scalar ------------------------------------------------------------ There are no set rules to follow when solving the largest problem size, a set of 1000 simultaneous equations. One potential tool for optimizing this benchmark is the LAPACK library, developed by Argonne National Laboratory in conjunction with the University of Illinois Center for Supercomputing Research and Development (CSRD). The LAPACK library features equation-solving algorithms that will block the data array into sections that fit into a given cache size. The LAPACK library calls not only the BLAS 1 and BLAS 2 routines but also a third level of BLAS, called matrix-matrix BLAS or the BLAS level 3. Example A-3 shows that a matrix-matrix multiply is at the heart of one BLAS 3 routine. The matrix multiplication computation can be blocked for modern architectures with cache memories. Highly efficient vectorized matrix multiplication routines have been written for the VAX vector architecture. For example, a double precision 64 by 64 matrix multiplication achieves over 85 percent of the peak MFLOPS on the Model 400 system. Performance can be further improved with other methods that increase the reuse of data while it is contained in the vector registers. For example, loop unrolling can be done until all the vector registers have been fully utilized. Partial results can be formed within the innermost A-5 Algorithm Optimization Examples ------------------------------------------------------------ Example A-3 Core Loop of a BLAS 3 Routine Using Matrix-Matrix Operations ------------------------------------------------------------ xGEMM - computes Y(I,J) = Y(I,J) + X(I,K)*M(K,J) here x = precision = F, D, G MSYNC ;synchronize with scalar IJLOOP: VLDx Y(I,J),std,VR0 ;Y(1:N,J) gets loaded into VR0 KLOOP: VLDx M(K,J),std,VR1 ;K(1:N,K) get loaded into VR1 VSMULx X(I,K),VR1,VR1 ;VR1 gets VR1 summed with ;X(I,K) as a scalar VVADDx VR0,VR2,VR0 ;VR0 gets VR0 summed with VR2 INC K ;increment K by vector length IF (K < SIZ) GOTO KLOOP RESET K ;reset I to SIZ VSTx VR0,Y(I,J),std ;VR0 gets stored into Y(I,J) INC I ;increment I by vector length IF (I < SIZ) GOTO IJLOOP INC J ;increment J by vector length RESET I ;reset I to SIZ IF (J < SIZ) GOTO IJLOOP MSYNC ;synchronize with scalar ------------------------------------------------------------ loop to minimize the loads and stores required. Because both rows and columns are traversed, the algorithm can be blocked for cache size. The VAX 6000 Model 400 exhibits vector speedups greater than 35 for the 64 by 64 matrix multiplication described above. Although the overall performance of the 1000 by 1000 size benchmark is less than a single 64 by 64 matrix multiplication, it does indicate the potential performance when blocking is used. Improving the performance of this benchmark is most challenging because the 1000 by 1000 matrix requires about eight times the vector cache size of 1 Mbyte. Further analysis is being conducted to determine the most efficient block size to use, that would maximize the use of BLAS 3 and remain within the size of the cache for a given block of code. The vectorized fraction increases to approximately 98 percent for the 1000 by 1000 benchmark. The proportion of vector arithmetics relative to vector load and stores is much improved for the BLAS 3s. Although the cache is exceeded, performance more than doubles when using a method that can block data based on the BLAS 3 algorithms. Therefore, the performance of the VAX 6000 Model 400 on the blocked Linpack 1000 A-6 Algorithm Optimization Examples by 1000 obtained a vector speedup of approximately 25, as shown in Figure A-1. ------------------------------------------------------------ A.2 SIGNAL PROCESSING--FAST FOURIER TRANSFORMS The Fourier transform decomposes a waveform, or more generally, a collection of data, into component sine and cosine representation. The discrete Fourier transform (DFT) of a data set of length N performs the transformation following the strict mathematical definition which requires O(N**2) floating-point operations. The fast Fourier transform (FFT), developed by Cooley and Tukey in 1965, reduced the number of operations to O(N x LOG[N]), improving computational speed significantly. Figure A-2 shows that the complex data in the bottom butterfly is multiplied in each stage by the appropriate weight. The result is then added to the top butterfly and subtracted from the bottom butterfly. If the algorithm is left in this configuration, it must use nonunity stride vectors, very short vectors, or masked arithmetic operations to perform the very small butterflies. ------------------------------------------------------------ A.2.1 Optimized One-Dimensional Fast Fourier Transforms The bit-reversal process that permutes the data to a form that enables the Cooley-Tukey algorithm to work is also shown in Figure A-2. When using vectors, a common approach to performing the bit-reversal reordering is to use vector gather or scatter instructions. These instructions allow vector loads and stores to be performed using an index register. Vector loads and stores require a constant stride. However, vector gather and scatter operations allow the user to build a vector of offsets to support indirect addressing in vector mode. Both gather and scatter instructions are available with VAX vectors. A vector implementation of the FFT algorithm has been developed that is well suited for the VAX vector architecture. One optimization made to the algorithm involves moving the bit-reversal section of the code to a place where the data permutation will benefit vector processing. By doing so, two goals are accomplished. First, the slower vector gather operations are moved to the center of the algorithm such that the data will already be in the vector cache. In Figure A-2, the first FFT stage starts out with large butterfly distances. After each stage the butterfly distance is halved. For the optimized version shown in Figure A-3, the bit-reversal permutation A-7 Algorithm Optimization Examples ------------------------------------------------------------ Figure A-2 Cooley-Tukey Butterfly Graph, One-Dimensional Fast Fourier Transform for N = 16 ------------------------------------------------------------ Refer to the printed version of this book, EK-60VAA-PG. ------------------------------------------------------------ is performed as close to the center as possible, when the stage number = LOG(N)/2. To complete the algorithm, the butterfly distances now increase again. Second, this process entirely eliminates the need for short butterflies. Another optimization made to the FFT algorithm is the use of a table lookup method to access the sine and cosine factors, which reduces repetitive calls to the computationally intensive trigonometric functions. The initialization of this trigonometric table has been fully vectorized but shows only a modest factor of 2 performance gain. To build the table, a first order linear recurrence loop is formed that severely limits vector speedup. Because this calculation is only done once, it becomes negligible for multiple calls to the one-dimensional FFTs and for all higher dimensional FFTs. The benchmark shown in Figure A-4 was looped and includes the calculation of the trigonometric table performed once for each FFT data length. A-8 Algorithm Optimization Examples ------------------------------------------------------------ Figure A-3 Optimized Cooley-Tukey Butterfly Graph, One-Dimensional Fast Fourier Transform for N = 16 ------------------------------------------------------------ Refer to the printed version of this book, EK-60VAA-PG. ------------------------------------------------------------ Reusing data in the vector registers also saves vector processing time. The VAX vector architecture provides 16 vector registers. If all 16 registers are used carefully, data can be reused by two successive butterfly stages without storing and reloading the data. With half the number of loads and stores, the vector performance almost doubles. ------------------------------------------------------------ A.2.2 Optimized Two-Dimensional Fast Fourier Transforms The optimized one-dimensional FFT can be used to compute multidimensional FFTs. Figure A-5 shows how an N by N two- dimensional FFT can be computed by performing N one-dimensional column FFTs and then N one-dimensional row FFTs. The same routine can be called for column or row access FFTs by simply varying the stride parameter that is passed to the routine. (Note: In FORTRAN, the column A-9 Algorithm Optimization Examples ------------------------------------------------------------ Figure A-4 One-Dimensional Fast Fourier Transform Performance Graph, Optimized Single-Precision Complex Transforms ------------------------------------------------------------ Refer to the printed version of this book, EK-60VAA-PG. ------------------------------------------------------------ ------------------------------------------------------------ Figure A-5 Two-Dimensional Fast Fourier Transforms Using N Column and N Row One-Dimensional Fast Fourier Transforms ------------------------------------------------------------ Refer to the printed version of this book, EK-60VAA-PG. ------------------------------------------------------------ access is unity stride and the row access has a stride of the dimension of the array.) A-10 Algorithm Optimization Examples For improved performance on VAX vector systems, the use of a matrix transpose can dramatically increase the vector processing performance of two-dimensional FFTs for large values of N (that is, N > 256). The difference between unity stride and nonunity stride is the key performance issue. Figure A-6 shows that a vectorized matrix transpose can be performed after each set of N one-dimensional FFTs. The computation will be equivalent to Figure A-2 but with a matrix transpose: each one-dimensional FFT will be column access which is unity stride. The overhead of transposing the matrix becomes negligible for large values of N. ------------------------------------------------------------ Figure A-6 Two-Dimensional Fast Fourier Transforms Using a Matrix Transpose Between Each Set of N Column One-Dimensional Fast Fourier Transforms ------------------------------------------------------------ Refer to the printed version of this book, EK-60VAA-PG. ------------------------------------------------------------ A-11 Algorithm Optimization Examples When the value of N is relatively small (that is, N < 256), the two- dimensional FFT can be computed by calling a one-dimensional FFT of length N**2. The small two-dimensional FFT can achieve performance equal to that of the aggregate size one-dimensional FFT by linearizing the data array. Figure A-7 shows the tradeoff between using the linearized two-dimensional routine (for small N) and the transposed method (for large N) to maintain high performance across all data sizes. The optimization of an algorithm that vectorizes poorly in its original form has been shown. The resulting algorithm yields much higher performance on the VAX 6000 Model 400 processor. High performance is due to the unique way the algorithm touches contiguous memory locations and its effort to maximize the vector length. The implementation described above always uses unity stride vectors and always results in a vector length of 64 for FFT lengths greater than 2K (2 x 1024). ------------------------------------------------------------ Figure A-7 Two-Dimensional Fast Fourier Transform Performance Graph, Optimized Single-Precision Complex Transforms ------------------------------------------------------------ Refer to the printed version of this book, EK-60VAA-PG. ------------------------------------------------------------ A-12 ------------------------------------------------------------ Glossary accumulator: A register that accumulates data for arithmetic or logic operations. ALU: Arithmetic Logic Unit, a subset of the operation instruction function unit that performs arithmetic and logical operations, usually in binary form. Amdahl's Law: A mathematical equation that states that a system is dominated by its slowest process. arithmetic exception: A software error that occurs while performing an arithmetic or floating point operation. array: Elements or data arranged in rows and columns. array processor: A vector processor consisting of auxiliary hardware attached to a host CPU. It is typically attached by an I/O bus and is treated by the host as a foreign I/O device. Also called an attached vector processor. asynchronous: Pertaining to events that are scheduled without any specific time reference; not synchronized to a master clock. For example, while performing arithmetic operations, the vector processor operates by an asynchronous schedule to that of the scalar processor, which is free to perform other operations. benchmark: A program used to evaluate the performance of a computer for a given application. cache miss: The case when the processor cannot find an item in the cache that it needs to perform an operation; also called a cache fault. When this happens, the item is fetched from main memory at the slower main memory speed. chaining: A form of instruction overlap that uses a special hardware path to load the output of one function unit directly into the input of a second function unit, as well as into the destination register of the first instruction. concurrent: Occurring during the same interval of time; may or may not be simultaneous. Glossary-1 Glossary control word operand: The portion of the instruction that indicates which registers to use and enables or disables certain functions. crossover point: The vector length at which the vector processor 's performance exceeds that of the scalar processor. data dependency: The case when data for one arithmetic instruction depends on the result of a previous instruction so both instructions cannot execute at the same time. decomposition: Part of the compilation process that prepares code for parallel or vector processing. It includes dependency analysis, recognition of parallel or vector inhibitors, and code restructuring. Decomposition can be automatic (controlled entirely by the compiler), directed (controlled by compiler directives or statements in the source code), or by a combination of the two. dependency analysis: An evaluation of how data flows through memory. The analysis is performed to identify data dependencies in a program unit and to determine the requirements they place on decomposition. first-order linear recurrence: A cyclic data dependency in a linear function that involves one variable. function unit: A section of the vector processor (or any other processor) that performs a specific function and operates independently from other units. Typically, a function unit performs related operations; for example, an add function unit may also perform subtraction since the operations are similar. Also called pipe. gather: The collection of data from memory starting with a base address plus an offset address; the instruction that performs this action placing the data into sequential locations in a vector register. integrated vector processor: A vector processor consisting of a coprocessor that is tightly coupled with its host scalar processor. interleaving: Using multiple memory boards in main memory so that one or more processors can access data, that is distributed among different boards, concurrently. IOTA: An instruction to generate a compressed vector of offset addresses for use in a gather or scatter instruction. Glossary-2 Glossary latency: The time that elapses from an element entering a pipeline until the first result is produced. linear recurrence: A cyclic data dependency in a linear function. LINPACK: An industry-wide benchmark used to measure the performance characteristics of various computers. Unlike some benchmarks, LINPACK is a real application package doing linear algebra calculations. Livermore FORTRAN kernels: A set of loops used to measure the performance characteristics of various computers and the efficiency of vectorizing compilers. The 24 kernels are fragments of programs from scientific and engineering applications. Also known as Livermore Loops. loop fusion: A transformation that takes two separate DO loops and makes a single DO loop one out of them. loop rolling: A transformation that combines parts of DO loops to allow for vectorization. loop unrolling: A transformation that separates certain DO loops into smaller loops to reduce overhead. loop-carried dependency: A data dependency that crosses iteration boundaries or that crosses between a loop and serial code. The dependency would not exist in the absence of the loop. loop-independent dependency: A data dependency that occurs whether or not a loop iterates. mask register: A vector control register used to select the elements of a vector that will participate in a particular operation. megaflops: A measure of the performance of a computer--the rate at which the computer executes floating-point operations. Expressed in terms of "millions of floating-point operations per second." Known as MFLOPS. memory bandwidth: The range of speeds at which a memory bus or path can carry data. The lowest speed is usually 0 (no data) and is, therefore normally not mentioned. For example, a memory bus that can provide data at speeds from 0 to 512 megabytes per second is said to have a bandwidth of 512 Mbytes/s. memory bank: An individual memory board. Groups of memory banks make up interleaved high-speed main memory. Glossary-3 Glossary MFLOPS: See megaflops. MIPS: A measure of the performance of a computer--the rate at which the computer executes instructions. Expressed in terms of "millions of instructions per second." MTF: Match true/false: When masked operations are enabled, only elements for which the Vector Mask Register bit matches true (or false, depending on the condition) are operated upon. optimizer: A program that scans code and changes the sequence or placement of the code to allow it to run faster and more efficiently on the scalar or vector processor. The program also reports dependencies that it cannot resolve. overlapping: Executing two or more instructions so part of their execution occurs at the same time. For example, while one instruction is performing an arithmetic operation, another is storing results to memory. parallelization: The part of the compilation process that prepares a section of source code for parallel processing. parallel processing: Concurrent execution of multiple segments from the same program image. peak MFLOPS: The theoretical maximum number of floating-point operations per second; a vector processor 's best attainable performance. pipe, or pipeline: A section of a processor that performs a specific function. For example, one pipe can be used for addition and subtraction, one for multiply, and one for load/store operations. Each operates independently from the others. Also called function unit. pipeline length: The number of segments of a pipeline within one function unit, which is the limit of the number of elements that can be executed in that function unit at one time. pipelining: A technique used in high-performance processors whereby a stream of data is processed in contiguous subtasks at separate stations along the pipeline. recursion: A process in which one of its steps makes use of the results of steps from an earlier statement. Also called recurrence. scalar: A single element or number. Glossary-4 Glossary scalar operand: The symbolic expression representing the scalar data accessed when an operation is executed; for example, the input data or arguement. scatter: The process of storing data into memory starting with a base address plus an offset address; the instruction that performs this action of placing the data back into memory. second-order linear recurrence: Two cyclic data dependencies occurring in a linear function. speedup ratio: The vector processor performance divided by the scalar processor performance; indicates how many times faster the computer is with a vector processor installed than without it. store: To move data from vector register to memory; for example, the VSTx command moves the results from the vector register back to memory. stride: The number of memory locations (bytes) between the starting address of consecutive vector elements; for example: A vector has a starting address, a length of 10 elements, and a stride of 4 bytes between the start of each element. stripmining: The process of splitting a vector into two or more subvectors, each of which will fit in a vector register, so each is processed sequentially by the vector procesor. sustained megaflops: The average floating-point performance achieved on a computer during some reference application (or benchmark). translation buffer: A hardware or software mechanism to remember successive virtual address translations and virtual page addresses, used to save time when referencing memory locations on the same memory page. translation buffer miss: An occurrence where the processor cannot translate the virtual address using the current contents of the translation buffer. In such a case, the processor is forced to load new information into the translation buffer to furnish the address. unknown dependency: An unclear relationship between multiple references to a memory location; a relationship in which the final value of the location may or may not depend on serial execution of the code involved. vector: A data structure composed of scalar elements with the same data type and organized as a simple linear sequence. Glossary-5 Glossary vector instruction: A native computer instruction that recognizes a vector as a native data structure and that can operate on all the elements of a vector concurrently. vector length register (VLR): A 7-bit register that controls the number of elements used in the vector registers, from 0 to 64. vector load: To move data from memory to the vector registers. For example, the VLDx command moves data to the vector registers. vector mask register (VMR): A 64-bit register that enables and disables the use of individual elements within a vector register. vector operand: A string of scalar data items with the same data type that are processed in a single operation. vector processor: A processor that operates on vectors, making use of pipelines that overlap key functional operations and performing the same operations repeatedly, to achieve high processing speeds. vector processing: Execution of vector operations on a vector processor. A single vector operation is capable of modifying each element in a vector operand concurrently. vector register: A high-speed buffer contained in the vector processor 's CPU, consisting of word- or address-length bit sequences that are directly accessible by the processor. A VAX vector register can hold 64 elements of 64 bits each. vectorizable: Capable of being converted to code that can be processed on a vector processor. vectorization: Part of the compilation process that prepares a section of source code for vector processing. vectorization factor: In a program, the fraction of code that can be converted to run on a vector processor. For example, a program with a vectorization factor above 70% will perform well on a system that has a vector processor. vector-scalar operation: An operation such as add, subtract, and so forth, in which a scalar number operates with each element of a vector register and places results in matching elements of another vector register. Glossary-6 Glossary vector-vector operation: An operation in which each element of one vector register operates with the corresponding element of a second vector register and then places the results in matching elements of a third vector register. Glossary-7 ------------------------------------------------------------ Index ------------------------------------------------------------ A ------------------------------------------------------------ Amdahl's Law · 1-19, A-3 Arithmetic data path chip · 2-4 instructions · 2-5 unit · 2-5 to 2-18, 2-19, 2-21 Arithmetic pipeline · 3-11 Array · 1-2, 1-6, 1-8, 1-13 Array index · 1-6 Attached vector processor · 1-6, 1-7 ------------------------------------------------------------ B ------------------------------------------------------------ Basic linear algebra subroutines (BLAS) · A-2 BLAS 2 routines · A-4 BLAS 3 routine · A-5 BLAS level 1 · A-2 Bus master · 2-5 ------------------------------------------------------------ C ------------------------------------------------------------ Cache · 1-7, 2-5, 2-7, 2-8, 2-12 to 2-16, 2-18, 3-5, A-4 Cache miss · 3-18, 3-21, 3-22 Chaining · 1-8, 1-18, 2-5 Chain into store · 2-18, 3-16 Chime · 1-13 Concurrent execution · 1-3, 1-12 Cooley-Tukey algorithm · A-7 butterfly graph · A-9 Crossover point · 1-22, 3-5 ------------------------------------------------------------ D ------------------------------------------------------------ Data cache · 2-8 registers · 2-9 Data dependencies · A-1 Deferred instruction · 2-20, 3-16, 3-20 Discrete Fourier transform · A-7 Double-precision · 2-6 Duplicate tag · 2-14, 2-15 tag store · 2-16 D_floating · 2-21 ------------------------------------------------------------ E ------------------------------------------------------------ Element · 1-2 Equation solvers · A-2 Exception reporting · 3-6 Execution time · 2-21 ------------------------------------------------------------ F ------------------------------------------------------------ Fast Fourier transform · A-7 Floating-point operations · 1-19 FORTRAN · 1-8, 3-2 Fourier transform · A-7 Function unit · 1-11, 1-18 F_floating · 2-21 ------------------------------------------------------------ G ------------------------------------------------------------ Gather · 1-15, 1-17, 2-19 Gather instruction · 3-15, A-7 G_floating · 2-21 ------------------------------------------------------------ I ------------------------------------------------------------ Imprecise exceptions · 3-10 Index vector · 1-17 Inhibiting constructs · 3-2 Instruction chaining · 1-18 decomposition · 2-17 execution overlap · 3-12 time · 3-16 Index-1 Index Instruction (Cont.) issue time · 3-16 overlap · 1-8, 1-18, 2-5 Integrated vector processor · 1-7 Invalidate queue · 2-16 ------------------------------------------------------------ L ------------------------------------------------------------ LAPACK library · A-5 Linpack · A-2 Load instruction · 3-15 Load/store instruction · 2-7, 3-13 pipeline · 2-18 unit · 2-7, 2-11, 2-16, 2-18, 2-21, 3-11 to 3-15 Locality of reference of data · 3-17 Longword · 2-6, 3-13 Loop unrolling · A-5 ------------------------------------------------------------ M ------------------------------------------------------------ Mask operate enable (MOE) · 3-13 register · 3-15 Masked memory instruction · 3-15 Matrix multiplication · 3-23, A-5 transpose · A-11 Maximize instruction overlap · A-1 Memory management · 2-11 exception · 2-14 exceptions · 3-8 fault · 3-11 fault priorites · 2-12 Memory management exceptions · 3-10 Memory Management Okay (MMOK) · 2-19, 3-14 Memory-to-memory architecture · 1-8 MFLOPS · 1-19 MIPS · 1-19 Modify intent bit (MI) · 3-13 Move From Vector Processor (MFVP) instruction · 3-6 ------------------------------------------------------------ N ------------------------------------------------------------ Nonunity stride · 3-18 ------------------------------------------------------------ O ------------------------------------------------------------ Offset · 1-17 vector register · 3-15 Overhead · 1-22 Overlap · 1-8, 1-13, 1-18, 2-20, 2-21, 3-15 Overlapping instructions · 3-15 ------------------------------------------------------------ P ------------------------------------------------------------ Page table entry · 2-11 Parallel pipelines · 1-13 Parity bit · 2-15 errors · 2-8 Peak MFLOPS · 1-19 Performance · 1-2, 1-3, 1-7 to 1-9, 1-11, 1-19, 1-21, 1-22 Pipe · 1-11 Pipeline · 1-18, 2-5, 2-17, 2-18, 2-21, 3-21 latency · 1-12, 1-13 Pipelining · 1-11, 1-13 Precise exceptions · 3-11 Program counter (PC) · 3-10 ------------------------------------------------------------ Q ------------------------------------------------------------ Quadword · 2-6, 2-14 ------------------------------------------------------------ R ------------------------------------------------------------ Register conflict · 3-12, 3-15, 3-17 file chip · 2-4 to 2-7, 2-9 offsets · 2-7 reuse · 3-25 Register length · 1-14 Register reuse · A-1, A-9 Register-to-register architecture · 1-8 Index-2 Index Return from Exception or Interrupt (REI) instruction · 3-7 ------------------------------------------------------------ S ------------------------------------------------------------ Scalar/vector memory synchronization · 3-7 to 3-9 Scalar/vector synchronization · 3-6 Scatter · 1-15, 1-17, 2-19 Scatter instruction · 3-13, 3-14, A-7 Scoreboarding · 2-5, 2-18 Sectioning · 1-14 SIMD · 1-3 Single-precision · 2-6 Speedup ratio · 1-22 Store operation · 3-13 Stride · 1-15, 3-13 Stripmining · 1-14 Subvector · 1-14 Synchronization · 3-6 Synchronize Vector Memory Access (VSYNC) instruction · 3-9 SYNC instruction · 3-6 ------------------------------------------------------------ T ------------------------------------------------------------ Translation buffer · A-4 Translation buffer (TB) · 2-7, 2-12, 2-14, 3-22 Translation-Not-Valid fault · 2-11 Trigonometric functions · A-8 Two-dimensional fast Fourier transforms · A-10 ------------------------------------------------------------ U ------------------------------------------------------------ Unity stride · 1-15, 3-18, 3-22, A-1 Unknown dependency · 1-9 ------------------------------------------------------------ V ------------------------------------------------------------ VAX instruction set · 2-2 Vector · 1-2 Arithmetic Exception Register · 2-5 cache · 3-21 control unit · 2-5, 2-9 Vector (Cont.) Count Register · 2-5, 2-9 issue unit · 2-17, 2-19 length · 1-9, 3-21 Length Register · 1-9, 2-5, 2-9 Length Register (VLR) · 3-13 Mask Register · 1-9, 2-9 Mask Register (VMR) · 3-13 Memory Activity Check Register · 2-5 Processor Status Register · 2-5 register · 1-14 register file · 3-20 Vectorization factor · 1-21, 1-22 Vectorizing compiler · 1-8, 1-14 VIB · 2-2 Virtual address · 2-7, 2-8, 2-11, 3-14 VSTL instruction · 3-21 VSYNC instruction · 3-9 VVADDL instruction · 3-21 ------------------------------------------------------------ W ------------------------------------------------------------ Wall-clock time · 3-4 Writeback cache · 3-13 ------------------------------------------------------------ X ------------------------------------------------------------ XMI bus · 2-8, 2-13 interface · 2-16 Index-3