Fortran is a column-major language, which means that all the elements of a column in a two-dimensional matrix are adjacent in memory.
double precision :: matrix(1:2, 1:3) Memory map: 2000 matrix(1,1) 2008 matrix(2,1) 2016 matrix(1,2) 2024 matrix(2,2) 2032 matrix(1,3) 2040 matrix(2,3)
As discussed in Chapter 13, Computer Hardware, most modern computers use virtual memory, which extends the apparent size of RAM using swap space on disk.
When a program accesses a given memory address, it is really using a virtual address, and the data stored at that virtual address could be in RAM or in swap space. If the data is in RAM, it will take nanoseconds to access. If it is in swap, it will take milliseconds, about 1,000,000 times longer.
Because it is more efficient to access disk in large chunks than in individual bytes, virtual memory is divided into pages. If a particular virtual address is in RAM, then the entire page surrounding it is in RAM, and if it is in swap, the entire page is in swap.
The goal when writing code is to minimize the frequency of accessing a different page than the last one accessed. If we access a value that is in swap, then we have to go to disk, which is expensive. At this point, the system moves the entire page into RAM, not just the value. This is called a page fault. If the next memory access is in the same page, there is no chance of causing a page fault.
Every time a page fault occurs, your program has to wait for the slow disk access, and loses a lot of time:
4 ms on a CPU that runs 3 billion instructions per sec = 4 ms 1 sec 3 billion inst x ------- x -------------- = 12,000,000 instructions 1000 ms 1 sec
Hence, if we structure our programs so that they do as many consecutive memory references as possible within the same page, we will reduce the number of page faults, and increase program performance.
This is where it is useful to know that Fortran is a column-major language. Since each column of a two-dimensional array is grouped together in memory, much or all of the column will be in the same page. Therefore, if our program works within a column for as long as possible, instead of jumping to a different column every time, it may run faster.
Example 27.1. Comparison of Row-major and Col-major Access
Initialize a 20,000 * 20,000 matrix of doubles (400,000,000 doubles * 8 bytes = 3.2 gigabytes).
The code below traverses the matrix in row-major order, i.e. is completely sweeps one row before going to the next. In other words, it access memory addresses sequentially over the entire 2D array. This program took an average of 1.8 seconds to run over several trials on a 2.6 GHz Core i5 processor with 8 GiB RAM.
#include <stdio.h> #include <stdlib.h> #include <sysexits.h> #define ROWS 20000 #define COLS ROWS int main(int argc,char *argv[]) { static double mat[ROWS][COLS]; size_t row, col; for (row = 0; row < ROWS; ++row) { for (col = 0; col < COLS; ++col) mat[row][col] = row + col; } // Access a random element of the matrix to prevent smart optimizers // from eliminating the loop entirely when the results are not used printf("%f\n", mat[random() % ROWS][random() % COLS]); return EX_OK; }
If we switch the inner and outer loops, the program will traverse the matrix in column-major order, i.e. it will sweep each column before going to a new one, which means it accesses a different column (and hence a different page) every time it accesses the array. This causes memory addresses to be accessed non-sequentially. This slight change to the code caused the program to take an average of 10.5 seconds instead of 1.8.
#include <stdio.h> #include <stdlib.h> #include <sysexits.h> #define ROWS 20000 #define COLS ROWS int main(int argc,char *argv[]) { static double mat[ROWS][COLS]; size_t row, col; for (col = 0; col < COLS; ++col) { for (row = 0; row < ROWS; ++row) mat[row][col] = row + col; } // Access a random element of the matrix to prevent smart optimizers // from eliminating the loop entirely when the results are not used printf("%f\n", mat[random() % ROWS][random() % COLS]); return EX_OK; }
Note that when reading or writing a matrix, we don't have much choice but to access it in row-major order. This doesn't make that much difference, since I/O is slow anyway. Traversal order is most important when processing a matrix entirely in memory. (Matrix addition, multiplication, Gauss elimination, etc.)