A

cache-

optimal algorithm is one that

minimizes the

number of

cache misses.

By way of example, first consider the straightforward matrix multiplication algorithm to multiply n x n matrices A and B to obtain C:

OrdMult(A, B, C, n):
for i = 0 to n
for j = 0 to n
for k = 0 to n
C_ij = A_ik * B_kj;

If we assume the (Z,L)

ideal-cache model, we get the following

cache-complexity analysis:

Since for each of the n rows of A we must read in B all over again, we have n^2/L cache misses for each row of A. Thus, the cache complexity of OrdMult is Θ(n^3/L). OrdMult is **not** cache-optimal.

A cache-optimal algorithm for matrix multiplication is as follows:

First, view the matrices A and B as consisting of n/s x n/s sub-matrices, each of size s x s. We choose s such that three s x s matrices will fit into memory at the same time, i.e. s = Θ(sqrt(Z)). (Clearly, this assumption means that the algorithm is cache-aware and not a cache-oblivious algorithm.) We will then multiply these sub-matrices in blocks:

BlockMult(A, B, C, n, s):
for i = 0 to n/s
for j = 0 to n/s
for k = 0 to n/s
OrdMult(A_ik, B_kj, C_ij, s);

Since each OrdMult call reads three s x s sized

matrices into

memory, it incurs s^2/ L

cache misses. Thus, the overall

cache complexity is

Θ((n/s)^3*s^2/L) =

Θ(n^3/(sL)) =

Θ(n^3/(

sqrt(Z)L)).

BlockMult is cache-optimal.

In practice, OrdMult is about O(n^4), and this is because of cache misses. BlockMult runs much closer to the theoretical Θ(n^3) value.

Exercise for the reader:

Divide and conquer matrix multiplication is a simple algorithm in which we divide the matrices into four equally-sized sub-matrices:

C_11 = A_11 * B_11 + A_12 * B_21
C_12 = A_11 * B_12 + A_12 * B_22
...

Prove that

divide and conquer matrix multiplication, a

cache-oblivious algorithm, is

cache-

optimal (i.e. that the

cache complexity is

Θ(n^3/(

sqrt(Z)L))).