A cache-optimal algorithm is one that minimizes the number of cache misses.

By way of example, first consider the straightforward matrix multiplication algorithm to multiply n x n matrices A and B to obtain C:

OrdMult(A, B, C, n):
  for i = 0 to n
    for j = 0 to n
      for k = 0 to n
        C_ij = A_ik * B_kj;
If we assume the (Z,L) ideal-cache model, we get the following cache-complexity analysis:

Since for each of the n rows of A we must read in B all over again, we have n^2/L cache misses for each row of A. Thus, the cache complexity of OrdMult is Θ(n^3/L). OrdMult is not cache-optimal.

A cache-optimal algorithm for matrix multiplication is as follows:

First, view the matrices A and B as consisting of n/s x n/s sub-matrices, each of size s x s. We choose s such that three s x s matrices will fit into memory at the same time, i.e. s = Θ(sqrt(Z)). (Clearly, this assumption means that the algorithm is cache-aware and not a cache-oblivious algorithm.) We will then multiply these sub-matrices in blocks:

BlockMult(A, B, C, n, s):
  for i = 0 to n/s
    for j = 0 to n/s
      for k = 0 to n/s
        OrdMult(A_ik, B_kj, C_ij, s);
Since each OrdMult call reads three s x s sized matrices into memory, it incurs s^2/ L cache misses. Thus, the overall cache complexity is Θ((n/s)^3*s^2/L) = Θ(n^3/(sL)) = Θ(n^3/(sqrt(Z)L)).

BlockMult is cache-optimal.

In practice, OrdMult is about O(n^4), and this is because of cache misses. BlockMult runs much closer to the theoretical Θ(n^3) value.

Exercise for the reader:
Divide and conquer matrix multiplication is a simple algorithm in which we divide the matrices into four equally-sized sub-matrices:

C_11 = A_11 * B_11 + A_12 * B_21
C_12 = A_11 * B_12 + A_12 * B_22
...
Prove that divide and conquer matrix multiplication, a cache-oblivious algorithm, is cache-optimal (i.e. that the cache complexity is Θ(n^3/(sqrt(Z)L))).

Log in or register to write something here or to contact authors.