A cache-optimal algorithm is one that minimizes the number of cache misses.

By way of example, first consider the straightforward matrix multiplication algorithm to multiply n x n matrices A and B to obtain C:

```OrdMult(A, B, C, n):
for i = 0 to n
for j = 0 to n
for k = 0 to n
C_ij = A_ik * B_kj;
```
If we assume the (Z,L) ideal-cache model, we get the following cache-complexity analysis:

Since for each of the n rows of A we must read in B all over again, we have n^2/L cache misses for each row of A. Thus, the cache complexity of OrdMult is Θ(n^3/L). OrdMult is not cache-optimal.

A cache-optimal algorithm for matrix multiplication is as follows:

First, view the matrices A and B as consisting of n/s x n/s sub-matrices, each of size s x s. We choose s such that three s x s matrices will fit into memory at the same time, i.e. s = Θ(sqrt(Z)). (Clearly, this assumption means that the algorithm is cache-aware and not a cache-oblivious algorithm.) We will then multiply these sub-matrices in blocks:

```BlockMult(A, B, C, n, s):
for i = 0 to n/s
for j = 0 to n/s
for k = 0 to n/s
OrdMult(A_ik, B_kj, C_ij, s);
```
Since each OrdMult call reads three s x s sized matrices into memory, it incurs s^2/ L cache misses. Thus, the overall cache complexity is Θ((n/s)^3*s^2/L) = Θ(n^3/(sL)) = Θ(n^3/(sqrt(Z)L)).

BlockMult is cache-optimal.

In practice, OrdMult is about O(n^4), and this is because of cache misses. BlockMult runs much closer to the theoretical Θ(n^3) value.

```C_11 = A_11 * B_11 + A_12 * B_21