Optimizing Chunk Access

In general, one should access memory in a contiguous manner, but sometimes it is impossible to do so without incurring other performance degradations. In this post I'll talk about one possible way to circumvent this.

How cache lines (roughly) work

First, let's think about why contiguous access is optimal. Whenever you need to access a memory address, cache is first asked whether or not it has that memory. Eventually you will get a cache miss, meaning the memory you're accessing isn't in a cache. Whenever there's a cache miss, the processor has to read from main memory. Memory around that address will also be read. For example, the Sandy Bridge architecture employed in many modern Intel processors have a 64-byte cache line, meaning that whenever you access main memory, you'll bring 64 byte chunks of memory into the cache (maybe not all at once). If we're accessing memory contiguously, every 64 bytes will (likely) have cache hits. Want to know how all of this works in detail? Ulrich Drepper has your back.

So just how big of a deal is a cache miss? Peter Norvig provided rough estimates of various operations:

Operation	Time (ns)
execute typical instruction	1.0
fetch from L1 cache memory	0.5
fetch from L2 cache memory	7.0
fetch from main memory	100.0

Actual timings will likely vary quite a bit, but expect roughly an order of magnitude difference each time you move up a level. See the entire table on Norvig's site .

What can we do?

Suppose we store our chunk data in YZX order; that is, the greatest stride is along the y axis, followed by z, and finally x. Optimal iteration would look like this:

for(int y = 0; y < RADIUS; ++y) {
    for(int z = 0; z < RADIUS; ++z) {
        for(int x = 0; x < RADIUS; ++x) {
            const auto &voxel = chunk.voxel(x, y, z);
            // Do stuff with voxel...
        }
    }
}

If we can process voxels in an order-independent manner, that's how we should do it. We can't always access memory in a cache-efficient manner though. For example, greedy meshing accesses voxel data in slices. In many cases like these, we can alleviate some of the pain by prefetching data into the cache. We are in control of the code, and we know our access patterns, so we can warm up the cache with data we know that we will soon use. This is similar to how a processor prefetches instructions .

Clang and GCC have a special function to prefetch data: __builtin_prefetch . We pass in a memory address to prefetch, whether or not it we need read-write access, and the temporal locality of the prefetch (0, 1, 2, or 3 where 0 means ephemeral and 3 means "keep it around in all caches as long as possible".)

Here's an example of an access pattern that is inefficient, yet we nearly get a 2x speedup due to prefetching:

// https://gist.github.com/thegedge/55dab0bfa87296926dc0

#include <array>
#include <chrono>
#include <iostream>

// Width of each dimension
constexpr size_t N = 200;

// Number of iterations for the test
constexpr size_t NUM_ITERATIONS = 100;

// Data type we'll use for testing
using DataType = short;

// 1D array index from a 3D coordinate
int index(int x, int y, int z) {
    return N*(N*y + z) + x;
}

void test1() {
    using namespace std::chrono;

    const DataType data[N*N*N] = {0};
    int sum = 0;

    const auto start = high_resolution_clock::now();
    for(int iteration = 0; iteration < NUM_ITERATIONS; ++iteration) {
        for(int x = 0; x < N; ++x) {
            for(int z = 0; z < N; ++z) {
                for(int y = 0; y < N; ++y) {
                    if(x > 0 && x + 1 < N) {
                        sum += data[index(x - 1, y, z)] / 2;
                        sum += data[index(x, y, z)];
                        sum += data[index(x + 1, y, z)] / 2;
                    } else {
                        sum += data[index(x, y, z)];
                    }
                }
            }
        }
    }

    const auto elapsed = high_resolution_clock::now() - start;
    const auto ms = duration_cast<milliseconds>(elapsed).count();
    std::cout << sum << " in " << ((1.0 * ms) / NUM_ITERATIONS)  << "ms per loop\n";
}

void test2() {
    using namespace std::chrono;

    const DataType data[N*N*N] = {0};
    int sum = 0;

    const auto start = high_resolution_clock::now();
    for(int iteration = 0; iteration < NUM_ITERATIONS; ++iteration) {
        for(int x = 0; x < N; ++x) {
            for(int z = 0; z < N; ++z) {
                for(int y = 0; y < N; ++y) {
                    if(x > 0 && x + 1 < N) {
                        sum += data[index(x - 1, y, z)] / 2;
                        sum += data[index(x, y, z)];
                        sum += data[index(x + 1, y, z)] / 2;
                    } else {
                        sum += data[index(x, y, z)];
                    }

                    const int prefetch_index = index(x, y + 20, z);
                    __builtin_prefetch(data + prefetch_index, 0, 0);
                }
            }
        }
    }

    const auto elapsed = high_resolution_clock::now() - start;
    const auto ms = duration_cast<milliseconds>(elapsed).count();
    std::cout << sum << " in " << ((1.0 * ms) / NUM_ITERATIONS)  << "ms per loop\n";
}

int main(int argc, char **argv) {
    test1();
    test2();
    return 0;
}