There are three common reasons for aligning memory addresses with cachelines:
- Improving Performance
- Maintaining Atomicity
- Preventing False Sharing
The third reason has been widely discussed online, so we will not cover it here. Instead, we will focus on the first two reasons.
Cross-Line Performance⌗
Let’s start with the following program. The program first allocates an 8K memory space. It’s important to note that the starting address of the space provided by malloc
is aligned to 8 or 16 bytes. We need to manually align it to 64 bytes for easier manipulation. The allocated space is 8K, which is the size of two pages. Regardless of the starting address, this space will contain a page boundary, which is a good alignment point and useful for testing. The subsequent operations simply involve repeatedly writing data and measuring the total time to calculate the average time per operation.
The program accepts an input parameter offset
, which specifies the starting write position from buf_pageend+offset
. Thus, this parameter controls the starting position of the write operation.
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#define BUF_SIZE 8192
#define ROUND 100000000UL
int main(int argc, char **argv)
{
char *buf, *buf_newaddr, *buf_pageend;
unsigned long i __attribute__((aligned(64)));
unsigned long offset __attribute__((aligned(64)));
struct timespec start={0,0}, end={0,0};
double start_ns, end_ns;
if (argc != 2) {
printf("missing args\n");
exit(-1);
}
offset = atoi(argv[1]);
again:
buf = (void *)malloc(BUF_SIZE);
/* Align to page boundary */
buf_pageend = (void *)((unsigned long)(buf + 4095) & 0xfffffffffffff000UL);
if (buf_pageend - buf < 1024) { // Ensure sufficient space before the pageend in case offset is negative
goto again; // No free, continue allocation, otherwise the same memory block may be allocated
}
memset(buf, 0, BUF_SIZE); // Pre-allocate
printf("&i = %lx, &offset=%lx\n", &i, &offset);
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < ROUND; i++) {
*((volatile unsigned long *)(buf_pageend + offset)) = 0; // Write 8 bytes
}
clock_gettime(CLOCK_MONOTONIC, &end);
start_ns = start.tv_sec * 1000000000 + start.tv_nsec;
end_ns = end.tv_sec * 1000000000 + end.tv_nsec;
printf("ns per store: %lf\n", (end_ns - start_ns)/ROUND);
}
Now that we’ve gone through the program, let’s start testing:
On my system, the cacheline size of the L1 cache is 64 bytes:
$ getconf LEVEL1_DCACHE_LINESIZE
64
When memory is loaded into the cache, it is loaded as a whole cacheline, aligned to a 64-byte boundary. It won’t load in smaller granularity. For example, if you access address 0x2
, what actually gets loaded is the 64-byte block from 0x0
to 0x3F
.
In our program, buf_pageend
points to the page boundary, which is aligned to 4K and, therefore, also aligned to 64B (cacheline-aligned). So, when we write to the range [buf_pageend, buf_pageend+63]
, no matter where we write within this range, we are only accessing one cacheline, and the speed will be the fastest.
$ ./a.out 0; ./a.out 1; ./a.out 56
&i = 7ffccd268780, &offset=7ffccd2687c0
ns per store: 0.429188
&i = 7ffc0f704140, &offset=7ffc0f704180
ns per store: 0.436388
&i = 7ffce5fedbc0, &offset=7ffce5fedc00
ns per store: 0.434455
However, if we pass in the parameter 57
, since we are writing 8 bytes, the write range will be [buf_pageend+57, buf_pageend+64]
. Notice that the last byte overflows into the next cacheline, so the speed decreases. The same applies if you pass in values from 58 to 63.
$ ./a.out 57; ./a.out 58; ./a.out 63
&i = 7fff250ea3c0, &offset=7fff250ea400
ns per store: 0.847032
&i = 7ffe82952d40, &offset=7ffe82952d80
ns per store: 0.851936
&i = 7fff046c6f00, &offset=7fff046c6f40
ns per store: 0.844823
If you pass in 64
, you’re back to reading and writing within a single cacheline ([buf_pageend+64, buf_pageend+127]
), and the speed increases again.
$ ./a.out 64
&i = 7ffd12741400, &offset=7ffd12741440
ns per store: 0.438385
Why does cross-line access slow down? This is because writing once requires accessing two cachelines to complete the operation, effectively doubling the time.
If you not only cross cachelines but also cross pages, the speed will be even slower. For example, if the first 4 bytes of the 8-byte write fall at the end of one page and the last 4 bytes fall at the beginning of the next page, the two pages (called virtual pages) typically map to non-contiguous physical pages. The x86 architecture uses VIPT (Virtually Indexed, Physically Tagged), and when the CPU looks up a cacheline, it must resolve the virtual address to a physical address when matching the tag. Since our data crosses cachelines, it first resolves the virtual address and writes the first cacheline, then resolves the virtual address again to write the second cacheline. This address resolution process is slow, and if a TLB miss occurs, it becomes even slower. So, cross-page writes not only incur the penalty of crossing cachelines but also the penalty of crossing pages, resulting in a significant performance degradation.
$ ./a.out -1; ./a.out -7
&i = 7fff143b3400, &offset=7fff143b3440
ns per store: 10.084426
&i = 7ffcfb6bb8c0, &offset=7ffcfb6bb900
ns per store: 10.089472
Of course, if you pass in -8
, the write will fall entirely within the previous cacheline, and the speed will be fast again.
Atomicity⌗
We know that the MESI protocol in cache coherence ensures data consistency when multiple CPUs read and write to the same cacheline, meaning that a single cacheline’s reads and writes are inherently atomic. But how does cross-line access break atomicity? Let’s look at the following program.
The main process creates 80 child processes. All parent and child processes share a memory region. Forty of the child processes repeatedly write 8 bytes of all f
s to an address, while the other forty child processes repeatedly write 8 bytes of all 0
s. The main process repeatedly reads these 8 bytes. If atomicity is preserved, the main process should see either 8 bytes of all 0
s or 8 bytes of all f
s, but never any other value. If other values appear, atomicity has been violated.
The program accepts an input parameter offset
, which specifies the starting write position from buf+offset
, allowing us to control the starting position of the write operation.
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>
#define BUF_SIZE 1024
int main(int argc, char **argv) {
char *buf;
int pid, status;
int count = 80;
int i,j;
int buf_realsize;
int write_count = 10000000;
int read_count = 30000000;
unsigned long data;
int offset;
if (argc != 2) {
printf("missing args\n");
exit(-1);
}
offset = atoi(argv[1]);
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
memset(buf, 0xdd, BUF_SIZE);
buf = (void *)((unsigned long)(buf + 63) & 0xffffffffffffffc0);
for (i = count; i; i--) {
pid = fork();
if (pid == 0) {
int child_pid = getpid();
if (child_pid % 2)
data = 0xffffffffffffffff;
else
data = 0x0000000000000000;
for (j = write_count; j; j--)
*(volatile unsigned long *)(buf + offset) = data;
exit(0);
} else {
// parent
}
}
for (j = read_count; j; j--) {
data = *(unsigned long *)(buf + offset);
if (data != 0xffffffffffffffff && data != 0)
printf("non-atomic: %016lx\n", data);
}
while ((pid = wait(&status)) > 0);
}
If you pass in offset
values like 0, 1, ..., 56
, the 8-byte writes will clearly fall within the range [buf, buf+63]
, i.e., within a single cacheline. There is no issue with atomicity.
$ ./a.out 0; ./a.out 10; ./a.out 56;
As we can see, there is no output, indicating that no non-atomic situations occurred.
However, if you pass in 57, 58, ..., 63
, the 8-byte write will span two cachelines. There is no atomicity guarantee between writing to the first and second cachelines. As a result, the 80 child processes can interfere with each other. For example, if the offset is 60
, half of the data is written to one cacheline and the other half to another. The main process might see a mix of data, like 0xffffffff00000000
or 0x00000000ffffffff
, from two different processes. Of course, all 0
s or all f
s are also possible, but you wouldn’t be able to tell just by looking.
$ ./a.out 60;
non-atomic: ffffffff00000000
non-atomic: 00000000ffffffff
non-atomic: 00000000ffffffff
non-atomic: 00000000ffffffff
non-atomic: ffffffff00000000
non-atomic: 00000000ffffffff
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
non-atomic: ffffffff00000000
To avoid such situations, you would need to use the lock
prefix at the assembly level. However, the best solution is simply to ensure alignment.
Compiler Optimization⌗
The code compiled with -O0
is unoptimized and contains many performance bottlenecks, such as instruction address alignment, branch prediction failures, pipeline bubbles, store-forwarding stalls, etc. These factors introduce significant overhead and obscure the performance differences of cross-line access. Therefore, the above program must be compiled with -O2
to eliminate these irrelevant performance bottlenecks.
Additionally, with -O2
optimizations, the compiler will unroll constant loops and perform dead code elimination, meaning intermediate assignment statements may be removed or reduced to a single assignment. Therefore, we write assignment statements in the following way:
*(volatile unsigned long *)(buf + offset) = xxx;
The volatile
keyword tells the compiler not to optimize away this statement, even if the same value is written multiple times. Whatever I write, the compiler must assign exactly as specified.