The Role of Aligning Memory Addresses With Cachelines

There are three common reasons for aligning memory addresses with cachelines:

Improving Performance
Maintaining Atomicity
Preventing False Sharing

The third reason has been widely discussed online, so we will not cover it here. Instead, we will focus on the first two reasons.

Cross-Line Performance

Let’s start with the following program. The program first allocates an 8K memory space. It’s important to note that the starting address of the space provided by malloc is aligned to 8 or 16 bytes. We need to manually align it to 64 bytes for easier manipulation. The allocated space is 8K, which is the size of two pages. Regardless of the starting address, this space will contain a page boundary, which is a good alignment point and useful for testing. The subsequent operations simply involve repeatedly writing data and measuring the total time to calculate the average time per operation.

Read full text

Revisiting the Volatile Keyword in C

The Most Common Usage

When a variable is declared as volatile, it tells the compiler that even though the current code being compiled does not modify this variable, the corresponding memory data might be modified for other reasons. There are many possible reasons for this, such as the memory location corresponding to the variable being mapped to a peripheral port using the memory mapped I/O mechanism. Essentially, we are accessing a hardware register, and the value of that register can change independently of the program.

Read full text

The Principle of Workingset in the Linux Kernel

Overview of Basic Concepts

The concept of workingset was first introduced by Professor Peter Denning in 1968. The fundamental idea is that programs exhibit temporal locality in their memory usage. At time $t$, the workingset of a program is considered to be the set of pages accessed by the process during the interval $(t-\tau, t)$, denoted as $W(t, \tau)$, where $\tau$ is known as the workingset parameter, a user-defined value. The original paper suggested that memory not accessed within $\tau$ could be reclaimed through sampling.

Read full text

Understanding the Implementation of qspinlock in the Linux Kernel

The qspinlock is implemented on top of the mcs spinlock. The MCS stands for Mellor, Crummey, and Scott, the surnames of the creators. The main idea is to have each spinner spin on its own per-cpu variable, thereby avoiding the constant cacheline bouncing between different CPUs. Cacheline bouncing occurs when all CPUs spin-wait on the same lock variable, causing them to repeatedly read this variable. When one CPU unlocks, this variable is modified, invalidating the cachelines of all other CPUs, which then have to re-read the variable. This results in a performance overhead. The MCS lock mitigates this by having each CPU spin on its own dedicated variable, thus avoiding contention on a single lock variable.

Read full text