Benchmarking in Cycles
How to measure low-latency codepaths with rdtsc, rdtscp, and lfence.
For low latency code, it is necessary to be able to benchmark code at a very granular level with high resolution. If you want to know where code is actually spending time, you need something that is both precise and low overhead.
Why high-level timers fall short
Elapsed-time benchmarking is familiar and easy to understand, but when you are trying to benchmark code that may take only a few hundred cycles, it starts to break down.
High-level timers rely on lower-level calls that can introduce hundreds of cycles of overhead. That level of overhead is unacceptable for very small measurements because it can end up dominating the result.
Elapsed-time benchmarking is also harder to compare across machines. Running the same benchmark on two machines operating at different frequencies will produce different elapsed-time outputs. That can create the illusion of more optimized code when in reality it is just a faster machine.
The timestamp counter
Modern x86 processors expose a 64-bit counter called the timestamp counter (TSC) that increments at a fixed rate.
On most modern systems,
invariant TSCmeans the counter behaves like a stable time base for measurement and remains consistent across cores.
You can read the value of this counter before and after the execution of some code to see how many TSC ticks elapsed while that code ran. You can check whether your system supports this with the following:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# This should show tsc
dmesg | grep -i tsc
# Should look for "tsc: Refined TSC clocksource calibration"
grep -o 'constant_tsc\|nonstop_tsc' /proc/cpuinfo | head -2
# Look for relevant output
With the TSC, you can read the counter before executing a block of code and then again after to measure the elapsed TSC ticks for that region. On x86, this is done with rdtsc and rdtscp. However, you cannot just place them before and after the code block and assume the result is correct. A proper setup looks like this:
#[inline(always)]
fn rdtsc_start() -> u64 {
unsafe {
std::arch::x86_64::_mm_lfence();
std::arch::x86_64::_rdtsc()
}
}
#[inline(always)]
fn rdtscp_end() -> u64 {
unsafe {
let mut aux = 0u32;
let tsc = std::arch::x86_64::__rdtscp(&raw mut aux);
std::arch::x86_64::_mm_lfence();
tsc
}
}
pub fn main() {
let start = rdtsc_start();
// something
let end = rdtscp_end();
let ticks = end - start;
}
There is more going on here than just starting and stopping a timer. The two key pieces are the lfence and the use of rdtsc at the start and rdtscp at the end.
Out-of-order execution and lfence
CPUs are free to execute instructions out of order whenever doing so does not change the program’s visible behavior. This keeps the pipeline busy, but it can distort benchmarking if the timestamp reads are allowed to move relative to the code being measured.
Suppose we want to benchmark only I4 and I5 in the following sequence:
I0 I1 I2 rdtsc I4 I5 rdtscp I6
The CPU is free to reorder instructions, so the timestamp reads might not occur exactly where we expect. For example, it may determine that rdtsc can execute before I1, which would cause I1 and I2 to be included in the measurement even though they should not be.
We need a way to fence off the measured region and prevent that movement. That is why lfence is used. It is a memory-ordering instruction that acts like a barrier and keeps nearby instructions from bleeding into the timing window. The lfence before rdtsc ensures that earlier instructions complete before we read the starting TSC value.
rdtsc vs rdtscp
The starting measurement is taken with rdtsc, while the ending measurement is taken with rdtscp.
A plain rdtsc is not serializing, so the CPU is free to move it around relative to nearby instructions. At the start of the measured region, that is handled by placing an lfence right before it. That fence ensures earlier work does not leak into the timing window.
At the end of the region, we use rdtscp instead of another plain rdtsc because rdtscp is partially serializing. It will not let the timestamp read happen before earlier instructions in the measured region have completed, which makes it much better suited for closing the timing window. We then place a final lfence after it so that later instructions do not begin executing too early and blur the end of the measurement.
A few practical notes
Reading the TSC correctly is only part of the job. It gives you a much better measurement primitive, but it does not automatically make the benchmark good.
If you want trustworthy results, you still need to warm up the code path, benchmark many times, isolate cores, and control affinity. The measurement itself may be low overhead, but the system you are measuring is still full of moving pieces.
Used this way, the TSC gives you a much more reliable way to measure very small code regions than naive wall-clock timing.