Skip to content
Advertisement

Writing a small file blocks for 20 ms

I discovered that on my Ubuntu 22 server, attempting to write to a file, often induces around 20ms delay, even when only writing a few bytes.

Here is some basic code that demonstrates the problem:

#include <iostream>
#include <fstream>
#include <chrono>
#include <filesystem>
#include <thread>

using namespace std;

void go_c() {
  FILE *out = fopen("hello.txt", "w");
  fputs("hello", out);
  fclose(out);
}

void go_cpp () {
  ofstream out("hello.txt");
  out<<"hello"<<endl;
}

double test(void (*f)()) {
  typedef chrono::time_point <chrono::steady_clock> tp;

  tp t0 = chrono::steady_clock::now();
  f();
  tp t1 = chrono::steady_clock::now();

  return chrono::duration<double>(t1-t0).count() * 1000; //milliseconds
}

void bench(void (*f)(), const char *txt, int delay_ms) {
  filesystem::remove("hello.txt");

  for (int i=0;i<5;i++) {
    double t = test(f);
    cerr<<i<<": "<<txt<<", time = "<<t<<" ms"<<endl;
    this_thread::sleep_for(std::chrono::milliseconds(delay_ms));
  }

  cerr<<endl;
}

int main () {
  bench(go_c, "C Write", 0);
  bench(go_cpp, "C++ Write", 0);
  bench(go_c, "C Write with delay", 2500);
  bench(go_cpp, "C++ Write with delay", 2500);

  return 0;
}

And here is the output:

ubuntu@captain:~/scratch$ g++ -o write3 write3.cpp -O2 -Wall
ubuntu@captain:~/scratch$ ./write3
0: C Write, time = 0.09978 ms
1: C Write, time = 21.9316 ms
2: C Write, time = 0.185957 ms
3: C Write, time = 0.140212 ms
4: C Write, time = 0.139051 ms

0: C++ Write, time = 0.145766 ms
1: C++ Write, time = 0.091845 ms
2: C++ Write, time = 0.139618 ms
3: C++ Write, time = 0.130834 ms
4: C++ Write, time = 0.132217 ms

0: C Write with delay, time = 0.048674 ms
1: C Write with delay, time = 0.23875 ms
2: C Write with delay, time = 20.8626 ms
3: C Write with delay, time = 8.4307 ms
4: C Write with delay, time = 19.4026 ms

0: C++ Write with delay, time = 17.1555 ms
1: C++ Write with delay, time = 17.5887 ms
2: C++ Write with delay, time = 18.9792 ms
3: C++ Write with delay, time = 25.8653 ms
4: C++ Write with delay, time = 20.7998 ms

It seems more likely to happen if there is a bit of delay between attempts, and also more likely to happen if the file already exists.

(It seems I can improve my server performance by deleting a file before I write to it. This seems illogical.)

The hardware is a 6-core Xeon 2386G with dual NVMe SSDs on soft Raid, running almost nothing else and no other users.


See attached screenshot showing the result of uname -a and dpkg --list | grep -E "libc6?-(dev|bin)" screenshot

Advertisement

Answer

This answer is meant to provide information about the observed behavior and to inspect what is going on.

TL;DR: the problem clearly comes from the active power state of the NVMe device that can be tuned so not to pay a huge wake-up overhead. This overhead is delayed because of asynchronous IO operations and delayed waits done in the Linux kernel. Please read Section “NVMe device power management” (at the end) so to fix this.


Under the hood: basic profiling

First of all, I can partially reproduce effect on my machine (Debian Linux kernel 5.15.0-2 with a 6-core Intel i5-9600KF and a NVMe “Samsung 970 Evo Plus” SSD) in some conditions. More specifically, the latency of ~6 ms is visible only during the second C test (2: C Write, time =) when I do not call GCC just before so to generate the binary. All timings are smaller than 1 ms otherwise (generally in the range 0.1~0.4 ms). Updating the kernel to the version 5.18.0-2 reduced the latency to 2.5 ms (+/- 0.5 ms). The timings are sufficiently stable and deterministic so to report such information.

A quick analysis using the great Linux perf tool shows that a significant portion of the time is spend in the kernel scheduling-related functions when the latency spike happens (using perf record). Hopefully, we can analyze what the scheduler does with perf too. The command perf sched record ./a.out can be used to record scheduling information and perf sched timehist -Vg help you to show the schedule (note that the perf command is sometime suffixed by the version on some systems like perf_5.4). Here is an example on my machine:

           time    cpu  0123456  task name                       wait time  sch delay   run time
                                 [tid/pid]                          (msec)     (msec)     (msec)
--------------- ------  -------  ------------------------------  ---------  ---------  ---------
   95730.123845 [0005]       i   <idle>                              0.000      0.000      0.000 
   95730.123845 [0002]    s      perf_5.4[55033]                     0.000      0.000      0.000 
   95730.124721 [0000]  i        <idle>                              0.000      0.000      0.000 
   95730.124725 [0000]  s        gmain[805/804]                      0.000      0.025      0.004 
   95730.124761 [0000]  i        <idle>                              0.004      0.000      0.036 
   95730.124765 [0000]  s        gmain[805/804]                      0.036      0.025      0.003 
   95730.124789 [0003]     i     <idle>                              0.000      0.000      0.000 
   95730.124795 [0003]     s     kworker/u12:2-e[45157]              0.000      0.023      0.006 
   95730.124822 [0002]    i      <idle>                              0.000      0.000      0.976 
   95730.124833 [0000]  i        <idle>                              0.003      0.000      0.068 
   95730.124834 [0005]       s   a.out[55040]                        0.000      0.026      0.988 
   95730.124836 [0000]  s    |   gmain[805/804]                      0.068      0.038      0.003 
   95730.124838 [0002]    s  |   sudo[54745]                         0.000      0.028      0.015 
   95730.124849 [0003]     i |   <idle>                              0.006      0.000      0.053 
   95730.124854 [0003]     s |   kworker/u12:2-e[45157]              0.053      0.027      0.004 
   95730.124886 [0002]    i  |   <idle>                              0.015      0.000      0.048 
   95730.124901 [0002]    s  |   sudo[54745]                         0.048      0.033      0.015 
   95730.124916 [0003]     i |   <idle>                              0.004      0.000      0.062 
   95730.124922 [0003]     s |   kworker/u12:2-e[45157]              0.062      0.024      0.005 
   95730.124945 [0004]      i|   <idle>                              0.000      0.000      0.000 
   95730.124987 [0004]      s|   gnome-terminal-[8464]               0.000      0.024      0.042 
   95730.127461 [0003]     i |   <idle>                              0.005      0.000      2.539 
   95730.127474 [0005]       i   <idle>                              0.988      0.000      2.639 
   95730.127475 [0003]     s     kworker/u12:2-e[45157]              2.539      0.023      0.013 
   95730.127516 [0000]  i  |     <idle>                              0.003      0.000      2.679 
   95730.127519 [0000]  s  |     gmain[805/804]                      2.679      0.027      0.003 
   95730.127530 [0005]     | s   a.out[55040]                        2.639      0.001      0.056 
   95730.127537 [0003]     i |   <idle>                              0.013      0.000      0.062 
   95730.127549 [0005]       i   <idle>                              0.056      0.000      0.018 
   95730.127550 [0003]     s     kworker/u12:2-e[45157]              0.062      0.023      0.013 
   95730.127566 [0004]      i    <idle>                              0.042      0.000      2.578 
   95730.127568 [0004]      s    kworker/u12:4-e[54041]              0.000      0.026      0.002 
   95730.127585 [0002]    i      <idle>                              0.015      0.000      2.683 
   95730.127585 [0000]  i        <idle>                              0.003      0.000      0.065 
   95730.127588 [0000]  s        gmain[805/804]                      0.065      0.026      0.003 
   95730.127591 [0005]       s   a.out[55040]                        0.018      0.001      0.042 
   95730.127595 [0002]    s  |   sudo[54745]                         2.683      0.043      0.009 
   95730.127605 [0004]      i|   <idle>                              0.002      0.000      0.037 
   95730.127617 [0005]       i   <idle>                              0.042      0.000      0.026 
   95730.127618 [0004]      s    kworker/u12:4-e[54041]              0.037      0.028      0.013 
   95730.127635 [0003]     i     <idle>                              0.013      0.000      0.085 
   95730.127637 [0003]     s     kworker/u12:2-e[45157]              0.085      0.027      0.002 
   95730.127644 [0003]     i     <idle>                              0.002      0.000      0.007 
   95730.127647 [0003]     s     kworker/u12:2-e[45157]              0.007      0.000      0.002 
   95730.127650 [0003]     i     <idle>                              0.002      0.000      0.003 
   95730.127652 [0003]     s     kworker/u12:2-e[45157]              0.003      0.000      0.001 
   95730.127653 [0003]     i     <idle>                              0.001      0.000      0.001 
   95730.127659 [0003]     s     kworker/u12:2-e[45157]              0.001      0.000      0.006 
   95730.127662 [0002]    i      <idle>                              0.009      0.000      0.067 
   95730.127662 [0000]  i        <idle>                              0.003      0.000      0.073 
   95730.127666 [0000]  s        gmain[805/804]                      0.073      0.036      0.003 
   95730.127669 [0003]     i     <idle>                              0.006      0.000      0.010 
   95730.127672 [0004]      i    <idle>                              0.013      0.000      0.053 
   95730.127673 [0003]     s     kworker/u12:2-e[45157]              0.010      0.000      0.004 
   95730.127674 [0004]      s    kworker/u12:4-e[54041]              0.053      0.026      0.002 
   95730.127676 [0004]      i    <idle>                              0.002      0.000      0.001 
   95730.127678 [0004]      s    kworker/u12:4-e[54041]              0.001      0.001      0.002 
   95730.127679 [0002]    s      sudo[54745]                         0.067      0.052      0.016 
   95730.127692 [0001]   i       <idle>                              0.000      0.000      0.000 
   95730.127717 [0001]   s       gnome-terminal-[8464]               2.704      0.019      0.024 
   95730.127725 [0005]       s   a.out[55040]                        0.026      0.001      0.107 
   95730.127755 [0002]    i  |   <idle>                              0.016      0.000      0.075 

The time and the left is in seconds, the 0123456 column show the schedule of the active tasks on the cores. s means the task is scheduled and i means it is interrupted. I added | symbols so to better understand when the tracked process is running (a.out is the program executing your code). The execution time printed by the program is 2.68278 ms so we are basically searching for a 0.0027 second gap in the timestamps (but I find the idle timestamps unreliable since they appear to indicate the end of the idle time instead of the beginning of the idle time — when the task is suspended).

The schedule shows that the process runs for 0.988 ms, then is interrupted for 2.639 ms, then runs again for 0.056 ms, then is again interrupted for 0.018 ms and runs again, etc. (the rest of the execution is not shown for sake of clarity). The first interruption match very well with the reported spike timing (especially since we should include the time for the process to warm up again and the scheduler to do the context switch).

During the interruption of the program, two tasks are awaken: a kernel thread called kworker/u12:2-e and a task called gmain which is certainly gnome-shell. The kernel thread starts when the program is interrupted and is interrupted when the program is resumed (with a 7 us delay). Also please note that the kernel thread takes 2.539 to run.

Perf can provide some information about kernel calls. You do that by adding the option --kernel-callchains --call-graph dwarf to sched record. Unfortunately, the result are not very useful in this case. The only useful information is that the kernel function io_schedule <- folio_wait_bit_common <- folio_wait_writeback <- truncate_inode_partial_folio are called when the program is being interrupted during the spike. This proves the program is interrupted because of an IO operation. You can also add the --wakeups flag so to see the wakeup event and see that the suspicious slow kernel thread is awaken by the target program (and the previous one are awaken by other tasks (typically gmain or gnome-terminal).

strace -T ./a.out can be used to track the time of system calls and we can clearly see that the third call to openat is slow on my machine. Here is the interesting part (reformatted for sake of clarity):

unlink:             0.000072 s
openat:             0.000047 s
newfstatat:         0.000007 s
write:              0.000044 s
close:              0.000006 s
[...] (write x 7)
openat:             0.000019 s
newfstatat:         0.000005 s
write:              0.000011 s
close:              0.000022 s
[...] (write x 7)
openat:             0.002334 s  <----- latency spike
newfstatat:         0.000057 s
write:              0.000080 s
close:              0.000052 s
[...] (write x 7)
openat:             0.000021 s
newfstatat:         0.000005 s
write:              0.000029 s
close:              0.000014 s
[...]

Based on the gathered information, we can clearly say that the system calls like openat or close do not always causes the program to: 1. be interrupted and 2. start a kernel thread doing the actual operation on the SSD. Instead, IO calls appears to be somehow aggregated/cached in RAM, and the completion/synchronization on the SSD is only done at specific moments. The latency spike only happens when a kernel threads does the work and the task is interrupted. My guess is that IO operations are done in RAM (certainly asynchronously) and the kernel sometimes flush/sync in-RAM data to the SSD and this is what takes few milliseconds. The reason for such delay is unclear. Anyway, it means the operation is probably latency bound.


In the abysses: kernel profiling

To understand what is exactly going on, one unfortunately need to trace kernel threads and possibly even the SSD driver stack which is a bit tricky. The simplest way to do that seems to use the Linux Function Tracer (aka. ftrace). Note that tracing all kernel functions is very expensive and hide the cost of expensive functions so the granularity should be adjusted. The kernel trace are quickly monstrously big and function names are often not very helpful. On top of that tracing kernel threads is not easy because its pid is unknown before the request is made and operations are done in multi-threaded context (and concurrently on each core due to context-switches).

I tried it on my machine and here is a simplified profiling trace (with only calls >=1us and no IRQ/fault kernel calls for sake of clarity) of the program:

__x64_sys_unlink();               92.569 us
__x64_sys_openat();               48.103 us
__x64_sys_newfstatat();            2.609 us
__x64_sys_write();                33.606 us
exit_to_user_mode_prepare();      12.517 us
__x64_sys_write();                 8.277 us
__x64_sys_write();                 2.482 us
__x64_sys_write();                 2.257 us
__x64_sys_write();                 2.240 us
__x64_sys_write();                 5.987 us
__x64_sys_write();                 5.090 us
__x64_sys_openat();               77.006 us    <------ Fast
__x64_sys_newfstatat();            2.433 us
__x64_sys_write();                43.509 us
exit_to_user_mode_prepare();      83.260 us
__x64_sys_write();                 5.688 us
__x64_sys_write();                 6.339 us
__x64_sys_write();                 4.521 us
__x64_sys_write();                 3.006 us
__x64_sys_write();                 4.309 us
__x64_sys_write();                 3.472 us
__x64_sys_write();                 2.669 us
__x64_sys_openat() {
    [CONTEXT SWITCH: a.out-73884 => <idle>-0]
    [MISSING PART: KERNEL THREAD]
    [CONTEXT SWITCH: <idle>-0 => a.out-73884]
}                               2441.794 us    <----- Latency spike
__x64_sys_newfstatat();            3.007 us
__x64_sys_write();                74.643 us
exit_to_user_mode_prepare();      64.822 us
__x64_sys_write();                24.032 us
__x64_sys_write();                 3.002 us
__x64_sys_write();                 2.408 us
__x64_sys_write();                 4.181 us
__x64_sys_write();                 3.662 us
__x64_sys_write();                 2.381 us
__x64_sys_write();                23.284 us
__x64_sys_openat();               79.258 us
__x64_sys_newfstatat();           27.363 us
__x64_sys_write();                45.040 us
[...]

The kernel trace proves that a context switch happens during a __x64_sys_openat (ie. syscall done by the fopen call) and the latency spike also happens at this time.

A deeper tracing show the function responsible for the wait:

__x64_sys_openat
  do_sys_openat2
    do_filp_open
      path_openat
        do_truncate
          notify_change
            ext4_setattr
              truncate_pagecache
                truncate_inode_pages_range
                  truncate_inode_partial_folio
                    folio_wait_writeback
                      folio_wait_bit
                        io_schedule
                          schedule
                            [task interruption] <---- takes ~95% of the time

Meanwhile, the first call to openat calls truncate_inode_pages_range but not truncate_inode_partial_folio so there is no task interruption and no kernel thread to complete the task. In fact, all calls to openat on “hellow.txt” cause truncate_inode_pages_range to be called but only two calls truncate_inode_partial_folio are made over the 5 first calls to fopen. This function always calls schedule in practice but only the first one is slow (subsequent calls take 20-700 us with an average time of 30 us). The truncate_pagecache function tends to confirm that there is a cache, but this does not explain why subsequent calls to schedule are faster.

When tracing the kernel threads, I ended up with traces like:

finish_task_switch.isra.0:   0.615 us
preempt_count_sub:           0.111 us
wq_worker_running:           0.246 us
_raw_spin_lock_irq:          0.189 us
process_one_work:           24.092 us <----- Actual kernel thread computation

This basically shows that the most important part (>95%) is missing profiling traces. Unfortunately, tracings functions like above (as well as using EBPF tools like ext4slower-bpfcc) resulted in very inconsistent timings, mainly because of the way the time is measured (the absolute time is missing). One need to trace IO-based subsystems to understand this further.

The event profiling of the lower-level NVMe stack (using the ftrace-based trace-cmd tool) shows that the first NVMe request is slow and subsequent ones are fast:

nvme_setup_cmd:    173732.202096  <----- First request started
nvme_sq:           173732.204539  <----- ~2.5 ms delay
nvme_complete_rq:  173732.204543
nvme_setup_cmd:    173732.204873  <----- Second request started
nvme_sq:           173732.204892  <----- No delay
nvme_complete_rq:  173732.204894
nvme_setup_cmd:    173732.205240
nvme_sq:           173732.205257  <----- Same here
nvme_complete_rq:  173732.205259
[...]

The fast that the second openat call is slow is probably due to the synchronization (the IO scheduler wait for the previous request to be completed). The most probable reason is that the NVME device enters is sleeping mode when no requests has been sent for a relatively long time and it takes time to wake up.


NVMe device power management

The sleeping mode states are called Active Power States. They can be seen using the command smartctl -a /dev/nvme0 (in the smartmontools package):

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.80W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     3.40W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0100W       -        -    4  4  4  4     2000    8000

The latency of the S3 and S4 mode is quite big but the consumption is also far smaller than others. This can be controlled using the nvme command (in the nvme-cli package). More specifically the get-feature and set-feature sub-commands. You can get more information about this here. In my case, I just wrote 1000 in the file /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us so the latency not to be bigger than 1 ms (require root privileges). Note that this file will be reset when the machine reboot.

WARNING: note that preventing the SSD to switch to deep sleeping modes can reduce the battery life on notebooks and increase the device temperature. In pathological cases (ie. poor NVMe controller), this could damage the device if it is not properly cooled. That being said, most device are protected against that using a throttling strategy.

Once the power QOS modified, the latency spike is now gone! Et voila! I get the following program output:

0: C Write, time = 0.289867 ms
1: C Write, time = 0.285233 ms
2: C Write, time = 0.225163 ms
3: C Write, time = 0.222544 ms
4: C Write, time = 0.212254 ms
[...]

Note this explains why the latency is not the same from one machine to another (and also the waiting time needed to enter in sleep mode), and why running GCC just before did not cause a latency spike.

Advertisement