Writing a small file blocks for 20 ms

Question

I discovered that on my Ubuntu 22 server, attempting to write to a file, often induces around 20ms delay, even when only writing a few bytes. Here is some basic code that demonstrates the problem: And here is the output: It seems more likely to happen if there is a bit of delay between attempts, and also more likely to

Accepted Answer

This answer is meant to provide information about the observed behavior and to inspect what is going on.TL;DR: the problem clearly comes from the active power state of the NVMe device that can be tuned so not to pay a huge wake-up overhead. This overhead is delayed because of asynchronous IO operations and delayed waits done in the Linux kernel. Please read Section “NVMe device power management” (at the end) so to fix this.Under the hood: basic profilingFirst of all, I can partially reproduce effect on my machine (Debian Linux kernel 5.15.0-2 with a 6-core Intel i5-9600KF and a NVMe “Samsung 970 Evo Plus” SSD) in some conditions. More specifically, the latency of ~6 ms is visible only during the second C test (2: C Write, time =) when I do not call GCC just before so to generate the binary. All timings are smaller than 1 ms otherwise (generally in the range 0.1~0.4 ms). Updating the kernel to the version 5.18.0-2 reduced the latency to 2.5 ms (+/- 0.5 ms). The timings are sufficiently stable and deterministic so to report such information.A quick analysis using the great Linux perf tool shows that a significant portion of the time is spend in the kernel scheduling-related functions when the latency spike happens (using perf record). Hopefully, we can analyze what the scheduler does with perf too. The command perf sched record ./a.out can be used to record scheduling information and perf sched timehist -Vg help you to show the schedule (note that the perf command is sometime suffixed by the version on some systems like perf_5.4). Here is an example on my machine: time cpu 0123456 task name wait time sch delay run time [tid/pid] (msec) (msec) (msec)--------------- ------ ------- ------------------------------ --------- --------- --------- 95730.123845 [0005] i 0.000 0.000 0.000 95730.123845 [0002] s perf_5.4[55033] 0.000 0.000 0.000 95730.124721 [0000] i 0.000 0.000 0.000 95730.124725 [0000] s gmain[805/804] 0.000 0.025 0.004 95730.124761 [0000] i 0.004 0.000 0.036 95730.124765 [0000] s gmain[805/804] 0.036 0.025 0.003 95730.124789 [0003] i 0.000 0.000 0.000 95730.124795 [0003] s kworker/u12:2-e[45157] 0.000 0.023 0.006 95730.124822 [0002] i 0.000 0.000 0.976 95730.124833 [0000] i 0.003 0.000 0.068 95730.124834 [0005] s a.out[55040] 0.000 0.026 0.988 95730.124836 [0000] s | gmain[805/804] 0.068 0.038 0.003 95730.124838 [0002] s | sudo[54745] 0.000 0.028 0.015 95730.124849 [0003] i | 0.006 0.000 0.053 95730.124854 [0003] s | kworker/u12:2-e[45157] 0.053 0.027 0.004 95730.124886 [0002] i | 0.015 0.000 0.048 95730.124901 [0002] s | sudo[54745] 0.048 0.033 0.015 95730.124916 [0003] i | 0.004 0.000 0.062 95730.124922 [0003] s | kworker/u12:2-e[45157] 0.062 0.024 0.005 95730.124945 [0004] i| 0.000 0.000 0.000 95730.124987 [0004] s| gnome-terminal-[8464] 0.000 0.024 0.042 95730.127461 [0003] i | 0.005 0.000 2.539 95730.127474 [0005] i 0.988 0.000 2.639 95730.127475 [0003] s kworker/u12:2-e[45157] 2.539 0.023 0.013 95730.127516 [0000] i | 0.003 0.000 2.679 95730.127519 [0000] s | gmain[805/804] 2.679 0.027 0.003 95730.127530 [0005] | s a.out[55040] 2.639 0.001 0.056 95730.127537 [0003] i | 0.013 0.000 0.062 95730.127549 [0005] i 0.056 0.000 0.018 95730.127550 [0003] s kworker/u12:2-e[45157] 0.062 0.023 0.013 95730.127566 [0004] i 0.042 0.000 2.578 95730.127568 [0004] s kworker/u12:4-e[54041] 0.000 0.026 0.002 95730.127585 [0002] i 0.015 0.000 2.683 95730.127585 [0000] i 0.003 0.000 0.065 95730.127588 [0000] s gmain[805/804] 0.065 0.026 0.003 95730.127591 [0005] s a.out[55040] 0.018 0.001 0.042 95730.127595 [0002] s | sudo[54745] 2.683 0.043 0.009 95730.127605 [0004] i| 0.002 0.000 0.037 95730.127617 [0005] i 0.042 0.000 0.026 95730.127618 [0004] s kworker/u12:4-e[54041] 0.037 0.028 0.013 95730.127635 [0003] i 0.013 0.000 0.085 95730.127637 [0003] s kworker/u12:2-e[45157] 0.085 0.027 0.002 95730.127644 [0003] i 0.002 0.000 0.007 95730.127647 [0003] s kworker/u12:2-e[45157] 0.007 0.000 0.002 95730.127650 [0003] i 0.002 0.000 0.003 95730.127652 [0003] s kworker/u12:2-e[45157] 0.003 0.000 0.001 95730.127653 [0003] i 0.001 0.000 0.001 95730.127659 [0003] s kworker/u12:2-e[45157] 0.001 0.000 0.006 95730.127662 [0002] i 0.009 0.000 0.067 95730.127662 [0000] i 0.003 0.000 0.073 95730.127666 [0000] s gmain[805/804] 0.073 0.036 0.003 95730.127669 [0003] i 0.006 0.000 0.010 95730.127672 [0004] i 0.013 0.000 0.053 95730.127673 [0003] s kworker/u12:2-e[45157] 0.010 0.000 0.004 95730.127674 [0004] s kworker/u12:4-e[54041] 0.053 0.026 0.002 95730.127676 [0004] i 0.002 0.000 0.001 95730.127678 [0004] s kworker/u12:4-e[54041] 0.001 0.001 0.002 95730.127679 [0002] s sudo[54745] 0.067 0.052 0.016 95730.127692 [0001] i 0.000 0.000 0.000 95730.127717 [0001] s gnome-terminal-[8464] 2.704 0.019 0.024 95730.127725 [0005] s a.out[55040] 0.026 0.001 0.107 95730.127755 [0002] i | 0.016 0.000 0.075 The time and the left is in seconds, the 0123456 column show the schedule of the active tasks on the cores. s means the task is scheduled and i means it is interrupted. I added | symbols so to better understand when the tracked process is running (a.out is the program executing your code). The execution time printed by the program is 2.68278 ms so we are basically searching for a 0.0027 second gap in the timestamps (but I find the idle timestamps unreliable since they appear to indicate the end of the idle time instead of the beginning of the idle time — when the task is suspended).The schedule shows that the process runs for 0.988 ms, then is interrupted for 2.639 ms, then runs again for 0.056 ms, then is again interrupted for 0.018 ms and runs again, etc. (the rest of the execution is not shown for sake of clarity). The first interruption match very well with the reported spike timing (especially since we should include the time for the process to warm up again and the scheduler to do the context switch).During the interruption of the program, two tasks are awaken: a kernel thread called kworker/u12:2-e and a task called gmain which is certainly gnome-shell. The kernel thread starts when the program is interrupted and is interrupted when the program is resumed (with a 7 us delay). Also please note that the kernel thread takes 2.539 to run.Perf can provide some information about kernel calls. You do that by adding the option --kernel-callchains --call-graph dwarf to sched record. Unfortunately, the result are not very useful in this case. The only useful information is that the kernel function io_schedule <- folio_wait_bit_common <- folio_wait_writeback <- truncate_inode_partial_folio are called when the program is being interrupted during the spike. This proves the program is interrupted because of an IO operation. You can also add the --wakeups flag so to see the wakeup event and see that the suspicious slow kernel thread is awaken by the target program (and the previous one are awaken by other tasks (typically gmain or gnome-terminal).strace -T ./a.out can be used to track the time of system calls and we can clearly see that the third call to openat is slow on my machine. Here is the interesting part (reformatted for sake of clarity):unlink: 0.000072 sopenat: 0.000047 snewfstatat: 0.000007 swrite: 0.000044 sclose: 0.000006 s[...] (write x 7)openat: 0.000019 snewfstatat: 0.000005 swrite: 0.000011 sclose: 0.000022 s[...] (write x 7)openat: 0.002334 s <----- latency spikenewfstatat: 0.000057 swrite: 0.000080 sclose: 0.000052 s[...] (write x 7)openat: 0.000021 snewfstatat: 0.000005 swrite: 0.000029 sclose: 0.000014 s[...]Based on the gathered information, we can clearly say that the system calls like openat or close do not always causes the program to: 1. be interrupted and 2. start a kernel thread doing the actual operation on the SSD. Instead, IO calls appears to be somehow aggregated/cached in RAM, and the completion/synchronization on the SSD is only done at specific moments. The latency spike only happens when a kernel threads does the work and the task is interrupted. My guess is that IO operations are done in RAM (certainly asynchronously) and the kernel sometimes flush/sync in-RAM data to the SSD and this is what takes few milliseconds. The reason for such delay is unclear. Anyway, it means the operation is probably latency bound.In the abysses: kernel profilingTo understand what is exactly going on, one unfortunately need to trace kernel threads and possibly even the SSD driver stack which is a bit tricky. The simplest way to do that seems to use the Linux Function Tracer (aka. ftrace). Note that tracing all kernel functions is very expensive and hide the cost of expensive functions so the granularity should be adjusted. The kernel trace are quickly monstrously big and function names are often not very helpful. On top of that tracing kernel threads is not easy because its pid is unknown before the request is made and operations are done in multi-threaded context (and concurrently on each core due to context-switches).I tried it on my machine and here is a simplified profiling trace (with only calls >=1us and no IRQ/fault kernel calls for sake of clarity) of the program:__x64_sys_unlink(); 92.569 us__x64_sys_openat(); 48.103 us__x64_sys_newfstatat(); 2.609 us__x64_sys_write(); 33.606 usexit_to_user_mode_prepare(); 12.517 us__x64_sys_write(); 8.277 us__x64_sys_write(); 2.482 us__x64_sys_write(); 2.257 us__x64_sys_write(); 2.240 us__x64_sys_write(); 5.987 us__x64_sys_write(); 5.090 us__x64_sys_openat(); 77.006 us <------ Fast__x64_sys_newfstatat(); 2.433 us__x64_sys_write(); 43.509 usexit_to_user_mode_prepare(); 83.260 us__x64_sys_write(); 5.688 us__x64_sys_write(); 6.339 us__x64_sys_write(); 4.521 us__x64_sys_write(); 3.006 us__x64_sys_write(); 4.309 us__x64_sys_write(); 3.472 us__x64_sys_write(); 2.669 us__x64_sys_openat() { [CONTEXT SWITCH: a.out-73884 => -0] [MISSING PART: KERNEL THREAD] [CONTEXT SWITCH: -0 => a.out-73884]} 2441.794 us <----- Latency spike__x64_sys_newfstatat(); 3.007 us__x64_sys_write(); 74.643 usexit_to_user_mode_prepare(); 64.822 us__x64_sys_write(); 24.032 us__x64_sys_write(); 3.002 us__x64_sys_write(); 2.408 us__x64_sys_write(); 4.181 us__x64_sys_write(); 3.662 us__x64_sys_write(); 2.381 us__x64_sys_write(); 23.284 us__x64_sys_openat(); 79.258 us__x64_sys_newfstatat(); 27.363 us__x64_sys_write(); 45.040 us[...]The kernel trace proves that a context switch happens during a __x64_sys_openat (ie. syscall done by the fopen call) and the latency spike also happens at this time.A deeper tracing show the function responsible for the wait:__x64_sys_openat do_sys_openat2 do_filp_open path_openat do_truncate notify_change ext4_setattr truncate_pagecache truncate_inode_pages_range truncate_inode_partial_folio folio_wait_writeback folio_wait_bit io_schedule schedule [task interruption] <---- takes ~95% of the timeMeanwhile, the first call to openat calls truncate_inode_pages_range but not truncate_inode_partial_folio so there is no task interruption and no kernel thread to complete the task. In fact, all calls to openat on “hellow.txt” cause truncate_inode_pages_range to be called but only two calls truncate_inode_partial_folio are made over the 5 first calls to fopen. This function always calls schedule in practice but only the first one is slow (subsequent calls take 20-700 us with an average time of 30 us). The truncate_pagecache function tends to confirm that there is a cache, but this does not explain why subsequent calls to schedule are faster.When tracing the kernel threads, I ended up with traces like:finish_task_switch.isra.0: 0.615 uspreempt_count_sub: 0.111 uswq_worker_running: 0.246 us_raw_spin_lock_irq: 0.189 usprocess_one_work: 24.092 us <----- Actual kernel thread computationThis basically shows that the most important part (>95%) is missing profiling traces. Unfortunately, tracings functions like above (as well as using EBPF tools like ext4slower-bpfcc) resulted in very inconsistent timings, mainly because of the way the time is measured (the absolute time is missing). One need to trace IO-based subsystems to understand this further.The event profiling of the lower-level NVMe stack (using the ftrace-based trace-cmd tool) shows that the first NVMe request is slow and subsequent ones are fast:nvme_setup_cmd: 173732.202096 <----- First request startednvme_sq: 173732.204539 <----- ~2.5 ms delaynvme_complete_rq: 173732.204543nvme_setup_cmd: 173732.204873 <----- Second request startednvme_sq: 173732.204892 <----- No delaynvme_complete_rq: 173732.204894nvme_setup_cmd: 173732.205240nvme_sq: 173732.205257 <----- Same herenvme_complete_rq: 173732.205259[...]The fast that the second openat call is slow is probably due to the synchronization (the IO scheduler wait for the previous request to be completed). The most probable reason is that the NVME device enters is sleeping mode when no requests has been sent for a relatively long time and it takes time to wake up.NVMe device power managementThe sleeping mode states are called Active Power States. They can be seen using the command smartctl -a /dev/nvme0 (in the smartmontools package):Supported Power StatesSt Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 7.80W - - 0 0 0 0 0 0 1 + 6.00W - - 1 1 1 1 0 0 2 + 3.40W - - 2 2 2 2 0 0 3 - 0.0700W - - 3 3 3 3 210 1200 4 - 0.0100W - - 4 4 4 4 2000 8000The latency of the S3 and S4 mode is quite big but the consumption is also far smaller than others. This can be controlled using the nvme command (in the nvme-cli package). More specifically the get-feature and set-feature sub-commands. You can get more information about this here. In my case, I just wrote 1000 in the file /sys/class/nvme/nvme0/power/pm_qos_latency_tolerance_us so the latency not to be bigger than 1 ms (require root privileges). Note that this file will be reset when the machine reboot.WARNING: note that preventing the SSD to switch to deep sleeping modes can reduce the battery life on notebooks and increase the device temperature. In pathological cases (ie. poor NVMe controller), this could damage the device if it is not properly cooled. That being said, most device are protected against that using a throttling strategy.Once the power QOS modified, the latency spike is now gone! Et voila! I get the following program output:0: C Write, time = 0.289867 ms1: C Write, time = 0.285233 ms2: C Write, time = 0.225163 ms3: C Write, time = 0.222544 ms4: C Write, time = 0.212254 ms[...]Note this explains why the latency is not the same from one machine to another (and also the waiting time needed to enter in sleep mode), and why running GCC just before did not cause a latency spike.

Advertisement

Answer

Under the hood: basic profiling

In the abysses: kernel profiling

NVMe device power management