Dirty pagecache writeback and vm.dirty_* parameters
What is "dirty pagecache"?
When your system wants to do anything with data, which are written on a hard-disk, it firstly needs to read the data from the disk and store them in RAM. The memory allocated for these data is called pagecache.
When something wants to modify data on hard-disk, usually via write() system-call (or similar), the changes are firstly done only in RAM and the respective pages in pagecache are marked as dirty.
What is "dirty writeback"?
Obviously, you want the modifications you've made to be also written to the hard-disk. Otherwise you'd lost them on shutdown. The act of writing the modifications stored in dirty pages back to hard-disk is called dirty writeback or flushing.
How the writeback works?
The I/O operation to read/write data from/to hard-disk is generally considered to be time-expensive. Therefore the kernel doesn't waste resources to do writeback immediately, but rather waits until there's some more data, in order to flush them all at once.
There are several parameters which control when and how is the writeback performed.
Most significant are:
- dirty_bytes / dirty_ratio
- dirty_background_bytes / dirty_background_ratio
- dirty_writeback_centisecs
- dirty_expire_centisecs
The writeback can be generally divided into 3 stages:
-
1st: Periodic
The kernel threads responsible for flushing dirty data are periodically woken (period based on dirty_writeback_centisecs) to flush data, which are dirty for at least or longer than dirty_expire_centisecs. -
2nd: Background
Once there is enough dirty data to cross threshold based on dirty_background_* parameters, the kernel will try to flush as much as possible or at least enough to get under the background threshold.
Note that this is done asynchronously in kernel flusher threads and while creating some overhead, it doesn't necessarily impact other application's workflow. (hence the name: background) -
3rd: Active
If the threshold based on dirty_* parameters (usually reasonably higher than background threshold) is crossed, it means that applications are producing more dirty data faster than the flusher threads manage to writeback in time. In order to prevent running out of memory, the tasks which produce dirty data are blocked in the write() system-call (and similar), actively waiting for the data to be flushed.
Note: In RHEL 5 and earlier, the flusher threads were named pdflush. In RHEL 6 and later they are named flush-XX.
Control parameters
-
dirty_writeback_centisecs
Specifies the interval between periodic wake-ups of flusher threads in 100'ths of second. -
dirty_expire_centisecs
Specifies the time after which are dirty data considered old enough to be flushed by periodic writeback in 100'ths of second. -
dirty_background_bytes / dirty_background_ratio
Specifies the threshold in bytes (_bytes) or as a percentage of dirty-able* memory (_ratio) at which the kernel will try to flush dirty data more actively.
Note that only one of these can be set and will be used. (See example below) -
dirty_bytes / dirty_ratio
Specifies the threshold in bytes (_bytes) or as a percentage of dirty-able* memory (_ratio) at which the process generating dirty data will itself start the writeback in write() system-call (or similar).
Note that only one of these can be set and will be used. (See example below)
Dirty-able memory denotes all memory, which can be potentially allocated for pagecache and get dirtied. (Free + Reclaimable)
Note: In RHEL 5 and earlier, the ratio percentages were calculated from total memory amount.
All these parameters can be inspected and modified during runtime (without rebooting) in /proc/sys/vm/....
Reference: Kernel source documentation ".../Documentation/sysctl/vm.txt"
Parameter's limits
Some of these parameters have min/max limits. If you attempt to set them outside of their limits, an error is reported and no changes are made.
- dirty_ratio, dirty_background_ratio : Minimal = 0 | Maximal = 100
- dirty_background_bytes : Minimal = 1
- dirty_bytes : Minimal = 2 * PAGE_SIZE
- dirty_expire_centisecs : Minimal = 0
Note: These limits might change between different kernel versions. Best is to check in specific kernel's source-code file: ".../kernel/sysctl.c".
Exclusiveness of ratio vs. bytes
The kernel uses only one of the respective ratio/bytes parameters, the other is set to 0.
For example, when current configuration is:
dirty_bytes = 0
dirty_ratio = 20
After using ex. "echo num > /proc/sys/vm/dirty_bytes" the new configuration becomes:
dirty_bytes = <num>
dirty_ratio = 0