Ceph - VM hangs when transferring large amounts of data to RBD disk

Solution Verified - Updated 2 Aug 2024

Environment

Red Hat Ceph Storage 1.3
Red Hat Ceph Storage 1.2.3
Red Hat Enterprise Linux 7
Red Hat Enterprise Linux 6
Virtual Machine using Qemu-KVM
Libvirt
Ceph Block Device (RBD)
Large Number of OSD's

Issue

Virtual machine boots up with no issues, storage disk from Ceph Cluster (RBD) is able to be mounted to the VM, and a file-system is able to be created. Small files < 1GB are able to be transferred without issue. When moderate sized files >1GB are transferred the VM slows to a halt and eventually locks up.

If enabled ceph client side logs in [client] section of ceph.conf then in client side logs:

7f0b769e7700 -1 -- 192.168.128.30:0/2021513 >> 192.168.128.35:6800/24374 pipe(0x7f0bcabc0000 sd=-1 :0 s=1 pgs=0 cs=0 l=1 c=0x7f0bc55e1ce0).connect couldn't created socket (24) Too many open files

The following kernel messages are generated:

Sep  1 16:04:15 xxxx-rds kernel: INFO: task jbd2/vdf-8:2362 blocked for more than 120 seconds.
Sep  1 16:04:15 xxxx-rds kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep  1 16:04:15 xxxx-rds kernel: jbd2/vdf-8      D ffff88023fd13680     0  2362      2 0x00000080
Sep  1 16:04:15 xxxx-rds kernel: ffff8800b40c7bb8 0000000000000046 ffff8800bab116c0 ffff8800b40c7fd8
Sep  1 16:04:15 xxxx-rds kernel: ffff8800b40c7fd8 ffff8800b40c7fd8 ffff8800bab116c0 ffff88023fd13f48
Sep  1 16:04:15 xxxx-rds kernel: ffff88023ff5e958 0000000000000002 ffffffff811f8310 ffff8800b40c7c30
Sep  1 16:04:15 xxxx-rds kernel: Call Trace:
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff811f8310>] ? generic_block_bmap+0x70/0x70
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff8160955d>] io_schedule+0x9d/0x130
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff811f831e>] sleep_on_buffer+0xe/0x20
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff81607330>] __wait_on_bit+0x60/0x90
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff811f8310>] ? generic_block_bmap+0x70/0x70
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff816073e7>] out_of_line_wait_on_bit+0x87/0xb0
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff81098270>] ? autoremove_wake_function+0x40/0x40
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff811f98da>] __wait_on_buffer+0x2a/0x30
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffffa01b776b>] jbd2_journal_commit_transaction+0x175b/0x19a0 [jbd2]
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff810125c6>] ? __switch_to+0x136/0x4a0
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffffa01bbda9>] kjournald2+0xc9/0x260 [jbd2]
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff81098230>] ? wake_up_bit+0x30/0x30
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffffa01bbce0>] ? commit_timeout+0x10/0x10 [jbd2]
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff8109726f>] kthread+0xcf/0xe0
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff81613cfc>] ret_from_fork+0x7c/0xb0
Sep  1 16:04:15 xxxx-rds kernel: [<ffffffff810971a0>] ? kthread_create_on_node+0x140/0x140

This issue has also been seen when formatting a large number of RBD devices directly attached to instances with mkfs.xfs. mkfs.xfs will block and print a the following hung_task_timeout message to messages:

Jun 07 11:25:02 volformattest-rhel7.localdomain kernel: INFO: task mkfs.xfs:2393 blocked for more than 120 seconds.
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel: mkfs.xfs        D 0000000000000000     0  2393   2387 0x00000083
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  ffff880fe416f930 0000000000000082 ffff880fe4c7a280 ffff880fe416ffd8
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  ffff880fe416ffd8 ffff880fe416ffd8 ffff880fe4c7a280 ffff88103fc54780
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  0000000000000000 7fffffffffffffff 0000000000000000 0000000000000000
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel: Call Trace:
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff8163a909>] schedule+0x29/0x70
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff816385f9>] schedule_timeout+0x209/0x2d0
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81058aaf>] ? kvm_clock_get_cycles+0x1f/0x30
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff810d814c>] ? ktime_get_ts64+0x4c/0xf0
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81639f3e>] io_schedule_timeout+0xae/0x130
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81639fd8>] io_schedule+0x18/0x20
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff8121cf93>] do_blockdev_direct_IO+0xc03/0x2620
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81218bd0>] ? I_BDEV+0x10/0x10
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff8121ea05>] __blockdev_direct_IO+0x55/0x60
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81218bd0>] ? I_BDEV+0x10/0x10
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81219227>] blkdev_direct_IO+0x57/0x60
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81218bd0>] ? I_BDEV+0x10/0x10
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff8116affd>] generic_file_direct_write+0xcd/0x190
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff8116b3dc>] __generic_file_aio_write+0x31c/0x3e0
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81219b2a>] blkdev_aio_write+0x5a/0xd0
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff811dddad>] do_sync_write+0x8d/0xd0
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff811de5cd>] vfs_write+0xbd/0x1e0
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff8110b554>] ? __audit_syscall_entry+0xb4/0x110
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff811df06f>] SyS_write+0x7f/0xe0
Jun 07 11:25:02 volformattest-rhel7.localdomain kernel:  [<ffffffff81645b12>] tracesys+0xdd/0xe2

Resolution

Take a look at the number of file descriptors the QEMU process is using while the data is transferring, as you are most likely over the limit:

pid=pid of qemu process

'cat /proc/$pid/limits'
'echo /proc/$pid/fd/* | wc -w'

On the RBD client review the qemu.conf file located /etc/libvirt and there are configuration options labeled:

max_files = 0
max_processes = 0

0 is the default option and will use the soft default value of the OS, for RHEL 7 this is 1024. We will need to override this by setting a positive integer value for max_files and max_processes . To calculate the correct value for your environment we have one rough formula :

 - Its like each connection uses 1 socket and 2 threads (1 reader, 1 writer).
 - So multiply that by (number of osds + mons) and add some extra (perhaps 50 to be generous for qemu for his job also ) for qemu + librbd/librados threads and that would be a rough maximum for threads
 - For fds, it's mainly just the 1 per connection (so number of  osds + mons) and a few more for qemu for his job also 
 -  So for example In max use case RBD image is mapped to 100 osds,  then it could be 200 threads and 100 sockets + some more for that qemu isntance.

max_files or we can say FD limit is per process for qemu user and above given formula is for Max use case of one RBD image (when RBD image is talking to all osds in cluster) and if you have multiple RBD images attached inside one qemu instance then you need to multiply above given formula value to number of RBD images attached in that instance.
max_processes is system wide setting for qemu user and here we need to multiply the number of RBD images attached to one instance first to above given formula value and above given formula is for Max use case of one RBD image (when RBD image is talking to all osds in cluster) and then you need to add same calculation from all other instances running in that Qemu host to get the right value for max_processes.

 - Same is given in man page of getrlimit function , means number of files limit is per process and processes limit is for system wide user.
   # man 2 getrlimit
    
   RLIMIT_NOFILE
              Specifies a value one greater than the maximum file descriptor number that can be opened by this process.  Attempts (open(2), pipe(2), dup(2), etc.)  to exceed  this  limit  yield  the  error
              EMFILE.  (Historically, this limit was named RLIMIT_OFILE on BSD.)

       RLIMIT_NPROC
              The  maximum  number of processes (or, more precisely on Linux, threads) that can be created for the real user ID of the calling process.  Upon encountering this limit, fork(2) fails with the
              error EAGAIN.  This limit is not enforced for processes that have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability.

Once the config option has been set in /etc/libvirt/qemu.conf a restart of libvirtd is required. Note: Restarting libvirtd will result in all instances being restarted.
Again attempt a large data transfer on the VM host and monitor the fd count and verify we are going over the initial soft limit of 1024. Monitor the number of fd's being opened this time it should not hang.
Related This content is not included.enhancement in ceph code and This content is not included.enhancement in qemu-kvm code for logging -EMFILE error at qemu-kvm layer is work in progress. This is documented in a knowledge base article.

Root Cause

The default soft limit of file descriptors (FDs) in Red Hat Enterprise Linux is 1024. If this setting is not overridden in the qemu.conf file then this number is used. When we create a RBD image, it is mapped to only few OSDs as it is thinly provisioned. And as soon as we start transferring large amount of data, RBD is mapping objects to other OSDs it means it starts creating connection to other OSDs in cluster and not closing the FDs and threads until operation is completed. As per connection we need 1 socket and 2 threads (1 reader, 1 writer). With bigger RBD images having large amount of data transfer with large number of OSDs in cluster, this can lead to a large number of threads and FDs being created and the limit being exceeded. Follow the Resolution section for setting the correct value for max_files and max_processes.

Diagnostic Steps

Have a Ceph cluster with large number of OSDs
Have a VM using qemu and an RBD image presented to the VM using libvirt.
Transfer a large data file to a RBD image
Monitor the fd count on the qemu PID as it approaches the default value of 1024.
As it gets close to the value the VM will lock up.
Modify max_files and max_processes in the qemu.conf file with formula given in Resolution section.
Again transfer a large data file to the VM and monitor the qemu process fd numbers again.
The fd numbers should be exceeding the previous default of 1024 and the VM should not lock up.

SBR

Ceph

Product(s)

Red Hat Ceph Storage

Category

Configure

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.