Is it safe to add LVM cache device to existing LVM volume group while filesystem is mounted?

I have an existing LVM volume group with a 10 TB logical volume mounted as an ext4 system which is actively in use.

Is it safe to run the command lvconvert --type cache --cachepool storage/lvmcache-data storage/data while ext4 filesystem is already mounted on storage/data? (The storage/lvmcache-data has been previously configured with lvconvert --type cache-pool --cachemode writeback --poolmetadata storage/lvmcache-metadata storage/lvmcache-data in case it makes a difference.)

I would assume yes, it’s safe to add cache on-the-fly to online volume with mounted filesystem, but I couldn’t find documentation either way.

This hasn’t been clearly documented anywhere by the authors of LVM but according to https://blog.delouw.ch/2020/01/29/using-lvm-cache-for-storage-tiering/

Another benefit of dm-cache over dm-writecache is that the cache can
be created, activated and destroyed online.

That means that as long as you’re using dm-cache module instead of dm-writecache module, it should be safe to add and remove the LVM cache while the logical volume has already been mounted.

Note that LVM cachemode setting writeback is different from dm-writecache.

In addition, RedHat documentation at https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.5/html/administration_guide/sect-lvm_cache#idm140401735629408 says following:

19.8.3. Configuring LVM Cache

Adding or removing cache pools can be done on active volumes, even with mounted filesystems in use.
However, there is overhead to the operation and performance impacts
will be seen, especially when removing a cache volume in writeback
mode, as a full data sync will need to occur.

I also verified this by following test:

  1. Create 4 additional storage devices on a virtual machine: sda (2 GB), sdb (4 GB), sdc (4 GB), sdd (1 GB). Sizes of these devices are not important, I used different sized devices to illustrate the flexibility of LVM here. You can pretend that smaller sdd is the fastest device and will be used as a cache.

  2. Build an LVM storage out of sda, sdb, sdc taking all the extents from all the devices (the volume group is called storage and logical volume is called data for this example):

    pvcreate /dev/sda /dev/sdb /dev/sdc
    vgcreate storage /dev/sda /dev/sdb /dev/sdc
    lvcreate -l100%FREE -n data storage
    mkfs.ext4 /dev/mapper/storage-data
    mkdir -p /root/test
    mount /dev/mapper/storage-data /root/test
    

    In real world, I would recommend creating partitions that are a bit shorter than the whole device and using those partitions for LVM physical volume. This allows easier replacement of a device because "1 TB" devices from different manufacturers may differ a couple of megabytes. I prefer to keep the last ~100 MB unpartitioned for SSDs to be able to create indentically sized partitions on different SSD devices. As a bonus, SSD device can use this never-used area of the disk as extra wear leveling area. If you use cheap drives, I would recommend leaving 10–20% never used because cheap drives typically have way less wear leveling area outside user accessible area. Leaving some user accessible area untouched (or freed with TRIM) allows the firmware to have more wear leveling area which prolongs the life of the drive and typically improves its performance.

  3. Start two fio test loops in paraller on two separate terminals in directory /root/test:

    First loop:

    while fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randwrite --size=500m --blocksize=4k --ioengine=libaio --fsync=1m --iodepth=1 --direct=1 --numjobs=1 --group_reporting --verify=sha1 --do_verify=0 --verify_state_save=1 --verify_backlog=1024 && fio --name TEST --eta-newline=5s --filename=fio-tempfile.dat --rw=randread --size=500m --blocksize=4k --ioengine=libaio --fsync=1m --iodepth=1 --direct=1 --numjobs=1 --group_reporting --verify=sha1 --do_verify=1 --verify_state_save=1 --verify_backlog=1024 --verify_dump=1 --verify_only ; do printf "nOK -- %snnn" "$(date --iso=sec)"; done
    

    Second loop (in another terminal):

    while fio --name TEST --eta-newline=5s --filename=fio-tempfile2.dat --rw=randwrite --size=500m --blocksize=4k --ioengine=libaio --fsync=1m --iodepth=4 --direct=1 --numjobs=4 --group_reporting --verify=sha1 --do_verify=0 --verify_state_save=1 --verify_backlog=1024 && fio --name TEST --eta-newline=5s --filename=fio-tempfile2.dat --rw=randread --size=500m --blocksize=4k --ioengine=libaio --fsync=1m --iodepth=2 --direct=1 --numjobs=8 --group_reporting --verify=sha1 --do_verify=1 --verify_state_save=1 --verify_backlog=1024 --verify_dump=1 --verify_only ; do printf "nOK -- %snnn" "$(date --iso=sec)"; done
    

    These create two files called fio-tempfile.dat and fio-tempfile2.dat which are continously written and verified with total of 5 processes and the contents of the files are verified. I tested with dd that if you modify a single byte, the loop will detect the error:

    dd if=/dev/zero of=fio-tempfile.dat seek=1000000 count=1 bs=1
    

    Once an error is detected, you can restart the loop and it will keep testing and verifying the storage until stopped or until an error is found.

  4. Add a new cache device (sdd) to this existing storage while the above test loops are constantly running to demonstrate that accessing the filesystem is safe:

    pvcreate /dev/sdd
    vgextend storage /dev/sdd
    lvcreate -n lvmcache-data -l 98%FREE storage /dev/sdd
    lvcreate -n lvmcache-metadata -l 50%FREE storage /dev/sdd
    lvconvert --type cache-pool --cachemode writeback --poolmetadata storage/lvmcache-metadata storage/lvmcache-data
    lvconvert --type cache --cachepool storage/lvmcache-data storage/data
    

    The last command adds the LVM cache device on-the-fly without causing data corruption. The cache will last over system reboots without issues, too. The reason for allocating only 98% for the data cache and 50% of the remaining space (1%) for the metadata cache is that building a cachepool out of these requires a bit of free space in volume group or it will fail. You could also use cachevol instead of cachepool but 3rd party software typically only supports cachepool because it’s the older method. Both have identical performance and cachepool is just more complex to build but has better interoperability with 3rd party repair and diagnostics software which is why I prefer to use it. The cachepool mode also supports using separate devices for metadata and data which could improve performance if you have multiple really fast devices.

  5. If you then want to remove the cache device, you can do following on-the-fly without data corruption:

    lvconvert --uncache storage/data
    

    This will take quite a long time if the LVM cache is in active use (such as in the example above with the test loops running) and it will keep displaying status such as

    Flushing 15610 blocks for cache storage/data.
    Flushing 11514 blocks for cache storage/data.
    Flushing 7418 blocks for cache storage/data.
    Flushing 5481 blocks for cache storage/data.
    ...
    Flushing 1 blocks for cache storage/data.
    Logical volume "lvmcache-data_cpool" successfully removed
    Logical volume storage/data is not cached.
    

    It seems that flushing may stall for a long time and keep displaying same amount of unflushed blocks but you just have to keep waiting. The filesystem mounted on top of the LVM keeps working at all times.

    I didn’t verify what happens if the power is lost while the uncache operation is in progress. I would assume the LVM boots with the cache still in use and you can simply re-run the uncache operation again.

Note that after the uncache command, both the data cache and metadata cache logical volumes will be lost (freed without any history) so if you want to re-attach the cache device, you have to build it from the start (all the lvcreate and lvconvert commands for the step 4). The cache device will still be part of the volume group after the uncache operation so you don’t need to redo that.

And as usual, always have up-to-date backup complete and verified before messing with any important data!

The above LVM cache setup will look like following according to lsblk -sp:

NAME                                             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
/dev/mapper/storage-data                         253:3    0    10G  0 lvm  /root/test
├─/dev/mapper/storage-lvmcache--data_cpool_cdata 253:0    0   996M  0 lvm  
│ └─/dev/sdd                                       8:48   0     1G  0 disk 
├─/dev/mapper/storage-lvmcache--data_cpool_cmeta 253:1    0    12M  0 lvm  
│ └─/dev/sdd                                       8:48   0     1G  0 disk 
└─/dev/mapper/storage-data_corig                 253:2    0    10G  0 lvm  
  ├─/dev/sda                                       8:0    0     2G  0 disk 
  ├─/dev/sdb                                       8:16   0     4G  0 disk 
  └─/dev/sdc                                       8:32   0     4G  0 disk 

Some additional tips on LVM cache usage:

You can tune LVM cache a bit even though the logic to select what to keep in cache is fully automatic. See man lvmcache for full details. Some examples:

  • List current cache settings (default values will not be listed):

    lvs -o cachesettings storage/data
    
  • Clear all cache settings (use defaults for everything):

    lvchange --cachesettings '' storage/data
    
  • Tune cache to always start flushing write cache to backing storage when if more than 10% of cache is used for write buffering:

    lvchange --cachesettings 'high_watermark=10' storage/data
    
  • Tune cache to keep flushing the write cache when there’s anything to flush once flushing has been started for any reason:

    lvchange --cachesettings 'low_watermark=0' storage/data
    
  • Tune cache to automatically pause flushing for 50 ms if backing storage is accessed (avoid introducing latency for flushing)

    lvchange --cachesettings 'pause_writeback=50' storage/data
    
  • Automatically flush even small amount of data to backing storage when it has been in the cache for more than 60 seconds:

    lvchange --cachesettings 'autocommit_time=60000' storage/data
    
Answered By: Mikko Rantalainen
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.