Why a filesystem is unmounted but still in use?

I was using ext4 filesystems for a long time, and it’s the first time I see a weird behavior of ext4 filesystem.

There is ext4 filesystem in /dev/dm-2

An I/O error happened in the underlying device, and the filesystem was remounted read-only.
It is fine and expected by the configuration.
But for some unknown reason, now it is not possible to completely unmount the filesystem.

The command umount /the/mount/point returned with success. Further runs of that command say "Not mounted".

The mount entry is gone from output of mount command. The filesystem is not mounted anywhere else.
But.

First: I can’t see the usual EXT4-fs: unmounting filesystem text in dmesg. In fact, there is nothing in the dmesg.

Second thing (it speaks for itself that something is wrong):

root# cat /proc/meminfo | grep dirty
Dirty:           9457728 kB
root# time sync

real    0m0.012s                                                                                
user    0m0.000s                                                                                
sys     0m0.002s
root# cat /proc/meminfo | grep dirty
Dirty:           9453632 kB

Third thing: the debug directory /sys/fs/ext4/dm-2 still exists.
Tried writing "1" to /sys/fs/ext4/dm-2/simulate_fail in hope that it will bring the filesystem down. But it does nothing, shows nothing in dmesg.
Finally the fourth thing which makes the device unusable:

root# e2fsck -fy /dev/dm-2
e2fsck 1.46.5 (30-Dec-2021)
/dev/dm-2 is in use.
e2fsck: Cannot continue, aborting.

I understand that it is possible to reboot and etc. This question is not about solving some simple newbie problem. I want somebody experienced in ext4 filesystem to help me understand what can cause this behavior.

The dm-2 device is not mounted anywhere else, not bind-mounted, not in use by anything else.

There was nothing else using the Dirty Cache at the moment of measuring it with cat /proc/meminfo | grep dirty.

The unmount call which succeeded, was not an MNT_DETACH (no -l flag was used). Despite that, it succeeded nearly immediately (it’s weird). The mount point is no longer mounted: but as I described above, it can be easily seen that the filesystem is NOT unmounted.

Update: as A.B pointed out, I tried to check if the filesystem is still mounted in a different namespace. I didn’t mount it in a different namespace, so I didn’t expect to see anything. But, surprisingly, it was mounted in a different namespace, surprisingly this (username changed):

4026533177 mnt       1 3411291 an-unrelated-nonroot-user       xdg-dbus-proxy --args=43

I tried to enter that namespace and unmount it using nsenter -t 3411291 -m -- umount /the/mount/point

It resulted in Segmentation fault (Core dumped), and this in dmesg

[970130.866738] Buffer I/O error on dev dm-2, logical block 0, lost sync page write
[970130.867925] EXT4-fs error (device dm-2): ext4_mb_release_inode_pa:4846: group 9239, free 2048, pa_free 4
[970130.870291] Buffer I/O error on dev dm-2, logical block 0, lost sync page write
[970130.949466] divide error: 0000 [#1] PREEMPT SMP PTI
[970130.950677] CPU: 49 PID: 4118804 Comm: umount Tainted: P        W  OE      6.1.68-missmika #1
[970130.953056] Hardware name: OEM X79G/X79G, BIOS 4.6.5 08/02/2022
[970130.953121] RIP: 0010:mb_update_avg_fragment_size+0x35/0x120
[970130.953121] Code: 41 54 53 4c 8b a7 98 03 00 00 41 f6 44 24 7c 80 0f 84 9a 00 00 00 8b 46 14 48 89 f3 85 c0 0f 84 8c 00 00 00 99 b9 ff ff ff ff <f7> 7e 18 0f bd c8 41 89 cd 41 83 ed 01 0f 88 ce 00 00 00 0f b6 47
[970130.957139] RSP: 0018:ffffb909e3123a28 EFLAGS: 00010202
[970130.957139] RAX: 000000000000082a RBX: ffff91140ac554d8 RCX: 00000000ffffffff
[970130.957139] RDX: 0000000000000000 RSI: ffff91140ac554d8 RDI: ffff910ead74f800
[970130.957139] RBP: ffffb909e3123a40 R08: 0000000000000000 R09: 0000000000004800
[970130.957139] R10: ffff910ead74f800 R11: ffff9114b7126000 R12: ffff910eb31d2000
[970130.957139] R13: 0000000000000007 R14: ffffb909e3123b80 R15: ffff911d732beffc
[970130.957139] FS:  00007f6d94ab4800(0000) GS:ffff911d7fcc0000(0000) knlGS:0000000000000000
[970130.957139] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[970130.957139] CR2: 00003d140602f000 CR3: 0000000365690002 CR4: 00000000001706e0
[970130.957139] Call Trace:
[970130.957139]  <TASK>
[970130.957139]  ? show_regs.cold+0x1a/0x1f
[970130.957139]  ? __die_body+0x24/0x70
[970130.957139]  ? __die+0x2f/0x3b
[970130.957139]  ? die+0x34/0x60
[970130.957139]  ? do_trap+0xdf/0x100
[970130.957139]  ? do_error_trap+0x73/0xa0
[970130.957139]  ? mb_update_avg_fragment_size+0x35/0x120
[970130.957139]  ? exc_divide_error+0x3f/0x60
[970130.957139]  ? mb_update_avg_fragment_size+0x35/0x120
[970130.957139]  ? asm_exc_divide_error+0x1f/0x30
[970130.957139]  ? mb_update_avg_fragment_size+0x35/0x120
[970130.957139]  ? mb_set_largest_free_order+0x11c/0x130
[970130.957139]  mb_free_blocks+0x24d/0x5e0
[970130.957139]  ? ext4_validate_block_bitmap.part.0+0x29/0x3e0
[970130.957139]  ? __getblk_gfp+0x33/0x3b0
[970130.957139]  ext4_mb_release_inode_pa.isra.0+0x12e/0x350
[970130.957139]  ext4_discard_preallocations+0x22e/0x490
[970130.957139]  ext4_clear_inode+0x31/0xb0
[970130.957139]  ext4_evict_inode+0xba/0x750
[970130.989137]  evict+0xd0/0x180
[970130.989137]  dispose_list+0x39/0x60
[970130.989137]  evict_inodes+0x18e/0x1a0
[970130.989137]  generic_shutdown_super+0x46/0x1b0
[970130.989137]  kill_block_super+0x2b/0x60
[970130.989137]  deactivate_locked_super+0x39/0x80
[970130.989137]  deactivate_super+0x46/0x50
[970130.989137]  cleanup_mnt+0x109/0x170
[970130.989137]  __cleanup_mnt+0x16/0x20
[970130.989137]  task_work_run+0x65/0xa0
[970130.989137]  exit_to_user_mode_prepare+0x152/0x170
[970130.989137]  syscall_exit_to_user_mode+0x2a/0x50
[970130.989137]  ? __x64_sys_umount+0x1a/0x30
[970130.989137]  do_syscall_64+0x6d/0x90
[970130.989137]  ? syscall_exit_to_user_mode+0x38/0x50
[970130.989137]  ? __x64_sys_newfstatat+0x22/0x30
[970130.989137]  ? do_syscall_64+0x6d/0x90
[970130.989137]  ? exit_to_user_mode_prepare+0x3d/0x170
[970130.989137]  ? syscall_exit_to_user_mode+0x38/0x50
[970130.989137]  ? __x64_sys_close+0x16/0x50
[970130.989137]  ? do_syscall_64+0x6d/0x90
[970130.989137]  ? exc_page_fault+0x8b/0x180
[970130.989137]  entry_SYSCALL_64_after_hwframe+0x64/0xce
[970130.989137] RIP: 0033:0x7f6d94925a3b
[970130.989137] Code: fb 43 0f 00 f7 d8 64 89 01 48 83 c8 ff c3 90 f3 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 c1 43 0f 00 f7 d8
[970130.989137] RSP: 002b:00007ffdd60f7d08 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[970130.989137] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007f6d94925a3b
[970130.989137] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055ca1c6f7d60
[970130.989137] RBP: 000055ca1c6f7b30 R08: 0000000000000000 R09: 00007ffdd60f6a90
[970130.989137] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[970130.989137] R13: 000055ca1c6f7d60 R14: 000055ca1c6f7c40 R15: 000055ca1c6f7b30
[970130.989137]  </TASK>
[970130.989137] Modules linked in: 88x2bu(OE) erofs dm_zero zram ext2 hfs hfsplus xfs kvdo(OE) dm_bufio mikasecfs(OE) simplefsplus(OE) melon(OE) mikatest(OE) iloveaki(OE) tls vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) ip6t_REJECT nf_reject_ipv6 ip6t_rt ipt_REJECT nf_reject_ipv4 xt_recent xt_tcpudp nft_limit xt_limit xt_addrtype xt_pkttype nft_chain_nat xt_MASQUERADE xt_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables binfmt_misc nfnetlink nvidia_uvm(POE) nvidia_drm(POE) intel_rapl_msr intel_rapl_common nvidia_modeset(POE) sb_edac nls_iso8859_1 x86_pkg_temp_thermal intel_powerclamp coretemp nvidia(POE) snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi cfg80211 joydev snd_hda_intel input_leds snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_midi rapl snd_seq_midi_event snd_rawmidi intel_cstate serio_raw pcspkr snd_seq video wmi snd_seq_device snd_timer drm_kms_helper fb_sys_fops snd syscopyarea sysfillrect sysimgblt soundcore
[970130.989137]  ioatdma dca mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr parport_pc ppdev lp parport drm efi_pstore ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear crct10dif_pclmul hid_generic crc32_pclmul ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 usbhid cdc_ether aesni_intel usbnet uas hid crypto_simd r8152 cryptd usb_storage mii psmouse ahci i2c_i801 r8169 lpc_ich libahci i2c_smbus realtek [last unloaded: 88x2bu(OE)]
[970131.024615] ---[ end trace 0000000000000000 ]---
[970131.203209] RIP: 0010:mb_update_avg_fragment_size+0x35/0x120
[970131.204344] Code: 41 54 53 4c 8b a7 98 03 00 00 41 f6 44 24 7c 80 0f 84 9a 00 00 00 8b 46 14 48 89 f3 85 c0 0f 84 8c 00 00 00 99 b9 ff ff ff ff <f7> 7e 18 0f bd c8 41 89 cd 41 83 ed 01 0f 88 ce 00 00 00 0f b6 47
[970131.207841] RSP: 0018:ffffb909e3123a28 EFLAGS: 00010202
[970131.209048] RAX: 000000000000082a RBX: ffff91140ac554d8 RCX: 00000000ffffffff
[970131.210284] RDX: 0000000000000000 RSI: ffff91140ac554d8 RDI: ffff910ead74f800
[970131.211512] RBP: ffffb909e3123a40 R08: 0000000000000000 R09: 0000000000004800
[970131.212749] R10: ffff910ead74f800 R11: ffff9114b7126000 R12: ffff910eb31d2000
[970131.213977] R13: 0000000000000007 R14: ffffb909e3123b80 R15: ffff911d732beffc
[970131.215181] FS:  00007f6d94ab4800(0000) GS:ffff911d7fcc0000(0000) knlGS:0000000000000000
[970131.216370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[970131.217553] CR2: 00003d140602f000 CR3: 0000000365690002 CR4: 00000000001706e0
[970131.218740] note: umount[4118804] exited with preempt_count 1

Machine still works, it’s possible to sync other filesystems:

root# sync -f /
root#

But not global sync:

root# sync
(goes D state forever)

The dirty cache related to that ghost filesystem is not gone, the filesystem still "mounted"

What can be the cause of these issues?

Asked By: melonfsck

||

Disclaimer: I can’t and won’t explain in this answer why a kernel partial failure was triggered. This looks like a kernel bug, possibly triggered by the I/O error conditions.

TL;DR

Having a filesystem still in use can happen when a new mount namespace inherits a mounted filesystem from the original mount namespace, but the propagation settings between both didn’t make the unmount in the original namespace propagate it in the new namespace. The command findmnt -A -o +PROPAGATION also displays the propagation status of every visible mountpoint in its output.

Normally this is not supposed to happen in a systemd environment, because systemd very early makes / a shared mount (rather than the kernel default of private) thus allowing unmounts to propagate within their shared group. I would thus expect this to happen more easily in a non-systemd environment, or anyway if a tool explicitly uses --make-private in some mounts. --make-private still has its use, especially for virtual pseudo-filesystems.

One way to prevent this to happen could be, before a new mount namespace is created to change such mountpoint as shared with mount --make-shared ....

I made an experiment to illustrate what happens with shared versus non-shared mounts. I attempted to make sure the experiment should work the same in a systemd or a non-systemd environment.

Experiment

This can be reproduced like below (some values such as /dev/loop0 have to be adapted):

# truncate -s $((2**20)) /tmp/test.raw
# mkfs.ext4 -Elazy_itable_init=0,lazy_journal_init=0 -L test /tmp/test.raw
mke2fs 1.47.0 (5-Feb-2023)

Filesystem too small for a journal
Discarding device blocks: done                            
Creating filesystem with 1024 1k blocks and 128 inodes

Allocating group tables: done                            
Writing inode tables: done                            
Writing superblocks and filesystem accounting information: done

# losetup -f --show /tmp/test.raw 
/dev/loop0
# mkdir -p /mnt/propagation/test

This will allow to change later the propagation for the experiment without having to alter the whole system by turning a directory into a mountpoint:

# mount --bind /mnt/propagation /mnt/propagation

Now different experiments can have different outcomes.

unshare(1) tells:

unshare since util-linux version 2.27 automatically sets
propagation to private in a new mount namespace to make sure that
the new namespace is really unshared. It’s possible to disable this
feature with option --propagation unchanged. Note that private is
the kernel default.

Other tools might do otherwise. Here we’ll change the underlying /mnt/propagation mountpoint instead and always use --propagation unchanged. This avoids getting different results for this experiment on non-systemd (kernel default: / is private) and systemd (systemd default: / is shared) systems.

  1. with shared

    # mount --make-shared /mnt/propagation
    # mount /dev/loop0 /mnt/propagation/test
    # ls /mnt/propagation/test
    lost+found
    # cat /proc/self/mountinfo | grep /mnt/propagation/test
    862 854 7:0 / /mnt/propagation/test rw,relatime shared:500 - ext4 /dev/loop0 rw
    

    Have a second (root) shell and unshare into a new mount namespace (I’ll change the prompt to NMNS# to distinguish it):

    # unshare -m --propagation unchanged --
    NMNS# cat /proc/self/mountinfo | grep /mnt/propagation/test
    1454 1453 7:0 / /mnt/propagation/test rw,relatime shared:500 - ext4 /dev/loop0 rw
    NMNS# cd /mnt/propagation/test
    

    The same shared:500 links the mount in the two namespaces: umounting from one will unmount it from the other.

    In the original shell (in the original mount namespace) unmount it:

    # umount /mnt/propagation/test
    umount: /mnt/propagation/test: target is busy.
    

    Free the resource usage:

    NMNS# cd /
    
    # umount /mnt/propagation/test
    # 
    

    This time it worked.

    And observe it also disappeared in the new mount namespace.

    NMNS# cat /proc/self/mountinfo | grep /mnt/propagation/test
    NMNS# 
    

    The kernel dmesg will have logged the filesystem is unmounted (everywhere), eg:

    EXT4-fs (loop0): unmounting filesystem e74e0353-ace0-4eff-86ae-30e288db853e.
    

    Quit the shell in the new mount namespace to clean up.

  2. with private

    # mount --make-private /mnt/propagation
    # mount /dev/loop0 /mnt/propagation/test
    # cat /proc/self/mountinfo | grep /mnt/propagation/test
    857 854 7:0 / /mnt/propagation/test rw,relatime - ext4 /dev/loop0 rw
    

    Not shared anymore.

    Elsewhere:

    # unshare -m --propagation unchanged --
    NMNS# cat /proc/self/mountinfo | grep /mnt/propagation/test
    1454 1453 7:0 / /mnt/propagation/test rw,relatime - ext4 /dev/loop0 rw
    NMNS# echo $$
    232529
    
    # umount /mnt/propagation/test
    # e2fsck /dev/loop0
    e2fsck 1.47.0 (5-Feb-2023)
    /dev/loop0 is in use.
    e2fsck: Cannot continue, aborting.
    
    
    
    # 
    

    The filesystem stayed mounted in the new mount namespace.

    To find this rogue namespace(s) from the original, one can run something like this:

    # for pid in $(lsns --noheadings -t mnt -o PID); do nsenter -t "$pid" -m -- findmnt /mnt/propagation/test && echo $pid; done
    nsenter: failed to execute findmnt: No such file or directory
    TARGET                SOURCE     FSTYPE OPTIONS
    /mnt/propagation/test /dev/loop0 ext4   rw,relatime
    232529
    # 
    

    Note: nsenter: failed to execute findmnt: No such file or directory happened where the mount namespace was for a running LXC container where findmnt was not available. The loop did find the PID of the process in the new namespace having the mountpoint (note: in real cases, this could be an other PID in the same mount namespace, it doesn’t matter.). In extreme cases, a dedicated command able to change mount namespace, check mounts and perform (u)mounts all-in-one would be required.

    This mount can be removed either by removing the remaining holding resource (PID 232529), which might be needed if the process actively uses the mounted filesystem (preventing umount to succeed), or by unmounting it in this namespace:

    # nsenter -t 232529 -m -- umount /mnt/propagation/test
    # e2fsck /dev/loop0
    e2fsck 1.47.0 (5-Feb-2023)
    test: clean, 11/128 files, 58/1024 blocks
    

Useful references:

Answered By: A.B