Mount bind point to incorrect NVMe device after power off/on device

I am developing an all-flash storage application. I found that mount bind has strange behavior on NVMe device power off/on.
Distro: SUSE Linux Enterprise Server 15 SP4 5.14.21-150400.24.46-default

Mount partition /dev/nvme10n1p1 to /mnt/10n1p1

# mount --bind /dev/nvme10n1p1 /mnt/10n1p1
# lsblk /mnt/10n1p1
NAME       MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme10n1p1 259:11   0  3.6T  0 part

# stat /dev/nvme10n1p1
  File: /dev/nvme10n1p1
  Size: 0               Blocks: 0          IO Block: 4096   block special file
Device: 5h/5d   Inode: 23620       Links: 1     Device type: 103,b

# stat /mnt/10n1
  File: /mnt/10n1
  Size: 0               Blocks: 0          IO Block: 4096   block special file
Device: 5h/5d   Inode: 23620       Links: 1     Device type: 103,b

Power off/on nvme device in short time to simulate hot-plug or power surge.

# ls -lat /sys/block | grep "nvme10n1"
.../0000:be:00.0/nvme/nvme2/nvme10n1
# lspci -vmms 0000:be:00.0
...PhySlot:        168
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT}/power

After power off/on mount bind point to new drive which is just power on.

# lsblk /mnt/10n1
NAME         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme30n2     259:11   0  3.6T  0 disk
└─nvme30n2p1 259:51   0  3.6T  0 part

# stat /dev/nvme30n2
  File: /dev/nvme30n2
  Size: 0               Blocks: 0          IO Block: 4096   block special file
Device: 5h/5d   Inode: 24836       Links: 1     Device type: 103,b

After advance test, I found that mount bind even could point to another drive which is power on in short time after original drive is power off.

# mount --bind  /dev/nvme0n1p1 /mnt/0n1
# lsblk 0n1
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme0n1p1 259:44   0  3.6T  0 part
# nvme id-ctrl /dev/nvme0n1 | grep sn
sn:       : PHLJ043200234P0DGN
# nvme id-ctrl /dev/nvme1n1 | grep sn
sn:       : PHLJ043105AU4P0DGN

# PHYSLOT_0=195
# PHYSLOT_1=194
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT_1}/power
# sleep 5
# date && echo 0|sudo tee /sys/bus/pci/slots/${PHYSLOT_0}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT_1}/power
# sleep 5
# date && echo 1|sudo tee /sys/bus/pci/slots/${PHYSLOT_0}/power

# lsblk /mnt/0n1
NAME       MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
nvme31n2p1 259:44   0  3.6T  0 part

# nvme id-ctrl /dev/nvme31n2 | grep sn
sn        : PHLJ043105AU4P0DGN
# nvme id-ctrl /dev/nvme32n2 | grep sn
sn        : PHLJ043200234P0DGN

I noticed that after power off/on the mount point’s inode number differ from new drive device’s inode number, however the mount point’s inode and new drive device inode share same minor number. I think this is the reason that original mount point can access new drive device which cause data corruption. My guess is mount point hold a ref to inde of power offed drive which cause inode is not destructed. Then new drive power on and takes the reclaimed minor number which is same as original dirve. I’m not sure if this behavior is expected, or if it’s a limitation or a bug of the mount bind.

Asked By: Bofan Liu

||

Turns out this is a linux kernel bdev lifecycle bug, and is fixed in version 5.15.

Related patch linked here.

Answered By: Bofan Liu
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.