VM "not a bootable disk" after upgrade from Promox 6.x to 7.x

Today I decided to upgrade Proxmox Virtual Environment (PVE)
from V6.2.4 to V7.1. 
I did it according to the following steps:

  1. First upgrade to the most recent 6.x version with apt-get update followed by apt-get upgrade-dist.
  2. I followed these steps: Upgrade Proxmox VE from 6.x to 7.0

During the upgrade I had some errors about "disk not found"
which I was worried about, but the upgrade just continued. 
(Maybe they have a relation with the issues I’m now running into.) 
After the upgrade finished, I restarted the host to complete the upgrade. 
The LXC containers were started (start on boot) without any issue. 
Some VMs (start on boot) also worked directly.

unspecified screen shot – possibly from boot or upgrade(?)

Some of the VMs gave the error Boot failed: not a bootable disk in the console and kept rebooting.  After some Google attempts I found a post somewhere that it sometimes helps to reboot the host,
so I did it again and after that all the VMs gave the concerning error. 
I Googled for hours and found a lot of similar issues. 
The only thing that worked for the most of the VMs was restoring the backup. 
Unfortunately, this does not work for the most important machine,
the mail server. 
It contains 170 GB of mails and the only backup I have is "proxmox backups" (images. 7x). None of them works as they all give the same issue.

The questions:

  1. How can I make the VM boot again? 
    If it’s not fixable, is there a way to enter the disk so I can get the data?
  2. How did this occur? is it my fault? is it a bug? is it a known issue?
  3. I don’t dare to reboot Proxmox anymore
    as I’m scared that other VMs will also break permanently. 
    How can I be sure that this is not happening again? 
    Or at least make sure the backup works!

Some important facts:

  • I’m 100% sure all the VMs worked find before the Proxmox upgrade
  • I created a backup of each machine to a network share (backup server)
    before I executed any update related command
  • The backups are made via the Proxmox web interface
  • All the VMs run on Ubuntu 20.04 LTS, including the mail server
  • I tried to set the BIOS to UEFI (without success)
  • I’m not a Proxmox Pro user, so if extra data is required,
    please explain how I can get it to avoid unnecessary posts

The error:

    Machine UUID a238b981-27dd-4ebd-acee-1a9ee97d66a11
    Booting from Hard Disk...
    Boot failed: not a bootable disk
    

    [Manually transcribed from this image.]

pveversion -v output

root@hv1:/home/axxmin# pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.13.19-2-pve)
pve-manager: 7.1-8 (running version: 7.1-8/5b267f33)
pve-kernel-helper: 7.1-6
pve-kernel-5.13: 7.1-5
pve-kernel-5.4: 6.4-11
pve-kernel-5.3: 6.1-6
pve-kernel-5.13.19-2-pve: 5.13.19-4
pve-kernel-5.4.157-1-pve: 5.4.157-1
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
ceph-fuse: 14.2.21-1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.0
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-5
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.0-14
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.0-4
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.2.0-3
proxmox-backup-client: 2.1.2-1
proxmox-backup-file-restore: 2.1.2-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-4
pve-cluster: 7.1-2
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-3
pve-ha-manager: 3.3-1
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.12.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.1-pve3

Backup configuration from concerning backup:

balloon: 6144 
boot: cdn 
bootdisk: sata0 
cores: 2 
ide2: none,
media=cdrom 
memory: 12288 
name: axx-mcow-srv01 
net0: virtio=4E:86:95:6A:FC:46,bridge=vmbr20,
firewall=1 
numa: 0 
onboot: 1
ostype: l26 
sata0: vm_instances:vm-140-disk-0,size=200G 
scsihw: virtio-scsi-pci 
smbios1: uuid=a238b981-27dd-4ebd-acee-1a9ee97d66a1 
sockets: 1 
vmgenid: 971eb84d-7502-4a68-97af-66c595c011b9 #qmdump#map:sata0:drive-sata0:vm_instances:raw:

The VM config

root@hv1:~# qm config 140
balloon: 6144
boot: cdn
bootdisk: sata0
cores: 2
ide2: none,media=cdrom
memory: 12288
name: axx-mcow-srv01
net0: virtio=4E:86:95:6A:FC:46,bridge=vmbr20,firewall=1
numa: 0
onboot: 1
ostype: l26
sata0: vm_instances:vm-140-disk-0,size=200G
scsihw: virtio-scsi-pci
smbios1: uuid=a238b981-27dd-4ebd-acee-1a9ee97d66a1
sockets: 1
vmgenid: 66102f99-158b-451b-a8e2-187ebed7b183

UPDATE1

I found a page After Backup – boot failed: not a bootable disk
at the Proxmox Support Forum, which makes sense to my understandings. 
As I don’t have a separated backup of the partition table,
I found a "boot-repair disk"
This did not fix my VM
but gave me some extra information, which may be useful.

screenshot – "useful" information(?)

screenshot – more information


UPDATE2

After a boot with GParted live, I can say that the partition table is gone. 
I tried to rebuild it with testdisk
(see Restore Damaged or Corrupted Linux Partition Table),
but this is not working.
TestDisk results
Is it possible to create an VM with exactly the same configuration (incl. disk size) and install Ubuntu on it (same disk settings as we always use the default) and copy that partition table to the "broken" server?

Asked By: CodeNinja

||

I could "fix" the partition table so the server booted again. 
After that I was able to migrate all the stuff to a new fresh server
to make sure everything works as it should. 
I did it with the steps below.

  1. Boot the server with the boot repair disk.
  2. Close the automatically started tool "boot repair".
  3. Open a terminal window and execute the command sudo testdisk /dev/sda
    This will start a (command line) tool to scan your disk (sda) for partitions.
  4. Confirm the disk with Proceed.
  5. Select the used partition table type. 
    I had a "BIOS" disk so I needed Intel
    If you have a "EUFI" disk, select EFI GPT.
  6. Go for Analyse to analyse your disk.
  7. Continue with Quick search to execute a quick scan. 
    Depending on the disk, this can take seconds to an hour.
  8. I think in some cases a quick scan will reach out but I noticed that a deeper search is more accurate and increases the chance to fix the disk. 
    In my case the quick scan was not enough.

The deep search will probably find multiple partitions. 
You can browse through the files by using P
You can go back by selecting the .. or pressing Q, but be careful;
press Q one time too many and you need to start over again!

Find the partition with the etc directory inside
and mark it as P (primary) partition with the arrow keys
( or ). 
In my case there was a second partition with a Linux-like file structure,
so I set this as "Primary bootable" *,
but I noticed that this partition is not always found. 
If it’s not found, (in my test) it reaches out to mark the partition
with the etc directory as primary P
(See the screenshot of my test machine below.) 
Confirm the settings with Enter (continue),
followed by Write to write the new partition table to the disk.

screenshot of TestDisk

After the partition table is written, open the boot repair tool
(which we closed at the beginning) to fix the Linux boot loader (GRUB). 
You can perform the fix with Recommended repair
If this option is not available, restart the boot repair tool,
as it probably was still active during the partition table scan.

       Boot Repair dialog, showing "Recommended repair" button

I could fix our mail server like this. 
I recovered (broken) backups of machines I fixed
with a working backup on to some test machines. 
I was able to fix all (5) the machines I tested this on.

I hope this answer can help someone else.

Answered By: CodeNinja

Thank you VERY much for posting your solution – saved my life!

Situation: Made adoption (CPU resources) for 5 Debian LAMP VMs in a Proxmox 7.2-7 environment, and restarted them to make changes effective. NONE of them came to life, all didn’t boot with the well-known "not a bootable disk" error.
In my case, there were two details different in the repair process:

sudo testdisk /dev/sda failed with message: "unable to open file or device /dev/sda: Not a directory", although the device in question is sda. 
Resolution: Entered just sudo testdisk,
and after the logging options I could select /dev/sda.

Quick test resulted in this:

    Partition               Start        End    Size in sectors
>*Linux                    0  32 33   121 157 36    1951744 [boot]
P Linux                  121 157 37  3161  11 50   48828416 [system]
P Linux Swap            3161  11 51  3647  99 36    7813120
L Linux                 3647 164 38 13054  10 12  151113728 [data1]

I did not need to change/toggle anything;
just went to the next screen with Enter, and then chose "write". 
From there it worked like written above.

Maybe my small hints can help someone, too.

This is being tracked as Proxmox bug 2874.

Answered By: sry
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.