[GH-ISSUE #39] Host is killed / pool filled up by snapshot creation when free zfs space is lower than written bytes in dataset? #34

Closed
opened 2026-02-26 17:44:04 +03:00 by kerem · 13 comments
Owner

Originally created by @lameduckonit on GitHub (Aug 31, 2020).
Original GitHub issue: https://github.com/Corsinvest/cv4pve-autosnap/issues/39

We had a crash of a whole host in our cluster (proxmox 6.2), in the following scenario:

  • Running different VMs on the host
  • the free disk space was about 2 TB
  • largest VM dataset consumes 2 TB
  • on day X, the left free space was lower than the written space of the dataset of which a snapshot was taken
  • just at the moment when the snapshot was taken, free space, was completely absorbed by the snapshot
  • the snapshot itself showed a size of 0B - after deleting the not really taken snapshot there was 1TB additional space
  • Later we realized that at this point other 0-byte snapshots were taken that consumed as much as the original data set from which the snapshot was taken.
  • Please have a look at the statistics on the picture - nothing happens all week - but on this particular day, the disk space on the hard drive implodes the moment the snapshots were taken.

We have not yet reconstructed the scenario - it is only a guess... but the impact was really deep...

grafik

Originally created by @lameduckonit on GitHub (Aug 31, 2020). Original GitHub issue: https://github.com/Corsinvest/cv4pve-autosnap/issues/39 We had a crash of a whole host in our cluster (proxmox 6.2), in the following scenario: - Running different VMs on the host - the free disk space was about 2 TB - largest VM dataset consumes 2 TB - on day X, the left free space was lower than the written space of the dataset of which a snapshot was taken - just at the moment when the snapshot was taken, free space, was completely absorbed by the snapshot - the snapshot itself showed a size of 0B - after deleting the not really taken snapshot there was 1TB additional space - Later we realized that at this point other 0-byte snapshots were taken that consumed as much as the original data set from which the snapshot was taken. - Please have a look at the statistics on the picture - nothing happens all week - but on this particular day, the disk space on the hard drive implodes the moment the snapshots were taken. We have not yet reconstructed the scenario - it is only a guess... but the impact was really deep... ![grafik](https://user-images.githubusercontent.com/70493353/91672398-9b9cf800-eb2e-11ea-867e-2a27f3cad7ff.png)
kerem closed this issue 2026-02-26 17:44:04 +03:00
Author
Owner

@franklupo commented on GitHub (Aug 31, 2020):

Can you give me autonap eceute string?

<!-- gh-comment-id:683718639 --> @franklupo commented on GitHub (Aug 31, 2020): Can you give me autonap eceute string?
Author
Owner

@lameduckonit commented on GitHub (Aug 31, 2020):

In the crontab on a VM that sends the request to a host of the cluster was used:

/bin/cv4pve-autosnap --host=10.10.10.10 --username=snapshots@pve --password='xxx' --vmid=all snap --keep=7 --label='D-'

Does this information help?

<!-- gh-comment-id:683735473 --> @lameduckonit commented on GitHub (Aug 31, 2020): In the crontab on a VM that sends the request to a host of the cluster was used: `/bin/cv4pve-autosnap --host=10.10.10.10 --username=snapshots@pve --password='xxx' --vmid=all snap --keep=7 --label='D-'` Does this information help?
Author
Owner

@franklupo commented on GitHub (Aug 31, 2020):

Use zfs?

<!-- gh-comment-id:683765465 --> @franklupo commented on GitHub (Aug 31, 2020): Use zfs?
Author
Owner

@lameduckonit commented on GitHub (Aug 31, 2020):

Yes a RaidZ2 ZFS Pool with compression and 6TB net space (8x1TB SSD) . Each VM owns a dataset in this pool. The lightgreen parts in the image above are free space (a little bit more than 2TB in the beginning), the darker turquise is the used space in TB of the whole pool.

<!-- gh-comment-id:683865798 --> @lameduckonit commented on GitHub (Aug 31, 2020): Yes a RaidZ2 ZFS Pool with compression and 6TB net space (8x1TB SSD) . Each VM owns a dataset in this pool. The lightgreen parts in the image above are free space (a little bit more than 2TB in the beginning), the darker turquise is the used space in TB of the whole pool.
Author
Owner

@franklupo commented on GitHub (Sep 1, 2020):

Hi,
execute the command "zfs list -t snapshot" and attach the result.

best regards

<!-- gh-comment-id:684516141 --> @franklupo commented on GitHub (Sep 1, 2020): Hi, execute the command "zfs list -t snapshot" and attach the result. best regards
Author
Owner

@lameduckonit commented on GitHub (Sep 1, 2020):

We had to repair the host - there are no snapshots right now on the host. We stopped using this tool for now.

I made no screenshots from the zfs state as the crash happened, but including the other snapshots there had been free space about 2TB in the Pool - the old snapshots themself had all been very small - from the concerning 2TB VM it was only a few 100MB since the VM (Filestorage) is not in use very much. Also the other snapshots had not been that big - everythings seemed tu run fine before.

<!-- gh-comment-id:684690878 --> @lameduckonit commented on GitHub (Sep 1, 2020): We had to repair the host - there are no snapshots right now on the host. We stopped using this tool for now. I made no screenshots from the zfs state as the crash happened, but including the other snapshots there had been free space about 2TB in the Pool - the old snapshots themself had all been very small - from the concerning 2TB VM it was only a few 100MB since the VM (Filestorage) is not in use very much. Also the other snapshots had not been that big - everythings seemed tu run fine before.
Author
Owner

@franklupo commented on GitHub (Sep 1, 2020):

The autosnaps use the same mechanism as the Proxmox VE Web.
I think the problem is the content variations of the VMs.
We have never encountered this problem in years of operation. We never use ZFS compression in production.

<!-- gh-comment-id:684794069 --> @franklupo commented on GitHub (Sep 1, 2020): The autosnaps use the same mechanism as the Proxmox VE Web. I think the problem is the content variations of the VMs. We have never encountered this problem in years of operation. We never use ZFS compression in production.
Author
Owner

@lameduckonit commented on GitHub (Sep 1, 2020):

I do not think, that the data changes and the grow of the snapshots of the VMs played a role. my understanding for zfs snapshots is: At the point a snapshot is taken, its size is 0 - and from then it is growing. if you take a new snapshot in the chain, the older snapshots stop growing. (So its quite different to snapshots at vmWare or other systems).

to clarifiy what happened:

Just before the moment when the new snapshots were taken, there had been 2TB of free space on the pool

  • a few seconds after the snapshot was taken, the host starts crashing with no space left on the pool

  • the vms weren't in use when the snapshots were taken, it was at the evening, and no one was working on them nor maintaining them. CPU was at 0-1%.

  • the snapshots from 24h ago took only a few GBs for all VMs incl. the big one.

  • the snapshots from the crash did show 0 Byte size in zfs list -t snapshot

  • After removing the 0Byte Snapshot of the 2TB VM, 1TB of free space was rediscovered in the pool

  • After removing the 0Byte Snapshot of another 500GB VM, the sum 1.8TB free space was back in the pool

We use zfs compression for about 3 years on all of our hosts and also on the separate backup storage - we never had such a problem before. Also with the previous ZOL 0.6 it was working fine.

But maybe we also never tried to snapshot a dataset which is bigger than the free space in the pool - we also never used automatic snapshots, which might throw many snapshot commands in a short time to the filesystem.

Maybe it is a combination of compression / raidz2 storage calculation / snapshot - I don't know yet - but it shall better never happen again - its really a bad feeling, loosing a whole host during a daily snapshot/backup routine :)

<!-- gh-comment-id:684864312 --> @lameduckonit commented on GitHub (Sep 1, 2020): I do not think, that the data changes and the grow of the snapshots of the VMs played a role. my understanding for zfs snapshots is: At the point a snapshot is taken, its size is 0 - and from then it is growing. if you take a new snapshot in the chain, the older snapshots stop growing. (So its quite different to snapshots at vmWare or other systems). to clarifiy what happened: Just before the moment when the new snapshots were taken, there had been 2TB of free space on the pool - a few seconds after the snapshot was taken, the host starts crashing with no space left on the pool - the vms weren't in use when the snapshots were taken, it was at the evening, and no one was working on them nor maintaining them. CPU was at 0-1%. - the snapshots from 24h ago took only a few GBs for all VMs incl. the big one. - the snapshots from the crash did show 0 Byte size in zfs list -t snapshot - After removing the 0Byte Snapshot of the 2TB VM, 1TB of free space was rediscovered in the pool - After removing the 0Byte Snapshot of another 500GB VM, the sum 1.8TB free space was back in the pool We use zfs compression for about 3 years on all of our hosts and also on the separate backup storage - we never had such a problem before. Also with the previous ZOL 0.6 it was working fine. But maybe we also never tried to snapshot a dataset which is bigger than the free space in the pool - we also never used automatic snapshots, which might throw many snapshot commands in a short time to the filesystem. Maybe it is a combination of compression / raidz2 storage calculation / snapshot - I don't know yet - but it shall better never happen again - its really a bad feeling, loosing a whole host during a daily snapshot/backup routine :)
Author
Owner

@franklupo commented on GitHub (Sep 1, 2020):

if create a zfs snapshot does it work or are you having problems?

<!-- gh-comment-id:684877290 --> @franklupo commented on GitHub (Sep 1, 2020): if create a zfs snapshot does it work or are you having problems?
Author
Owner

@franklupo commented on GitHub (Sep 1, 2020):

execute this commands and attach result.
zfs list
zfs list -t snapshot
df -h

Best regards

<!-- gh-comment-id:684885438 --> @franklupo commented on GitHub (Sep 1, 2020): execute this commands and attach result. **zfs list** **zfs list -t snapshot** **df -h** Best regards
Author
Owner

@lameduckonit commented on GitHub (Sep 3, 2020):

Snapshots work - but at the moment we stopped using them

zfs list

NAME                              USED  AVAIL     REFER  MOUNTPOINT
rpool                            3.09T  3.34T      238K  /rpool
rpool/ROOT                       3.07G  3.34T      219K  /rpool/ROOT
rpool/ROOT/pve-1                 3.07G  3.34T     3.07G  /
rpool/data                       3.08T  3.34T      347K  /rpool/data
rpool/data/basevol-10000-disk-0   637M  15.4G      637M  /rpool/data/basevol-10000-disk-0
rpool/data/basevol-10001-disk-0   880M  15.1G      880M  /rpool/data/basevol-10001-disk-0
rpool/data/subvol-304-disk-0      110G  89.7G      110G  /rpool/data/subvol-304-disk-0
rpool/data/subvol-304-disk-1     14.8G  10.2G     14.8G  /rpool/data/subvol-304-disk-1
rpool/data/subvol-309-disk-0     50.2G   790M     50.2G  /rpool/data/subvol-309-disk-0
rpool/data/subvol-311-disk-0     4.34G  45.7G     4.34G  /rpool/data/subvol-311-disk-0
rpool/data/subvol-315-disk-0     1.04G   299G     1.04G  /rpool/data/subvol-315-disk-0
rpool/data/subvol-401-disk-0     5.76G  10.2G     5.76G  /rpool/data/subvol-401-disk-0
rpool/data/subvol-510-disk-0     26.9G  8.08G     26.9G  /rpool/data/subvol-510-disk-0
rpool/data/vm-1502-disk-0        93.9G  3.34T     93.9G  -
rpool/data/vm-1814-disk-0         272G  3.34T      272G  -
rpool/data/vm-2103-disk-0         132G  3.40T     69.5G  -
rpool/data/vm-2103-disk-1        2.03T  3.34T     2.03T  -
rpool/data/vm-2104-disk-0         132G  3.41T     61.9G  -
rpool/data/vm-2301-disk-0        14.4G  3.34T     14.4G  -
rpool/data/vm-327-disk-0         37.0G  3.34T     37.0G  -
rpool/data/vm-603-disk-0         49.0G  3.34T     49.0G  -
rpool/data/vm-701-disk-0         39.2G  3.34T     39.2G  -
rpool/data/vm-902-disk-0         91.7G  3.34T     91.7G  -

zfs list -t snapshot

NAME                                       USED  AVAIL     REFER  MOUNTPOINT
rpool/data/basevol-10000-disk-0@__base__     0B      -      637M  -
rpool/data/basevol-10001-disk-0@__base__  18.3K      -      880M  -

df -h

Filesystem                       Size  Used Avail Use% Mounted on
udev                             189G     0  189G   0% /dev
tmpfs                             38G   59M   38G   1% /run
rpool/ROOT/pve-1                 3.4T  3.1G  3.4T   1% /
tmpfs                            189G   54M  189G   1% /dev/shm
tmpfs                            5.0M     0  5.0M   0% /run/lock
tmpfs                            189G     0  189G   0% /sys/fs/cgroup
rpool                            3.4T  256K  3.4T   1% /rpool
rpool/ROOT                       3.4T  256K  3.4T   1% /rpool/ROOT
rpool/data                       3.4T  384K  3.4T   1% /rpool/data
rpool/data/basevol-10001-disk-0   16G  881M   16G   6% /rpool/data/basevol-10001-disk-0
rpool/data/basevol-10000-disk-0   16G  638M   16G   4% /rpool/data/basevol-10000-disk-0
rpool/data/subvol-304-disk-0     200G  111G   90G  56% /rpool/data/subvol-304-disk-0
rpool/data/subvol-304-disk-1      25G   15G   11G  60% /rpool/data/subvol-304-disk-1
rpool/data/subvol-309-disk-0      51G   51G  791M  99% /rpool/data/subvol-309-disk-0
rpool/data/subvol-311-disk-0      50G  4.4G   46G   9% /rpool/data/subvol-311-disk-0
rpool/data/subvol-315-disk-0     300G  1.1G  299G   1% /rpool/data/subvol-315-disk-0
rpool/data/subvol-401-disk-0      16G  5.8G   11G  36% /rpool/data/subvol-401-disk-0
rpool/data/subvol-510-disk-0      35G   27G  8.1G  77% /rpool/data/subvol-510-disk-0
/dev/fuse                         30M  136K   30M   1% /etc/pve
tmpfs                             38G     0   38G   0% /run/user/0
<!-- gh-comment-id:686272147 --> @lameduckonit commented on GitHub (Sep 3, 2020): Snapshots work - but at the moment we stopped using them `zfs list` ``` NAME USED AVAIL REFER MOUNTPOINT rpool 3.09T 3.34T 238K /rpool rpool/ROOT 3.07G 3.34T 219K /rpool/ROOT rpool/ROOT/pve-1 3.07G 3.34T 3.07G / rpool/data 3.08T 3.34T 347K /rpool/data rpool/data/basevol-10000-disk-0 637M 15.4G 637M /rpool/data/basevol-10000-disk-0 rpool/data/basevol-10001-disk-0 880M 15.1G 880M /rpool/data/basevol-10001-disk-0 rpool/data/subvol-304-disk-0 110G 89.7G 110G /rpool/data/subvol-304-disk-0 rpool/data/subvol-304-disk-1 14.8G 10.2G 14.8G /rpool/data/subvol-304-disk-1 rpool/data/subvol-309-disk-0 50.2G 790M 50.2G /rpool/data/subvol-309-disk-0 rpool/data/subvol-311-disk-0 4.34G 45.7G 4.34G /rpool/data/subvol-311-disk-0 rpool/data/subvol-315-disk-0 1.04G 299G 1.04G /rpool/data/subvol-315-disk-0 rpool/data/subvol-401-disk-0 5.76G 10.2G 5.76G /rpool/data/subvol-401-disk-0 rpool/data/subvol-510-disk-0 26.9G 8.08G 26.9G /rpool/data/subvol-510-disk-0 rpool/data/vm-1502-disk-0 93.9G 3.34T 93.9G - rpool/data/vm-1814-disk-0 272G 3.34T 272G - rpool/data/vm-2103-disk-0 132G 3.40T 69.5G - rpool/data/vm-2103-disk-1 2.03T 3.34T 2.03T - rpool/data/vm-2104-disk-0 132G 3.41T 61.9G - rpool/data/vm-2301-disk-0 14.4G 3.34T 14.4G - rpool/data/vm-327-disk-0 37.0G 3.34T 37.0G - rpool/data/vm-603-disk-0 49.0G 3.34T 49.0G - rpool/data/vm-701-disk-0 39.2G 3.34T 39.2G - rpool/data/vm-902-disk-0 91.7G 3.34T 91.7G - ``` `zfs list -t snapshot` ``` NAME USED AVAIL REFER MOUNTPOINT rpool/data/basevol-10000-disk-0@__base__ 0B - 637M - rpool/data/basevol-10001-disk-0@__base__ 18.3K - 880M - ``` `df -h` ``` Filesystem Size Used Avail Use% Mounted on udev 189G 0 189G 0% /dev tmpfs 38G 59M 38G 1% /run rpool/ROOT/pve-1 3.4T 3.1G 3.4T 1% / tmpfs 189G 54M 189G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 189G 0 189G 0% /sys/fs/cgroup rpool 3.4T 256K 3.4T 1% /rpool rpool/ROOT 3.4T 256K 3.4T 1% /rpool/ROOT rpool/data 3.4T 384K 3.4T 1% /rpool/data rpool/data/basevol-10001-disk-0 16G 881M 16G 6% /rpool/data/basevol-10001-disk-0 rpool/data/basevol-10000-disk-0 16G 638M 16G 4% /rpool/data/basevol-10000-disk-0 rpool/data/subvol-304-disk-0 200G 111G 90G 56% /rpool/data/subvol-304-disk-0 rpool/data/subvol-304-disk-1 25G 15G 11G 60% /rpool/data/subvol-304-disk-1 rpool/data/subvol-309-disk-0 51G 51G 791M 99% /rpool/data/subvol-309-disk-0 rpool/data/subvol-311-disk-0 50G 4.4G 46G 9% /rpool/data/subvol-311-disk-0 rpool/data/subvol-315-disk-0 300G 1.1G 299G 1% /rpool/data/subvol-315-disk-0 rpool/data/subvol-401-disk-0 16G 5.8G 11G 36% /rpool/data/subvol-401-disk-0 rpool/data/subvol-510-disk-0 35G 27G 8.1G 77% /rpool/data/subvol-510-disk-0 /dev/fuse 30M 136K 30M 1% /etc/pve tmpfs 38G 0 38G 0% /run/user/0 ```
Author
Owner

@franklupo commented on GitHub (Sep 3, 2020):

execute and attach result
zfs get all

best regards

<!-- gh-comment-id:686307942 --> @franklupo commented on GitHub (Sep 3, 2020): execute and attach result **zfs get all** best regards
Author
Owner

@lameduckonit commented on GitHub (Sep 3, 2020):

..there it is:

zfs_get_all.txt

Thank you for your efforts!

<!-- gh-comment-id:686566921 --> @lameduckonit commented on GitHub (Sep 3, 2020): ..there it is: [zfs_get_all.txt](https://github.com/Corsinvest/cv4pve-autosnap/files/5169548/zfs_get_all.txt) Thank you for your efforts!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/cv4pve-autosnap#34
No description provided.