mirror of
https://github.com/Corsinvest/cv4pve-autosnap.git
synced 2026-04-25 17:05:48 +03:00
[GH-ISSUE #39] Host is killed / pool filled up by snapshot creation when free zfs space is lower than written bytes in dataset? #34
Labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/cv4pve-autosnap#34
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @lameduckonit on GitHub (Aug 31, 2020).
Original GitHub issue: https://github.com/Corsinvest/cv4pve-autosnap/issues/39
We had a crash of a whole host in our cluster (proxmox 6.2), in the following scenario:
We have not yet reconstructed the scenario - it is only a guess... but the impact was really deep...
@franklupo commented on GitHub (Aug 31, 2020):
Can you give me autonap eceute string?
@lameduckonit commented on GitHub (Aug 31, 2020):
In the crontab on a VM that sends the request to a host of the cluster was used:
/bin/cv4pve-autosnap --host=10.10.10.10 --username=snapshots@pve --password='xxx' --vmid=all snap --keep=7 --label='D-'Does this information help?
@franklupo commented on GitHub (Aug 31, 2020):
Use zfs?
@lameduckonit commented on GitHub (Aug 31, 2020):
Yes a RaidZ2 ZFS Pool with compression and 6TB net space (8x1TB SSD) . Each VM owns a dataset in this pool. The lightgreen parts in the image above are free space (a little bit more than 2TB in the beginning), the darker turquise is the used space in TB of the whole pool.
@franklupo commented on GitHub (Sep 1, 2020):
Hi,
execute the command "zfs list -t snapshot" and attach the result.
best regards
@lameduckonit commented on GitHub (Sep 1, 2020):
We had to repair the host - there are no snapshots right now on the host. We stopped using this tool for now.
I made no screenshots from the zfs state as the crash happened, but including the other snapshots there had been free space about 2TB in the Pool - the old snapshots themself had all been very small - from the concerning 2TB VM it was only a few 100MB since the VM (Filestorage) is not in use very much. Also the other snapshots had not been that big - everythings seemed tu run fine before.
@franklupo commented on GitHub (Sep 1, 2020):
The autosnaps use the same mechanism as the Proxmox VE Web.
I think the problem is the content variations of the VMs.
We have never encountered this problem in years of operation. We never use ZFS compression in production.
@lameduckonit commented on GitHub (Sep 1, 2020):
I do not think, that the data changes and the grow of the snapshots of the VMs played a role. my understanding for zfs snapshots is: At the point a snapshot is taken, its size is 0 - and from then it is growing. if you take a new snapshot in the chain, the older snapshots stop growing. (So its quite different to snapshots at vmWare or other systems).
to clarifiy what happened:
Just before the moment when the new snapshots were taken, there had been 2TB of free space on the pool
a few seconds after the snapshot was taken, the host starts crashing with no space left on the pool
the vms weren't in use when the snapshots were taken, it was at the evening, and no one was working on them nor maintaining them. CPU was at 0-1%.
the snapshots from 24h ago took only a few GBs for all VMs incl. the big one.
the snapshots from the crash did show 0 Byte size in zfs list -t snapshot
After removing the 0Byte Snapshot of the 2TB VM, 1TB of free space was rediscovered in the pool
After removing the 0Byte Snapshot of another 500GB VM, the sum 1.8TB free space was back in the pool
We use zfs compression for about 3 years on all of our hosts and also on the separate backup storage - we never had such a problem before. Also with the previous ZOL 0.6 it was working fine.
But maybe we also never tried to snapshot a dataset which is bigger than the free space in the pool - we also never used automatic snapshots, which might throw many snapshot commands in a short time to the filesystem.
Maybe it is a combination of compression / raidz2 storage calculation / snapshot - I don't know yet - but it shall better never happen again - its really a bad feeling, loosing a whole host during a daily snapshot/backup routine :)
@franklupo commented on GitHub (Sep 1, 2020):
if create a zfs snapshot does it work or are you having problems?
@franklupo commented on GitHub (Sep 1, 2020):
execute this commands and attach result.
zfs list
zfs list -t snapshot
df -h
Best regards
@lameduckonit commented on GitHub (Sep 3, 2020):
Snapshots work - but at the moment we stopped using them
zfs listzfs list -t snapshotdf -h@franklupo commented on GitHub (Sep 3, 2020):
execute and attach result
zfs get all
best regards
@lameduckonit commented on GitHub (Sep 3, 2020):
..there it is:
zfs_get_all.txt
Thank you for your efforts!