[GH-ISSUE #1] Logger shows "Need to balance: True" but nothing happens

kerem commented

2026-02-26 17:46:04 +03:00

Owner

Originally created by @mattv8 on GitHub (May 4, 2022).
Original GitHub issue: https://github.com/cvk98/Proxmox-load-balancer/issues/1

Is this expected behavior?

INFO | START ***Load-balancer!***
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update

I have two nodes, which are already nearly balanced so this could be the reason why. See my screenshot below:

Originally created by @mattv8 on GitHub (May 4, 2022). Original GitHub issue: https://github.com/cvk98/Proxmox-load-balancer/issues/1 Is this expected behavior? ``` INFO | START ***Load-balancer!*** INFO | Need to balance: True INFO | Number of options = 1 INFO | Waiting 10 seconds for cluster information update INFO | Need to balance: True INFO | Number of options = 1 INFO | Waiting 10 seconds for cluster information update ``` I have two nodes, which are already nearly balanced so this could be the reason why. See my screenshot below: ![image](https://user-images.githubusercontent.com/9312603/166742131-536a3b32-8729-4731-81bb-ae21a9e15927.png)

kerem

2026-02-26 17:46:04 +03:00

closed this issue
added the
documentation
label

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@cvk98 commented on GitHub (May 4, 2022):

It depends on many factors.
It may turn out that you have 1 option for migration, but the VM may have a CD-ROM connected or its HDD is located on the node's local storage. Then the balancer finds 1 option to improve the situation, but cannot implement it at the stage of checking the possibility of migration. The output in the "DEBUG" mode can tell more about what is happening.

@cvk98 commented on GitHub (May 4, 2022): It depends on many factors. It may turn out that you have 1 option for migration, but the VM may have a CD-ROM connected or its HDD is located on the node's local storage. Then the balancer finds 1 option to improve the situation, but cannot implement it at the stage of checking the possibility of migration. The output in the "DEBUG" mode can tell more about what is happening.

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@cvk98 commented on GitHub (May 5, 2022):

In the readme, I added the requirement of a common storage for all nodes

@cvk98 commented on GitHub (May 5, 2022): In the readme, I added the requirement of a common storage for all nodes

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@mattv8 commented on GitHub (May 16, 2022):

Sorry for the delay, so I do have common storage between all nodes. In fact, they are all identical: same number of CPU's, RAM and storage. However, something strange is still happening. The algorithm sees that it needs to balance, and finds an option, but the migration doesn't end up happening and the algorithm gets stuck in an infinite loop:

root@PVE1:~# python3 ~/Proxmox-load-balancer/plb.py
INFO | START Load-balancer!
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 1
DEBUG | Starting vm_migration
DEBUG | VM:202 migration from PVE2 to "recipient"
DEBUG | The VM:202 has [{'is_tpmstate': 0, 'replicate': 1, 'cdrom': 0, 'volid': 'shared-zfs:vm-202-disk-1', 'drivename': 'efidisk0', 'is_unused': 0, 'is_vmstate': 0, 'size': 1048576, 'referenced_in_config': 1, 'shared': 0}, {'shared': 0, 'referenced_in_config': 1, 'size': 4194304, 'is_unused': 0, 'drivename': 'tpmstate0', 'is_vmstate': 0, 'volid': 'shared-zfs:vm-202-disk-2', 'cdrom': 0, 'is_tpmstate': 1, 'replicate': 1}]
INFO | Waiting 10 seconds for cluster information update
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 0
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 0
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True

What do you think is stopping it up? This is Virtual Environment 7.2-3 with latest pull from this repo.

@mattv8 commented on GitHub (May 16, 2022): Sorry for the delay, so I do have common storage between all nodes. In fact, they are all identical: same number of CPU's, RAM and storage. However, something strange is still happening. The algorithm sees that it needs to balance, and finds an option, but the migration doesn't end up happening and the algorithm gets stuck in an infinite loop: > root@PVE1:~# python3 ~/Proxmox-load-balancer/plb.py > INFO | START ***Load-balancer!*** > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True > DEBUG | Running temporary_dict > DEBUG | Starting calculating > INFO | Number of options = 1 > DEBUG | Starting vm_migration > DEBUG | VM:202 migration from PVE2 to "recipient" > DEBUG | The VM:202 has [{'is_tpmstate': 0, 'replicate': 1, 'cdrom': 0, 'volid': 'shared-zfs:vm-202-disk-1', 'drivename': 'efidisk0', 'is_unused': 0, 'is_vmstate': 0, 'size': 1048576, 'referenced_in_config': 1, 'shared': 0}, {'shared': 0, 'referenced_in_config': 1, 'size': 4194304, 'is_unused': 0, 'drivename': 'tpmstate0', 'is_vmstate': 0, 'volid': 'shared-zfs:vm-202-disk-2', 'cdrom': 0, 'is_tpmstate': 1, 'replicate': 1}] > INFO | Waiting 10 seconds for cluster information update > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True > DEBUG | Running temporary_dict > DEBUG | Starting calculating > INFO | Number of options = 0 > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True > DEBUG | Running temporary_dict > DEBUG | Starting calculating > INFO | Number of options = 0 > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True What do you think is stopping it up? This is Virtual Environment 7.2-3 with latest pull from this repo.

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@cvk98 commented on GitHub (May 17, 2022):

In theory:

The script decides that the cluster is unbalanced
Goes through all the migration options and finds one that will improve the situation.
Tries to migrate the selected VM, but cannot due to local VM resources: "The VM:202 has..."
Decides again that the cluster is not balanced (for some reason VM:202 no longer selects)
BUT! any migration will increase sum_of_deviations. In this case, sorted_variants will be empty.

Here it is necessary to include another algorithm that will choose a bad (but not critical) option. And then it will start working in the same mode.

Such a cluster cannot be balanced with improvements. We need to make it worse so that new options open up.
It's not difficult to implement, but I have nowhere to test it. Maybe I'll add this as an option.

@cvk98 commented on GitHub (May 17, 2022): In theory: 1. The script decides that the cluster is unbalanced 2. Goes through all the migration options and finds one that will improve the situation. 3. Tries to migrate the selected VM, but cannot due to local VM resources: "The VM:202 has..." 4. Decides again that the cluster is not balanced (for some reason VM:202 no longer selects) 5. BUT! any migration will increase sum_of_deviations. In this case, sorted_variants will be empty. Here it is necessary to include another algorithm that will choose a bad (but not critical) option. And then it will start working in the same mode. ![image](https://user-images.githubusercontent.com/88323643/168769464-2884c262-16b3-48e1-bef7-1c8035d2ff28.png) Such a cluster cannot be balanced with improvements. We need to make it worse so that new options open up. It's not difficult to implement, but I have nowhere to test it. Maybe I'll add this as an option.

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@mattv8 commented on GitHub (May 17, 2022):

Ah ha! Interesting, thanks for the explanation. I am sure this is somewhat difficult to test and implement since you must iteratively migrate and check, and migration takes time and compute resources.

I will look more into the algorithm when I have time to see if I can contribute. For now, I need to see why the API isn't starting the migration when it hits the def vm_migration(); function. It's like the API call isn't responding properly.

@mattv8 commented on GitHub (May 17, 2022): Ah ha! Interesting, thanks for the explanation. I am sure this is somewhat difficult to test and implement since you must iteratively migrate and check, and migration takes time and compute resources. I will look more into the algorithm when I have time to see if I can contribute. For now, I need to see why the API isn't starting the migration when it hits the def vm_migration(); function. It's like the API call isn't responding properly.

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@cvk98 commented on GitHub (May 18, 2022):

pvesh get /nodes/PVE2/qemu/202/migrate - will show local resources that prevent migration
pvesh create /nodes/PVE2/qemu/200/migrate --target PVE1 --online 1 - this is the CLI analog of the http request that the script makes
If this command does not start the migration, then the script will not be able to do it either.
Using this link, you can view the migration options and change them in the script to suit your needs: https://pve.proxmox.com/pve-docs/api-viewer/#/nodes/{node}/qemu/{vmid}/migrate

@cvk98 commented on GitHub (May 18, 2022): **pvesh get /nodes/PVE2/qemu/202/migrate** - will show local resources that prevent migration **pvesh create /nodes/PVE2/qemu/200/migrate --target PVE1 --online 1** - this is the CLI analog of the http request that the script makes If this command does not start the migration, then the script will not be able to do it either. Using this link, you can view the migration options and change them in the script to suit your needs: https://pve.proxmox.com/pve-docs/api-viewer/#/nodes/{node}/qemu/{vmid}/migrate

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@cvk98 commented on GitHub (May 18, 2022):

Changes will need to be made in this block

@cvk98 commented on GitHub (May 18, 2022): Changes will need to be made in this block ![image](https://user-images.githubusercontent.com/88323643/168960604-471aa217-c106-4a4d-a3ee-50a91b8690b5.png)

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@cvk98 commented on GitHub (May 22, 2022):

I hope I was able to help you

@cvk98 commented on GitHub (May 22, 2022): I hope I was able to help you

kerem commented

2026-02-26 17:46:04 +03:00

Author

Owner

@mattv8 commented on GitHub (May 23, 2022):

Thank you, yes, very helpful! Fine to close this as it is not an issue. I'm still testing in my environment; I'll report back if I have any more issues.

@mattv8 commented on GitHub (May 23, 2022): Thank you, yes, very helpful! Fine to close this as it is not an issue. I'm still testing in my environment; I'll report back if I have any more issues.

kerem referenced this issue

2026-02-26 17:46:10 +03:00

[PR #3] Feature/only run on master #25

Rows
Columns

[GH-ISSUE #1] Logger shows "Need to balance: True" but nothing happens #3