[GH-ISSUE #1] Logger shows "Need to balance: True" but nothing happens #3

Closed
opened 2026-02-26 17:46:04 +03:00 by kerem · 9 comments
Owner

Originally created by @mattv8 on GitHub (May 4, 2022).
Original GitHub issue: https://github.com/cvk98/Proxmox-load-balancer/issues/1

Is this expected behavior?

INFO | START ***Load-balancer!***
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update

I have two nodes, which are already nearly balanced so this could be the reason why. See my screenshot below:
image

Originally created by @mattv8 on GitHub (May 4, 2022). Original GitHub issue: https://github.com/cvk98/Proxmox-load-balancer/issues/1 Is this expected behavior? ``` INFO | START ***Load-balancer!*** INFO | Need to balance: True INFO | Number of options = 1 INFO | Waiting 10 seconds for cluster information update INFO | Need to balance: True INFO | Number of options = 1 INFO | Waiting 10 seconds for cluster information update ``` I have two nodes, which are already nearly balanced so this could be the reason why. See my screenshot below: ![image](https://user-images.githubusercontent.com/9312603/166742131-536a3b32-8729-4731-81bb-ae21a9e15927.png)
kerem 2026-02-26 17:46:04 +03:00
Author
Owner

@cvk98 commented on GitHub (May 4, 2022):

It depends on many factors.
It may turn out that you have 1 option for migration, but the VM may have a CD-ROM connected or its HDD is located on the node's local storage. Then the balancer finds 1 option to improve the situation, but cannot implement it at the stage of checking the possibility of migration. The output in the "DEBUG" mode can tell more about what is happening.

<!-- gh-comment-id:1117686545 --> @cvk98 commented on GitHub (May 4, 2022): It depends on many factors. It may turn out that you have 1 option for migration, but the VM may have a CD-ROM connected or its HDD is located on the node's local storage. Then the balancer finds 1 option to improve the situation, but cannot implement it at the stage of checking the possibility of migration. The output in the "DEBUG" mode can tell more about what is happening.
Author
Owner

@cvk98 commented on GitHub (May 5, 2022):

In the readme, I added the requirement of a common storage for all nodes

<!-- gh-comment-id:1118290610 --> @cvk98 commented on GitHub (May 5, 2022): In the readme, I added the requirement of a common storage for all nodes
Author
Owner

@mattv8 commented on GitHub (May 16, 2022):

Sorry for the delay, so I do have common storage between all nodes. In fact, they are all identical: same number of CPU's, RAM and storage. However, something strange is still happening. The algorithm sees that it needs to balance, and finds an option, but the migration doesn't end up happening and the algorithm gets stuck in an infinite loop:

root@PVE1:~# python3 ~/Proxmox-load-balancer/plb.py
INFO | START Load-balancer!
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 1
DEBUG | Starting vm_migration
DEBUG | VM:202 migration from PVE2 to "recipient"
DEBUG | The VM:202 has [{'is_tpmstate': 0, 'replicate': 1, 'cdrom': 0, 'volid': 'shared-zfs:vm-202-disk-1', 'drivename': 'efidisk0', 'is_unused': 0, 'is_vmstate': 0, 'size': 1048576, 'referenced_in_config': 1, 'shared': 0}, {'shared': 0, 'referenced_in_config': 1, 'size': 4194304, 'is_unused': 0, 'drivename': 'tpmstate0', 'is_vmstate': 0, 'volid': 'shared-zfs:vm-202-disk-2', 'cdrom': 0, 'is_tpmstate': 1, 'replicate': 1}]
INFO | Waiting 10 seconds for cluster information update
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 0
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True
DEBUG | Running temporary_dict
DEBUG | Starting calculating
INFO | Number of options = 0
DEBUG | Authorization attempt...
DEBUG | Successful authentication. Response code: 200
DEBUG | init when creating a Cluster object
DEBUG | Starting Cluster.cluster_name
DEBUG | Information about the cluster name has been received. Response code: 200
DEBUG | Launching Cluster.cluster_items
DEBUG | Attempt to get information about the cluster...
DEBUG | Information about the cluster has been received. Response code: 200
DEBUG | Launching Cluster.cluster_hosts
DEBUG | Launching Cluster.cluster_vms
DEBUG | Launching Cluster.cluster_membership
DEBUG | Launching Cluster.cluster_cpu
DEBUG | Starting cluster_load_verification
DEBUG | Starting need_to_balance_checking
INFO | Need to balance: True

What do you think is stopping it up? This is Virtual Environment 7.2-3 with latest pull from this repo.

<!-- gh-comment-id:1127818173 --> @mattv8 commented on GitHub (May 16, 2022): Sorry for the delay, so I do have common storage between all nodes. In fact, they are all identical: same number of CPU's, RAM and storage. However, something strange is still happening. The algorithm sees that it needs to balance, and finds an option, but the migration doesn't end up happening and the algorithm gets stuck in an infinite loop: > root@PVE1:~# python3 ~/Proxmox-load-balancer/plb.py > INFO | START ***Load-balancer!*** > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True > DEBUG | Running temporary_dict > DEBUG | Starting calculating > INFO | Number of options = 1 > DEBUG | Starting vm_migration > DEBUG | VM:202 migration from PVE2 to "recipient" > DEBUG | The VM:202 has [{'is_tpmstate': 0, 'replicate': 1, 'cdrom': 0, 'volid': 'shared-zfs:vm-202-disk-1', 'drivename': 'efidisk0', 'is_unused': 0, 'is_vmstate': 0, 'size': 1048576, 'referenced_in_config': 1, 'shared': 0}, {'shared': 0, 'referenced_in_config': 1, 'size': 4194304, 'is_unused': 0, 'drivename': 'tpmstate0', 'is_vmstate': 0, 'volid': 'shared-zfs:vm-202-disk-2', 'cdrom': 0, 'is_tpmstate': 1, 'replicate': 1}] > INFO | Waiting 10 seconds for cluster information update > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True > DEBUG | Running temporary_dict > DEBUG | Starting calculating > INFO | Number of options = 0 > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True > DEBUG | Running temporary_dict > DEBUG | Starting calculating > INFO | Number of options = 0 > DEBUG | Authorization attempt... > DEBUG | Successful authentication. Response code: 200 > DEBUG | init when creating a Cluster object > DEBUG | Starting Cluster.cluster_name > DEBUG | Information about the cluster name has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_items > DEBUG | Attempt to get information about the cluster... > DEBUG | Information about the cluster has been received. Response code: 200 > DEBUG | Launching Cluster.cluster_hosts > DEBUG | Launching Cluster.cluster_vms > DEBUG | Launching Cluster.cluster_membership > DEBUG | Launching Cluster.cluster_cpu > DEBUG | Starting cluster_load_verification > DEBUG | Starting need_to_balance_checking > INFO | Need to balance: True What do you think is stopping it up? This is Virtual Environment 7.2-3 with latest pull from this repo.
Author
Owner

@cvk98 commented on GitHub (May 17, 2022):

In theory:

  1. The script decides that the cluster is unbalanced
  2. Goes through all the migration options and finds one that will improve the situation.
  3. Tries to migrate the selected VM, but cannot due to local VM resources: "The VM:202 has..."
  4. Decides again that the cluster is not balanced (for some reason VM:202 no longer selects)
  5. BUT! any migration will increase sum_of_deviations. In this case, sorted_variants will be empty.

Here it is necessary to include another algorithm that will choose a bad (but not critical) option. And then it will start working in the same mode.
image
Such a cluster cannot be balanced with improvements. We need to make it worse so that new options open up.
It's not difficult to implement, but I have nowhere to test it. Maybe I'll add this as an option.

<!-- gh-comment-id:1128593267 --> @cvk98 commented on GitHub (May 17, 2022): In theory: 1. The script decides that the cluster is unbalanced 2. Goes through all the migration options and finds one that will improve the situation. 3. Tries to migrate the selected VM, but cannot due to local VM resources: "The VM:202 has..." 4. Decides again that the cluster is not balanced (for some reason VM:202 no longer selects) 5. BUT! any migration will increase sum_of_deviations. In this case, sorted_variants will be empty. Here it is necessary to include another algorithm that will choose a bad (but not critical) option. And then it will start working in the same mode. ![image](https://user-images.githubusercontent.com/88323643/168769464-2884c262-16b3-48e1-bef7-1c8035d2ff28.png) Such a cluster cannot be balanced with improvements. We need to make it worse so that new options open up. It's not difficult to implement, but I have nowhere to test it. Maybe I'll add this as an option.
Author
Owner

@mattv8 commented on GitHub (May 17, 2022):

Ah ha! Interesting, thanks for the explanation. I am sure this is somewhat difficult to test and implement since you must iteratively migrate and check, and migration takes time and compute resources.

I will look more into the algorithm when I have time to see if I can contribute. For now, I need to see why the API isn't starting the migration when it hits the def vm_migration(); function. It's like the API call isn't responding properly.

<!-- gh-comment-id:1129313231 --> @mattv8 commented on GitHub (May 17, 2022): Ah ha! Interesting, thanks for the explanation. I am sure this is somewhat difficult to test and implement since you must iteratively migrate and check, and migration takes time and compute resources. I will look more into the algorithm when I have time to see if I can contribute. For now, I need to see why the API isn't starting the migration when it hits the def vm_migration(); function. It's like the API call isn't responding properly.
Author
Owner

@cvk98 commented on GitHub (May 18, 2022):

pvesh get /nodes/PVE2/qemu/202/migrate - will show local resources that prevent migration
pvesh create /nodes/PVE2/qemu/200/migrate --target PVE1 --online 1 - this is the CLI analog of the http request that the script makes
If this command does not start the migration, then the script will not be able to do it either.
Using this link, you can view the migration options and change them in the script to suit your needs: https://pve.proxmox.com/pve-docs/api-viewer/#/nodes/{node}/qemu/{vmid}/migrate

<!-- gh-comment-id:1129558018 --> @cvk98 commented on GitHub (May 18, 2022): **pvesh get /nodes/PVE2/qemu/202/migrate** - will show local resources that prevent migration **pvesh create /nodes/PVE2/qemu/200/migrate --target PVE1 --online 1** - this is the CLI analog of the http request that the script makes If this command does not start the migration, then the script will not be able to do it either. Using this link, you can view the migration options and change them in the script to suit your needs: https://pve.proxmox.com/pve-docs/api-viewer/#/nodes/{node}/qemu/{vmid}/migrate
Author
Owner

@cvk98 commented on GitHub (May 18, 2022):

Changes will need to be made in this block
image

<!-- gh-comment-id:1129567491 --> @cvk98 commented on GitHub (May 18, 2022): Changes will need to be made in this block ![image](https://user-images.githubusercontent.com/88323643/168960604-471aa217-c106-4a4d-a3ee-50a91b8690b5.png)
Author
Owner

@cvk98 commented on GitHub (May 22, 2022):

I hope I was able to help you

<!-- gh-comment-id:1133915399 --> @cvk98 commented on GitHub (May 22, 2022): I hope I was able to help you
Author
Owner

@mattv8 commented on GitHub (May 23, 2022):

Thank you, yes, very helpful! Fine to close this as it is not an issue. I'm still testing in my environment; I'll report back if I have any more issues.

<!-- gh-comment-id:1134963263 --> @mattv8 commented on GitHub (May 23, 2022): Thank you, yes, very helpful! Fine to close this as it is not an issue. I'm still testing in my environment; I'll report back if I have any more issues.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/Proxmox-load-balancer#3
No description provided.