mirror of
https://github.com/hibiken/asynq.git
synced 2026-04-25 23:15:51 +03:00
[GH-ISSUE #1046] [BUG] Scheduler Tasks Frequently Skipped When Large Number of Handlers Registered #2522
Labels
No labels
CLI
bug
designing
documentation
duplicate
enhancement
good first issue
good first issue
help wanted
idea
invalid
investigate
needs-more-info
performance
pr-welcome
pull-request
question
wontfix
work in progress
work in progress
work-around-available
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/asynq#2522
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @lingcoder on GitHub (May 7, 2025).
Original GitHub issue: https://github.com/hibiken/asynq/issues/1046
Originally assigned to: @hibiken, @kamikazechaser on GitHub.
Describe the bug
We have encountered a critical issue where scheduled tasks managed by the Scheduler are often skipped and not executed as expected. Our application registers around 25 Scheduler tasks and about 30 regular worker handlers. The problem occurs even though we are running a single asynq instance with
asynq.Config.Concurrencyset to 20,000.Environment (please complete the following information):
asynqpackage version: v0.25.1To Reproduce
Steps to reproduce the behavior:
asynq.Config.Concurrencyto 20,000.Expected behavior
All Scheduler tasks should be executed according to their schedule without being skipped.
Screenshots
If applicable, add screenshots or logs here to help explain your problem.
Additional context
We have tried adjusting the concurrency and monitoring resource usage, but the issue persists. This bug is causing critical scheduled jobs to be missed, which impacts our application's reliability.
We suspect that this issue may be related to the load on Redis or the application instance, and possibly the high number of registered scheduled tasks and task handlers. We did not experience this problem when the number of scheduled tasks was smaller, but it started to appear as the number increased recently. This is especially problematic for tasks that are scheduled to run only once per day—if they are skipped, it has a significant impact on our business.
@wangz1x commented on GitHub (May 8, 2025):
@lingcoder Will the number of handler or the time cost by task affect the result?
Now I have
I haven't notice "skipped".
But I'm still quite concerned because the new feature will depend on the scheduler.
@kamikazechaser commented on GitHub (May 15, 2025):
20,000 is an unusually high (and arbitrary) concurrency setting for a single instance. Could you describe your setup i.e.
Could you describe this more. What state are the tasks in: https://github.com/hibiken/asynq/wiki/Life-of-a-Task#task-lifecycle
@lingcoder commented on GitHub (May 23, 2025):
@kamikazechaser
Thanks for your quick response! Here is some additional background and clarifications:
Workload: Our tasks are quite IO-intensive and involve a lot of database batch processing.
Deployment: We are running on AWS EKS, and Redis is hosted on AWS ElastiCache (Redis 7.0.x).
Monitoring: We have monitoring in place for our infrastructure, but we haven't established detailed metrics specifically for asynq yet. I will try to gather more specifics, but unfortunately our operations engineer recently left, so we are currently limited in this regard.
Skipped/Unfired Tasks: The skipped scheduled tasks do not appear in our application logs at all – it's as if those scheduled moments are missed entirely, rather than the job being delayed long enough to eventually appear. The cron expressions we use specify precise run points, not just intervals (e.g., 0 0 * * *). Sadly, when a task is skipped, it leaves no log or trace (failed or otherwise), and I haven’t yet checked the exact Redis state at the skipped times.
On Premises / Small Scale: When we ran fewer scheduled tasks earlier, everything seemed fine. The problems started after scaling up the schedules.
Issue Scope: At this stage, as a developer (not infra), I don't have deep-dive visibility into the infra/runtime/metrics, but this issue has high business impact for us.
I'm opening this issue partly to find out if anyone else has encountered similar problems when scaling up the number of scheduled tasks/handlers, particularly with a relatively high IO workload and on managed Redis. If this is a known limit or has a recommended workaround, it would help us justify prioritizing actions or design changes internally.
If you or others have insight into what to further investigate, or if there are known scheduler/accounting/concurrency limits in asynq, please point me in the right direction.
I will also try to arrange for someone to further analyze and investigate the issue.
Thanks again for your support and for an awesome project!
@kamikazechaser commented on GitHub (May 23, 2025):
20k is too high for a single instance. I would set it to
min(file_descriptors, max_db_connection_pool_size, max_redis_connection_pool_size)whichever is lower, then make changes from there on. Then scale it horizontally across multiple machines while also scaling redis and the DB.@lingcoder commented on GitHub (May 23, 2025):
Thanks for your suggestion! I’ve already reduced it to 1k for now. If the issue persists, I’ll adjust further as you recommended. Appreciate your help!