[GH-ISSUE #985] UX Wart: 504 error when long-running request times out in web UI #2122

Closed
opened 2026-03-01 17:56:38 +03:00 by kerem · 8 comments
Owner

Originally created by @kylrth on GitHub (May 27, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/985

I reverse proxy my public archive through NGINX. If I request a long-running action from the web UI (like if I click "pull" when I've selected 50 archives), the browser waits for a response from the server, which will only come when the entire job is completed. Since that's going to take a while, the request times out and NGINX serves a "504 gateway time-out" page.

The server should probably respond when the request is received, not when it's completed.

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.4.0-113-generic-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.10.4         valid     /usr/local/bin/python3.10                                                   
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py          
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2022.04.08     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v100.0.4896.127  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data                                                                       
 √  SOURCES_DIR           4 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           35 files        valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             672.0 KB        valid     ./index.sqlite3
Originally created by @kylrth on GitHub (May 27, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/985 I reverse proxy my public archive through NGINX. If I request a long-running action from the web UI (like if I click "pull" when I've selected 50 archives), the browser waits for a response from the server, which will only come when the entire job is completed. Since that's going to take a while, the request times out and NGINX serves a "504 gateway time-out" page. The server should probably respond when the request is received, not when it's completed. #### ArchiveBox version ```logs ArchiveBox v0.6.3 Cpython Linux Linux-5.4.0-113-generic-x86_64-with-glibc2.31 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git √ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v100.0.4896.127 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /data √ SOURCES_DIR 4 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 35 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 672.0 KB valid ./index.sqlite3 ```
Author
Owner

@mAAdhaTTah commented on GitHub (May 29, 2022):

I ran into this and I ended up configuring my reverse proxy to not timeout. The page can't load until the process is completed (it's not a background task) so I'm not sure your solution would work.

<!-- gh-comment-id:1140445900 --> @mAAdhaTTah commented on GitHub (May 29, 2022): I ran into this and I ended up configuring my reverse proxy to not timeout. The page can't load until the process is completed (it's not a background task) so I'm not sure your solution would work.
Author
Owner

@kylrth commented on GitHub (May 29, 2022):

Ok, thanks for the recommendation. I'll probably do that for now. What I'm suggesting is to make it a background task, because that seems more appropriate. It's fine if not though.

<!-- gh-comment-id:1140454470 --> @kylrth commented on GitHub (May 29, 2022): Ok, thanks for the recommendation. I'll probably do that for now. What I'm suggesting is to *make* it a background task, because that seems more appropriate. It's fine if not though.
Author
Owner

@mAAdhaTTah commented on GitHub (May 29, 2022):

Feasibly, you could do this via the CLI with some combo of schedule and/or a cronjob. Although I can't seem to find it now, there are designs on improving the background job capabilities of ArchiveBox, although I wouldn't expect that to land anytime soon, given the slow development on the project right now.

<!-- gh-comment-id:1140459393 --> @mAAdhaTTah commented on GitHub (May 29, 2022): Feasibly, you could do this via the CLI with some combo of `schedule` and/or a cronjob. Although I can't seem to find it now, there are designs on improving the background job capabilities of ArchiveBox, although I wouldn't expect that to land anytime soon, given the slow development on the project right now.
Author
Owner

@pirate commented on GitHub (Jun 9, 2022):

The archive task actually continues just fine even if the user navigates away after the 504, so I haven't prioritized fixing this but I've been aware of it for a while. It was convenient to run the archive task from the main request thread without forking because if it finishes in time for the response then the UX is normal, and if it 504s and they refresh it also works, it's just a UX wart to show the error on long running tasks.

One easy way to solve this is to use Django's little-known post-request pattern where you subclass and override HTTPResponse.close()to run a function after the response is returned (that way we don't have to add a whole async task runner or scheduling system like Celery/dramatiq): https://gist.github.com/pirate/c4deb41c16793c05950a6721a820cde9

Another way is to use StreamingHttpResponse to return 90% of the response html immediately and the last chunk on completion that runs some JS to trigger a page refresh: https://gist.github.com/pirate/79f84dfee81ba0a38b6113541e827fd5

<!-- gh-comment-id:1150538251 --> @pirate commented on GitHub (Jun 9, 2022): The archive task actually continues just fine even if the user navigates away after the 504, so I haven't prioritized fixing this but I've been aware of it for a while. It was convenient to run the archive task from the main request thread without forking because if it finishes in time for the response then the UX is normal, and if it 504s and they refresh it also works, it's just a UX wart to show the error on long running tasks. One easy way to solve this is to use Django's little-known post-request pattern where you subclass and override `HTTPResponse.close()`to run a function after the response is returned (that way we don't have to add a whole async task runner or scheduling system like Celery/dramatiq): https://gist.github.com/pirate/c4deb41c16793c05950a6721a820cde9 Another way is to use `StreamingHttpResponse` to return 90% of the response html immediately and the last chunk on completion that runs some JS to trigger a page refresh: https://gist.github.com/pirate/79f84dfee81ba0a38b6113541e827fd5
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

This was improved on the /add/ page in v0.7.2, the UI now auto-redirects back to the Snapshots page. We should still implement improvements for the other long-running admin actions though...

<!-- gh-comment-id:1900110372 --> @pirate commented on GitHub (Jan 19, 2024): This was improved on the `/add/` page in v0.7.2, the UI now auto-redirects back to the Snapshots page. We should still implement improvements for the other long-running admin actions though...
Author
Owner

@Routhinator commented on GitHub (Oct 31, 2025):

I'll note that I landed on this as the Reset option does timeout, and after several tests I can say that the page is not reset and updated with the latest snapshot unless the connection does not time out . I have a snapshot with an IP block on the singlepage output, and when I reset it - the updated copy gathered is discarded after the 504, and the old copy remains.

If I use Re-snapshot things work, but then i need to go delete the old snapshot to keep one copy.

<!-- gh-comment-id:3473577070 --> @Routhinator commented on GitHub (Oct 31, 2025): I'll note that I landed on this as the Reset option does timeout, and after several tests I can say that the page is not reset and updated with the latest snapshot unless the connection _does not time out_ . I have a snapshot with an IP block on the singlepage output, and when I reset it - the updated copy gathered is discarded after the 504, and the old copy remains. If I use Re-snapshot things work, but then i need to go delete the old snapshot to keep one copy.
Author
Owner

@Routhinator commented on GitHub (Oct 31, 2025):

Actually this may be something else - finally got proxy timeouts set long enough for Reset to complete, and it just never updated the chrome singlepage. Running re-snapshot works. Not sure why it never updates with Reset even though the logs show it is re-snapshotting the url

<!-- gh-comment-id:3473690173 --> @Routhinator commented on GitHub (Oct 31, 2025): Actually this may be something else - finally got proxy timeouts set long enough for Reset to complete, and it just never updated the chrome singlepage. Running re-snapshot works. Not sure why it never updates with Reset even though the logs show it is re-snapshotting the url
Author
Owner

@pirate commented on GitHub (Dec 29, 2025):

this is fixed on dev

<!-- gh-comment-id:3697594055 --> @pirate commented on GitHub (Dec 29, 2025): this is fixed on `dev`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2122
No description provided.