[GH-ISSUE #209] Check page status (up/down) and don't attempt archiving on inaccessible pages #3161

Closed
opened 2026-03-14 21:22:03 +03:00 by kerem · 3 comments
Owner

Originally created by @pirate on GitHub (Apr 3, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/209

On every pass through, perform a simple GET request to check if the server is responsive, and save the response code in the archive json.

If the code is >400, don't attempt running archive methods as they will only save broken 404 pages.

Related to: https://github.com/pirate/ArchiveBox/pull/203
CC: @shakkhar

I'm just saving this issue to make sure I don't forget to add this to the next release.

Originally created by @pirate on GitHub (Apr 3, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/209 On every pass through, perform a simple GET request to check if the server is responsive, and save the response code in the archive json. If the code is >400, don't attempt running archive methods as they will only save broken 404 pages. Related to: https://github.com/pirate/ArchiveBox/pull/203 CC: @shakkhar I'm just saving this issue to make sure I don't forget to add this to the next release.
kerem 2026-03-14 21:22:03 +03:00
Author
Owner

@cdvv7788 commented on GitHub (Aug 12, 2020):

A HEAD request may be enough for this (fulfills the purpose while being lighter)

<!-- gh-comment-id:672955055 --> @cdvv7788 commented on GitHub (Aug 12, 2020): A `HEAD` request may be enough for this (fulfills the purpose while being lighter)
Author
Owner

@cdvv7788 commented on GitHub (Aug 12, 2020):

The conclusion about this is that it is better to spend some additional time trying to archive the URL, even if it only works with a single method than adding checks for accessibility. Those can get complicated (SSL handshakes, HTTP response codes, HTTP methods, etc).
In short, it currently works as expected. Every archive method will be ran against the url without any previous sanity checks.

<!-- gh-comment-id:672989123 --> @cdvv7788 commented on GitHub (Aug 12, 2020): The conclusion about this is that it is better to spend some additional time trying to archive the URL, even if it only works with a single method than adding checks for accessibility. Those can get complicated (SSL handshakes, HTTP response codes, HTTP methods, etc). In short, it currently works as expected. Every archive method will be ran against the url without any previous sanity checks.
Author
Owner

@pirate commented on GitHub (Aug 12, 2020):

If there are enough complaints maybe we can make this an option in the future, but I think for now as @cdvv7788 said it's better to try a down URL few unnecessary times than to potentially miss archiving it because it only responds to certain types of requests in certain situations.

<!-- gh-comment-id:673009710 --> @pirate commented on GitHub (Aug 12, 2020): If there are enough complaints maybe we can make this an option in the future, but I think for now as @cdvv7788 said it's better to try a down URL few unnecessary times than to potentially miss archiving it because it only responds to certain types of requests in certain situations.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3161
No description provided.