mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #209] Check page status (up/down) and don't attempt archiving on inaccessible pages #1649
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1649
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Apr 3, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/209
On every pass through, perform a simple GET request to check if the server is responsive, and save the response code in the archive json.
If the code is >400, don't attempt running archive methods as they will only save broken 404 pages.
Related to: https://github.com/pirate/ArchiveBox/pull/203
CC: @shakkhar
I'm just saving this issue to make sure I don't forget to add this to the next release.
@cdvv7788 commented on GitHub (Aug 12, 2020):
A
HEADrequest may be enough for this (fulfills the purpose while being lighter)@cdvv7788 commented on GitHub (Aug 12, 2020):
The conclusion about this is that it is better to spend some additional time trying to archive the URL, even if it only works with a single method than adding checks for accessibility. Those can get complicated (SSL handshakes, HTTP response codes, HTTP methods, etc).
In short, it currently works as expected. Every archive method will be ran against the url without any previous sanity checks.
@pirate commented on GitHub (Aug 12, 2020):
If there are enough complaints maybe we can make this an option in the future, but I think for now as @cdvv7788 said it's better to try a down URL few unnecessary times than to potentially miss archiving it because it only responds to certain types of requests in certain situations.