[GH-ISSUE #328] Broken layout: debian.org #237

Closed
opened 2026-03-01 14:41:44 +03:00 by kerem · 4 comments
Owner

Originally created by @onlyjob on GitHub (Mar 15, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/328

Current HEAD of master is unable to correctly capture the layout of https://debian.org with Chromium 80.0.3987.132 built on Debian 10.3, running on Debian 10.3, Python 3.7.3.

Originally created by @onlyjob on GitHub (Mar 15, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/328 Current HEAD of _master_ is unable to correctly capture the layout of <https://debian.org> with `Chromium 80.0.3987.132 built on Debian 10.3, running on Debian 10.3`, `Python 3.7.3`.
kerem closed this issue 2026-03-01 14:41:44 +03:00
Author
Owner

@pirate commented on GitHub (Mar 16, 2020):

It's not expected to be able to perfectly reproduce every page, wget output often isn't perfect which is why it also saves the PDF, screenshot, and DOM dump separately. Have you looked at all the outputs or is it just the wget one thats's broken?

If there's an error message that's ArchiveBox related and not caused by the origin, I'm happy to reopen the ticket.
If you need perfect fidelity archives I recommend checking out some of the alternatives here: https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects

<!-- gh-comment-id:599733708 --> @pirate commented on GitHub (Mar 16, 2020): It's not expected to be able to perfectly reproduce every page, wget output often isn't perfect which is why it also saves the PDF, screenshot, and DOM dump separately. Have you looked at all the outputs or is it just the wget one thats's broken? If there's an error message that's ArchiveBox related and not caused by the origin, I'm happy to reopen the ticket. If you need perfect fidelity archives I recommend checking out some of the alternatives here: https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects
Author
Owner

@onlyjob commented on GitHub (Mar 16, 2020):

I don't understand why this issue is closed. A reasonably straightforward web site captured improperly (I don't know whether wget is to blame or anything else).
This bug report is for you to check why structure of the web site was not reproduced.
This is a legitimate problem that might have a solution. I'd appreciate if you could investigate and try to reproduce the problem instead of just closing the bug report.

<!-- gh-comment-id:599776976 --> @onlyjob commented on GitHub (Mar 16, 2020): I don't understand why this issue is closed. A reasonably straightforward web site captured improperly (I don't know whether `wget` is to blame or anything else). This bug report is for you to check why structure of the web site was not reproduced. This is a legitimate problem that might have a solution. I'd appreciate if you could investigate and try to reproduce the problem instead of just closing the bug report.
Author
Owner

@pirate commented on GitHub (Mar 16, 2020):

I'm sorry but as a matter of policy I don't investigate archive fidelity issues when it's not an ArchiveBox bug (i.e. it's broken output from wget or Chrome). I'd be chasing down hundreds of people's individual HTML/CSS archiving bugs if I did, I simply don't have the time to fix wget's bugs for them. The ethos of archivebox is to provide multiple archive outputs for each site in order to cover the cases where an individual method fails. We may also switch away from wget entirely in the future to https://github.com/rockdaboot/wget2 or Wpull.

You're welcome to attempt archiving the site yourself manually with wget, and if the issue is reproducible that way you can open an issue on the wget issue tracker to get it resolved on their side.

wget --no-verbose \
              --adjust-extension \
              --convert-links \
              --force-directories \
              --backup-converted \
              --span-hosts \
              --no-parent \
              -e robots=off \
              --restrict-file-names=windows \
              --timeout=60 \
              --warc-file=archive.warc \
              --page-requisites \
              --compression=auto \
              --no-check-certificate \
              --no-hsts \
              "https://debian.org"

If it's not reproducible with wget on its own, then it's indeed an ArchiveBox or config issue, and I'd be happy to reopen the ticket, but I need more info from you to do so. (please provide the exact command you ran along with any config options you're using, and a screenshot of the broken output so I can investigate further)

<!-- gh-comment-id:599796868 --> @pirate commented on GitHub (Mar 16, 2020): I'm sorry but as a matter of policy I don't investigate archive fidelity issues when it's not an ArchiveBox bug (i.e. it's broken output from `wget` or Chrome). I'd be chasing down hundreds of people's individual HTML/CSS archiving bugs if I did, I simply don't have the time to fix `wget`'s bugs for them. The ethos of archivebox is to provide multiple archive outputs for each site in order to cover the cases where an individual method fails. We may also switch away from `wget` entirely in the future to https://github.com/rockdaboot/wget2 or Wpull. You're welcome to attempt archiving the site yourself manually with wget, and if the issue is reproducible that way you can open an issue on the `wget` issue tracker to get it resolved on their side. ```bash wget --no-verbose \ --adjust-extension \ --convert-links \ --force-directories \ --backup-converted \ --span-hosts \ --no-parent \ -e robots=off \ --restrict-file-names=windows \ --timeout=60 \ --warc-file=archive.warc \ --page-requisites \ --compression=auto \ --no-check-certificate \ --no-hsts \ "https://debian.org" ``` If it's not reproducible with `wget` on its own, then it's indeed an ArchiveBox or config issue, and I'd be happy to reopen the ticket, but I need more info from you to do so. (please provide the exact command you ran along with any config options you're using, and a screenshot of the broken output so I can investigate further)
Author
Owner

@onlyjob commented on GitHub (Mar 17, 2020):

Fair enough, thanks.
I tried the wget command that you've suggested and it captured the web site layout successfully.

Also 83197ef failed to detect page title (Failed: Unable to detect page title) while latest release 0.2.4 could capture the title correctly.

I'm just evaluating ArchiveBox so I tried to capture https://debian.org using command echo https://debian.org | archivebox without any additional config options.

<!-- gh-comment-id:599864563 --> @onlyjob commented on GitHub (Mar 17, 2020): Fair enough, thanks. I tried the `wget` command that you've suggested and it captured the web site layout successfully. Also 83197ef failed to detect page title (`Failed: Unable to detect page title`) while latest release 0.2.4 could capture the title correctly. I'm just evaluating ArchiveBox so I tried to capture https://debian.org using command `echo https://debian.org | archivebox` without any additional config options.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#237
No description provided.