mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #328] Broken layout: debian.org #237
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#237
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @onlyjob on GitHub (Mar 15, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/328
Current HEAD of master is unable to correctly capture the layout of https://debian.org with
Chromium 80.0.3987.132 built on Debian 10.3, running on Debian 10.3,Python 3.7.3.@pirate commented on GitHub (Mar 16, 2020):
It's not expected to be able to perfectly reproduce every page, wget output often isn't perfect which is why it also saves the PDF, screenshot, and DOM dump separately. Have you looked at all the outputs or is it just the wget one thats's broken?
If there's an error message that's ArchiveBox related and not caused by the origin, I'm happy to reopen the ticket.
If you need perfect fidelity archives I recommend checking out some of the alternatives here: https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Community#Web-Archiving-Projects
@onlyjob commented on GitHub (Mar 16, 2020):
I don't understand why this issue is closed. A reasonably straightforward web site captured improperly (I don't know whether
wgetis to blame or anything else).This bug report is for you to check why structure of the web site was not reproduced.
This is a legitimate problem that might have a solution. I'd appreciate if you could investigate and try to reproduce the problem instead of just closing the bug report.
@pirate commented on GitHub (Mar 16, 2020):
I'm sorry but as a matter of policy I don't investigate archive fidelity issues when it's not an ArchiveBox bug (i.e. it's broken output from
wgetor Chrome). I'd be chasing down hundreds of people's individual HTML/CSS archiving bugs if I did, I simply don't have the time to fixwget's bugs for them. The ethos of archivebox is to provide multiple archive outputs for each site in order to cover the cases where an individual method fails. We may also switch away fromwgetentirely in the future to https://github.com/rockdaboot/wget2 or Wpull.You're welcome to attempt archiving the site yourself manually with wget, and if the issue is reproducible that way you can open an issue on the
wgetissue tracker to get it resolved on their side.If it's not reproducible with
wgeton its own, then it's indeed an ArchiveBox or config issue, and I'd be happy to reopen the ticket, but I need more info from you to do so. (please provide the exact command you ran along with any config options you're using, and a screenshot of the broken output so I can investigate further)@onlyjob commented on GitHub (Mar 17, 2020):
Fair enough, thanks.
I tried the
wgetcommand that you've suggested and it captured the web site layout successfully.Also
83197effailed to detect page title (Failed: Unable to detect page title) while latest release 0.2.4 could capture the title correctly.I'm just evaluating ArchiveBox so I tried to capture https://debian.org using command
echo https://debian.org | archiveboxwithout any additional config options.