mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #227] Archive Method: Chrome headless attempts to re-archive static file formats #154
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#154
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pigmonkey on GitHub (Apr 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/227
A number of my source URLs are PDF files. Looking through my ArchiveBox logs, I see Chromium timing out when it attempts to print these to PDFs. I can recreate this issue by executing Chromium with the URL myself:
Initially I thought this might have something to do with running
--print-to-pdfon a PDF file, but the same timeout occurs with just the--headlessswitch.Without
--headless, Chromium opens the URL fine. (That GPU error is just some Chromium cruft.)Since the Chromium PDF generation happens after the URL is fetched with wget, I think we should just inspect the fetched file and, if it is already a PDF, not attempt to execute Chromium on it.
In fact, it's probably better to inspect the file and only execute Chromium if the file format appears in a (configurable?) whitelist. If one of the URLs is an mp3 file, we also wouldn't want to try to generate a PDF via Chromium. If one of the URLs is a text file, I personally would not want the overhead of creating a PDF of it, but maybe some people might.
It might make sense to apply the same whitelist to WARC generation. WARC generation does currently work on my PDF URLs, but I don't feel a compelling need for it.
This should be done using something like
file(1), rather than trying to guess based on file extension.(please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you)
If relevant, I am running:
6eff6f4@pigmonkey commented on GitHub (Apr 30, 2019):
To accomplish file inspection with pure python, as per #177, we could use https://github.com/ahupp/python-magic
@pirate commented on GitHub (May 2, 2019):
Ah this is a bug, it should ignore PDF files automatically (as well as all other staticfile formats):
Do you mind checking out v0.4.0 and seeing if the issue still happens?
@pigmonkey commented on GitHub (May 2, 2019):
I see. It is caused by the lack of a comma after m3u8.
https://github.com/pirate/ArchiveBox/blob/master/archivebox/util.py#L73
The bug is not present in your comment. Maybe you copy/pasted from another branch.
I'm not sure how to checkout a pull request. Is that the same as the
djangobranch?@pirate commented on GitHub (May 2, 2019):
Ah, I indeed copy-pasted from the
djangobranch where I had it already fixed (yesdjangois the same branch as that PR, you can just check it out andpip install -e .to test it).I just pushed a fix to
masterto immediately it as well though:github.com/pirate/ArchiveBox@500534f4be, thanks for reporting this!@pigmonkey commented on GitHub (May 2, 2019):
Thanks. This does solve the problem for most of my bookmarked PDFs.
Depending on the extension is fragile. For instance, this URL from my bookmarks is a PDF but does not get recognized as a static file: https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=6731&context=etd
I see you do have a TODO comment concerning this in
is_static_file()on both branches. Is it worth tracking that in a separate issue?