[GH-ISSUE #227] Archive Method: Chrome headless attempts to re-archive static file formats #154

Closed
opened 2026-03-01 14:41:06 +03:00 by kerem · 5 comments
Owner

Originally created by @pigmonkey on GitHub (Apr 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/227

A number of my source URLs are PDF files. Looking through my ArchiveBox logs, I see Chromium timing out when it attempts to print these to PDFs. I can recreate this issue by executing Chromium with the URL myself:

$ /usr/bin/chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://www.sailrite.com/PDF/Thread%20and%20Needle%20Recommendations.pdf
[0430/163501.155864:ERROR:viz_main_impl.cc(170)] Exiting GPU process due to errors during initialization
[0430/163601.102090:INFO:headless_shell.cc(308)] Timeout

Initially I thought this might have something to do with running --print-to-pdf on a PDF file, but the same timeout occurs with just the --headless switch.

$ /usr/bin/chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --timeout=60000 https://www.sailrite.com/PDF/Thread%20and%20Needle%20Recommendations.pdf
[0430/164115.368262:ERROR:viz_main_impl.cc(170)] Exiting GPU process due to errors during initialization

Without --headless, Chromium opens the URL fine. (That GPU error is just some Chromium cruft.)

Since the Chromium PDF generation happens after the URL is fetched with wget, I think we should just inspect the fetched file and, if it is already a PDF, not attempt to execute Chromium on it.

In fact, it's probably better to inspect the file and only execute Chromium if the file format appears in a (configurable?) whitelist. If one of the URLs is an mp3 file, we also wouldn't want to try to generate a PDF via Chromium. If one of the URLs is a text file, I personally would not want the overhead of creating a PDF of it, but maybe some people might.

It might make sense to apply the same whitelist to WARC generation. WARC generation does currently work on my PDF URLs, but I don't feel a compelling need for it.

This should be done using something like file(1), rather than trying to guess based on file extension.

(please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you)

If relevant, I am running:

  • OS: Arch Lunux
  • ArchiveBox version: 6eff6f4
  • Python version: Python 3.7.3
  • Chrome version: Chromium 74.0.3729.108 Arch Linux
Originally created by @pigmonkey on GitHub (Apr 30, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/227 A number of my source URLs are PDF files. Looking through my ArchiveBox logs, I see Chromium timing out when it attempts to print these to PDFs. I can recreate this issue by executing Chromium with the URL myself: $ /usr/bin/chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://www.sailrite.com/PDF/Thread%20and%20Needle%20Recommendations.pdf [0430/163501.155864:ERROR:viz_main_impl.cc(170)] Exiting GPU process due to errors during initialization [0430/163601.102090:INFO:headless_shell.cc(308)] Timeout Initially I thought this might have something to do with running `--print-to-pdf` on a PDF file, but the same timeout occurs with just the `--headless` switch. $ /usr/bin/chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --timeout=60000 https://www.sailrite.com/PDF/Thread%20and%20Needle%20Recommendations.pdf [0430/164115.368262:ERROR:viz_main_impl.cc(170)] Exiting GPU process due to errors during initialization Without `--headless`, Chromium opens the URL fine. (That GPU error is just some Chromium cruft.) Since the Chromium PDF generation happens after the URL is fetched with wget, I think we should just inspect the fetched file and, if it is already a PDF, not attempt to execute Chromium on it. In fact, it's probably better to inspect the file and only execute Chromium if the file format appears in a (configurable?) whitelist. If one of the URLs is an mp3 file, we also wouldn't want to try to generate a PDF via Chromium. If one of the URLs is a text file, I personally would not want the overhead of creating a PDF of it, but maybe some people might. It might make sense to apply the same whitelist to WARC generation. WARC generation does currently work on my PDF URLs, but I don't feel a compelling need for it. This should be done using something like `file(1)`, rather than trying to guess based on file extension. (please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you) If relevant, I am running: - OS: Arch Lunux - ArchiveBox version: 6eff6f4 - Python version: Python 3.7.3 - Chrome version: Chromium 74.0.3729.108 Arch Linux
Author
Owner

@pigmonkey commented on GitHub (Apr 30, 2019):

To accomplish file inspection with pure python, as per #177, we could use https://github.com/ahupp/python-magic

<!-- gh-comment-id:488155955 --> @pigmonkey commented on GitHub (Apr 30, 2019): To accomplish file inspection with pure python, as per #177, we could use https://github.com/ahupp/python-magic
Author
Owner

@pirate commented on GitHub (May 2, 2019):

Ah this is a bug, it should ignore PDF files automatically (as well as all other staticfile formats):

STATICFILE_EXTENSIONS = {
    # 99.999% of the time, URLs ending in these extentions are static files
    # that can be downloaded as-is, not html pages that need to be rendered
    # in headless chrome and re-archived as pdf/png/html
    'gif', 'jpeg', 'jpg', 'png', 'tif', 'tiff', 'wbmp', 'ico', 'jng', 'bmp',
    'svg', 'svgz', 'webp', 'ps', 'eps', 'ai',
    'mp3', 'mp4', 'm4a', 'mpeg', 'mpg', 'mkv', 'mov', 'webm', 'm4v', 
    'flv', 'wmv', 'avi', 'ogg', 'ts', 'm3u8',
    'pdf', 'txt', 'rtf', 'rtfd', 'doc', 'docx', 'ppt', 'pptx', 'xls', 'xlsx',
    'atom', 'rss', 'css', 'js', 'json',
    'dmg', 'iso', 'img',
    'rar', 'war', 'hqx', 'zip', 'gz', 'bz2', '7z',

    # Less common extensions to consider adding later
    # jar, swf, bin, com, exe, dll, deb
    # ear, hqx, eot, wmlc, kml, kmz, cco, jardiff, jnlp, run, msi, msp, msm, 
    # pl pm, prc pdb, rar, rpm, sea, sit, tcl tk, der, pem, crt, xpi, xspf,
    # ra, mng, asx, asf, 3gpp, 3gp, mid, midi, kar, jad, wml, htc, mml

    # Thse are always treated as pages, not as static files, never add them:
    # html, htm, shtml, xhtml, xml, aspx, php, cgi
}
is_static_file = lambda url: extension(url).lower() in STATICFILE_EXTENSIONS  # TODO: use mime type detection in addition to extension
@enforce_types
def should_save_pdf(link: Link, out_dir: Optional[str]=None) -> bool:
    out_dir = out_dir or link.link_dir
    if is_static_file(link.url):
        return False
    
    if os.path.exists(os.path.join(out_dir, 'output.pdf')):
        return False

    return SAVE_PDF

Do you mind checking out v0.4.0 and seeing if the issue still happens?

<!-- gh-comment-id:488782040 --> @pirate commented on GitHub (May 2, 2019): Ah this is a bug, it should ignore PDF files automatically (as well as all other staticfile formats): ```python STATICFILE_EXTENSIONS = { # 99.999% of the time, URLs ending in these extentions are static files # that can be downloaded as-is, not html pages that need to be rendered # in headless chrome and re-archived as pdf/png/html 'gif', 'jpeg', 'jpg', 'png', 'tif', 'tiff', 'wbmp', 'ico', 'jng', 'bmp', 'svg', 'svgz', 'webp', 'ps', 'eps', 'ai', 'mp3', 'mp4', 'm4a', 'mpeg', 'mpg', 'mkv', 'mov', 'webm', 'm4v', 'flv', 'wmv', 'avi', 'ogg', 'ts', 'm3u8', 'pdf', 'txt', 'rtf', 'rtfd', 'doc', 'docx', 'ppt', 'pptx', 'xls', 'xlsx', 'atom', 'rss', 'css', 'js', 'json', 'dmg', 'iso', 'img', 'rar', 'war', 'hqx', 'zip', 'gz', 'bz2', '7z', # Less common extensions to consider adding later # jar, swf, bin, com, exe, dll, deb # ear, hqx, eot, wmlc, kml, kmz, cco, jardiff, jnlp, run, msi, msp, msm, # pl pm, prc pdb, rar, rpm, sea, sit, tcl tk, der, pem, crt, xpi, xspf, # ra, mng, asx, asf, 3gpp, 3gp, mid, midi, kar, jad, wml, htc, mml # Thse are always treated as pages, not as static files, never add them: # html, htm, shtml, xhtml, xml, aspx, php, cgi } is_static_file = lambda url: extension(url).lower() in STATICFILE_EXTENSIONS # TODO: use mime type detection in addition to extension ``` ```python @enforce_types def should_save_pdf(link: Link, out_dir: Optional[str]=None) -> bool: out_dir = out_dir or link.link_dir if is_static_file(link.url): return False if os.path.exists(os.path.join(out_dir, 'output.pdf')): return False return SAVE_PDF ``` Do you mind checking out [v0.4.0](https://github.com/pirate/ArchiveBox/pull/207) and seeing if the issue still happens?
Author
Owner

@pigmonkey commented on GitHub (May 2, 2019):

I see. It is caused by the lack of a comma after m3u8.

https://github.com/pirate/ArchiveBox/blob/master/archivebox/util.py#L73

The bug is not present in your comment. Maybe you copy/pasted from another branch.

I'm not sure how to checkout a pull request. Is that the same as the django branch?

<!-- gh-comment-id:488796518 --> @pigmonkey commented on GitHub (May 2, 2019): I see. It is caused by the lack of a comma after m3u8. https://github.com/pirate/ArchiveBox/blob/master/archivebox/util.py#L73 The bug is not present in your comment. Maybe you copy/pasted from another branch. I'm not sure how to checkout a pull request. Is that the same as the `django` branch?
Author
Owner

@pirate commented on GitHub (May 2, 2019):

Ah, I indeed copy-pasted from the django branch where I had it already fixed (yes django is the same branch as that PR, you can just check it out and pip install -e . to test it).

I just pushed a fix to master to immediately it as well though: github.com/pirate/ArchiveBox@500534f4be, thanks for reporting this!

<!-- gh-comment-id:488797402 --> @pirate commented on GitHub (May 2, 2019): Ah, I indeed copy-pasted from the `django` branch where I had it already fixed (yes `django` is the same branch as that PR, you can just check it out and `pip install -e .` to test it). I just pushed a fix to `master` to immediately it as well though: https://github.com/pirate/ArchiveBox/commit/500534f4be87e94f05d9cf6063babd4faa5145cc, thanks for reporting this!
Author
Owner

@pigmonkey commented on GitHub (May 2, 2019):

Thanks. This does solve the problem for most of my bookmarked PDFs.

Depending on the extension is fragile. For instance, this URL from my bookmarks is a PDF but does not get recognized as a static file: https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=6731&context=etd

I see you do have a TODO comment concerning this in is_static_file() on both branches. Is it worth tracking that in a separate issue?

<!-- gh-comment-id:488866248 --> @pigmonkey commented on GitHub (May 2, 2019): Thanks. This does solve the problem for most of my bookmarked PDFs. Depending on the extension is fragile. For instance, this URL from my bookmarks is a PDF but does not get recognized as a static file: https://scholarcommons.usf.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=6731&context=etd I see you do have a TODO comment concerning this in `is_static_file()` on both branches. Is it worth tracking that in a separate issue?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#154
No description provided.