[GH-ISSUE #1004] Bug: Cannot archive pdf page. #3649

Closed
opened 2026-03-14 23:54:00 +03:00 by kerem · 3 comments
Owner

Originally created by @TheAnachronism on GitHub (Jul 25, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1004

Describe the bug

I was trying to archive a URL that just returned a PDF file. I have only singlefile and PDF enabled, but archivebox didn't archive anything except the title, headers, and the favicon.

Steps to reproduce

  1. Running archivebox in docker with everything disabled exception singlefile and pdf
  2. Archive a webpage that gives back a pdf (https://ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf) for example.
  3. See what actually gets saved in the archive directory

Screenshots or log output

Result of doing the same with the cli:

[i] [2022-07-25 21:53:38] ArchiveBox v0.6.3: archivebox add https://ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf
    > /data

[+] [2022-07-25 21:53:38] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1658786018-import.txt
    > Parsed 1 URLs from input (Generic TXT)                                                                                                          
    > Found 1 new URLs not already in index

[*] [2022-07-25 21:53:38] Writing 1 links to main index...
    √ ./index.sqlite3                                                                                                                                 

[▶] [2022-07-25 21:53:38] Starting archiving of 1 snapshots in index...

[+] [2022-07-25 21:53:38] "ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf"
    https://ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf
    > ./archive/1658786018.446854
      > title
      > favicon                                                                                                                                       
      > headers                                                                                                                                       
        4 files (241.4 KB) in 0:00:01s                                                                                                                

[√] [2022-07-25 21:53:40] Update of 1 pages complete (1.71 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.10.0-15-amd64-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.10.4         valid     /usr/local/bin/python3.10                                                   
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py          
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 -  WGET_BINARY           -               disabled  /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 -  READABILITY_BINARY    -               disabled  /node/node_modules/readability-extractor/readability-extractor              
 -  MERCURY_BINARY        -               disabled  /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v100.0.4896.127  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /data                                                                       
 √  SOURCES_DIR           994 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           1239 files      valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             5.2 MB          valid     ./index.sqlite3
Originally created by @TheAnachronism on GitHub (Jul 25, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1004 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> I was trying to archive a URL that just returned a PDF file. I have only singlefile and PDF enabled, but archivebox didn't archive anything except the title, headers, and the favicon. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Running archivebox in docker with everything disabled exception singlefile and pdf 2. Archive a webpage that gives back a pdf (https://ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf) for example. 3. See what actually gets saved in the archive directory #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> Result of doing the same with the cli: ``` [i] [2022-07-25 21:53:38] ArchiveBox v0.6.3: archivebox add https://ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf > /data [+] [2022-07-25 21:53:38] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1658786018-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2022-07-25 21:53:38] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2022-07-25 21:53:38] Starting archiving of 1 snapshots in index... [+] [2022-07-25 21:53:38] "ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf" https://ifs.host.cs.st-andrews.ac.uk/Books/SE9/Web/ExtraChaps/Documentation.pdf > ./archive/1658786018.446854 > title > favicon > headers 4 files (241.4 KB) in 0:00:01s [√] [2022-07-25 21:53:40] Update of 1 pages complete (1.71 sec) - 0 links skipped - 1 links updated - 0 links had errors ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.3 Cpython Linux Linux-5.10.0-15-amd64-x86_64-with-glibc2.31 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl - WGET_BINARY - disabled /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file - READABILITY_BINARY - disabled /node/node_modules/readability-extractor/readability-extractor - MERCURY_BINARY - disabled /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/yt-dlp √ CHROME_BINARY v100.0.4896.127 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 6 files valid /data √ SOURCES_DIR 994 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 1239 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 5.2 MB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
kerem closed this issue 2026-03-14 23:54:06 +03:00
Author
Owner

@TheAnachronism commented on GitHub (Jul 25, 2022):

As archive box is seen as a self-hosted version of archive.org and archive.org has no problems archiving a PDF I do believe that archive box should be able to do that too.

<!-- gh-comment-id:1194681094 --> @TheAnachronism commented on GitHub (Jul 25, 2022): As archive box is seen as a self-hosted version of archive.org and archive.org has no problems archiving a PDF I do believe that archive box should be able to do that too.
Author
Owner

@pirate commented on GitHub (Jul 26, 2022):

You need to enable the wget extractor method to download staticfiles directly using their URL. The PDF extractor is designed to convert pages that aren't PDF into PDF archives, it's redundant if the file is already a PDF (which is what the wget extractor is designed for, staticfiles).

<!-- gh-comment-id:1195855583 --> @pirate commented on GitHub (Jul 26, 2022): You need to enable the wget extractor method to download staticfiles directly using their URL. The PDF extractor is designed to convert pages that aren't PDF into PDF archives, it's redundant if the file is already a PDF (which is what the wget extractor is designed for, staticfiles).
Author
Owner

@TheAnachronism commented on GitHub (Jul 27, 2022):


I could have come up with that solution myself 🤦‍♂️
Thanks for the quick answer

<!-- gh-comment-id:1196292528 --> @TheAnachronism commented on GitHub (Jul 27, 2022): … I could have come up with that solution myself 🤦‍♂️ Thanks for the quick answer
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3649
No description provided.