[GH-ISSUE #1055] Bug: SingleFile was not able to archive the page #658

Closed
opened 2026-03-01 14:45:21 +03:00 by kerem · 2 comments
Owner

Originally created by @nickali on GitHub (Nov 28, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1055

Describe the bug

Using the latest stable docker image for arm64, no matter what site I try to archive, SingleFile always gives me the same error: SingleFile was not able to archive the page. All the other options seem to work fine.

Steps to reproduce

  1. Add+
  2. Choose Auto-detect parser, depth=0, SingleFile for Archive methods
  3. Submit.

Screenshots or log output

Log after submission

Log

[+] [2022-11-28 06:38:05] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1669617485-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2022-11-28 06:38:05] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2022-11-28 06:38:05] Starting archiving of 1 snapshots in index... [+] [2022-11-28 06:38:05] "www.betanews.com" https://www.betanews.com > ./archive/1669617485.841526 > singlefile Extractor failed: SingleFile was not able to archive the page Got single-file response code: 0. TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md Run to see full output: cd /data/archive/1669617485.841526; /node/node_modules/single-file/cli/single-file --browser-executable-path=chromium "--browser-args=[\"--headless\", \"--no-sandbox\", \"--disable-gpu\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"]" https://www.betanews.com singlefile.html 2 files (238.3 KB) in 0:00:00s [√] [2022-11-28 06:38:06] Update of 1 pages complete (0.23 sec) - 0 links skipped - 1 links updated - 1 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 [+] [2022-11-28 06:38:05] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1669617485-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2022-11-28 06:38:05] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2022-11-28 06:38:05] Starting archiving of 1 snapshots in index... [+] [2022-11-28 06:38:05] "www.betanews.com" https://www.betanews.com > ./archive/1669617485.841526 > singlefile Extractor failed: SingleFile was not able to archive the page Got single-file response code: 0. TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md Run to see full output: cd /data/archive/1669617485.841526; /node/node_modules/single-file/cli/single-file --browser-executable-path=chromium "--browser-args=[\"--headless\", \"--no-sandbox\", \"--disable-gpu\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"]" https://www.betanews.com singlefile.html 2 files (238.3 KB) in 0:00:00s [√] [2022-11-28 06:38:06] Update of 1 pages complete (0.23 sec) - 0 links skipped - 1 links updated - 1 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000

This is probably easier to read:

Screen Shot 2022-11-28 at 1 38 28 AM

In the logs, in the CMD STR, I see this:

CMD STR

Log

/node/node_modules/single-file/cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless\", \"--no-sandbox\", \"--disable-gpu\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://www.betanews.com singlefile.html

Again, if easier to read

Screen Shot 2022-11-28 at 1 31 18 AM

If I pull up a shell in the docker instance and run the above command, I get:

bash: syntax error near unexpected token `('

If I remove the user-agent from that command, I get 'Unexpected end of JSON.'

I tried setting the CHROME_USER_AGENT to

- CHROME_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36"

under archivebox -> environment in docker-compose.yml, I get the same 'Unexpected end of JSON" error.

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.15.0-1022-oracle-aarch64-with-glibc2.28 aarch64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data
 √  SOURCES_DIR           6 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           1 files         valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3

docker-compose was left untouched (except for testing out the user agent change once).

Running Ubuntu 22.04.1
Docker version: Docker version 20.10.21, build baeda1f

Originally created by @nickali on GitHub (Nov 28, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1055 #### Describe the bug Using the latest stable docker image for arm64, no matter what site I try to archive, SingleFile always gives me the same error: SingleFile was not able to archive the page. All the other options seem to work fine. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Add+ 2. Choose Auto-detect parser, depth=0, SingleFile for Archive methods 3. Submit. #### Screenshots or log output <details><summary>Log after submission</summary> <p> #### Log ``` [+] [2022-11-28 06:38:05] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1669617485-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2022-11-28 06:38:05] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2022-11-28 06:38:05] Starting archiving of 1 snapshots in index... [+] [2022-11-28 06:38:05] "www.betanews.com" https://www.betanews.com > ./archive/1669617485.841526 > singlefile Extractor failed: SingleFile was not able to archive the page Got single-file response code: 0. TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md Run to see full output: cd /data/archive/1669617485.841526; /node/node_modules/single-file/cli/single-file --browser-executable-path=chromium "--browser-args=[\"--headless\", \"--no-sandbox\", \"--disable-gpu\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"]" https://www.betanews.com singlefile.html 2 files (238.3 KB) in 0:00:00s [√] [2022-11-28 06:38:06] Update of 1 pages complete (0.23 sec) - 0 links skipped - 1 links updated - 1 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 [+] [2022-11-28 06:38:05] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1669617485-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2022-11-28 06:38:05] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2022-11-28 06:38:05] Starting archiving of 1 snapshots in index... [+] [2022-11-28 06:38:05] "www.betanews.com" https://www.betanews.com > ./archive/1669617485.841526 > singlefile Extractor failed: SingleFile was not able to archive the page Got single-file response code: 0. TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md Run to see full output: cd /data/archive/1669617485.841526; /node/node_modules/single-file/cli/single-file --browser-executable-path=chromium "--browser-args=[\"--headless\", \"--no-sandbox\", \"--disable-gpu\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"]" https://www.betanews.com singlefile.html 2 files (238.3 KB) in 0:00:00s [√] [2022-11-28 06:38:06] Update of 1 pages complete (0.23 sec) - 0 links skipped - 1 links updated - 1 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 ``` </p> </details> This is probably easier to read: ![Screen Shot 2022-11-28 at 1 38 28 AM](https://user-images.githubusercontent.com/1514992/204210323-b6b74391-b981-434a-819d-04ad9231567f.png) In the logs, in the CMD STR, I see this: <details><summary>CMD STR</summary> <p> #### Log ``` /node/node_modules/single-file/cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless\", \"--no-sandbox\", \"--disable-gpu\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://www.betanews.com singlefile.html ``` </p> </details> Again, if easier to read ![Screen Shot 2022-11-28 at 1 31 18 AM](https://user-images.githubusercontent.com/1514992/204209392-b961277c-d9c2-4bf6-86c6-e5679fb51b83.png) If I pull up a shell in the docker instance and run the above command, I get: ```shell bash: syntax error near unexpected token `(' ``` If I remove the user-agent from that command, I get 'Unexpected end of JSON.' I tried setting the CHROME_USER_AGENT to ``` - CHROME_USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36" ``` under archivebox -> environment in docker-compose.yml, I get the same 'Unexpected end of JSON" error. #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.15.0-1022-oracle-aarch64-with-glibc2.28 aarch64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /data √ SOURCES_DIR 6 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 1 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 204.0 KB valid ./index.sqlite3 ``` docker-compose was left untouched (except for testing out the user agent change once). Running Ubuntu 22.04.1 Docker version: Docker version 20.10.21, build baeda1f
kerem closed this issue 2026-03-01 14:45:21 +03:00
Author
Owner

@pirate commented on GitHub (Nov 29, 2022):

Can you try running the most basic case outside of docker to confirm singlefile works on that URL:

npm install single-file

single-file https://www.betanews.com singlefile.html

# open singlefile.html

Also verify that the exit 0 doesn't indicate it actually succeeded (but shows the error message anyway) by checkiing for ./data/archive/<timestamp>/singlefile.html in one of the failed snapshot output folders. If singlefile.html is present and valid then it's just a bug in the error output / output parsing and not actually in singlefile.

<!-- gh-comment-id:1330115733 --> @pirate commented on GitHub (Nov 29, 2022): Can you try running the most basic case outside of docker to confirm singlefile works on that URL: ```bash npm install single-file single-file https://www.betanews.com singlefile.html # open singlefile.html ``` Also verify that the exit 0 doesn't indicate it actually succeeded (but shows the error message anyway) by checkiing for `./data/archive/<timestamp>/singlefile.html` in one of the failed snapshot output folders. If `singlefile.html` is present and valid then it's just a bug in the error output / output parsing and not actually in singlefile.
Author
Owner

@nickali commented on GitHub (Nov 30, 2022):

The /data/archive directory has the json file and the index.html with the archive types, but no HTML file with contents of the website.

I should have been more clear, the betanews site is not the only one causing issues, but any site I tried.

I installed single-file outside of docker using npm and tried the following:

single-file https://betanews.com --back-end=puppeteer --browser-executable-path $(which chromium) betanews.html

The betanews.html contained the html as expected.

<!-- gh-comment-id:1331570924 --> @nickali commented on GitHub (Nov 30, 2022): The /data/archive directory has the json file and the index.html with the archive types, but no HTML file with contents of the website. I should have been more clear, the betanews site is not the only one causing issues, but any site I tried. I installed single-file outside of docker using npm and tried the following: ```shell single-file https://betanews.com --back-end=puppeteer --browser-executable-path $(which chromium) betanews.html ``` The betanews.html contained the html as expected.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#658
No description provided.