[GH-ISSUE #1386] Support: singlefile & readability fail to work #845

Open
opened 2026-03-01 14:46:44 +03:00 by kerem · 9 comments
Owner

Originally created by @ghost on GitHub (Mar 25, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1386

For every snapshot I try, singlefile and readability fail. I assume readability may fail due to lack of the singlefile.html.

Error for singlefile:

SingleFile was not able to archive the page

If I run the command it tries to run in terminal I get:

bash: syntax error near unexpected token `('

The raw command I copied from the log section and ran to get the above:

/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/ singlefile.html

This error occurs for every article I try to archive.

Readability error is:

Readability was not able to archive the page (invalid JSON)

But, again, since I see a reference to singlefile.html in the command I expect solving the above will solve this.

The output of archivebox version is:

0.7.2
ArchiveBox v0.7.2 BUILD_TIME=2024-02-28 11:27:51 1709137671
IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-101-generic-x86_64-with-glibc2.35 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=755
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.10.12        valid     /usr/bin/python3.10                                                         
 √  SQLITE_BINARY         v2.6.0          valid     /usr/lib/python3.10/sqlite3/dbapi2.py                                       
 √  DJANGO_BINARY         v3.1.14         valid     /home/ahermitforhire/.local/lib/python3.10/site-packages/django/__init__.py 
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /home/ahermitforhire/.local/bin/archivebox                                  

 √  CURL_BINARY           v7.81.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.2         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v12.22.9        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.54         valid     ./node_modules/single-file-cli/single-file                                  
 √  READABILITY_BINARY    v0.0.11         valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/parser/cli.js                                     
 √  GIT_BINARY            v2.34.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.12.30     valid     /home/ahermitforhire/.local/bin/yt-dlp                                      
 √  CHROME_BINARY         v122.0.6261.94  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox         
 √  TEMPLATES_DIR         3 files         valid     /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                                        

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                                        
 -  COOKIES_FILE          -               disabled  None                                                                        

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /mnt/media/ArchiveBox                                                       
 √  SOURCES_DIR           11 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           5 files         valid     ./archive                                                                   
 √  CONFIG_FILE           238.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             328.0 KB        valid     ./index.sqlite3  
Originally created by @ghost on GitHub (Mar 25, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1386 For every snapshot I try, singlefile and readability fail. I assume readability may fail due to lack of the singlefile.html. Error for singlefile: `SingleFile was not able to archive the page` If I run the command it tries to run in terminal I get: **bash: syntax error near unexpected token `('** The raw command I copied from the log section and ran to get the above: `/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/ singlefile.html ` This error occurs for every article I try to archive. Readability error is: `Readability was not able to archive the page (invalid JSON)` But, again, since I see a reference to singlefile.html in the command I expect solving the above will solve this. The output of archivebox version is: ``` 0.7.2 ArchiveBox v0.7.2 BUILD_TIME=2024-02-28 11:27:51 1709137671 IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-101-generic-x86_64-with-glibc2.35 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=755 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.10.12 valid /usr/bin/python3.10 √ SQLITE_BINARY v2.6.0 valid /usr/lib/python3.10/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /home/ahermitforhire/.local/lib/python3.10/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /home/ahermitforhire/.local/bin/archivebox √ CURL_BINARY v7.81.0 valid /usr/bin/curl √ WGET_BINARY v1.21.2 valid /usr/bin/wget √ NODE_BINARY v12.22.9 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.54 valid ./node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.34.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2023.12.30 valid /home/ahermitforhire/.local/bin/yt-dlp √ CHROME_BINARY v122.0.6261.94 valid /usr/bin/chromium √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /home/ahermitforhire/.local/lib/python3.10/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 8 files valid /mnt/media/ArchiveBox √ SOURCES_DIR 11 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 5 files valid ./archive √ CONFIG_FILE 238.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 328.0 KB valid ./index.sqlite3 ```
Author
Owner

@pirate commented on GitHub (Mar 25, 2024):

Try running this:

/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium 'https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/' singlefile.html 

But also you are archiving a URL that's already on the internet archive? You can try it but we don't really support that very well. You may want to follow this issue if you do that a lot: https://github.com/ArchiveBox/ArchiveBox/issues/160

<!-- gh-comment-id:2018158927 --> @pirate commented on GitHub (Mar 25, 2024): Try running this: ```bash /mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium 'https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/' singlefile.html ``` But also you are archiving a URL that's already on the internet archive? You can try it but we don't really support that very well. You may want to follow this issue if you do that a lot: https://github.com/ArchiveBox/ArchiveBox/issues/160
Author
Owner

@ghost commented on GitHub (Mar 25, 2024):

If I do that in terminal I get:

Unexpected token '?'

Note: the error I described happens on ANY URL I try to add as mentioned in my initial post, not just archive.org links. For example:

/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://www.theguardian.com/us-news/2016/aug/30/us-national-parks-fire-lookout-forest-wildfire singlefile.html

Gets:

bash: syntax error near unexpected token `('

(I noticed in my initial post the code block removed the symbol before the parenthesis and I have edited to reflect that)

Also, I don't plan on using the terminal over the web interface to add new snapshots. The only reason I ran the command in terminal was to get more details of the error, so I'd like to see what can be done to solve this to enable the use of the web UI. Thanks!

<!-- gh-comment-id:2018184493 --> @ghost commented on GitHub (Mar 25, 2024): If I do that in terminal I get: `Unexpected token '?'` Note: the error I described happens on **ANY** URL I try to add as mentioned in my initial post, not just archive.org links. For example: `/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://www.theguardian.com/us-news/2016/aug/30/us-national-parks-fire-lookout-forest-wildfire singlefile.html` Gets: **bash: syntax error near unexpected token `('** (I noticed in my initial post the code block removed the symbol before the parenthesis and I have edited to reflect that) Also, I don't plan on using the terminal over the web interface to add new snapshots. The only reason I ran the command in terminal was to get more details of the error, so I'd like to see what can be done to solve this to enable the use of the web UI. Thanks!
Author
Owner

@pirate commented on GitHub (Mar 25, 2024):

Can you screenshot the terminal running the command and getting this error Unexpected token '?'

(manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing error near unexpected token (')

<!-- gh-comment-id:2018817866 --> @pirate commented on GitHub (Mar 25, 2024): Can you screenshot the terminal running the command and getting this error `Unexpected token '?'` (manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing `error near unexpected token ('`)
Author
Owner

@RuiQui commented on GitHub (Jul 25, 2024):

Can you screenshot the terminal running the command and getting this error Unexpected token '?'

(manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing error near unexpected token (')

I encountered the same issue while running in Docker. I used the command from the web logs:

/app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args=[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"] https://google.com singlefile.html

The result was:

bash: syntax error near unexpected token `('

When I modified the command to:

/app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args="[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"]" https://google.com singlefile.html

The program ran correctly. The difference lies in whether double quotes are used for the --browser-args parameter.

<!-- gh-comment-id:2250356936 --> @RuiQui commented on GitHub (Jul 25, 2024): > Can you screenshot the terminal running the command and getting this error `Unexpected token '?'` > > (manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing `error near unexpected token ('`) I encountered the same issue while running in Docker. I used the command from the web logs: ``` /app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args=[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"] https://google.com singlefile.html ``` The result was: ``` bash: syntax error near unexpected token `(' ``` When I modified the command to: ``` /app/node_modules/single-file-cli/single-file --browser-executable-path=chromium-browser --browser-args="[\"--headless=new\", \"--no-sandbox\", \"--no-zygote\", \"--disable-dev-shm-usage\", \"--disable-software-rasterizer\", \"--run-all-compositor-stages-before-draw\", \"--hide-scrollbars\", \"--window-size=1440,2000\", \"--autoplay-policy=no-user-gesture-required\", \"--no-first-run\", \"--use-fake-ui-for-media-stream\", \"--use-fake-device-for-media-stream\", \"--disable-sync\", \"--user-agent=Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Mobile Safari/537.36 Edg/126.0.0.0\", \"--window-size=1440,2000\", \"--user-data-dir=/tmp/test\"]" https://google.com singlefile.html ``` The program ran correctly. The difference lies in whether double quotes are used for the `--browser-args` parameter.
Author
Owner

@pirate commented on GitHub (Aug 12, 2024):

That's a quirk of how the helper terminal commands outputted for debugging differ between docker/non-docker setups.

I believe currently those helper commands assume the user is running the command in docker, but as you encountered there is an extra layer of shell escaping that needs to be added/removed to handle quotes properly outside Docker.

I might improve it in a future release by changing the helper commands depending on if IS_DOCKER=True/False, but to be honest for now it's a low priority for me.

Given that single-file ran correctly for you when executed directly, but not when it's run within ArchiveBox, that points to the issue being either in ArchiveBox's code or in your dependency configuration.

Can you share the full output of the following commands so I can investigate further:

cd /mnt/media/ArchiveBox
archivebox config

cat package.json

which -a node
node --version

which -a single-file
/app/node_modules/single-file-cli/single-file --version
<!-- gh-comment-id:2284951021 --> @pirate commented on GitHub (Aug 12, 2024): That's a quirk of how the helper terminal commands outputted for debugging differ between docker/non-docker setups. I believe currently those helper commands assume the user is running the command in docker, but as you encountered there is an extra layer of shell escaping that needs to be added/removed to handle quotes properly outside Docker. I might improve it in a future release by changing the helper commands depending on if `IS_DOCKER=True/False`, but to be honest for now it's a low priority for me. Given that single-file ran correctly for you when executed directly, but not when it's run within ArchiveBox, that points to the issue being either in ArchiveBox's code or in your dependency configuration. **Can you share the full output of the following commands so I can investigate further:** ```bash cd /mnt/media/ArchiveBox archivebox config cat package.json which -a node node --version which -a single-file /app/node_modules/single-file-cli/single-file --version ```
Author
Owner

@Scrub000 commented on GitHub (Oct 14, 2024):

This doesn't work in the default Docker image either.

<!-- gh-comment-id:2409896745 --> @Scrub000 commented on GitHub (Oct 14, 2024): This doesn't work in the default Docker image either.
Author
Owner

@pirate commented on GitHub (Oct 14, 2024):

There are many reasons why it might fail on a particular website, it doens't work on all sites that's why we provide many extractors. Are you encountering it fail on all websites or just a few specific ones? @Scrub000

Also please try the dev version and see if that works archivebox/archivebox:dev (not on your main collection, just test it with a new empty data dir).

<!-- gh-comment-id:2410139635 --> @pirate commented on GitHub (Oct 14, 2024): There are many reasons why it might fail on a particular website, it doens't work on all sites that's why we provide many extractors. Are you encountering it fail on *all* websites or just a few specific ones? @Scrub000 Also please try the dev version and see if that works `archivebox/archivebox:dev` (not on your main collection, just test it with a new empty data dir).
Author
Owner

@darrylo commented on GitHub (Oct 27, 2024):

If I do that in terminal I get:

Unexpected token '?'

Note: the error I described happens on ANY URL I try to add as mentioned in my initial post, not just archive.org links. For example:

I just encountered this error and it was because my version of node was too old (and this is also shown in your archivebox version). I was using something like a 1-year old version of Ubuntu, but it had node version 12.something. After removing the old version and upgrading to the latest version (20.18.0), this problem went away.

<!-- gh-comment-id:2439838722 --> @darrylo commented on GitHub (Oct 27, 2024): > If I do that in terminal I get: > > `Unexpected token '?'` > > Note: the error I described happens on **ANY** URL I try to add as mentioned in my initial post, not just archive.org links. For example: I just encountered this error and it was because my version of node was too old (and this is also shown in your archivebox version). I was using something like a 1-year old version of Ubuntu, but it had node version 12.something. After removing the old version and upgrading to the latest version (20.18.0), this problem went away.
Author
Owner

@pirate commented on GitHub (Oct 27, 2024):

Yes old node is a very common issue for people running on bare metal because Debian often ships with extremely outdated node.

<!-- gh-comment-id:2440168236 --> @pirate commented on GitHub (Oct 27, 2024): Yes old `node` is a very common issue for people running on bare metal because Debian often ships with extremely outdated node.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#845
No description provided.