mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1386] Support: singlefile & readability fail to work #845
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#845
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ghost on GitHub (Mar 25, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1386
For every snapshot I try, singlefile and readability fail. I assume readability may fail due to lack of the singlefile.html.
Error for singlefile:
SingleFile was not able to archive the pageIf I run the command it tries to run in terminal I get:
bash: syntax error near unexpected token `('
The raw command I copied from the log section and ran to get the above:
/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://web.archive.org/web/20240301170542/https://www.roadandtrack.com/car-culture/a46975496/behind-f1-velvet-curtain/ singlefile.htmlThis error occurs for every article I try to archive.
Readability error is:
Readability was not able to archive the page (invalid JSON)But, again, since I see a reference to singlefile.html in the command I expect solving the above will solve this.
The output of archivebox version is:
@pirate commented on GitHub (Mar 25, 2024):
Try running this:
But also you are archiving a URL that's already on the internet archive? You can try it but we don't really support that very well. You may want to follow this issue if you do that a lot: https://github.com/ArchiveBox/ArchiveBox/issues/160
@ghost commented on GitHub (Mar 25, 2024):
If I do that in terminal I get:
Unexpected token '?'Note: the error I described happens on ANY URL I try to add as mentioned in my initial post, not just archive.org links. For example:
/mnt/media/ArchiveBox/node_modules/single-file-cli/single-file --browser-executable-path=chromium --browser-args=[\"--headless=new\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36 ArchiveBox/0.7.2 (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"] https://www.theguardian.com/us-news/2016/aug/30/us-national-parks-fire-lookout-forest-wildfire singlefile.htmlGets:
bash: syntax error near unexpected token `('
(I noticed in my initial post the code block removed the symbol before the parenthesis and I have edited to reflect that)
Also, I don't plan on using the terminal over the web interface to add new snapshots. The only reason I ran the command in terminal was to get more details of the error, so I'd like to see what can be done to solve this to enable the use of the web UI. Thanks!
@pirate commented on GitHub (Mar 25, 2024):
Can you screenshot the terminal running the command and getting this error
Unexpected token '?'(manually remove the user agent args when running that copy-pasted command as the quote escaping is whats causing a bunch of the errors you're seeing
error near unexpected token (')@RuiQui commented on GitHub (Jul 25, 2024):
I encountered the same issue while running in Docker. I used the command from the web logs:
The result was:
When I modified the command to:
The program ran correctly. The difference lies in whether double quotes are used for the
--browser-argsparameter.@pirate commented on GitHub (Aug 12, 2024):
That's a quirk of how the helper terminal commands outputted for debugging differ between docker/non-docker setups.
I believe currently those helper commands assume the user is running the command in docker, but as you encountered there is an extra layer of shell escaping that needs to be added/removed to handle quotes properly outside Docker.
I might improve it in a future release by changing the helper commands depending on if
IS_DOCKER=True/False, but to be honest for now it's a low priority for me.Given that single-file ran correctly for you when executed directly, but not when it's run within ArchiveBox, that points to the issue being either in ArchiveBox's code or in your dependency configuration.
Can you share the full output of the following commands so I can investigate further:
@Scrub000 commented on GitHub (Oct 14, 2024):
This doesn't work in the default Docker image either.
@pirate commented on GitHub (Oct 14, 2024):
There are many reasons why it might fail on a particular website, it doens't work on all sites that's why we provide many extractors. Are you encountering it fail on all websites or just a few specific ones? @Scrub000
Also please try the dev version and see if that works
archivebox/archivebox:dev(not on your main collection, just test it with a new empty data dir).@darrylo commented on GitHub (Oct 27, 2024):
I just encountered this error and it was because my version of node was too old (and this is also shown in your archivebox version). I was using something like a 1-year old version of Ubuntu, but it had node version 12.something. After removing the old version and upgrading to the latest version (20.18.0), this problem went away.
@pirate commented on GitHub (Oct 27, 2024):
Yes old
nodeis a very common issue for people running on bare metal because Debian often ships with extremely outdated node.