[GH-ISSUE #450] Bugfix: 4.16 - No readability output when archiving #298

Closed
opened 2026-03-01 14:42:13 +03:00 by kerem · 3 comments
Owner

Originally created by @winteriscariot on GitHub (Aug 18, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/450

Describe the bug

When I archive a link, there's no readability directory:

/archive/1597758864/readability/content.html < this directory is not created, even though the ArchiveBox web UI shows the link, which then proceeds to 404 because the readability directory doesn't exist.

Steps to reproduce

Run ArchiveBox with the following env flags:

OUTPUT_DIR=/home/winteriscariot/.archivebox
CHROME_USER_DATA_DIR=/home/winteriscariot/.config/chromium/Default
CHROME_BINARY=/usr/bin/chromium
SUBMIT_ARCHIVE_DOT_ORG=False
FETCH_MEDIA=False
OUTPUT_PERMISSIONS=755
COOKIES_FILE=/home/winteriscariot/cookies.txt

Screenshots or log output

Software versions

  • OS: Arch Linux
  • ArchiveBox version: ArchiveBox v0.4.16
  • Python version: Python 3.8.5
  • Chrome version: Chromium 84.0.4147.105 Arch Linux
Originally created by @winteriscariot on GitHub (Aug 18, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/450 #### Describe the bug When I archive a link, there's no readability directory: /archive/1597758864/readability/content.html < this directory is not created, even though the ArchiveBox web UI shows the link, which then proceeds to 404 because the readability directory doesn't exist. #### Steps to reproduce Run ArchiveBox with the following env flags: OUTPUT_DIR=/home/winteriscariot/.archivebox CHROME_USER_DATA_DIR=/home/winteriscariot/.config/chromium/Default CHROME_BINARY=/usr/bin/chromium SUBMIT_ARCHIVE_DOT_ORG=False FETCH_MEDIA=False OUTPUT_PERMISSIONS=755 COOKIES_FILE=/home/winteriscariot/cookies.txt #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> #### Software versions - OS: Arch Linux - ArchiveBox version: ArchiveBox v0.4.16 - Python version: Python 3.8.5 - Chrome version: Chromium 84.0.4147.105 Arch Linux
kerem closed this issue 2026-03-01 14:42:13 +03:00
Author
Owner

@winteriscariot commented on GitHub (Aug 18, 2020):

It looks like some methods are being skipped and I'm not sure why:

[+] [2020-08-18 14:07:46] "www.marketwatch.com/news/story.asp?guid=%7B4F246FDA-E151-11EA-827E-551CAC965F72%7D"
    http://www.marketwatch.com/news/story.asp?guid=%7B4F246FDA-E151-11EA-827E-551CAC965F72%7D
    > ./archive/1597759665
      > title
      > favicon                                                                                                        
      > wget                                                                                                           
      X singlefile                                                                                                     
      > pdf
      > screenshot                                                                                                     
      > dom                                                                                                            
      X readability                                                                                                    
      X git
      > media
      > archive_org                                                                                                    
                                                                                                                       
[√] [2020-08-18 14:08:01] Update of 1 pages complete (15.70 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

Is there an env option I have to set in order to get the Readability output to properly fire? Or is there a prereq that I missed that I need to install?

EDIT: I'm on arch linux and installed the 'readability-cli' package, which installs /usr/bin/readable. After searching through the code I've tried the following with no luck, same results (ie, no readability link):

env SAVE_READABILITY=True READABILITY_BINARY=/usr/bin/readable archivebox add '<url>'
<!-- gh-comment-id:675506298 --> @winteriscariot commented on GitHub (Aug 18, 2020): It looks like some methods are being skipped and I'm not sure why: ``` [+] [2020-08-18 14:07:46] "www.marketwatch.com/news/story.asp?guid=%7B4F246FDA-E151-11EA-827E-551CAC965F72%7D" http://www.marketwatch.com/news/story.asp?guid=%7B4F246FDA-E151-11EA-827E-551CAC965F72%7D > ./archive/1597759665 > title > favicon > wget X singlefile > pdf > screenshot > dom X readability X git > media > archive_org [√] [2020-08-18 14:08:01] Update of 1 pages complete (15.70 sec) - 0 links skipped - 1 links updated - 0 links had errors ``` Is there an env option I have to set in order to get the Readability output to properly fire? Or is there a prereq that I missed that I need to install? EDIT: I'm on arch linux and installed the 'readability-cli' package, which installs /usr/bin/readable. After searching through the code I've tried the following with no luck, same results (ie, no readability link): env SAVE_READABILITY=True READABILITY_BINARY=/usr/bin/readable archivebox add '<url>'
Author
Owner

@pirate commented on GitHub (Aug 18, 2020):

Try the latest version of archivebox v4.17, it adds some helptext explaining how to install SingleFile and Readability:

pip install --upgrade archivebox
npm install -g 'git+https://github.com/pirate/readability-extractor.git'
npm install -g 'git+https://github.com/gildas-lormeau/SingleFile.git'

(they also work out-of-the-box in Docker)

<!-- gh-comment-id:675513980 --> @pirate commented on GitHub (Aug 18, 2020): Try the latest version of archivebox v4.17, it adds some helptext explaining how to install SingleFile and Readability: ```bash pip install --upgrade archivebox npm install -g 'git+https://github.com/pirate/readability-extractor.git' npm install -g 'git+https://github.com/gildas-lormeau/SingleFile.git' ``` (they also work out-of-the-box in Docker)
Author
Owner

@winteriscariot commented on GitHub (Aug 18, 2020):

hm yeah it looks like the Docker version is working without any hiccups, I can slot this into my workflow so this should fix my issues.

Thanks for the response! Great project you have here. :)

<!-- gh-comment-id:675530840 --> @winteriscariot commented on GitHub (Aug 18, 2020): hm yeah it looks like the Docker version is working without any hiccups, I can slot this into my workflow so this should fix my issues. Thanks for the response! Great project you have here. :)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#298
No description provided.