[GH-ISSUE #1298] Bug: readability extractor fails in 0.7.1 docker with ERROR: illegal operation on a directory #3821

Closed
opened 2026-03-15 00:34:25 +03:00 by kerem · 6 comments
Owner

Originally created by @bramnet on GitHub (Dec 19, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1298

Describe the bug

When running the docker version of ArchiveBox, the readability extractor consistently fails. When looking at logs, it's because it is attempting to use a deprecated module and is failing.

Steps to reproduce

  1. Start docker image archivebox/archivebox:0.7
  2. Add and manually pull a URL
  3. readability extractor consistently fails with listed error below.

Screenshots or log output

archivebox-app-1  |       > readability
archivebox-app-1  |         Extractor failed:
archivebox-app-1  |              Readability was not able to archive the page
archivebox-app-1  |             (node:5898) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
archivebox-app-1  |             (Use `node --trace-deprecation ...` to show where the warning was created)
archivebox-app-1  |             node:internal/process/promises:289
archivebox-app-1  |             triggerUncaughtException(err, true /* fromPromise */);

ArchiveBox version

0.7.1+editable
ArchiveBox v0.7.1+editable Cpython Linux Linux-6.1.21-v8+-aarch64-with-glibc2.36 aarch64
DEBUG=False IN_DOCKER=True IN_QEMU=False IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.6         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.7.1          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.4.0          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v21.1.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.18         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.9          valid     /app/node_modules/readability-extractor/readability-extractor               
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js                                  
 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.10.13     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v119.0.6045.9   valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                                        

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                                        
 -  COOKIES_FILE          -               disabled  None                                                                        

[i] Data locations:
 √  OUTPUT_DIR            7 files @       valid     /data                                                                       
 √  SOURCES_DIR           332 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           1186 files      valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             19.6 MB         valid     ./index.sqlite3 
Originally created by @bramnet on GitHub (Dec 19, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1298 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug When running the docker version of ArchiveBox, the readability extractor consistently fails. When looking at logs, it's because it is attempting to use a deprecated module and is failing. #### Steps to reproduce 1. Start docker image archivebox/archivebox:0.7 2. Add and manually pull a URL 3. readability extractor consistently fails with listed error below. #### Screenshots or log output ``` archivebox-app-1 | > readability archivebox-app-1 | Extractor failed: archivebox-app-1 | Readability was not able to archive the page archivebox-app-1 | (node:5898) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. archivebox-app-1 | (Use `node --trace-deprecation ...` to show where the warning was created) archivebox-app-1 | node:internal/process/promises:289 archivebox-app-1 | triggerUncaughtException(err, true /* fromPromise */); ``` #### ArchiveBox version ``` 0.7.1+editable ArchiveBox v0.7.1+editable Cpython Linux Linux-6.1.21-v8+-aarch64-with-glibc2.36 aarch64 DEBUG=False IN_DOCKER=True IN_QEMU=False IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 SEARCH_BACKEND=ripgrep [i] Dependency versions: √ PYTHON_BINARY v3.11.6 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.1 valid /usr/local/bin/archivebox √ CURL_BINARY v8.4.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v21.1.0 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.18 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.9 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2023.10.13 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v119.0.6045.9 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 7 files @ valid /data √ SOURCES_DIR 332 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 1186 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 19.6 MB valid ./index.sqlite3 ```
Author
Owner

@pirate commented on GitHub (Dec 19, 2023):

The punycode thing is just a warning, not an error. If it's failing there's likely some other error later on causing it.

Can you try running the readability command it shows and post the full output?

I also tried upgrading readability in the :dev docker image, you can pull the latest version and give that a try as well. https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch

<!-- gh-comment-id:1863254687 --> @pirate commented on GitHub (Dec 19, 2023): The punycode thing is just a warning, not an error. If it's failing there's likely some other error later on causing it. Can you try running the readability command it shows and post the full output? I also tried upgrading readability in the `:dev` docker image, you can pull the latest version and give that a try as well. https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch
Author
Owner

@bramnet commented on GitHub (Dec 19, 2023):

Running the readability command directly returns the following:

(node:15469) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: EISDIR: illegal operation on a directory, read] {
  errno: -21,
  code: 'EISDIR',
  syscall: 'read'
}

I don't have the ability to build and run the dev image at this very moment (busy IRL), I'll try this later and report my findings.

<!-- gh-comment-id:1863275769 --> @bramnet commented on GitHub (Dec 19, 2023): Running the readability command directly returns the following: ``` (node:15469) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. (Use `node --trace-deprecation ...` to show where the warning was created) node:internal/process/promises:289 triggerUncaughtException(err, true /* fromPromise */); ^ [Error: EISDIR: illegal operation on a directory, read] { errno: -21, code: 'EISDIR', syscall: 'read' } ``` I don't have the ability to build and run the dev image at this very moment (busy IRL), I'll try this later and report my findings.
Author
Owner

@pirate commented on GitHub (Dec 19, 2023):

Looks like it was tried to run on a directory? It's probably being called wrong or there is an unexpected dir taking the place of the expected html files in the snapshot folder. Can you post a screenshot of the ./archive/<timestamp> folder's contents for the failing snapshot.

No need to build the image, just pull the published archivebox/archivebox:dev and run it.

<!-- gh-comment-id:1863379544 --> @pirate commented on GitHub (Dec 19, 2023): Looks like it was tried to run on a directory? It's probably being called wrong or there is an unexpected dir taking the place of the expected html files in the snapshot folder. Can you post a screenshot of the `./archive/<timestamp>` folder's contents for the failing snapshot. No need to build the image, just pull the published `archivebox/archivebox:dev` and run it.
Author
Owner

@bramnet commented on GitHub (Dec 19, 2023):

I haven't had the chance to pull the dev and try running it yet, but the directory contents are as below

favicon.ico   htmltotext.txt  index.json  mercury      warc
headers.json  index.html      media       output.html
<!-- gh-comment-id:1863621525 --> @bramnet commented on GitHub (Dec 19, 2023): I haven't had the chance to pull the dev and try running it yet, but the directory contents are as below ``` favicon.ico htmltotext.txt index.json mercury warc headers.json index.html media output.html ```
Author
Owner

@bramnet commented on GitHub (Dec 20, 2023):

I ran dev, readability extracted successfully without any issues as far as I can tell.

<!-- gh-comment-id:1863655802 --> @bramnet commented on GitHub (Dec 20, 2023): I ran dev, readability extracted successfully without any issues as far as I can tell.
Author
Owner

@pirate commented on GitHub (Dec 20, 2023):

Great, going to close this for now as fixed in 0.7.2 (dev) https://github.com/ArchiveBox/ArchiveBox/pull/1297 then. Let me know if you have any further issues.

Note: readability seems to have gotten slower in their latest release, so you may need to increase the default timeouts a bit until they speed it up archivebox config --set TIMEOUT=120 or higher (up from 60sec by default).

<!-- gh-comment-id:1865041902 --> @pirate commented on GitHub (Dec 20, 2023): Great, going to close this for now as fixed in `0.7.2` (dev) https://github.com/ArchiveBox/ArchiveBox/pull/1297 then. Let me know if you have any further issues. *Note:* readability seems to have gotten slower in their latest release, so you may need to increase the default timeouts a bit until they speed it up `archivebox config --set TIMEOUT=120` or higher (up from 60sec by default).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3821
No description provided.