[GH-ISSUE #765] Bug: Wrong path used for Readability content when doing an index-only update #484

Open
opened 2026-03-01 14:44:03 +03:00 by kerem · 0 comments
Owner

Originally created by @berezovskyi on GitHub (Jun 7, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/765

Describe the bug

When doing a first-time index with the newly added Sonic service,

[X] An Exception ocurred reading the indexable content=[Errno 2] No such file or directory: '/data/archive/1609984440.646491//data/archive/1609984440.646491/readability/content.txt':

I suspect github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/search/utils.py (L12) to be the place where the duplication happens but I am not sure why (maybe the output variable is wrong and should only contain relative path or use_pwd should be False).

Steps to reproduce

  1. Use an existing install of ArchiveBox that runs using Docker deployment (without Compose).
  2. Switch to a Docker Compose setup described in https://github.com/ArchiveBox/ArchiveBox/wiki/Docker and add Sonic.
  3. Run docker-compose run archivebox update --index-only as instructed.

Screenshots or log output

[X] An Exception ocurred reading the indexable content=[Errno 2] No such file or directory: '/data/archive/1609984440.646491//data/archive/1609984440.646491/readability/content.txt':

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.8.0-53-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            14 files        valid     /data
 √  SOURCES_DIR           327 files       valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           5007 files      valid     ./archive
 √  CONFIG_FILE           676.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             77.6 MB         valid     ./index.sqlite3
Originally created by @berezovskyi on GitHub (Jun 7, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/765 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug When doing a first-time index with the newly added Sonic service, > [X] An Exception ocurred reading the indexable content=[Errno 2] No such file or directory: '/data/archive/1609984440.646491//data/archive/1609984440.646491/readability/content.txt': I suspect https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/search/utils.py#L12 to be the place where the duplication happens but I am not sure why (maybe the output variable is wrong and should only contain relative path or `use_pwd` should be False). #### Steps to reproduce 1. Use an existing install of ArchiveBox that runs using Docker deployment (without Compose). 2. Switch to a Docker Compose setup described in https://github.com/ArchiveBox/ArchiveBox/wiki/Docker and add Sonic. 3. Run `docker-compose run archivebox update --index-only` as instructed. #### Screenshots or log output > [X] An Exception ocurred reading the indexable content=[Errno 2] No such file or directory: '/data/archive/1609984440.646491//data/archive/1609984440.646491/readability/content.txt': #### ArchiveBox version ``` ArchiveBox v0.6.2 Cpython Linux Linux-5.8.0-53-generic-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 14 files valid /data √ SOURCES_DIR 327 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 5007 files valid ./archive √ CONFIG_FILE 676.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 77.6 MB valid ./index.sqlite3 ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#484
No description provided.