[GH-ISSUE #847] Bug: Readability failure aborts archiving process with exception #527

Closed
opened 2026-03-01 14:44:20 +03:00 by kerem · 0 comments
Owner

Originally created by @herrbischoff on GitHub (Sep 14, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/847

Describe the bug

Attempting to archive https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702 results in the process aborting entirely, throwing an exception instead of continuing with an error. This hints at some error checking not done thoroughly enough.

Steps to reproduce

  1. Ran ArchiveBox with the following config:
[SERVER_CONFIG]
SECRET_KEY = [REDACTED]

[ARCHIVE_METHOD_OPTIONS]
RESOLUTION = 1440,4320
YOUTUBEDL_BINARY = /usr/local/bin/yt-dlp

[GENERAL_CONFIG]
TIMEOUT = 1200

and the command

archivebox add https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702
  1. Relevant output:
[√] [2021-09-14 00:54:24] "Dead white man's clothes: How fast fashion is turning parts of Ghana into toxic landfill - ABC News"
    https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702
    √ ./archive/1631453820.320194
      > readability
    ! Failed to archive link: Exception: Exception in archive_methods.save_readability(Link(url=https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702))

Traceback (most recent call last):
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 114, in archive_link
    log_archive_method_finished(result)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/logging_util.py", line 435, in log_archive_method_finished
    hints = hints if isinstance(hints, (list, tuple)) else hints.split('\n')
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/archivebox/.local/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/archivebox_update.py", line 119, in main
    update(
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/main.py", line 783, in update
    archive_links(to_archive, overwrite=overwrite, **archive_kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_readability(Link(url=https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702))

ArchiveBox version

ArchiveBox v0.6.2
Cpython FreeBSD FreeBSD-13.0-RELEASE-p4-amd64-64bit-ELF amd64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/home/archivebox/.local/bin/archivebox
 √  PYTHON_BINARY         v3.8.10         valid     /usr/local/bin/python3.8
 √  DJANGO_BINARY         v3.1.13         valid     /usr/home/archivebox/.local/lib/python3.8/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.78.0         valid     /usr/local/bin/curl
 √  WGET_BINARY           v1.21           valid     /usr/local/bin/wget
 √  NODE_BINARY           v14.17.0        valid     /usr/local/bin/node
 √  SINGLEFILE_BINARY     v0.3.29         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.3          valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.32.0         valid     /usr/local/bin/git
 √  YOUTUBEDL_BINARY      v2021.06.09     valid     /usr/local/bin/yt-dlp
 √  CHROME_BINARY         v92.0.4515.159  valid     /usr/local/bin/chrome
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/local/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/home/archivebox/.local/lib/python3.8/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /usr/home/archivebox/.local/lib/python3.8/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            9 files         valid     /var/db/archivebox
 √  SOURCES_DIR           48 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           1474 files      valid     ./archive
 √  CONFIG_FILE           861.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             13.3 MB         valid     ./index.sqlite3
Originally created by @herrbischoff on GitHub (Sep 14, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/847 #### Describe the bug Attempting to archive <https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702> results in the process aborting entirely, throwing an exception instead of continuing with an error. This hints at some error checking not done thoroughly enough. #### Steps to reproduce 1. Ran ArchiveBox with the following config: ``` [SERVER_CONFIG] SECRET_KEY = [REDACTED] [ARCHIVE_METHOD_OPTIONS] RESOLUTION = 1440,4320 YOUTUBEDL_BINARY = /usr/local/bin/yt-dlp [GENERAL_CONFIG] TIMEOUT = 1200 ``` and the command ``` archivebox add https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702 ``` 2. Relevant output: ``` [√] [2021-09-14 00:54:24] "Dead white man's clothes: How fast fashion is turning parts of Ghana into toxic landfill - ABC News" https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702 √ ./archive/1631453820.320194 > readability ! Failed to archive link: Exception: Exception in archive_methods.save_readability(Link(url=https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702)) Traceback (most recent call last): File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 114, in archive_link log_archive_method_finished(result) File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/logging_util.py", line 435, in log_archive_method_finished hints = hints if isinstance(hints, (list, tuple)) else hints.split('\n') TypeError: a bytes-like object is required, not 'str' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/archivebox/.local/bin/archivebox", line 8, in <module> sys.exit(main()) File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/cli/archivebox_update.py", line 119, in main update( File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/main.py", line 783, in update archive_links(to_archive, overwrite=overwrite, **archive_kwargs) File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 181, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/home/archivebox/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 130, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( Exception: Exception in archive_methods.save_readability(Link(url=https://www.abc.net.au/news/2021-08-12/fast-fashion-turning-parts-ghana-into-toxic-landfill/100358702)) ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython FreeBSD FreeBSD-13.0-RELEASE-p4-amd64-64bit-ELF amd64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/home/archivebox/.local/bin/archivebox √ PYTHON_BINARY v3.8.10 valid /usr/local/bin/python3.8 √ DJANGO_BINARY v3.1.13 valid /usr/home/archivebox/.local/lib/python3.8/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.78.0 valid /usr/local/bin/curl √ WGET_BINARY v1.21 valid /usr/local/bin/wget √ NODE_BINARY v14.17.0 valid /usr/local/bin/node √ SINGLEFILE_BINARY v0.3.29 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.3 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.32.0 valid /usr/local/bin/git √ YOUTUBEDL_BINARY v2021.06.09 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v92.0.4515.159 valid /usr/local/bin/chrome √ RIPGREP_BINARY v13.0.0 valid /usr/local/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /usr/home/archivebox/.local/lib/python3.8/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /usr/home/archivebox/.local/lib/python3.8/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 9 files valid /var/db/archivebox √ SOURCES_DIR 48 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 1474 files valid ./archive √ CONFIG_FILE 861.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 13.3 MB valid ./index.sqlite3 ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#527
No description provided.