[GH-ISSUE #1140] Bug: Exception TypeError: a bytes-like object is required, not 'str' in readability log_archive_method_finished #713

Closed
opened 2026-03-01 14:45:44 +03:00 by kerem · 4 comments
Owner

Originally created by @jfinkhaeuser on GitHub (Apr 19, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1140

The readability extractor got an exception, but looking at the exception, it's probably best addressed in the logging utility.

Describe the bug

I'm running the docker-compose setup. I've added a few URLs with --index-only, and have a script that runs an update to fill in the archiving. One one of the URLs, the error below was raised.

It's clear that hints is expected to be a different type than it is. While the calling function may need to be fixed, the log utility likely should never error out like this and convert hints as necessary; at least that's how I'd approach this.

Steps to reproduce

Simply adding a URL with the readability extractor enabled.

Screenshots or log output

Traceback (most recent call last):
  File "/app/archivebox/extractors/__init__.py", line 114, in archive_link
    log_archive_method_finished(result)
  File "/app/archivebox/logging_util.py", line 435, in log_archive_method_finished
    hints = hints if isinstance(hints, (list, tuple)) else hints.split('\n')
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_update.py", line 119, in main
    update(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 783, in update
    archive_links(to_archive, overwrite=overwrite, **archive_kwargs)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-3.10.108-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              


[i] Data locations:
Originally created by @jfinkhaeuser on GitHub (Apr 19, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1140 The readability extractor got an exception, but looking at the exception, it's probably best addressed in the logging utility. #### Describe the bug I'm running the docker-compose setup. I've added a few URLs with `--index-only`, and have a script that runs an update to fill in the archiving. One one of the URLs, the error below was raised. It's clear that `hints` is expected to be a different type than it is. While the calling function may need to be fixed, the log utility likely should never error out like this and convert `hints` as necessary; at least that's how I'd approach this. #### Steps to reproduce Simply adding a URL with the readability extractor enabled. #### Screenshots or log output ```python Traceback (most recent call last): File "/app/archivebox/extractors/__init__.py", line 114, in archive_link log_archive_method_finished(result) File "/app/archivebox/logging_util.py", line 435, in log_archive_method_finished hints = hints if isinstance(hints, (list, tuple)) else hints.split('\n') TypeError: a bytes-like object is required, not 'str' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/app/archivebox/cli/archivebox_update.py", line 119, in main update( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/main.py", line 783, in update archive_links(to_archive, overwrite=overwrite, **archive_kwargs) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/__init__.py", line 181, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/__init__.py", line 130, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( ``` #### ArchiveBox version ```logs ArchiveBox v0.6.2 Cpython Linux Linux-3.10.108-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: ```
Author
Owner

@Mrgove10 commented on GitHub (Jun 24, 2023):

Same bug here 

[+] Adding URL: http://example.com/
Internal Server Error: /add/
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/archivebox/extractors/__init__.py", line 114, in archive_link
    log_archive_method_finished(result)
  File "/usr/local/lib/python3.10/dist-packages/archivebox/logging_util.py", line 435, in log_archive_method_finished
    hints = hints if isinstance(hints, (list, tuple)) else hints.split('\n')
TypeError: a bytes-like object is required, not 'str'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/django/contrib/auth/mixins.py", line 109, in dispatch
    return super().dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/django/views/generic/base.py", line 98, in dispatch
    return handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/django/views/generic/edit.py", line 142, in post
    return self.form_valid(form)
  File "/usr/local/lib/python3.10/dist-packages/archivebox/core/views.py", line 286, in form_valid
    add(**input_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/archivebox/main.py", line 624, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/usr/local/lib/python3.10/dist-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_readability(Link(url=http://example.com/))
"POST /add/ HTTP/1.1" 500 145

Version : 

**ArchiveBox v0.6.2
Cpython Linux Linux-5.15.107-2-pve-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.10.6         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/dist-packages/django/bin/django-admin.py          
 √  CURL_BINARY           v7.81.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.2         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v12.22.9        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.0.34         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.6          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.34.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v108.0.5359.40  valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/local/lib/python3.10/dist-packages/archivebox                          
 √  TEMPLATES_DIR         3 files         valid     /usr/local/lib/python3.10/dist-packages/archivebox/templates                
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /home/archivebox/archivebox                                                 
 √  SOURCES_DIR           11 files        valid     ./sources                                                                   
 √  LOGS_DIR              2 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           5 files         valid     ./archive                                                                   
 √  CONFIG_FILE           106.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             300.0 KB        valid     ./index.sqlite3**
<!-- gh-comment-id:1605666787 --> @Mrgove10 commented on GitHub (Jun 24, 2023): ``` Same bug here [+] Adding URL: http://example.com/ Internal Server Error: /add/ Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/archivebox/extractors/__init__.py", line 114, in archive_link log_archive_method_finished(result) File "/usr/local/lib/python3.10/dist-packages/archivebox/logging_util.py", line 435, in log_archive_method_finished hints = hints if isinstance(hints, (list, tuple)) else hints.split('\n') TypeError: a bytes-like object is required, not 'str' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/exception.py", line 47, in inner response = get_response(request) File "/usr/local/lib/python3.10/dist-packages/django/core/handlers/base.py", line 181, in _get_response response = wrapped_callback(request, *callback_args, **callback_kwargs) File "/usr/local/lib/python3.10/dist-packages/django/views/generic/base.py", line 70, in view return self.dispatch(request, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/django/contrib/auth/mixins.py", line 109, in dispatch return super().dispatch(request, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/django/views/generic/base.py", line 98, in dispatch return handler(request, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/django/views/generic/edit.py", line 142, in post return self.form_valid(form) File "/usr/local/lib/python3.10/dist-packages/archivebox/core/views.py", line 286, in form_valid add(**input_kwargs) File "/usr/local/lib/python3.10/dist-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/archivebox/main.py", line 624, in add archive_links(new_links, overwrite=False, **archive_kwargs) File "/usr/local/lib/python3.10/dist-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/archivebox/extractors/__init__.py", line 181, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/usr/local/lib/python3.10/dist-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/archivebox/extractors/__init__.py", line 130, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( Exception: Exception in archive_methods.save_readability(Link(url=http://example.com/)) "POST /add/ HTTP/1.1" 500 145 ``` ``` Version : **ArchiveBox v0.6.2 Cpython Linux Linux-5.15.107-2-pve-x86_64-with-glibc2.35 x86_64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.6 valid /usr/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/dist-packages/django/bin/django-admin.py √ CURL_BINARY v7.81.0 valid /usr/bin/curl √ WGET_BINARY v1.21.2 valid /usr/bin/wget √ NODE_BINARY v12.22.9 valid /usr/bin/node √ SINGLEFILE_BINARY v1.0.34 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.6 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.34.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.12.17 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v108.0.5359.40 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /usr/local/lib/python3.10/dist-packages/archivebox √ TEMPLATES_DIR 3 files valid /usr/local/lib/python3.10/dist-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /home/archivebox/archivebox √ SOURCES_DIR 11 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 5 files valid ./archive √ CONFIG_FILE 106.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 300.0 KB valid ./index.sqlite3** ```
Author
Owner

@pirate commented on GitHub (Jun 28, 2023):

Sorry @Mrgove10 haven't gotten around to fixing this yet, it should be safe to ignore if you don't mind restarting the crawl after this point using the archivebox update --resume ...

You can also disable readability temporarily to do the first pass archivebox config --set SAVE_READABILITY=False, and enable it on a second pass so that it doesn't block other archiving work.

<!-- gh-comment-id:1610984816 --> @pirate commented on GitHub (Jun 28, 2023): Sorry @Mrgove10 haven't gotten around to fixing this yet, it should be safe to ignore if you don't mind restarting the crawl after this point using the `archivebox update --resume ...` You can also disable readability temporarily to do the first pass `archivebox config --set SAVE_READABILITY=False`, and enable it on a second pass so that it doesn't block other archiving work.
Author
Owner

@Mrgove10 commented on GitHub (Jul 1, 2023):

Thanks for the resolution ! i have temporarly disables redability until the fix :)

<!-- gh-comment-id:1615933049 --> @Mrgove10 commented on GitHub (Jul 1, 2023): Thanks for the resolution ! i have temporarly disables redability until the fix :)
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

Should be fixed in v0.7.2. Comment back here if you're still having issues!

<!-- gh-comment-id:1899747569 --> @pirate commented on GitHub (Jan 19, 2024): Should be fixed in v0.7.2. Comment back here if you're still having issues!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#713
No description provided.