[GH-ISSUE #999] Bug: UnicodeDecodeError when archiving site #624

Closed
opened 2026-03-01 14:45:05 +03:00 by kerem · 2 comments
Owner

Originally created by @InnovativeInventor on GitHub (Jul 16, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/999

Describe the bug

Archiving a certain website (https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before) results in a UnicodeDecodeError error, which halts the update process. Ideally, there should be a way to skip errored archives.

Steps to reproduce

archivebox init
archivebox setup     # interactive setup
archivebox add www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before

Screenshots or log output

[+] [2022-07-16 03:26:32] "www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before"
    https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before
    > ./archive/1657940207.365172
      > title
      > favicon
      > headers
      > singlefile
        Extractor timed out after 60s.
        Run to see full output:
            [redacted]

      > pdf
      > screenshot
      > dom
      > wget
      > readability
      > mercury
      > media
    ! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before))

Traceback (most recent call last):
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 109, in archive_link
    result = method_function(link=link, out_dir=out_dir)
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/media.py", line 74, in save_media
    index_texts = [
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/media.py", line 75, in <listcomp>
    text_file.read_text(encoding='utf-8').strip()
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/pathlib.py", line 1267, in read_text
    return f.read()
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 25433: invalid continuation byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/redacted/.pyenv/versions/3.9.10/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/cli/archivebox_update.py", line 119, in main
    update(
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/main.py", line 783, in update
    archive_links(to_archive, overwrite=overwrite, **archive_kwargs)
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_media(Link(url=https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before))

ArchiveBox version

ArchiveBox v0.6.2
Cpython Darwin macOS-12.2.1-arm64-arm-64bit arm64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /redacted/.pyenv/versions/3.9.10/bin/archivebox
 √  PYTHON_BINARY         v3.9.10         valid     /redacted/.pyenv/versions/3.9.10/bin/python3.9
 √  DJANGO_BINARY         v3.1.14         valid     /redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.77.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.21.3         valid     /opt/homebrew/bin/wget
 √  NODE_BINARY           v18.6.0         valid     /opt/homebrew/bin/node
 √  SINGLEFILE_BINARY     v1.0.11         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.4          valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.32.0         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /opt/homebrew/bin/youtube-dl
 √  CHROME_BINARY         v103.0.5060.114  valid     "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
 √  RIPGREP_BINARY        v13.0.0         valid     /opt/homebrew/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            11 files        valid     /redacted/git/offline-commute-reading
 √  SOURCES_DIR           6 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           18 files        valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             812.0 KB        valid     ./index.sqlite3
Originally created by @InnovativeInventor on GitHub (Jul 16, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/999 #### Describe the bug Archiving a certain website (https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before) results in a `UnicodeDecodeError` error, which halts the update process. Ideally, there should be a way to skip errored archives. #### Steps to reproduce ``` archivebox init archivebox setup # interactive setup archivebox add www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before ``` #### Screenshots or log output ``` [+] [2022-07-16 03:26:32] "www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before" https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before > ./archive/1657940207.365172 > title > favicon > headers > singlefile Extractor timed out after 60s. Run to see full output: [redacted] > pdf > screenshot > dom > wget > readability > mercury > media ! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before)) Traceback (most recent call last): File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 109, in archive_link result = method_function(link=link, out_dir=out_dir) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/media.py", line 74, in save_media index_texts = [ File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/media.py", line 75, in <listcomp> text_file.read_text(encoding='utf-8').strip() File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/pathlib.py", line 1267, in read_text return f.read() File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 25433: invalid continuation byte The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/redacted/.pyenv/versions/3.9.10/bin/archivebox", line 8, in <module> sys.exit(main()) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/cli/archivebox_update.py", line 119, in main update( File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/main.py", line 783, in update archive_links(to_archive, overwrite=overwrite, **archive_kwargs) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 181, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 130, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( Exception: Exception in archive_methods.save_media(Link(url=https://www.thedrive.com/the-war-zone/new-radars-are-giving-old-air-force-f-16s-capabilities-like-never-before)) ``` #### ArchiveBox version ```logs ArchiveBox v0.6.2 Cpython Darwin macOS-12.2.1-arm64-arm-64bit arm64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /redacted/.pyenv/versions/3.9.10/bin/archivebox √ PYTHON_BINARY v3.9.10 valid /redacted/.pyenv/versions/3.9.10/bin/python3.9 √ DJANGO_BINARY v3.1.14 valid /redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.77.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /opt/homebrew/bin/wget √ NODE_BINARY v18.6.0 valid /opt/homebrew/bin/node √ SINGLEFILE_BINARY v1.0.11 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.4 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.32.0 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.12.17 valid /opt/homebrew/bin/youtube-dl √ CHROME_BINARY v103.0.5060.114 valid "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" √ RIPGREP_BINARY v13.0.0 valid /opt/homebrew/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /redacted/.pyenv/versions/3.9.10/lib/python3.9/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 11 files valid /redacted/git/offline-commute-reading √ SOURCES_DIR 6 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 18 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 812.0 KB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-01 14:45:05 +03:00
Author
Owner

@pirate commented on GitHub (Jul 20, 2022):

Duplicate of https://github.com/ArchiveBox/ArchiveBox/issues/991

<!-- gh-comment-id:1190499843 --> @pirate commented on GitHub (Jul 20, 2022): Duplicate of https://github.com/ArchiveBox/ArchiveBox/issues/991
Author
Owner

@turian commented on GitHub (Sep 12, 2022):

I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026

TDLR, until that's merged:

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use my branch and pip install or whatever from there.

<!-- gh-comment-id:1244601496 --> @turian commented on GitHub (Sep 12, 2022): I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026 TDLR, until that's merged: Add this to ArchiveBox.conf: ``` YOUTUBEDL_BINARY=/usr/bin/yt-dlp ``` If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker `turian/archivebox:kludge-984-UTF8-bug`, instead of `archivebox/archivebox` for now. Or use [my branch](https://github.com/turian/ArchiveBox/tree/feature/kludge-984-UTF8-bug) and pip install or whatever from there.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#624
No description provided.