[GH-ISSUE #1014] Bug: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 264619: invalid start byte #3657

Closed
opened 2026-03-14 23:56:21 +03:00 by kerem · 2 comments
Owner

Originally created by @turian on GitHub (Aug 12, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1014

Describe the bug

running archivebox on many URLs, it snagged with the following error:


[+] [2022-08-12 18:04:45] "www.ashra.com/news.php?m=A"
    http://www.ashra.com/news.php?m=A
    > ./archive/1660315483.000912
      > favicon
      > headers
      > singlefile
      > pdf
        Extractor timed out after 60s.
        Run to see full output:
            cd /data/archive/1660315483.000912;
            /usr/bin/chromium --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --single-process --no-zygote "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/605.1.15 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)" --window-size=1440,2000 --timeout=60000 --print-to-pdf http://www.ashra.com/news.php?m=A

      > screenshot
      > dom
      > wget
      > title
      > readability
      > mercury
      > media
    ! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=http://www.ashra.com/news.php?m=A))

Traceback (most recent call last):
  File "/app/archivebox/extractors/__init__.py", line 109, in archive_link
    result = method_function(link=link, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/media.py", line 75, in save_media
    index_texts = [
  File "/app/archivebox/extractors/media.py", line 76, in <listcomp>
    text_file.read_text(encoding='utf-8').strip()
  File "/usr/local/lib/python3.10/pathlib.py", line 1133, in read_text
    return f.read()
  File "/usr/local/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 264619: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 109, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 642, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_media(Link(url=http://www.ashra.com/news.php?m=A))

Steps to reproduce

archivebox with this URL:

www.ashra.com/news.php?m=A

Screenshots or log output

see above

ArchiveBox version

archivebox@Ubuntu-2204-jammy-amd64-base:~/archivebox_data$ docker run -v $PWD:/data archivebox/archivebox version
ArchiveBox v0.6.3
Cpython Linux Linux-5.15.0-25-generic-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.10.4         valid     /usr/local/bin/python3.10
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.30.2         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2022.04.08     valid     /usr/local/bin/yt-dlp
 √  CHROME_BINARY         v101.0.4951.41  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /data
 √  SOURCES_DIR           1 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           234 files       valid     ./archive
 √  CONFIG_FILE           307.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             3.2 MB          valid     ./index.sqlite3
Originally created by @turian on GitHub (Aug 12, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1014 #### Describe the bug running archivebox on many URLs, it snagged with the following error: ``` [+] [2022-08-12 18:04:45] "www.ashra.com/news.php?m=A" http://www.ashra.com/news.php?m=A > ./archive/1660315483.000912 > favicon > headers > singlefile > pdf Extractor timed out after 60s. Run to see full output: cd /data/archive/1660315483.000912; /usr/bin/chromium --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --single-process --no-zygote "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/605.1.15 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)" --window-size=1440,2000 --timeout=60000 --print-to-pdf http://www.ashra.com/news.php?m=A > screenshot > dom > wget > title > readability > mercury > media ! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=http://www.ashra.com/news.php?m=A)) Traceback (most recent call last): File "/app/archivebox/extractors/__init__.py", line 109, in archive_link result = method_function(link=link, out_dir=out_dir) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/media.py", line 75, in save_media index_texts = [ File "/app/archivebox/extractors/media.py", line 76, in <listcomp> text_file.read_text(encoding='utf-8').strip() File "/usr/local/lib/python3.10/pathlib.py", line 1133, in read_text return f.read() File "/usr/local/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 264619: invalid start byte The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/app/archivebox/cli/archivebox_add.py", line 109, in main add( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/main.py", line 642, in add archive_links(new_links, overwrite=False, **archive_kwargs) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/__init__.py", line 181, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/__init__.py", line 130, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( Exception: Exception in archive_methods.save_media(Link(url=http://www.ashra.com/news.php?m=A)) ``` #### Steps to reproduce archivebox with this URL: ``` www.ashra.com/news.php?m=A ``` #### Screenshots or log output see above #### ArchiveBox version ``` archivebox@Ubuntu-2204-jammy-amd64-base:~/archivebox_data$ docker run -v $PWD:/data archivebox/archivebox version ArchiveBox v0.6.3 Cpython Linux Linux-5.15.0-25-generic-x86_64-with-glibc2.31 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 6 files valid /data √ SOURCES_DIR 1 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 234 files valid ./archive √ CONFIG_FILE 307.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 3.2 MB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-14 23:56:26 +03:00
Author
Owner
<!-- gh-comment-id:1221369927 --> @pirate commented on GitHub (Aug 20, 2022): Duplicate of: https://github.com/ArchiveBox/ArchiveBox/issues/984 https://github.com/ArchiveBox/ArchiveBox/issues/991
Author
Owner

@turian commented on GitHub (Sep 12, 2022):

I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026

TDLR, until that's merged:

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use my branch and pip install or whatever from there.

<!-- gh-comment-id:1244599901 --> @turian commented on GitHub (Sep 12, 2022): I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026 TDLR, until that's merged: Add this to ArchiveBox.conf: ``` YOUTUBEDL_BINARY=/usr/bin/yt-dlp ``` If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker `turian/archivebox:kludge-984-UTF8-bug`, instead of `archivebox/archivebox` for now. Or use [my branch](https://github.com/turian/ArchiveBox/tree/feature/kludge-984-UTF8-bug) and pip install or whatever from there.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3657
No description provided.