[GH-ISSUE #1542] Bug: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte #915

Open
opened 2026-03-01 14:47:15 +03:00 by kerem · 8 comments
Owner

Originally created by @JPeroutek on GitHub (Oct 16, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1542

Describe the bug

While attempting to archive a URL, I get multiple failures on the child pages (added with depth = 1), along the lines of
'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Steps to reproduce

  1. install ArchiveBox using Docker Compose (default installation, PiHole enabled)
  2. Add the following URL, Depth = 1 "https://www.carburetor-parts.com/carburetor-repair-manuals"
  3. Note that almost all files get a failure when fetching title, and on the archive_org step

Screenshots or log output

ArchiveBox version

0.7.2
ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11                                           
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                         
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py          
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /usr/local/bin/archivebox                                           

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl                                                       
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                       
 √  NODE_BINARY           v20.12.2        valid     /usr/bin/node

 √  SINGLEFILE_BINARY     v1.1.46         valid     /app/node_modules/single-file-cli/single-file

 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor

 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js

 √  GIT_BINARY            v2.39.2         valid     /usr/bin/git

 √  YOUTUBEDL_BINARY      v2023.12.30     valid     /usr/local/bin/yt-dlp

 √  CHROME_BINARY         v124.0.6367.29  valid     /usr/bin/chromium-browser

 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg


[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox

 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates

 -  CUSTOM_TEMPLATES_DIR  -               disabled  None


[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None

 -  COOKIES_FILE          -               disabled  None


[i] Data locations:
 √  OUTPUT_DIR            8 files @       valid     /data

 √  SOURCES_DIR           88 files        valid     ./sources

 √  LOGS_DIR              1 files         valid     ./logs

 √  ARCHIVE_DIR           157 files       valid     ./archive

 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf

 √  SQL_INDEX             1.2 MB          valid     ./index.sqlite3
Originally created by @JPeroutek on GitHub (Oct 16, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1542 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> While attempting to archive a URL, I get multiple failures on the child pages (added with depth = 1), along the lines of ``` 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte ``` #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. install ArchiveBox using Docker Compose (default installation, PiHole enabled) 2. Add the following URL, Depth = 1 "https://www.carburetor-parts.com/carburetor-repair-manuals" 3. Note that almost all files get a failure when fetching title, and on the `archive_org` step #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs 0.7.2 ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:47:02 1713998822 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /usr/local/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.12.2 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.46 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2023.12.30 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v124.0.6367.29 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 8 files @ valid /data √ SOURCES_DIR 88 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 157 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 1.2 MB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@pirate commented on GitHub (Oct 16, 2024):

Thanks for reporting.

Are you able to share the failing URL or perhaps an anonymized similar URL?

Non-English non-utf8 pages present a variety of encoding challenges, it's hard to debug without a specific url.

Also can you try it with archivebox/archivebox:0.8.5rc44 (in a new empty /data dir, just to test), there are some recent improvements that might fix this bug.

<!-- gh-comment-id:2417996162 --> @pirate commented on GitHub (Oct 16, 2024): Thanks for reporting. Are you able to share the failing URL or perhaps an anonymized similar URL? Non-English non-utf8 pages present a variety of encoding challenges, it's hard to debug without a specific url. Also can you try it with `archivebox/archivebox:0.8.5rc44` (in a new empty /data dir, just to test), there are some recent improvements that might fix this bug.
Author
Owner

@JPeroutek commented on GitHub (Oct 16, 2024):

@pirate Oof, forgot to put the URL in there. Added.

Here it is again, just for ease of finding it:

https://www.carburetor-parts.com/carburetor-repair-manuals

I'll give the rc image a shot too and report back.

<!-- gh-comment-id:2418006712 --> @JPeroutek commented on GitHub (Oct 16, 2024): @pirate Oof, forgot to put the URL in there. Added. Here it is again, just for ease of finding it: https://www.carburetor-parts.com/carburetor-repair-manuals I'll give the rc image a shot too and report back.
Author
Owner

@JPeroutek commented on GitHub (Oct 16, 2024):

Couldn't get it running with the specified version.

Docker Compose file

Error log in ArchiveBox Docker container

<!-- gh-comment-id:2418127858 --> @JPeroutek commented on GitHub (Oct 16, 2024): Couldn't get it running with the specified version. [Docker Compose file](https://gist.github.com/JPeroutek/42c2db312fa8540f185c53f5482ddc97) [Error log in ArchiveBox Docker container](https://gist.github.com/JPeroutek/ae169c63bace4bc203a935b142eb6905)
Author
Owner

@pirate commented on GitHub (Oct 17, 2024):

Ah ok that error is separate from the original issue, but it's from it failing to create a Unix socket file in the data dir, it's a permissions problem with the volume mounted to /data.

Are you on macOS, windows, or Linux?
Also could you try just commenting out the data volume bind mount line in docker-compose.yml for one run.

<!-- gh-comment-id:2418185720 --> @pirate commented on GitHub (Oct 17, 2024): Ah ok that error is separate from the original issue, but it's from it failing to create a Unix socket file in the data dir, it's a permissions problem with the volume mounted to /data. Are you on macOS, windows, or Linux? Also could you try just commenting out the data volume bind mount line in docker-compose.yml for one run.
Author
Owner

@JPeroutek commented on GitHub (Oct 17, 2024):

Windows 10.

Tried to run it after commenting out all the volume mounts to ./data, but couldn't get it to start. Looks like it wants me to run the Init step, but since the data folder is unbound, I can't go in and run docker compose run archivebox init.

Error message from Archivebox container

<!-- gh-comment-id:2420027318 --> @JPeroutek commented on GitHub (Oct 17, 2024): Windows 10. Tried to run it after commenting out all the volume mounts to `./data`, but couldn't get it to start. Looks like it wants me to run the `Init` step, but since the data folder is unbound, I can't go in and run `docker compose run archivebox init`. [Error message from Archivebox container](https://gist.github.com/JPeroutek/e3f163c09944867f6292dbdcd04a08a5)
Author
Owner

@JPeroutek commented on GitHub (Oct 17, 2024):

In the end I don't think the original is a showstopper for me, it looks like its mostly preventing ArchiveBox from fetching Titles for PDF documents.

<!-- gh-comment-id:2420035297 --> @JPeroutek commented on GitHub (Oct 17, 2024): In the end I don't think the original is a showstopper for me, it looks like its mostly preventing ArchiveBox from fetching Titles for PDF documents.
Author
Owner

@pirate commented on GitHub (Oct 17, 2024):

Ah that's useful to know, it's maybe related to how it's trying to parse the response body for a

<!-- gh-comment-id:2420072085 --> @pirate commented on GitHub (Oct 17, 2024): Ah that's useful to know, it's maybe related to how it's trying to parse the response body for a <title> tag. There is logic that's supposed to default to the file name if the file is a binary blob like a PDF, but perhaps it broke at some point. I'll take a look.
Author
Owner

@tensor5g commented on GitHub (Apr 6, 2025):

I have the same issue, trying to archive a PDF directly (maybe this isn't supported?)

https://cdn.shopify.com/s/files/1/0560/3803/1433/files/Warranty_-_EN.pdf?v=1716443094

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
<!-- gh-comment-id:2781718912 --> @tensor5g commented on GitHub (Apr 6, 2025): I have the same issue, trying to archive a PDF directly (maybe this isn't supported?) https://cdn.shopify.com/s/files/1/0560/3803/1433/files/Warranty_-_EN.pdf?v=1716443094 ``` 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#915
No description provided.