mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1542] Bug: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte #3934
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3934
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @JPeroutek on GitHub (Oct 16, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1542
Describe the bug
While attempting to archive a URL, I get multiple failures on the child pages (added with depth = 1), along the lines ofSteps to reproduce
archive_orgstepScreenshots or log output
ArchiveBox version
@pirate commented on GitHub (Oct 16, 2024):
Thanks for reporting.
Are you able to share the failing URL or perhaps an anonymized similar URL?
Non-English non-utf8 pages present a variety of encoding challenges, it's hard to debug without a specific url.
Also can you try it with
archivebox/archivebox:0.8.5rc44(in a new empty /data dir, just to test), there are some recent improvements that might fix this bug.@JPeroutek commented on GitHub (Oct 16, 2024):
@pirate Oof, forgot to put the URL in there. Added.
Here it is again, just for ease of finding it:
https://www.carburetor-parts.com/carburetor-repair-manuals
I'll give the rc image a shot too and report back.
@JPeroutek commented on GitHub (Oct 16, 2024):
Couldn't get it running with the specified version.
Docker Compose file
Error log in ArchiveBox Docker container
@pirate commented on GitHub (Oct 17, 2024):
Ah ok that error is separate from the original issue, but it's from it failing to create a Unix socket file in the data dir, it's a permissions problem with the volume mounted to /data.
Are you on macOS, windows, or Linux?
Also could you try just commenting out the data volume bind mount line in docker-compose.yml for one run.
@JPeroutek commented on GitHub (Oct 17, 2024):
Windows 10.
Tried to run it after commenting out all the volume mounts to
./data, but couldn't get it to start. Looks like it wants me to run theInitstep, but since the data folder is unbound, I can't go in and rundocker compose run archivebox init.Error message from Archivebox container
@JPeroutek commented on GitHub (Oct 17, 2024):
In the end I don't think the original is a showstopper for me, it looks like its mostly preventing ArchiveBox from fetching Titles for PDF documents.
@pirate commented on GitHub (Oct 17, 2024):
Ah that's useful to know, it's maybe related to how it's trying to parse the response body for a
@tensor5g commented on GitHub (Apr 6, 2025):
I have the same issue, trying to archive a PDF directly (maybe this isn't supported?)
https://cdn.shopify.com/s/files/1/0560/3803/1433/files/Warranty_-_EN.pdf?v=1716443094