mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1373] Bug: UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 110372: surrogates not allowed when trying to render unprintable filesystem path in view #3860
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3860
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Finkregh on GitHub (Mar 6, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1373
Describe the bug
Access impossible due to unicode issue
Steps to reproduce
I dont know what exactly changed/happened.
Screenshots or log output
ArchiveBox version
logs with DEBUG=True
@pirate commented on GitHub (Mar 6, 2024):
Looks like you archived a URL that contains unprintable UTF-8 bytes (possibly from a broken emoji/accented character/crylic/chinese/arabic/etc.) and it ended up in a filesystem path, so it's failing when trying to render the path in the public view.
In the short term you can find/strip all special UTF-8 characters in filenames using this script I wrote:
strip_bad_filename_characters.shor a program likedetox(apt install detox; man detox).In the long term ArchiveBox should fix this by force-normalizing all filenames to UTF-8 form-D on creation so this doesn't happen in the future.
@Finkregh commented on GitHub (Mar 7, 2024):
Additional thoughts on this:
the char in question is https://www.unicodepedia.com/unicode/low-surrogates/dcf6/trail-surrogate-dcf6/
If i look a files/folders and grep through the list for non-ascii:
I cant see that char.
If i move the folders in question away I still get the same issue:
I´d rather not run detox as it would rename all sorts of files and then the archive would be broken. Same with the script you linked.
Is there any way to narrow this down to where the actual files is? perhaps even more debug than DEBUG=True?
@Finkregh commented on GitHub (Mar 17, 2024):
Interstingly i did a
sqlite3 database.db '.dump' > foo.sql(besides some strace) which lead to not having the issue anymore. I wonder what that did and if something went wrong insside the sqlite file before.I´d still be interested in getting to know how to debug this :)
edit: I also moved all archive data back which i suspected to cause issues and it still works.
@Finkregh commented on GitHub (Mar 17, 2024):
Aaand its back... o_O?I read your pretty nice upgrading documentation that explains what
initdoes. So I ran it and everything works. Still guessing in the direction of some sqlite issue... And I tried to getdjango-debug-toolbar==3.2.4to run but ran into an exception:edit: restarted the server and the issue is back. now running init again w/o restart, lets see
edit2: still broken :|
@pirate commented on GitHub (Mar 21, 2024):
Good approach trying to narrow down the failing request with
django-debug-toolbar, not sure why it failed, I'll take a look. You can also try disabling most of the panes that it uses as they're often individually buggy and not all panes are needed to track down a broken request:archivebox/core/settings.py:165DEBUG_TOOLBAR_PANELS(you can comment out almost everything in there, I'd start by disabling'debug_toolbar.panels.cache.CachePanel'). There are also middlewares that can be added to log requests specifically: https://github.com/Rhumbix/django-request-loggingWe can also keep trying the more direct approach to find where the offending bytes are recorded on the filesystem or in sqlite, before spelunking through the ArchiveBox code, maybe something like:
@Finkregh commented on GitHub (Mar 26, 2024):
Tried with request-logging:
Sqlite glob with non-ascii returns all sort of stuff, not that char.
I tried with this and it returned nothing:
Edit: the grep did find some files, i moved them away and nothing changed :(
@pirate commented on GitHub (Mar 26, 2024):
damn... ok. I guess I might have to fix it the harder way: changing the renderer to handle this.
Before we go debugging too much further can you help double check these super quick:
Related issues:
@Finkregh commented on GitHub (Mar 26, 2024):
@Finkregh commented on GitHub (Mar 26, 2024):
FYI the debug toolbar:
with only
debug_toolbar.panels.request.RequestPanelin theDEBUG_TOOLBAR_PANELS@Finkregh commented on GitHub (Mar 26, 2024):
$ LC_ALL=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 archivebox server --nothreading '[::]:8080'leads to the same issue as before
@Finkregh commented on GitHub (Mar 26, 2024):
I pulled the whole debug block in the settings.py to the bottom of the file and added
ERROR_LOG="/tmp/err.log"now the server starts and throws the same 500 error as before w/o any debug toolbar :D@Finkregh commented on GitHub (Apr 5, 2024):
Anything else i could try?
If i´d try something like move a directory in data/ away, test and retry after moving the next:
Would that work? Should I do something additionally in that loop?
@Finkregh commented on GitHub (May 6, 2024):
I'm now running this after moving all directories from
archivetobroken:This is one folder I identified:
So a grep for literally
udcf6would have shown the issue from the beginning...I'll leave the script running and update again if I come across another issue.
@pirate commented on GitHub (May 7, 2024):
Argh, it was wget path detection all along! That part of the codebase causes so many nasty surprises, see https://github.com/ArchiveBox/ArchiveBox/issues/549
Working around and reverse-engineering wget's absurdly complicated mapping of URLs to filepaths is consistently one of the most troublesome, labor-intensive parts of running this entire project.
I think I'll abandon trying to support unicode in filepaths entirely and just change wget to use(it turns out that counterintuitively--restrict-file-names=ascii --content-dispositionand also stop trying to auto-detect wget's output location like this, it's caused so many hard-to-debug headaches like this one.windowsis more restrictive thanascii, and that I already tried this in the past and reverted it)@pirate commented on GitHub (May 7, 2024):
Should be tentatively fixed in https://github.com/ArchiveBox/ArchiveBox/pull/1424
I added workaround logic in
wget_output_path()to fallback to the parent dir if the html path contains unprintable unicode.@clb92 commented on GitHub (Jul 29, 2024):
Something very similar just happened to me again:
After removing snapshots via the CLI for half an hour, I tracked it down to a Google search URL.
@dot-mike commented on GitHub (Aug 3, 2025):
Hit this issue as well. Apparerntly archivebox does not like URLS with special characters in them.
File logs: