mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #441] Browsers attempting to autodetect encoding leads to Unicode rendering issues in some replayed extractor outputs #1804
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1804
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @MartinMSPedersen on GitHub (Aug 14, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/441
Describe the bug
When archive pages in danish language from https://politiken.dk/, unicode-characters are wrong.
Steps to reproduce
https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
Screenshots or log output
Software versions
github.com/pirate/ArchiveBox@aa085cdb60@pirate commented on GitHub (Aug 14, 2020):
In which output are you seeing the wrong characters? The wget output? Can you post a screenshot?
@rfletcher commented on GitHub (Aug 16, 2020):
Another example: https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/
Archived with 0.4.14 in docker. The source site (left) and "html dump" archive are pictured below. Note the apostrophes and an em dash are mangled on the right. The wget output text shows the same.
@cdvv7788 commented on GitHub (Aug 17, 2020):
@rfletcher can you please share the exact command you are running, and can you try running
archivebox versionto confirm the version? I was not able to reproduce the issue with that page.@MartinMSPedersen Can you please share some screenshots of your output?
@rfletcher commented on GitHub (Aug 21, 2020):
Sorry for the delay. The info you've asked for is below.
I've played around a bit more and I really don't know what to make of this. I'm on a mac, running 10.15.6. When I open output.html for the sixcolors.com URL in either Safari or Chrome (both up to date), it renders as above, with the wrong characters. Oddly though, when I use Finder's "quick look" feature to view the same file, it's rendered correctly. I'm pretty surprised that Quick Look and Safari would show a different result, but they do. Here's another screenshot, showing Quick Look on the left and Safari on the right.
I would have chalked this up to a macOS/Safari bug at this point, except that Quick Look is the only place I've seen this file rendered correctly. Multiple browsers show something other than the original text.
@rfletcher commented on GitHub (Aug 21, 2020):
I got curious and installed more browsers. All running their latest versions on macOS Catalina. (✅ = rendered correctly, ❌ = incorrectly)
❌ Chrome
❌ Safari
✅ Quick Look (macOS Finder)
✅ Firefox
✅ Edge
✅ Brave
I don't know if there's anything you can do to the HTML file to make it render more consistently, but I'm not so sure there's an ArchiveBox bug here, exactly.
Here's the actual HTML file I used for testing: output.html.zip
@MartinMSPedersen commented on GitHub (Sep 3, 2020):
Sorry for the delay.
Here are some screenshots that I think is useful.
First a version from archive.org where the unicode is encoded correctly.

And the version from archive-box viewed in firefox.

Here we can see that firefox believes the html is encoded in windows-1252 which is not correct.

Same result on google-chrome
If I choose the singlefile version then it is encoded correctly as UTF-8.

@cdvv7788 commented on GitHub (Sep 10, 2020):
@MartinMSPedersen @rfletcher just to confirm, what extractors output is having the issue? Wget?
@rfletcher commented on GitHub (Sep 10, 2020):
My current versions:
In my case it looks like these outputs are using the wrong encoding for the sixcolors.com URL (as viewed in Safari on macOS):
These show expected output:
All three of the bad HTML documents show
document.characterSetas"windows-1252". The rest show"UTF-8".I think what might be happening is that the original page has the encoding information set in a response header (my example URL definitely includes
content-type: text/html; charset=UTF-8), but when the HTML body is saved locally without headers that explicit encoding information is lost. At that point it's up to the renderer to guess the encoding, and some are getting it wrong.@cdvv7788 commented on GitHub (Sep 10, 2020):
Yes, that is our guess too. Thanks for the information!
@MartinMSPedersen commented on GitHub (Sep 19, 2020):
Maybe I should close this issue now?
@pirate commented on GitHub (Sep 22, 2020):
@MartinMSPedersen no, we're still thinking about how to solve this by either storing and replaying headers or converting the encoding on-disk to UTF-8.
We're currently stuck on reproducing the issue reliably, as it only happens when visiting the pages directly, but not when they're iframed. Our suspicion is that this is a subtle behavior of Chrome's automatic encoding detection, and our solution will involving nudging Chrome towards the right direction or finding out why it's autodetecting differently based on whether the content is iframed or not.