mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1119] Bug: UTF-8 is not supported for websites with html lang attribute #2211
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2211
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @PovilasID on GitHub (Mar 10, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1119
Describe the bug
Characters that are added with UTF 8 are not rendered out in a website that does not have UTF-8 tag explicitly but instead relies on html lang attribute.
Please make UTF 8 just default charset or add an option to force UTF 8 to be default.
Steps to reproduce
Screenshots or log output
Use https://delfi.lt/ to get an archive and open wget of the snapshot.
ArchiveBox version
v0.6.2
@pirate commented on GitHub (Mar 10, 2023):
Please post the full verbatim output of
archivebox version, not just the version number. Also please upload the archived html wget output or a screenshot.Most likely this is a wget issue, as archivebox has no say in how wget internally implements encoding detection.
@PovilasID commented on GitHub (Mar 10, 2023):
Screenshot:

1archivebox version` output
UTF8 was not working for all websites, now it is working for most but not all, so in the code there must have been a change validates tags or does something else to switch how wget works. https://lrt.lt was a website that had issues with UTF 8 chars in the past
@pirate commented on GitHub (Mar 11, 2023):
Thanks for posting those, can you try the latest dev branch version, there are some updates:
archivebox/archivebox:devhttps://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch@PovilasID commented on GitHub (Mar 11, 2023):
No effect after pulling
devdocker image and forcing a rebuild and re-snapshoting the website... well not completely no effect. Singlepage started working, so that is nice. It handles UTF 8 characters correctly, although, wget still shows same error's instead of UTF8 requiring characters.