[GH-ISSUE #980] Bug: UTF 8 implemented wrong in readability mode #2119

Closed
opened 2026-03-01 17:56:37 +03:00 by kerem · 2 comments
Owner

Originally created by @PovilasID on GitHub (May 22, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/980

Describe the bug

Then a data is taken with UTF 8 chars on website in "readability" option next to warc UTF 8 chars are not supported

Steps to reproduce

In an instance of archive box pull a website with UTF 8 chars ( https://www.lrt.lt/ )
In readability mode you instead of chars like "š" you will get "Å¡" or "ė" -> "Ä—" etc.

Screenshots or log output

ArchiveBox version

v0.6.2

Originally created by @PovilasID on GitHub (May 22, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/980 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug Then a data is taken with UTF 8 chars on website in "readability" option next to warc UTF 8 chars are not supported #### Steps to reproduce In an instance of archive box pull a website with UTF 8 chars ( https://www.lrt.lt/ ) In readability mode you instead of chars like "š" you will get "Å¡" or "ė" -> "Ä—" etc. #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> #### ArchiveBox version v0.6.2
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

Is this still happening for you on 0.7.2?

<!-- gh-comment-id:1900113416 --> @pirate commented on GitHub (Jan 19, 2024): Is this still happening for you on `0.7.2`?
Author
Owner

@PovilasID commented on GitHub (Jan 19, 2024):

There is an update first time since 2021! Woohoo!
I updated and tested it looks like it is handling UTF8 charset symbols correctly

<!-- gh-comment-id:1900232939 --> @PovilasID commented on GitHub (Jan 19, 2024): There is an update first time since 2021! Woohoo! I updated and tested it looks like it is handling UTF8 charset symbols correctly
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2119
No description provided.