[GH-ISSUE #1119] Bug: UTF-8 is not supported for websites with html lang attribute #702

Open
opened 2026-03-01 14:45:37 +03:00 by kerem · 4 comments
Owner

Originally created by @PovilasID on GitHub (Mar 10, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1119

Describe the bug

Characters that are added with UTF 8 are not rendered out in a website that does not have UTF-8 tag explicitly but instead relies on html lang attribute.
Please make UTF 8 just default charset or add an option to force UTF 8 to be default.

Steps to reproduce

Screenshots or log output

Use https://delfi.lt/ to get an archive and open wget of the snapshot.

ArchiveBox version

v0.6.2

Originally created by @PovilasID on GitHub (Mar 10, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1119 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> Characters that are added with UTF 8 are not rendered out in a website that does not have UTF-8 tag explicitly but instead relies on [html lang attribute](https://www.w3schools.com/tags/att_lang.asp). Please make UTF 8 just default charset or add an option to force UTF 8 to be default. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> Use https://delfi.lt/ to get an archive and open wget of the snapshot. #### ArchiveBox version <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue --> v0.6.2
Author
Owner

@pirate commented on GitHub (Mar 10, 2023):

Please post the full verbatim output of archivebox version, not just the version number. Also please upload the archived html wget output or a screenshot.

Most likely this is a wget issue, as archivebox has no say in how wget internally implements encoding detection.

<!-- gh-comment-id:1464521973 --> @pirate commented on GitHub (Mar 10, 2023): Please post the full verbatim output of `archivebox version`, not just the version number. Also please upload the archived html wget output or a screenshot. Most likely this is a wget issue, as archivebox has no say in how wget internally implements encoding detection.
Author
Owner

@PovilasID commented on GitHub (Mar 10, 2023):

Screenshot:
image

1archivebox version` output

ArchiveBox v0.6.2
Cpython Linux Linux-5.11.0-1028-oracle-aarch64-with-glibc2.28 aarch64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /data                                                                       
 √  SOURCES_DIR           14 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           8 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             272.0 KB        valid     ./index.sqlite3 

UTF8 was not working for all websites, now it is working for most but not all, so in the code there must have been a change validates tags or does something else to switch how wget works. https://lrt.lt was a website that had issues with UTF 8 chars in the past

<!-- gh-comment-id:1464568417 --> @PovilasID commented on GitHub (Mar 10, 2023): Screenshot: ![image](https://user-images.githubusercontent.com/396243/224439497-348247dc-684f-49a1-972e-812655a95217.png) 1archivebox version` output ``` ArchiveBox v0.6.2 Cpython Linux Linux-5.11.0-1028-oracle-aarch64-with-glibc2.28 aarch64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 6 files valid /data √ SOURCES_DIR 14 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 8 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 272.0 KB valid ./index.sqlite3 ``` UTF8 was not working for all websites, now it is working for most but not all, so in the code there must have been a change validates tags or does something else to switch how wget works. https://lrt.lt was a website that had issues with UTF 8 chars in the past
Author
Owner

@pirate commented on GitHub (Mar 11, 2023):

Thanks for posting those, can you try the latest dev branch version, there are some updates: archivebox/archivebox:dev https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch

<!-- gh-comment-id:1464717562 --> @pirate commented on GitHub (Mar 11, 2023): Thanks for posting those, can you try the latest dev branch version, there are some updates: `archivebox/archivebox:dev` https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch
Author
Owner

@PovilasID commented on GitHub (Mar 11, 2023):

No effect after pulling dev docker image and forcing a rebuild and re-snapshoting the website... well not completely no effect. Singlepage started working, so that is nice. It handles UTF 8 characters correctly, although, wget still shows same error's instead of UTF8 requiring characters.

<!-- gh-comment-id:1464731779 --> @PovilasID commented on GitHub (Mar 11, 2023): No effect after pulling `dev` docker image and forcing a rebuild and re-snapshoting the website... well not completely no effect. Singlepage started working, so that is nice. It handles UTF 8 characters correctly, although, wget still shows same error's instead of UTF8 requiring characters.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#702
No description provided.