[GH-ISSUE #441] Browsers attempting to autodetect encoding leads to Unicode rendering issues in some replayed extractor outputs

kerem commented

2026-03-01 17:53:47 +03:00

Owner

Originally created by @MartinMSPedersen on GitHub (Aug 14, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/441

Describe the bug

When archive pages in danish language from https://politiken.dk/, unicode-characters are wrong.

Steps to reproduce

Archive this page: https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
Compare the archived version with a live-version in a browser:
https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
The three danish characters æ, ø and å are wrong.

Screenshots or log output

archive add 'https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt'

[i] [2020-08-14 11:34:08] ArchiveBox v0.4.13: archivebox add https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt < /dev/stdin
    > /data

[+] [2020-08-14 11:34:09] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597404849-import.txt
    > Parsed 1 URLs from input (Plain Text)
    > Found 1 new URLs not already in index

[*] [2020-08-14 11:34:09] Writing 2 links to main index...
    √ /data/index.sqlite3
    √ /data/index.json
    √ /data/index.html

[▶] [2020-08-14 11:34:09] Collecting content for 1 Snapshots in archive...

[+] [2020-08-14 11:34:09] "politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt"
    https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt
    > ./archive/1597404849
      > title
      > favicon
      > wget
      > singlefile
      > pdf
        Failed:
            Exception Failed to chmod: output.pdf does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1597404849;
            chromium --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt

      > screenshot
      > dom
      > media
      > archive_org

[√] [2020-08-14 11:34:38] Update of 1 pages complete (29.46 sec)
    - 0 links skipped
    - 0 links updated
    - 1 links had errors

    Hint: To view your archive index, open:
        /data/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-14 11:34:38] Writing 2 links to main index...
    √ /data/index.sqlite3
    √ /data/index.json
    √ /data/index.html

Software versions

Using the docker image based on commit: github.com/pirate/ArchiveBox@aa085cdb60
ArchiveBox version: v0.4.13

Originally created by @MartinMSPedersen on GitHub (Aug 14, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/441 #### Describe the bug When archive pages in danish language from https://politiken.dk/, unicode-characters are wrong. #### Steps to reproduce 1. Archive this page: https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt 2. Compare the archived version with a live-version in a browser: https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt 3. The three danish characters æ, ø and å are wrong. #### Screenshots or log output ``` archive add 'https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt' ``` ``` [i] [2020-08-14 11:34:08] ArchiveBox v0.4.13: archivebox add https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt < /dev/stdin > /data [+] [2020-08-14 11:34:09] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1597404849-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2020-08-14 11:34:09] Writing 2 links to main index... √ /data/index.sqlite3 √ /data/index.json √ /data/index.html [▶] [2020-08-14 11:34:09] Collecting content for 1 Snapshots in archive... [+] [2020-08-14 11:34:09] "politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt" https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt > ./archive/1597404849 > title > favicon > wget > singlefile > pdf Failed: Exception Failed to chmod: output.pdf does not exist (did the previous step fail?) Run to see full output: cd /data/archive/1597404849; chromium --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://politiken.dk/oekonomi/arbejdsmarked/art5521310/Reng%C3%B8ringsassistenter-blev-tvunget-op-i-ilmarch-med-749-kmt > screenshot > dom > media > archive_org [√] [2020-08-14 11:34:38] Update of 1 pages complete (29.46 sec) - 0 links skipped - 0 links updated - 1 links had errors Hint: To view your archive index, open: /data/index.html Or run the built-in webserver: archivebox server [*] [2020-08-14 11:34:38] Writing 2 links to main index... √ /data/index.sqlite3 √ /data/index.json √ /data/index.html ``` #### Software versions - Using the docker image based on commit: https://github.com/pirate/ArchiveBox/tree/aa085cdb60d835c0c4fc07a4983c328a39cc9292 - ArchiveBox version: v0.4.13

kerem added the

size: hard

touches: data/schema/architecture

status: wip

labels

2026-03-01 17:53:47 +03:00

kerem commented

2026-03-01 17:53:47 +03:00

Author

Owner

@pirate commented on GitHub (Aug 14, 2020):

In which output are you seeing the wrong characters? The wget output? Can you post a screenshot?

@pirate commented on GitHub (Aug 14, 2020): In which output are you seeing the wrong characters? The wget output? Can you post a screenshot?

kerem commented

2026-03-01 17:53:47 +03:00

Author

Owner

@rfletcher commented on GitHub (Aug 16, 2020):

Another example: https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/

Archived with 0.4.14 in docker. The source site (left) and "html dump" archive are pictured below. Note the apostrophes and an em dash are mangled on the right. The wget output text shows the same.

@rfletcher commented on GitHub (Aug 16, 2020): Another example: https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/ Archived with 0.4.14 in docker. The source site (left) and "html dump" archive are pictured below. Note the apostrophes and an em dash are mangled on the right. The wget output text shows the same. <img width="1433" alt="image" src="https://user-images.githubusercontent.com/39261/90342613-7f9a4200-dfd7-11ea-8610-dc3404ba391f.png">

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@cdvv7788 commented on GitHub (Aug 17, 2020):

@rfletcher can you please share the exact command you are running, and can you try running archivebox version to confirm the version? I was not able to reproduce the issue with that page.
@MartinMSPedersen Can you please share some screenshots of your output?

@cdvv7788 commented on GitHub (Aug 17, 2020): @rfletcher can you please share the exact command you are running, and can you try running `archivebox version` to confirm the version? I was not able to reproduce the issue with that page. @MartinMSPedersen Can you please share some screenshots of your output?

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@rfletcher commented on GitHub (Aug 21, 2020):

@rfletcher can you please share the exact command you are running, and can you try running archivebox version to confirm the version? I was not able to reproduce the issue with that page.

Sorry for the delay. The info you've asked for is below.

I've played around a bit more and I really don't know what to make of this. I'm on a mac, running 10.15.6. When I open output.html for the sixcolors.com URL in either Safari or Chrome (both up to date), it renders as above, with the wrong characters. Oddly though, when I use Finder's "quick look" feature to view the same file, it's rendered correctly. I'm pretty surprised that Quick Look and Safari would show a different result, but they do. Here's another screenshot, showing Quick Look on the left and Safari on the right.

I would have chalked this up to a macOS/Safari bug at this point, except that Quick Look is the only place I've seen this file rendered correctly. Multiple browsers show something other than the original text.

$ docker run  -v '/mnt/backup/web:/data' -it nikisweeting/archivebox:0.4.14 version
ArchiveBox v0.4.14

[i] Dependency versions:
 √  PYTHON_BINARY          /usr/local/bin/python                                                        v3.8.5          valid 
 √  DJANGO_BINARY          /usr/local/lib/python3.8/site-packages/django/bin/django-admin.py            v3.0.8          valid 
 √  CURL_BINARY            /usr/bin/curl                                                                v7.64.0         valid 
 √  WGET_BINARY            /usr/bin/wget                                                                v1.20.1         valid 
 √  SINGLEFILE_BINARY      /node/node_modules/.bin/single-file                                          v0.1.0          valid 
 √  READABILITY_BINARY     /node/node_modules/.bin/readability-extractor                                v0.1.0          valid 
 √  GIT_BINARY             /usr/bin/git                                                                 v2.20.1         valid 
 √  YOUTUBEDL_BINARY       /usr/local/bin/youtube-dl                                                    v2020.07.28     valid 
 √  CHROME_BINARY          /usr/bin/chromium                                                            v83.0.4103.116  valid 

[i] Code locations:
 √  REPO_DIR               /app                                                                         25 files        valid 
 √  PYTHON_DIR             /app/archivebox                                                              19 files        valid 
 √  TEMPLATES_DIR          /app/archivebox/themes/legacy                                                6 files         valid 

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR                                                                                -               disabled 
 -  COOKIES_FILE                                                                                        -               disabled 

[i] Data locations:
 √  OUTPUT_DIR             /data                                                                        11 files        valid 
 √  SOURCES_DIR            /data/sources                                                                45 files        valid 
 √  LOGS_DIR               /data/logs                                                                   0 files         valid 
 √  ARCHIVE_DIR            /data/archive                                                                17 files        valid 
 √  CONFIG_FILE            /data/ArchiveBox.conf                                                        535.0 Bytes     valid 
 √  SQL_INDEX              /data/index.sqlite3                                                          164.0 KB        valid 
 √  JSON_INDEX             /data/index.json                                                             238.6 KB        valid 
 √  HTML_INDEX             /data/index.html                                                             27.6 KB         valid

$ docker run  -v '/mnt/backup/web:/data' -it nikisweeting/archivebox:0.4.14 add 'https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/'
[i] [2020-08-20 23:51:28] ArchiveBox v0.4.14: archivebox add https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/
    > /data

[+] [2020-08-20 23:51:31] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597967491-import.txt
    > Parsed 1 URLs from input (Plain Text)                                                                                                                                                                 
    > Found 1 new URLs not already in index                                                                                                                                                                 

[*] [2020-08-20 23:51:31] Writing 17 links to main index...
    √ /data/index.sqlite3                                                                                                                                                                                   
    √ /data/index.json                                                                                                                                                                                      
    √ /data/index.html                                                                                                                                                                                      

[▶] [2020-08-20 23:51:31] Collecting content for 1 Snapshots in archive...

[+] [2020-08-20 23:51:31] "sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel"
    https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/
    > ./archive/1597967491
      > title
      > favicon                                                                                                                                                                                             
      > wget                                                                                                                                                                                                
      > singlefile                                                                                                                                                                                          
      > pdf                                                                                                                                                                                                 
      > screenshot                                                                                                                                                                                          
      > dom                                                                                                                                                                                                 
      > readability                                                                                                                                                                                         
      X git                                                                                                                                                                                                 
      > media
      > archive_org                                                                                                                                                                                         
                                                                                                                                                                                                            
[√] [2020-08-20 23:51:54] Update of 1 pages complete (23.41 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

    Hint: To view your archive index, open:
        /data/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-20 23:51:55] Writing 17 links to main index...
    √ /data/index.sqlite3                                                                                                                                                                                   
    √ /data/index.json                                                                                                                                                                                      
    √ /data/index.html

@rfletcher commented on GitHub (Aug 21, 2020): > @rfletcher can you please share the exact command you are running, and can you try running archivebox version to confirm the version? I was not able to reproduce the issue with that page. Sorry for the delay. The info you've asked for is below. I've played around a bit more and I really don't know what to make of this. I'm on a mac, running 10.15.6. When I open output.html for the sixcolors.com URL in either Safari or Chrome (both up to date), it renders as above, with the wrong characters. Oddly though, when I use Finder's "quick look" feature to view the same file, it's rendered correctly. I'm pretty surprised that Quick Look and Safari would show a different result, but they do. Here's another screenshot, showing Quick Look on the left and Safari on the right. I would have chalked this up to a macOS/Safari bug at this point, except that Quick Look is the *only* place I've seen this file rendered correctly. Multiple browsers show something other than the original text. <img width="965" alt="image" src="https://user-images.githubusercontent.com/39261/90840231-71298e80-e327-11ea-97c5-68dc264c7601.png"> * * * ``` $ docker run -v '/mnt/backup/web:/data' -it nikisweeting/archivebox:0.4.14 version ArchiveBox v0.4.14 [i] Dependency versions: √ PYTHON_BINARY /usr/local/bin/python v3.8.5 valid √ DJANGO_BINARY /usr/local/lib/python3.8/site-packages/django/bin/django-admin.py v3.0.8 valid √ CURL_BINARY /usr/bin/curl v7.64.0 valid √ WGET_BINARY /usr/bin/wget v1.20.1 valid √ SINGLEFILE_BINARY /node/node_modules/.bin/single-file v0.1.0 valid √ READABILITY_BINARY /node/node_modules/.bin/readability-extractor v0.1.0 valid √ GIT_BINARY /usr/bin/git v2.20.1 valid √ YOUTUBEDL_BINARY /usr/local/bin/youtube-dl v2020.07.28 valid √ CHROME_BINARY /usr/bin/chromium v83.0.4103.116 valid [i] Code locations: √ REPO_DIR /app 25 files valid √ PYTHON_DIR /app/archivebox 19 files valid √ TEMPLATES_DIR /app/archivebox/themes/legacy 6 files valid [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR /data 11 files valid √ SOURCES_DIR /data/sources 45 files valid √ LOGS_DIR /data/logs 0 files valid √ ARCHIVE_DIR /data/archive 17 files valid √ CONFIG_FILE /data/ArchiveBox.conf 535.0 Bytes valid √ SQL_INDEX /data/index.sqlite3 164.0 KB valid √ JSON_INDEX /data/index.json 238.6 KB valid √ HTML_INDEX /data/index.html 27.6 KB valid ``` ``` $ docker run -v '/mnt/backup/web:/data' -it nikisweeting/archivebox:0.4.14 add 'https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/' [i] [2020-08-20 23:51:28] ArchiveBox v0.4.14: archivebox add https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/ > /data [+] [2020-08-20 23:51:31] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1597967491-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2020-08-20 23:51:31] Writing 17 links to main index... √ /data/index.sqlite3 √ /data/index.json √ /data/index.html [▶] [2020-08-20 23:51:31] Collecting content for 1 Snapshots in archive... [+] [2020-08-20 23:51:31] "sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel" https://sixcolors.com/post/2020/07/automate-this-how-hot-does-it-feel/ > ./archive/1597967491 > title > favicon > wget > singlefile > pdf > screenshot > dom > readability X git > media > archive_org [√] [2020-08-20 23:51:54] Update of 1 pages complete (23.41 sec) - 0 links skipped - 1 links updated - 0 links had errors Hint: To view your archive index, open: /data/index.html Or run the built-in webserver: archivebox server [*] [2020-08-20 23:51:55] Writing 17 links to main index... √ /data/index.sqlite3 √ /data/index.json √ /data/index.html ```

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@rfletcher commented on GitHub (Aug 21, 2020):

I would have chalked this up to a macOS/Safari bug at this point, except that Quick Look is the only place I've seen this file rendered correctly. Multiple browsers show something other than the original text.

I got curious and installed more browsers. All running their latest versions on macOS Catalina. (✅ = rendered correctly, ❌ = incorrectly)

❌ Chrome
❌ Safari
✅ Quick Look (macOS Finder)
✅ Firefox
✅ Edge
✅ Brave

I don't know if there's anything you can do to the HTML file to make it render more consistently, but I'm not so sure there's an ArchiveBox bug here, exactly.

Here's the actual HTML file I used for testing: output.html.zip

@rfletcher commented on GitHub (Aug 21, 2020): > I would have chalked this up to a macOS/Safari bug at this point, except that Quick Look is the only place I've seen this file rendered correctly. Multiple browsers show something other than the original text. I got curious and installed more browsers. All running their latest versions on macOS Catalina. (✅ = rendered correctly, ❌ = incorrectly) ❌ Chrome ❌ Safari ✅ Quick Look (macOS Finder) ✅ Firefox ✅ Edge ✅ Brave I don't know if there's anything you can do to the HTML file to make it render more consistently, but I'm not so sure there's an ArchiveBox bug here, exactly. Here's the actual HTML file I used for testing: [output.html.zip](https://github.com/pirate/ArchiveBox/files/5106255/output.html.zip)

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@MartinMSPedersen commented on GitHub (Sep 3, 2020):

Sorry for the delay.
Here are some screenshots that I think is useful.

First a version from archive.org where the unicode is encoded correctly.

And the version from archive-box viewed in firefox.

Here we can see that firefox believes the html is encoded in windows-1252 which is not correct.
Same result on google-chrome

If I choose the singlefile version then it is encoded correctly as UTF-8.

@MartinMSPedersen commented on GitHub (Sep 3, 2020): Sorry for the delay. Here are some screenshots that I think is useful. First a version from archive.org where the unicode is encoded correctly. ![archive_org_version](https://user-images.githubusercontent.com/1326261/92091048-7af3cd00-edd0-11ea-8440-7c19adee4a02.png) And the version from archive-box viewed in firefox. ![local_version_firefox](https://user-images.githubusercontent.com/1326261/92091051-7b8c6380-edd0-11ea-8e96-30aeb1103521.png) Here we can see that firefox believes the html is encoded in windows-1252 which is not correct. Same result on google-chrome ![local_version_source](https://user-images.githubusercontent.com/1326261/92091053-7b8c6380-edd0-11ea-9440-bc4e22141b21.png) If I choose the singlefile version then it is encoded correctly as UTF-8. ![single_file_version](https://user-images.githubusercontent.com/1326261/92091055-7c24fa00-edd0-11ea-901a-5cd6fcd999a1.png)

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@cdvv7788 commented on GitHub (Sep 10, 2020):

@MartinMSPedersen @rfletcher just to confirm, what extractors output is having the issue? Wget?

@cdvv7788 commented on GitHub (Sep 10, 2020): @MartinMSPedersen @rfletcher just to confirm, what extractors output is having the issue? Wget?

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@rfletcher commented on GitHub (Sep 10, 2020):

My current versions:

ArchiveBox 0.4.21
Safari 13.1.2 (latest)
macOS 10.15.6 (latest)

In my case it looks like these outputs are using the wrong encoding for the sixcolors.com URL (as viewed in Safari on macOS):

❌ Wget WARC
❌ Chrome HTML
❌ Readability

These show expected output:

✅ Chrome SingleFile
✅ Archive.org
✅ Original
✅ Chrome PDF
✅ Chrome screenshot

All three of the bad HTML documents show document.characterSet as "windows-1252". The rest show "UTF-8".

I think what might be happening is that the original page has the encoding information set in a response header (my example URL definitely includes content-type: text/html; charset=UTF-8), but when the HTML body is saved locally without headers that explicit encoding information is lost. At that point it's up to the renderer to guess the encoding, and some are getting it wrong.

@rfletcher commented on GitHub (Sep 10, 2020): My current versions: - ArchiveBox 0.4.21 - Safari 13.1.2 (latest) - macOS 10.15.6 (latest) In my case it looks like these outputs are using the wrong encoding for the sixcolors.com URL (as viewed in Safari on macOS): - ❌ Wget WARC - ❌ Chrome HTML - ❌ Readability These show expected output: - ✅ Chrome SingleFile - ✅ Archive.org - ✅ Original - ✅ Chrome PDF - ✅ Chrome screenshot All three of the bad HTML documents show `document.characterSet` as `"windows-1252"`. The rest show `"UTF-8"`. I think what might be happening is that the original page has the encoding information set in a response header (my example URL definitely includes `content-type: text/html; charset=UTF-8`), but when the HTML body is saved locally without headers that explicit encoding information is lost. At that point it's up to the renderer to guess the encoding, and some are getting it wrong.

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@cdvv7788 commented on GitHub (Sep 10, 2020):

Yes, that is our guess too. Thanks for the information!

@cdvv7788 commented on GitHub (Sep 10, 2020): Yes, that is our guess too. Thanks for the information!

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@MartinMSPedersen commented on GitHub (Sep 19, 2020):

Maybe I should close this issue now?

@MartinMSPedersen commented on GitHub (Sep 19, 2020): Maybe I should close this issue now?

kerem commented

2026-03-01 17:53:48 +03:00

Author

Owner

@pirate commented on GitHub (Sep 22, 2020):

@MartinMSPedersen no, we're still thinking about how to solve this by either storing and replaying headers or converting the encoding on-disk to UTF-8.

We're currently stuck on reproducing the issue reliably, as it only happens when visiting the pages directly, but not when they're iframed. Our suspicion is that this is a subtle behavior of Chrome's automatic encoding detection, and our solution will involving nudging Chrome towards the right direction or finding out why it's autodetecting differently based on whether the content is iframed or not.

@pirate commented on GitHub (Sep 22, 2020): @MartinMSPedersen no, we're still thinking about how to solve this by either storing and replaying headers or converting the encoding on-disk to UTF-8. We're currently stuck on reproducing the issue reliably, as it only happens when visiting the pages directly, but not when they're iframed. Our suspicion is that this is a subtle behavior of Chrome's automatic encoding detection, and our solution will involving nudging Chrome towards the right direction or finding out why it's autodetecting differently based on whether the content is iframed or not.

Rows
Columns

[GH-ISSUE #441] Browsers attempting to autodetect encoding leads to Unicode rendering issues in some replayed extractor outputs #1804

Describe the bug

Steps to reproduce

Screenshots or log output

Software versions