[GH-ISSUE #52] Archive Method: Wget: Detect and fix downloaded file encodings to utf-8 #3056

Closed
opened 2026-03-14 20:47:49 +03:00 by kerem · 4 comments
Owner

Originally created by @pirate on GitHub (Nov 3, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/52

Best guess for encoding:

$ file -bi archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.html
application/gzip; charset=binary

First un-gzip any gzipped files:

gunzip < archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.html > archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.decoded.html

Then re-detect encoding and normalize to UTF-8

iconv -f ISO-8859-1 -t UTF-8 archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.html > archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.decoded.html
Originally created by @pirate on GitHub (Nov 3, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/52 Best guess for encoding: ```bash $ file -bi archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.html application/gzip; charset=binary ``` First un-gzip any gzipped files: ```bash gunzip < archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.html > archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.decoded.html ``` Then re-detect encoding and normalize to UTF-8 ```bash iconv -f ISO-8859-1 -t UTF-8 archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.html > archive/1509683331/blog.wesleyac.com/posts/two-interview-questions.decoded.html ```
Author
Owner

@Offirmo commented on GitHub (Mar 19, 2019):

Real case: I archived http://catb.org/jargon/html/index.html

Original:
Screen Shot 2019-03-19 at 16 57 01

Archived:
Screen Shot 2019-03-19 at 16 57 08

too bad...

<!-- gh-comment-id:474211554 --> @Offirmo commented on GitHub (Mar 19, 2019): Real case: I archived http://catb.org/jargon/html/index.html Original: ![Screen Shot 2019-03-19 at 16 57 01](https://user-images.githubusercontent.com/603503/54584022-381e2800-4a6a-11e9-990f-9d0da8848033.png) Archived: ![Screen Shot 2019-03-19 at 16 57 08](https://user-images.githubusercontent.com/603503/54584026-3d7b7280-4a6a-11e9-9828-db1aa2a212fa.png) too bad...
Author
Owner

@pirate commented on GitHub (Apr 16, 2019):

@Offirmo don't worry, the byte-for-byte data from the server is saved correctly in the WARC. Even though it's displaying incorrectly now, as we add encoding fixes later on it will update and fix older previously mangled/badly-decoded archives to display correctly.

<!-- gh-comment-id:483787021 --> @pirate commented on GitHub (Apr 16, 2019): @Offirmo don't worry, the byte-for-byte data from the server is saved correctly in the WARC. Even though it's displaying incorrectly now, as we add encoding fixes later on it will update and fix older previously mangled/badly-decoded archives to display correctly.
Author
Owner

@cdvv7788 commented on GitHub (Jul 16, 2020):

@Offirmo I just tried with the django branch: archivebox add http://catb.org/jargon/html/index.html and it seems to be working correctly now.
@pirate I guess this has been fixed since. Can you please confirm and close the issue if that is the case?

<!-- gh-comment-id:659582943 --> @cdvv7788 commented on GitHub (Jul 16, 2020): @Offirmo I just tried with the django branch: `archivebox add http://catb.org/jargon/html/index.html` and it seems to be working correctly now. @pirate I guess this has been fixed since. Can you please confirm and close the issue if that is the case?
Author
Owner

@pirate commented on GitHub (Jul 16, 2020):

@Offirmo if you see any further encoding issues on the latest django version feel free to comment back here and I can reopen this ticket.

<!-- gh-comment-id:659697302 --> @pirate commented on GitHub (Jul 16, 2020): @Offirmo if you see any further encoding issues on the latest `django` version feel free to comment back here and I can reopen this ticket.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3056
No description provided.