[GH-ISSUE #257] Error if the character code is Shift_JIS #3201

Closed
opened 2026-03-14 21:35:47 +03:00 by kerem · 4 comments
Owner

Originally created by @matoken on GitHub (Aug 16, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/257

Error if the character code is Shift_JIS

An error occurs on some Japanese pages.

Steps to reproduce

An error occurs on this page.

https://www.mbc.co.jp/news/
http://167.86.112.42/hello_sjis.html

Screenshots or log output

$ ./archive https://www.mbc.co.jp/news/
[*] [2019-08-16 18:53:59] Downloading https://www.mbc.co.jp/news/
[!] Failed to download https://www.mbc.co.jp/news/

	 'utf-8' codec can't decode byte 0x8e in position 181: invalid start byte

The character code of the page where the error occurs seems to be Shift_JIS (a little old Japanese character code).

$ curl -s https://www.mbc.co.jp/news/ | grep -i charset=
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS" />
<script type="text/javascript" src="js/scrollsmoothly.js" charset="utf-8"></script>
<link rel="stylesheet" type="text/css" href="/css/mbc_menu_import.css" charset="Shift-JIS">
<SCRIPT language="JavaScript" src="/js/mbcmenu.js" charset="Shift-JIS"></SCRIPT>

An error occurred when trying to create a tiny Shift_JIS page.

$ echo '<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=Shift_JIS" />
</head>
<body>
こんにちは
</body>
</html>' | iconv -f UTF8 -t SJIS > hello_sjis.html
$ ./archive http://167.86.112.42/hello_sjis.html
[*] [2019-08-16 19:02:10] Downloading http://167.86.112.42/hello_sjis.html
[!] Failed to download http://167.86.112.42/hello_sjis.html                                                                                                                      

     'utf-8' codec can't decode byte 0x82 in position 103: invalid start byte

Software versions

  • OS: Debian GNU/Linux 10 (buster) amd64
  • ArchiveBox version: ArchiveBox version e2b054ae7
  • Python version: 3.7.3 ( Debian package 3.7.3-1 )
  • Chrome version: 73.0.3683.75 (Debian package 73.0.3683.75-1 )
Originally created by @matoken on GitHub (Aug 16, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/257 #### Error if the character code is Shift_JIS An error occurs on some Japanese pages. #### Steps to reproduce An error occurs on this page. https://www.mbc.co.jp/news/ http://167.86.112.42/hello_sjis.html #### Screenshots or log output ``` $ ./archive https://www.mbc.co.jp/news/ [*] [2019-08-16 18:53:59] Downloading https://www.mbc.co.jp/news/ [!] Failed to download https://www.mbc.co.jp/news/ 'utf-8' codec can't decode byte 0x8e in position 181: invalid start byte ``` The character code of the page where the error occurs seems to be Shift_JIS (a little old Japanese character code). ``` $ curl -s https://www.mbc.co.jp/news/ | grep -i charset= <meta http-equiv="content-type" content="text/html; charset=Shift_JIS" /> <script type="text/javascript" src="js/scrollsmoothly.js" charset="utf-8"></script> <link rel="stylesheet" type="text/css" href="/css/mbc_menu_import.css" charset="Shift-JIS"> <SCRIPT language="JavaScript" src="/js/mbcmenu.js" charset="Shift-JIS"></SCRIPT> ``` An error occurred when trying to create a tiny Shift_JIS page. ``` $ echo '<html> <head> <meta http-equiv="content-type" content="text/html; charset=Shift_JIS" /> </head> <body> こんにちは </body> </html>' | iconv -f UTF8 -t SJIS > hello_sjis.html ``` ``` $ ./archive http://167.86.112.42/hello_sjis.html [*] [2019-08-16 19:02:10] Downloading http://167.86.112.42/hello_sjis.html [!] Failed to download http://167.86.112.42/hello_sjis.html 'utf-8' codec can't decode byte 0x82 in position 103: invalid start byte ``` #### Software versions - OS: Debian GNU/Linux 10 (buster) amd64 - ArchiveBox version: ArchiveBox version e2b054ae7 - Python version: 3.7.3 ( Debian package 3.7.3-1 ) - Chrome version: 73.0.3683.75 (Debian package 73.0.3683.75-1 )
kerem 2026-03-14 21:35:47 +03:00
Author
Owner

@pirate commented on GitHub (Aug 17, 2019):

Ahhh encoding problems, they're never-ending and really hard to solve 100% correctly for all cases. Unfortunately cant promise I'll get around to this anytime soon, too many other important issues in the queue, sorry for the trouble. May I suggest finding a site that lets you proxy and convert the character encoding by passing a url, then archiving that site instead? As a last resort Google Translate might do the trick?

<!-- gh-comment-id:522197620 --> @pirate commented on GitHub (Aug 17, 2019): Ahhh encoding problems, they're never-ending and really hard to solve 100% correctly for all cases. Unfortunately cant promise I'll get around to this anytime soon, too many other important issues in the queue, sorry for the trouble. May I suggest finding a site that lets you proxy and convert the character encoding by passing a url, then archiving that site instead? As a last resort Google Translate might do the trick?
Author
Owner

@cdvv7788 commented on GitHub (Jul 20, 2020):

@matoken Can you please check if the content can be archived now using the django branch? I still see some issue with the title in the index, but the content seems to be archived correctly.

<!-- gh-comment-id:661062957 --> @cdvv7788 commented on GitHub (Jul 20, 2020): @matoken Can you please check if the content can be archived now using the `django` branch? I still see some issue with the title in the index, but the content seems to be archived correctly.
Author
Owner

@matoken commented on GitHub (Jul 21, 2020):

I installed django branch in a new environment and tried it out.
I tried some sites that use Shift_JIS and it seems to work.
The title of the index is sometimes a URL and sometimes garbled.

20200722_06:07:36-3864896-edit

<!-- gh-comment-id:662127600 --> @matoken commented on GitHub (Jul 21, 2020): I installed django branch in a new environment and tried it out. I tried some sites that use Shift_JIS and it seems to work. The title of the index is sometimes a URL and sometimes garbled. ![20200722_06:07:36-3864896-edit](https://user-images.githubusercontent.com/582400/88111251-8ad09c80-cbe8-11ea-8fd1-24b609e36682.jpg)
Author
Owner

@pirate commented on GitHub (Jul 22, 2020):

Try the django branch now, it should be fixed in #378. If you still see any problems comment back here and I'll reopen the issue.

<!-- gh-comment-id:662524765 --> @pirate commented on GitHub (Jul 22, 2020): Try the `django` branch now, it should be fixed in #378. If you still see any problems comment back here and I'll reopen the issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3201
No description provided.