[GH-ISSUE #32] Cant run archive.py due to UTF-8 encoding issues #3042

Closed
opened 2026-03-14 20:43:57 +03:00 by kerem · 15 comments
Owner

Originally created by @movanet on GitHub (Jul 4, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/32

Any ideas why?

[2017-07-04 05:45:26] Starting archive from /root/bookmark-archiver/downloads/ril_export.html export file.
Traceback (most recent call last):
  File "./archive.py", line 521, in <module>
    create_archive(export_file, service=export_type, resume=resume_from)
  File "./archive.py", line 487, in create_archive
    dump_index(links, service)
  File "./archive.py", line 398, in dump_index
    f.write(index_html.format(*template_vars))
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f4c2' in position 3530: ordinal not in range(128)
Originally created by @movanet on GitHub (Jul 4, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/32 Any ideas why? ``` [2017-07-04 05:45:26] Starting archive from /root/bookmark-archiver/downloads/ril_export.html export file. Traceback (most recent call last): File "./archive.py", line 521, in <module> create_archive(export_file, service=export_type, resume=resume_from) File "./archive.py", line 487, in create_archive dump_index(links, service) File "./archive.py", line 398, in dump_index f.write(index_html.format(*template_vars)) UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f4c2' in position 3530: ordinal not in range(128) ```
kerem 2026-03-14 20:43:57 +03:00
Author
Owner

@pirate commented on GitHub (Jul 4, 2017):

Try pulling and running it again. I've been refactoring over the last couple hours, so you probably pulled a broken version, sorry!

<!-- gh-comment-id:312846405 --> @pirate commented on GitHub (Jul 4, 2017): Try pulling and running it again. I've been refactoring over the last couple hours, so you probably pulled a broken version, sorry!
Author
Owner

@movanet commented on GitHub (Jul 4, 2017):

Thanks. It still fails:

[+] [2017-07-04 07:23:07] Starting archive from /root/bookmark-archiver/downloads/ril_export.html export file.
Traceback (most recent call last):
  File "./archive.py", line 88, in <module>
    create_archive(export_file, service=export_type, resume=resume_from)
  File "./archive.py", line 55, in create_archive
    dump_index(links, service)
  File "/root/bookmark-archiver/index.py", line 16, in dump_index
    link_html = f.read()
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 393: ordinal not in range(128)
<!-- gh-comment-id:312853796 --> @movanet commented on GitHub (Jul 4, 2017): Thanks. It still fails: ``` [+] [2017-07-04 07:23:07] Starting archive from /root/bookmark-archiver/downloads/ril_export.html export file. Traceback (most recent call last): File "./archive.py", line 88, in <module> create_archive(export_file, service=export_type, resume=resume_from) File "./archive.py", line 55, in create_archive dump_index(links, service) File "/root/bookmark-archiver/index.py", line 16, in dump_index link_html = f.read() File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 393: ordinal not in range(128) ```
Author
Owner

@pirate commented on GitHub (Jul 4, 2017):

Ok sweet, at least it's failing in a different place though, which makes me think it's due to lacking hardcoded encodings.

I just updated all the open() calls to manually specify encoding='utf-8'. Please pull and try again, lemme know how it goes. What system are you running this on by the way?

<!-- gh-comment-id:312854316 --> @pirate commented on GitHub (Jul 4, 2017): Ok sweet, at least it's failing in a different place though, which makes me think it's due to lacking hardcoded encodings. I just updated all the `open()` calls to manually specify `encoding='utf-8'`. Please pull and try again, lemme know how it goes. What system are you running this on by the way?
Author
Owner

@movanet commented on GitHub (Jul 4, 2017):

It started to work however it stuck at chrome version. Should I update the chromium?

./archive.py ril_export.html[+] [2017-07-04 07:33:46] Starting archive from ril_export.html export file.
[*] [2017-07-04 07:33:48] Created archive index with 1699 links.
[*] Checking Dependencies:
/usr/bin/chromium-browser
[X] Chrome version must be 59 or greater for headless PDF and screenshot saving
    See https://github.com/pirate/bookmark-archiver for help.
<!-- gh-comment-id:312855798 --> @movanet commented on GitHub (Jul 4, 2017): It started to work however it stuck at chrome version. Should I update the chromium? ``` ./archive.py ril_export.html[+] [2017-07-04 07:33:46] Starting archive from ril_export.html export file. [*] [2017-07-04 07:33:48] Created archive index with 1699 links. [*] Checking Dependencies: /usr/bin/chromium-browser [X] Chrome version must be 59 or greater for headless PDF and screenshot saving See https://github.com/pirate/bookmark-archiver for help. ```
Author
Owner

@movanet commented on GitHub (Jul 4, 2017):

btw this is my chromium version:
Chromium 58.0.3029.110 Built on Ubuntu , running on Ubuntu 16.04

<!-- gh-comment-id:312855934 --> @movanet commented on GitHub (Jul 4, 2017): btw this is my chromium version: Chromium 58.0.3029.110 Built on Ubuntu , running on Ubuntu 16.04
Author
Owner

@pirate commented on GitHub (Jul 4, 2017):

Yes, you cannot run chrome headless unless you have a newer version of chromium or google-chrome. Simply run apt upgrade chromium-browser to upgrade.

<!-- gh-comment-id:312856010 --> @pirate commented on GitHub (Jul 4, 2017): Yes, you cannot run chrome headless unless you have a newer version of chromium or google-chrome. Simply run `apt upgrade chromium-browser` to upgrade.
Author
Owner

@movanet commented on GitHub (Jul 4, 2017):

Strange. Perhaps its not yet available for my Ubuntu?

apt upgrade chromium-browser
Reading package lists... Done
Building dependency tree
Reading state information... Done
chromium-browser is already the newest version (58.0.3029.110-0ubuntu0.16.04.1281).

<!-- gh-comment-id:312857956 --> @movanet commented on GitHub (Jul 4, 2017): Strange. Perhaps its not yet available for my Ubuntu? apt upgrade chromium-browser Reading package lists... Done Building dependency tree Reading state information... Done chromium-browser is already the newest version (58.0.3029.110-0ubuntu0.16.04.1281).
Author
Owner

@movanet commented on GitHub (Jul 4, 2017):

Tried downloading chromium from https://github.com/scheib/chromium-latest-linux/blob/master/ and modifying the env, but it doesnt work...

env CHROME_BINARY=/root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome ./archive.py ril_export.html
[+] [2017-07-04 08:03:42] Starting archive from ril_export.html export file.
[] [2017-07-04 08:03:44] Created archive index with 1699 links.
[
] Checking Dependencies:
/root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome
/root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome: error while loading shared libraries: libgtk-3.so.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "./archive.py", line 88, in
create_archive(export_file, service=export_type, resume=resume_from)
File "./archive.py", line 64, in create_archive
check_dependencies()
File "/root/bookmark-archiver/config.py", line 47, in check_dependencies
if int(version) < 59:
ValueError: invalid literal for int() with base 10: ''

<!-- gh-comment-id:312860952 --> @movanet commented on GitHub (Jul 4, 2017): Tried downloading chromium from https://github.com/scheib/chromium-latest-linux/blob/master/ and modifying the env, but it doesnt work... env CHROME_BINARY=/root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome ./archive.py ril_export.html [+] [2017-07-04 08:03:42] Starting archive from ril_export.html export file. [*] [2017-07-04 08:03:44] Created archive index with 1699 links. [*] Checking Dependencies: /root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome /root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome: error while loading shared libraries: libgtk-3.so.0: cannot open shared object file: No such file or directory Traceback (most recent call last): File "./archive.py", line 88, in <module> create_archive(export_file, service=export_type, resume=resume_from) File "./archive.py", line 64, in create_archive check_dependencies() File "/root/bookmark-archiver/config.py", line 47, in check_dependencies if int(version) < 59: ValueError: invalid literal for int() with base 10: ''
Author
Owner

@pirate commented on GitHub (Jul 4, 2017):

What is the output of /root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome --version ?

<!-- gh-comment-id:312924179 --> @pirate commented on GitHub (Jul 4, 2017): What is the output of `/root/bookmark-archiver/chromium-latest-linux/484087/chrome-linux/chrome --version` ?
Author
Owner

@movanet commented on GitHub (Jul 6, 2017):

chromium-browser --version
Chromium 58.0.3029.110 Built on Ubuntu , running on Ubuntu 16.04

<!-- gh-comment-id:313399710 --> @movanet commented on GitHub (Jul 6, 2017): chromium-browser --version Chromium 58.0.3029.110 Built on Ubuntu , running on Ubuntu 16.04
Author
Owner

@movanet commented on GitHub (Jul 6, 2017):

installed google chrome. seemed to be working. so I guess the problem was with chromium-browser. chrome is working allright it seems. I am closing this.

/bookmark-archiver# env CHROME_BINARY=/usr/bin/google-chrome ./archive.py ril_export.html

[+] [2017-07-06 10:11:49] Starting archive from ril_export.html export file.
[] [2017-07-06 10:11:53] Created archive index with 1699 links.
[
] Checking Dependencies:
/usr/bin/google-chrome
/usr/bin/wget
/usr/bin/curl
[+] [1497864202 (2017-06-19 05:23)]

<!-- gh-comment-id:313408676 --> @movanet commented on GitHub (Jul 6, 2017): installed google chrome. seemed to be working. so I guess the problem was with chromium-browser. chrome is working allright it seems. I am closing this. /bookmark-archiver# env CHROME_BINARY=/usr/bin/google-chrome ./archive.py ril_export.html [+] [2017-07-06 10:11:49] Starting archive from ril_export.html export file. [*] [2017-07-06 10:11:53] Created archive index with 1699 links. [*] Checking Dependencies: /usr/bin/google-chrome /usr/bin/wget /usr/bin/curl [+] [1497864202 (2017-06-19 05:23)]
Author
Owner

@movanet commented on GitHub (Jul 6, 2017):

sorry, another unicode error:

~/bookmark-archiver# ./archive.py ril_export.html
[] [2017-07-06 11:03:01] Starting archive from ril_export.html export file.
[+] [2017-07-06 11:03:07] Created archive index with 1699 links.
[
] Checking Dependencies:
/usr/bin/chromium-browser
/usr/bin/wget
/usr/bin/curl
[+] [1497864202 (2017-06-19 05:23)] "Helios4 - Your own private cloud": kobol.io/helios4/
- Downloading full site
0.9% (1/60sec)Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/root/bookmark-archiver/config.py", line 145, in progress_bar
seconds,
UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128)
wget output:
Converting links in kobol.io/helios4/css/custom.css... nothing to do.
Converting links in kobol.io/helios4/css/owl.carousel.css... 0-1
Converting links in kobol.io/helios4/css/socicon.css... 1-0
Converting links in kobol.io/helios4/css/iconsmind.css... 3-0
Converting links in kobol.io/helios4/css/bootstrap.css... 0-5
Converting links in kobol.io/helios4/css/interface-icons.css... 6-0
Converting links in kobol.io/helios4/css/theme.css... 1-0
Converting links in kobol.io/helios4/css/font-mulilato.css... nothing to do.
Converted links in 9 files in 0.03 seconds.
Run to see full output: cd pocket/archive/1497864202; wget --timestamping --adjust-extension --no-parent --page-requisites --convert-links http://kobol.io/helios4/
Failed: Exception Failed to wget download
- Printing PDF
0.9% (1/60sec)Process Process-2:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/root/bookmark-archiver/config.py", line 145, in progress_bar
seconds,
UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128)
- Snapping Screenshot
0.9% (1/60sec)Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/root/bookmark-archiver/config.py", line 145, in progress_bar
seconds,
UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128)
- Submitting to archive.org
0.9% (1/60sec)Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/root/bookmark-archiver/config.py", line 145, in progress_bar
seconds,
UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128)
- Fetching Favicon
0.9% (1/60sec)Process Process-5:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/root/bookmark-archiver/config.py", line 145, in progress_bar
seconds,
UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128)
- Creating link info file
[X] Archive creation stopped.
Continue where you left off by running:
./archive.py ril_export.html pocket 1497840833
Traceback (most recent call last):
File "./archive.py", line 91, in
create_archive(export_file, service=export_type, resume=resume_from)
File "./archive.py", line 69, in create_archive
raise e
File "./archive.py", line 59, in create_archive
dump_website(link, service)
File "/root/bookmark-archiver/fetch.py", line 260, in dump_website
print('[{green}+{reset}] [{timestamp} ({time})] "{title}": {blue}{base_url}{reset}'.format(**link, **ANSI))
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 120: ordinal not in range(128)

<!-- gh-comment-id:313424122 --> @movanet commented on GitHub (Jul 6, 2017): sorry, another unicode error: ~/bookmark-archiver# ./archive.py ril_export.html [*] [2017-07-06 11:03:01] Starting archive from ril_export.html export file. [+] [2017-07-06 11:03:07] Created archive index with 1699 links. [*] Checking Dependencies: /usr/bin/chromium-browser /usr/bin/wget /usr/bin/curl [+] [1497864202 (2017-06-19 05:23)] "Helios4 - Your own private cloud": kobol.io/helios4/ - Downloading full site 0.9% (1/60sec)Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/root/bookmark-archiver/config.py", line 145, in progress_bar seconds, UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128) wget output: Converting links in kobol.io/helios4/css/custom.css... nothing to do. Converting links in kobol.io/helios4/css/owl.carousel.css... 0-1 Converting links in kobol.io/helios4/css/socicon.css... 1-0 Converting links in kobol.io/helios4/css/iconsmind.css... 3-0 Converting links in kobol.io/helios4/css/bootstrap.css... 0-5 Converting links in kobol.io/helios4/css/interface-icons.css... 6-0 Converting links in kobol.io/helios4/css/theme.css... 1-0 Converting links in kobol.io/helios4/css/font-mulilato.css... nothing to do. Converted links in 9 files in 0.03 seconds. Run to see full output: cd pocket/archive/1497864202; wget --timestamping --adjust-extension --no-parent --page-requisites --convert-links http://kobol.io/helios4/ Failed: Exception Failed to wget download - Printing PDF 0.9% (1/60sec)Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/root/bookmark-archiver/config.py", line 145, in progress_bar seconds, UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128) - Snapping Screenshot 0.9% (1/60sec)Process Process-3: Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/root/bookmark-archiver/config.py", line 145, in progress_bar seconds, UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128) - Submitting to archive.org 0.9% (1/60sec)Process Process-4: Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/root/bookmark-archiver/config.py", line 145, in progress_bar seconds, UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128) - Fetching Favicon 0.9% (1/60sec)Process Process-5: Traceback (most recent call last): File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/root/bookmark-archiver/config.py", line 145, in progress_bar seconds, UnicodeEncodeError: 'ascii' codec can't encode character '\u2588' in position 15: ordinal not in range(128) - Creating link info file [X] Archive creation stopped. Continue where you left off by running: ./archive.py ril_export.html pocket 1497840833 Traceback (most recent call last): File "./archive.py", line 91, in <module> create_archive(export_file, service=export_type, resume=resume_from) File "./archive.py", line 69, in create_archive raise e File "./archive.py", line 59, in create_archive dump_website(link, service) File "/root/bookmark-archiver/fetch.py", line 260, in dump_website print('[{green}+{reset}] [{timestamp} ({time})] "{title}": {blue}{base_url}{reset}'.format(**link, **ANSI)) UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 120: ordinal not in range(128)
Author
Owner

@pirate commented on GitHub (Jul 6, 2017):

Try running the script like this:

env PYTHONENCODING=utf-8 python3 archive.py export.html

Also post back with the output of these:

python3 -c "import sys; print(sys.stdout.encoding)"
python3 -c "print('日本語██')"
<!-- gh-comment-id:313523159 --> @pirate commented on GitHub (Jul 6, 2017): Try running the script like this: ```bash env PYTHONENCODING=utf-8 python3 archive.py export.html ``` Also post back with the output of these: ```bash python3 -c "import sys; print(sys.stdout.encoding)" python3 -c "print('日本語██')" ```
Author
Owner

@pirate commented on GitHub (Jul 6, 2017):

You can also just try pulling and running it again, I added instructions to fix this problem. It's fairly rare for this to still be happening in 2017, most distros default to the UTF-8 locale by now. I'm surprised that you're seeing this issue on Ubuntu 16.04.

if sys.stdout.encoding != 'UTF-8':
    print('[X] Your system is running python3 scripts with a bad locale setting: {} (it should be UTF-8).'.format(sys.stdout.encoding))
    print('    To fix it, add the line "export PYTHONIOENCODING=utf8" to your ~/.bashrc file (without quotes)')
    print('')
    print('    Confirm that it\'s fixed by opening a new shell and running:')
    print('        python3 -c "import sys; print(sys.stdout.encoding)"   # should output UTF-8')
    print('')
    print('    Alternatively, run this script with:')
    print('        env PYTHONIOENCODING=utf8 ./archive.py export.html')
<!-- gh-comment-id:313528199 --> @pirate commented on GitHub (Jul 6, 2017): You can also just try pulling and running it again, I added instructions to fix this problem. It's fairly rare for this to still be happening in 2017, most distros default to the UTF-8 locale by now. I'm surprised that you're seeing this issue on Ubuntu 16.04. ```python if sys.stdout.encoding != 'UTF-8': print('[X] Your system is running python3 scripts with a bad locale setting: {} (it should be UTF-8).'.format(sys.stdout.encoding)) print(' To fix it, add the line "export PYTHONIOENCODING=utf8" to your ~/.bashrc file (without quotes)') print('') print(' Confirm that it\'s fixed by opening a new shell and running:') print(' python3 -c "import sys; print(sys.stdout.encoding)" # should output UTF-8') print('') print(' Alternatively, run this script with:') print(' env PYTHONIOENCODING=utf8 ./archive.py export.html') ```
Author
Owner

@pirate commented on GitHub (Jul 25, 2017):

If you're still having trouble feel free to comment back and I'll re-open this. For now I'm closing this issue due to inactivity.

<!-- gh-comment-id:317840001 --> @pirate commented on GitHub (Jul 25, 2017): If you're still having trouble feel free to comment back and I'll re-open this. For now I'm closing this issue due to inactivity.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3042
No description provided.