[PR #936] Fix bin_version: set LANG=C when calling executables to avoid parsing localized output #2806

Closed
opened 2026-03-01 18:00:48 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/936

State: closed
Merged: Yes


Summary

bin_version is used to set user agent string in title extractor. bin_version calls external executables with --version and parses its output. The resulting version string is used to set user agent string.

Executables might output in localized languages, however, user agent strings can only be in latin-1, resulting in this error:

        Extractor failed:                                                                                                                                                                                                                   
            UnicodeEncodeError 'latin-1' codec can't encode characters in position 201-202: ordinal not in range(256) 

This problem is fixed by running executables with environment variable LANG=C.

Example of running wget --version when LANG="zh_TW.UTF-8":

GNU Wget 1.21,於 linux-gnu 上編譯。

-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls 
+ntlm +opie +psl +ssl/openssl 

Wgetrc: 
    /etc/wgetrc (系統)
語系: 
    /usr/share/locale 
編譯: 
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" 
    -DLOCALEDIR="/usr/share/locale" -I. -I../../src -I../lib 
    -I../../lib -Wdate-time -D_FORTIFY_SOURCE=2 -DHAVE_LIBSSL -DNDEBUG 
    -g -O2 -ffile-prefix-map=/build/wget-OM48Vs/wget-1.21=. 
    -fstack-protector-strong -Wformat -Werror=format-security 
    -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall 
連結: 
    gcc -DHAVE_LIBSSL -DNDEBUG -g -O2 
    -ffile-prefix-map=/build/wget-OM48Vs/wget-1.21=. 
    -fstack-protector-strong -Wformat -Werror=format-security 
    -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall -Wl,-Bsymbolic-functions 
    -Wl,-z,relro -Wl,-z,now -lpcre2-8 -luuid -lidn2 -lssl -lcrypto -lz 
    -lpsl ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a 

版權所有 (C) 2015 自由軟體基金會
GPLv3+ 授權:GNU GPL 第三版或更新版本
<http://www.gnu.org/licenses/gpl.html>。
此為自由軟體:您能自由修改與重散布它。
在法律允許的範圍內沒有任何擔保。

最初由 Hrvoje Niksic <hniksic@xemacs.org> 編寫。
請將漏洞報告和問題寄到 <bug-wget@gnu.org>。

Related issues

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk
**Original Pull Request:** https://github.com/ArchiveBox/ArchiveBox/pull/936 **State:** closed **Merged:** Yes --- # Summary `bin_version` is used to set user agent string in `title` extractor. `bin_version` calls external executables with `--version` and parses its output. The resulting version string is used to set user agent string. Executables might output in localized languages, however, user agent strings can only be in `latin-1`, resulting in this error: ``` Extractor failed: UnicodeEncodeError 'latin-1' codec can't encode characters in position 201-202: ordinal not in range(256) ``` This problem is fixed by running executables with environment variable LANG=C. Example of running `wget --version` when `LANG="zh_TW.UTF-8"`: ``` GNU Wget 1.21,於 linux-gnu 上編譯。 -cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls +ntlm +opie +psl +ssl/openssl Wgetrc: /etc/wgetrc (系統) 語系: /usr/share/locale 編譯: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" -DLOCALEDIR="/usr/share/locale" -I. -I../../src -I../lib -I../../lib -Wdate-time -D_FORTIFY_SOURCE=2 -DHAVE_LIBSSL -DNDEBUG -g -O2 -ffile-prefix-map=/build/wget-OM48Vs/wget-1.21=. -fstack-protector-strong -Wformat -Werror=format-security -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall 連結: gcc -DHAVE_LIBSSL -DNDEBUG -g -O2 -ffile-prefix-map=/build/wget-OM48Vs/wget-1.21=. -fstack-protector-strong -Wformat -Werror=format-security -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -g -Wall -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now -lpcre2-8 -luuid -lidn2 -lssl -lcrypto -lz -lpsl ftp-opie.o openssl.o http-ntlm.o ../lib/libgnu.a 版權所有 (C) 2015 自由軟體基金會 GPLv3+ 授權:GNU GPL 第三版或更新版本 <http://www.gnu.org/licenses/gpl.html>。 此為自由軟體:您能自由修改與重散布它。 在法律允許的範圍內沒有任何擔保。 最初由 Hrvoje Niksic <hniksic@xemacs.org> 編寫。 請將漏洞報告和問題寄到 <bug-wget@gnu.org>。 ``` # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [x] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk
kerem 2026-03-01 18:00:48 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2806
No description provided.