[GH-ISSUE #864] Bug: URLs with parentheses are mishandled by archivebox add #2043

Open
opened 2026-03-01 17:56:02 +03:00 by kerem · 3 comments
Owner

Originally created by @WesleyAC on GitHub (Sep 28, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/864

Describe the bug

URLs like https://en.wikipedia.org/wiki/APL_(programming_language) are silently converted to https://en.wikipedia.org/wiki/APL_.

Steps to reproduce

  1. archivebox init
  2. echo "https://en.wikipedia.org/wiki/APL_(programming_language)" | archivebox add OR archivebox add "https://en.wikipedia.org/wiki/APL_(programming_language)"
  3. archivebox list now shows https://en.wikipedia.org/wiki/APL_ "APL - Wikipedia", which is incorrect.

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.10.66-x86_64-with-glibc2.33 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /nix/store/xfqwr74qli0fm186dj5006960mn5pw38-archivebox-0.6.2/bin/archivebox 
 √  PYTHON_BINARY         v3.9.6          valid     /nix/store/i1m8r7mv8h47wr850cdsxksy22lv6gsz-python3-3.9.6/bin/python3.9     
 √  DJANGO_BINARY         v3.1.7          valid     /nix/store/889dn8m5c562vy949frmxdna4kxfm8rf-python3.9-Django-3.1.7/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.76.1         valid     /nix/store/blaac45yvgljsy15jdxgvxxqs6w5yhqj-curl-7.76.1-bin/bin/curl        
 -  WGET_BINARY           -               disabled  /run/current-system/sw/bin/wget                                             
 √  NODE_BINARY           v14.17.6        valid     /nix/store/zqgmd79n5p0mdaw4sbvkv7gvrmks76a2-nodejs-14.17.6/bin/node         
 √  SINGLEFILE_BINARY     v0.3.31         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.3          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 -  GIT_BINARY            -               disabled  /run/current-system/sw/bin/git                                              
 -  YOUTUBEDL_BINARY      -               disabled  /nix/store/vmfzdc3wrnhhklk8fm5zrz342vp9kwd4-python3.9-youtube-dl-2021.06.06/bin/youtube-dl
 √  CHROME_BINARY         v93.0.4577.82   valid     /run/current-system/sw/bin/chromium-browser                                 
 √  RIPGREP_BINARY        v12.1.1         valid     /run/current-system/sw/bin/rg                                               

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /nix/store/xfqwr74qli0fm186dj5006960mn5pw38-archivebox-0.6.2/lib/python3.9/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /nix/store/xfqwr74qli0fm186dj5006960mn5pw38-archivebox-0.6.2/lib/python3.9/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /home/wesleyac/code/notebook/data/archivebox                                
 √  SOURCES_DIR           204 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           157 files       valid     ./archive                                                                   
 √  CONFIG_FILE           246.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             1.4 MB          valid     ./index.sqlite3                                                             
Originally created by @WesleyAC on GitHub (Sep 28, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/864 #### Describe the bug URLs like `https://en.wikipedia.org/wiki/APL_(programming_language)` are silently converted to `https://en.wikipedia.org/wiki/APL_`. #### Steps to reproduce 1. `archivebox init` 2. `echo "https://en.wikipedia.org/wiki/APL_(programming_language)" | archivebox add` OR `archivebox add "https://en.wikipedia.org/wiki/APL_(programming_language)"` 3. `archivebox list` now shows `https://en.wikipedia.org/wiki/APL_ "APL - Wikipedia"`, which is incorrect. #### ArchiveBox version ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.10.66-x86_64-with-glibc2.33 x86_64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /nix/store/xfqwr74qli0fm186dj5006960mn5pw38-archivebox-0.6.2/bin/archivebox √ PYTHON_BINARY v3.9.6 valid /nix/store/i1m8r7mv8h47wr850cdsxksy22lv6gsz-python3-3.9.6/bin/python3.9 √ DJANGO_BINARY v3.1.7 valid /nix/store/889dn8m5c562vy949frmxdna4kxfm8rf-python3.9-Django-3.1.7/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.76.1 valid /nix/store/blaac45yvgljsy15jdxgvxxqs6w5yhqj-curl-7.76.1-bin/bin/curl - WGET_BINARY - disabled /run/current-system/sw/bin/wget √ NODE_BINARY v14.17.6 valid /nix/store/zqgmd79n5p0mdaw4sbvkv7gvrmks76a2-nodejs-14.17.6/bin/node √ SINGLEFILE_BINARY v0.3.31 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.3 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /run/current-system/sw/bin/git - YOUTUBEDL_BINARY - disabled /nix/store/vmfzdc3wrnhhklk8fm5zrz342vp9kwd4-python3.9-youtube-dl-2021.06.06/bin/youtube-dl √ CHROME_BINARY v93.0.4577.82 valid /run/current-system/sw/bin/chromium-browser √ RIPGREP_BINARY v12.1.1 valid /run/current-system/sw/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /nix/store/xfqwr74qli0fm186dj5006960mn5pw38-archivebox-0.6.2/lib/python3.9/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /nix/store/xfqwr74qli0fm186dj5006960mn5pw38-archivebox-0.6.2/lib/python3.9/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /home/wesleyac/code/notebook/data/archivebox √ SOURCES_DIR 204 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 157 files valid ./archive √ CONFIG_FILE 246.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 1.4 MB valid ./index.sqlite3 ```
Author
Owner

@WesleyAC commented on GitHub (Sep 28, 2021):

Looks like this is related to #235 / #287, but since those are closed, I think it makes sense to file this.

<!-- gh-comment-id:929484763 --> @WesleyAC commented on GitHub (Sep 28, 2021): Looks like this is related to #235 / #287, but since those are closed, I think it makes sense to file this.
Author
Owner

@WesleyAC commented on GitHub (Sep 28, 2021):

Oh, I see — I need to manually give it the --parser option. Could that be mentioned somewhere more visible in the docs? Maybe in the input formats section of the readme?

<!-- gh-comment-id:929487804 --> @WesleyAC commented on GitHub (Sep 28, 2021): Oh, I see — I need to manually give it the `--parser` option. Could that be mentioned somewhere more visible in the docs? Maybe in the [input formats](https://github.com/ArchiveBox/ArchiveBox#input-formats) section of the readme?
Author
Owner

@WesleyAC commented on GitHub (Sep 28, 2021):

Specifically, it'd be great to have an example like:

archivebox add --parser url_list < list_of_urls.txt
<!-- gh-comment-id:929489118 --> @WesleyAC commented on GitHub (Sep 28, 2021): Specifically, it'd be great to have an example like: ``` archivebox add --parser url_list < list_of_urls.txt ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2043
No description provided.