[GH-ISSUE #1545] Bug: netscape_html fails due to parsing sources-*-import.txt instead of bookmarks.html #2428

Open
opened 2026-03-01 17:58:58 +03:00 by kerem · 4 comments
Owner

Originally created by @hrdl-github on GitHub (Oct 17, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1545

Describe the bug

netscape_html fails due to parsing the sources-*-import.txt containing the bookmark file's filename instead of the bookmark file.

Steps to reproduce

  1. Export bookmarks, e.g. using librewolf / firefox
  2. run archivebox add /tmp/firefox_bookmarks_export.html, optionally with --parser netscape_html

ArchiveBox version

0.8.5rc44
ArchiveBox v0.8.5rc44 COMMIT_HASH=unknown BUILD_TIME=2024-10-17 16:36:41 1729175801
IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux 
PLATFORM=Linux-6.11.CENSORED-with-glibc2.40 PYTHON=Cpython (venv)
EUID=1000:1000 UID=1000:1000 PUID=1000:1000 FS_UID=1000:1000 FS_PERMS=644 FS_ATOMIC=True
FS_REMOTE=False
DEBUG=False IS_TTY=False SUDO=False ID=CENSORED SEARCH_BACKEND=ripgrep 
LDAP=False

 Binary Dependencies:
 √  pip                   24.2.0       lib_pip    ./lib/x86_64-linux/pip/venv/bin/pip
 √  pipx                  1.7.1        lib_pip    ./lib/x86_64-linux/pip/venv/bin/pipx
 √  python                3.12.7       venv_pip   ~/python/bin/python
 √  sqlite                2.6.0        venv_pip   ~/python/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py
 √  django                5.1.2        venv_pip   ~/python/lib/python3.12/site-packages/django/__init__.py
 √  node                  22.9.0       env        /usr/bin/node
 √  npm                   10.9.0       env        /usr/bin/npm
 √  npx                   10.9.0       env        /usr/bin/npx
 √  playwright            1.47.1       sys_pip    /usr/bin/playwright
 √  puppeteer             23.6.0       lib_npm    ./lib/x86_64-linux/npm/node_modules/.bin/puppeteer
 √  rg                    14.1.1       env        /usr/bin/rg
 √  chrome                129.0.6668   env        /usr/bin/chromium
 √  curl                  8.10.1       env        /usr/bin/curl
 √  git                   2.47.0       env        /usr/bin/git
 √  postlight-parser      2.2.3        sys_npm    /usr/bin/postlight-parser
 √  readability-extractor 0.0.11       lib_npm    /usr/bin/readability-extractor
 √  single-file           2.0.64       lib_npm    /usr/bin/single-file
 √  wget                  1.24.5       env        /usr/bin/wget
 √  yt-dlp                2024.10.7    env        ~/python/bin/yt-dlp
 √  ffmpeg                7.0.2        env        /usr/bin/ffmpeg

 Package Managers:
 √  sys_pip     /usr/bin/pip                                         UID=1000 PATH=/usr…
 -  pipx        not available                                        UID=1000 PATH=
 √  venv_pip    ~/python/abox/bin/pip                                UID=1000 PATH=~/py…
 √  lib_pip     ./lib/x86_64-linux/pip/venv/bin/pip                  UID=1000 PATH=./li…
 √  sys_npm     /usr/bin/npm                                         UID=1000 PATH=/usr…
 √  lib_npm     /usr/bin/npm                                         UID=1000 PATH=./li…
 √  playwright  /usr/bin/playwright                                  UID=0    PATH=./li…
 √  puppeteer   /usr/bin/npx                                         UID=1000 PATH=./li…
 √  env         /usr/bin/which                                       UID=1000 PATH=~/py…
 -  brew        not available                                        UID=1000 PATH=
 -  apt         not available                                        UID=0    PATH=

 Code locations:
 √  PACKAGE_DIR           31 files        valid     ~/python/lib/python3.12/site-packages/archivebox
 √  TEMPLATES_DIR         4 files         valid     ~/python/lib/python3.12/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  missing         unused    ./user_templates                    
 -  USER_PLUGINS_DIR      missing         unused    ./user_plugins                      
 √  LIB_DIR               4 files         valid     ./lib/x86_64-linux                  

 Data locations:
 √  DATA_DIR              15 files        valid     ~/archive                           
 √  CONFIG_FILE           139.0 Bytes     valid     ./ArchiveBox.conf                   
 √  SQL_INDEX             3.8 MB          valid     ./index.sqlite3                     
 √  QUEUE_DATABASE        92.0 KB         valid     ./queue.sqlite3                     
 √  ARCHIVE_DIR           18 files        valid     ./archive                           
 √  SOURCES_DIR           16 files        valid     ./sources                           
 √  PERSONAS_DIR          1 files         valid     ./personas                          
 √  LOGS_DIR              5 files         valid     ./logs                              
 √  TMP_DIR               4 files         valid     ./tmp/44fbc4c1                      
Originally created by @hrdl-github on GitHub (Oct 17, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1545 #### Describe the bug `netscape_html` fails due to parsing the `sources-*-import.txt` containing the bookmark file's filename instead of the bookmark file. #### Steps to reproduce 1. Export bookmarks, e.g. using librewolf / firefox 2. run `archivebox add /tmp/firefox_bookmarks_export.html`, optionally with `--parser netscape_html` #### ArchiveBox version ```logs 0.8.5rc44 ArchiveBox v0.8.5rc44 COMMIT_HASH=unknown BUILD_TIME=2024-10-17 16:36:41 1729175801 IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.11.CENSORED-with-glibc2.40 PYTHON=Cpython (venv) EUID=1000:1000 UID=1000:1000 PUID=1000:1000 FS_UID=1000:1000 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=False DEBUG=False IS_TTY=False SUDO=False ID=CENSORED SEARCH_BACKEND=ripgrep LDAP=False Binary Dependencies: √ pip 24.2.0 lib_pip ./lib/x86_64-linux/pip/venv/bin/pip √ pipx 1.7.1 lib_pip ./lib/x86_64-linux/pip/venv/bin/pipx √ python 3.12.7 venv_pip ~/python/bin/python √ sqlite 2.6.0 venv_pip ~/python/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py √ django 5.1.2 venv_pip ~/python/lib/python3.12/site-packages/django/__init__.py √ node 22.9.0 env /usr/bin/node √ npm 10.9.0 env /usr/bin/npm √ npx 10.9.0 env /usr/bin/npx √ playwright 1.47.1 sys_pip /usr/bin/playwright √ puppeteer 23.6.0 lib_npm ./lib/x86_64-linux/npm/node_modules/.bin/puppeteer √ rg 14.1.1 env /usr/bin/rg √ chrome 129.0.6668 env /usr/bin/chromium √ curl 8.10.1 env /usr/bin/curl √ git 2.47.0 env /usr/bin/git √ postlight-parser 2.2.3 sys_npm /usr/bin/postlight-parser √ readability-extractor 0.0.11 lib_npm /usr/bin/readability-extractor √ single-file 2.0.64 lib_npm /usr/bin/single-file √ wget 1.24.5 env /usr/bin/wget √ yt-dlp 2024.10.7 env ~/python/bin/yt-dlp √ ffmpeg 7.0.2 env /usr/bin/ffmpeg Package Managers: √ sys_pip /usr/bin/pip UID=1000 PATH=/usr… - pipx not available UID=1000 PATH= √ venv_pip ~/python/abox/bin/pip UID=1000 PATH=~/py… √ lib_pip ./lib/x86_64-linux/pip/venv/bin/pip UID=1000 PATH=./li… √ sys_npm /usr/bin/npm UID=1000 PATH=/usr… √ lib_npm /usr/bin/npm UID=1000 PATH=./li… √ playwright /usr/bin/playwright UID=0 PATH=./li… √ puppeteer /usr/bin/npx UID=1000 PATH=./li… √ env /usr/bin/which UID=1000 PATH=~/py… - brew not available UID=1000 PATH= - apt not available UID=0 PATH= Code locations: √ PACKAGE_DIR 31 files valid ~/python/lib/python3.12/site-packages/archivebox √ TEMPLATES_DIR 4 files valid ~/python/lib/python3.12/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR missing unused ./user_templates - USER_PLUGINS_DIR missing unused ./user_plugins √ LIB_DIR 4 files valid ./lib/x86_64-linux Data locations: √ DATA_DIR 15 files valid ~/archive √ CONFIG_FILE 139.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 3.8 MB valid ./index.sqlite3 √ QUEUE_DATABASE 92.0 KB valid ./queue.sqlite3 √ ARCHIVE_DIR 18 files valid ./archive √ SOURCES_DIR 16 files valid ./sources √ PERSONAS_DIR 1 files valid ./personas √ LOGS_DIR 5 files valid ./logs √ TMP_DIR 4 files valid ./tmp/44fbc4c1 ```
Author
Owner

@pirate commented on GitHub (Oct 17, 2024):

Thanks for reporting. I'll get it fixed.

<!-- gh-comment-id:2420067925 --> @pirate commented on GitHub (Oct 17, 2024): Thanks for reporting. I'll get it fixed.
Author
Owner

@pirate commented on GitHub (Oct 21, 2024):

I'm considering removing support for directly passing file URLs as CLI args as it's led to lots of parser ambiguity in the past, I'm leaning towards preferring piping instead.

Are you able to do this instead:

archivebox add --parser=netscape_html < /tmp/firefox_bookmarks_export.html
<!-- gh-comment-id:2427915693 --> @pirate commented on GitHub (Oct 21, 2024): I'm considering [removing support for directly passing file URLs](https://github.com/ArchiveBox/ArchiveBox/issues/1363#issuecomment-2050736969) as CLI args as it's led to lots of parser ambiguity in the past, I'm leaning towards preferring piping instead. Are you able to do this instead: ```bash archivebox add --parser=netscape_html < /tmp/firefox_bookmarks_export.html ```
Author
Owner

@hrdl-github commented on GitHub (Oct 22, 2024):

Yes, redirecting stdin works fine.

<!-- gh-comment-id:2428927045 --> @hrdl-github commented on GitHub (Oct 22, 2024): Yes, redirecting stdin works fine.
Author
Owner

@pirate commented on GitHub (Oct 22, 2024):

Ok, I will document the change in the upcoming release. Please use stdin redirection moving forward for any local files. Args will only support URLs in the next release.

<!-- gh-comment-id:2430397118 --> @pirate commented on GitHub (Oct 22, 2024): Ok, I will document the change in the upcoming release. Please use stdin redirection moving forward for any local files. Args will only support URLs in the next release.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2428
No description provided.