[GH-ISSUE #1657] Feature Request: More robust export_browser_history.sh #2501

Closed
opened 2026-03-01 17:59:28 +03:00 by kerem · 4 comments
Owner

Originally created by @pcrockett on GitHub (Feb 16, 2025).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1657

Originally assigned to: @pirate on GitHub.

What type of suggestion are you making?

Proposing a new feature

What is the problem that your feature request solves?

Looking at available sources, archiving browser history requires running export_browser_history.sh.

However I see a few issues:

  • It looks like this was written for macOS only. Linux users have to figure out how to use the script manually.
  • There's a sqlite syntax error for the Firefox export.
  • The script fails silently. Depending on the error it will just generate an empty file, do nothing, etc. and may generate no helpful output.

What is your proposed solution?

I'm a bit of a Bash nerd and would love to make this work with Linux and Firefox at least. I've already started here. Is this kind of contribution something you would take?

Side notes:

  • This branch seeks to fix all the issues I've found so far. I have split the commits up in a logical way as well, so they're easy to review one-by-one.
  • I do not have a mac to test with, so you will definitely want to test these changes on a mac before merging.
  • I am only considering installing Chromium to get that working on Linux. Not sure if I will yet.

What hacks or alternative solutions have you tried to solve the problem?

Pass the full file name to the script after the --firefox argument. But that still fails with a sqlite syntax error.

Share the entire output of the archivebox version command for the current verison you are using.

0.7.3
ArchiveBox v0.7.3 COMMIT_HASH=069aabc BUILD_TIME=2024-12-15 09:54:03 1734256443
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.13.2-arch1-1-x86_64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.11        valid     /usr/local/bin/python3.11                             
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py           
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.7.3          valid     /usr/local/bin/archivebox                             

 √  CURL_BINARY           v8.10.1         valid     /usr/bin/curl                                         
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                         
 √  NODE_BINARY           v20.18.1        valid     /usr/bin/node                                         
 √  SINGLEFILE_BINARY     v1.1.54         valid     /app/node_modules/single-file-cli/single-file         
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /app/node_modules/@postlight/parser/cli.js            
 √  GIT_BINARY            v2.39.5         valid     /usr/bin/git                                          
 √  YOUTUBEDL_BINARY      v2024.12.13     valid     /usr/local/bin/yt-dlp                                 
 √  CHROME_BINARY         v131.0.6778.33  valid     /usr/bin/chromium-browser                             
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                       
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                             
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                  

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                  
 -  COOKIES_FILE          -               disabled  None                                                  

[i] Data locations:
 √  OUTPUT_DIR            5 files @       valid     /data                                                 
 √  SOURCES_DIR           5 files         valid     ./sources                                             
 √  LOGS_DIR              2 files         valid     ./logs                                                
 √  ARCHIVE_DIR           4 files         valid     ./archive                                             
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                     
 √  SQL_INDEX             244.0 KB        valid     ./index.sqlite3

This is on the latest dev branch. The last time this script was touched was in github.com/ArchiveBox/ArchiveBox@aa5533b80f

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to start a PR to develop this myself
  • I have donated money to go towards fixing this issue

Mini Survey

  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
  • I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
Originally created by @pcrockett on GitHub (Feb 16, 2025). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1657 Originally assigned to: @pirate on GitHub. ### What type of suggestion are you making? Proposing a new feature ### What is the problem that your feature request solves? Looking at [available sources](https://github.com/ArchiveBox/ArchiveBox/wiki/Quickstart#2-get-your-list-of-urls-to-archive), archiving browser history requires running [export_browser_history.sh](https://github.com/ArchiveBox/ArchiveBox/blob/dev/bin/export_browser_history.sh). However I see a few issues: * It looks like this was written for macOS only. Linux users have to figure out how to use the script manually. * There's a sqlite syntax error for the Firefox export. * The script fails silently. Depending on the error it will just generate an empty file, do nothing, etc. and may generate no helpful output. ### What is your proposed solution? I'm a bit of a Bash nerd and would love to make this work with Linux and Firefox at least. I've already started [here](https://github.com/ArchiveBox/ArchiveBox/compare/dev...pcrockett:ArchiveBox:fix/export-browser-history?expand=1). Is this kind of contribution something you would take? Side notes: * This branch seeks to fix all the issues I've found so far. I have split the commits up in a logical way as well, so they're easy to review one-by-one. * I do **not** have a mac to test with, so you will definitely want to test these changes on a mac before merging. * I am only _considering_ installing Chromium to get that working on Linux. Not sure if I will yet. ### What hacks or alternative solutions have you tried to solve the problem? Pass the full file name to the script after the `--firefox` argument. But that still fails with a sqlite syntax error. ### Share the entire output of the `archivebox version` command for the current verison you are using. ```shell 0.7.3 ArchiveBox v0.7.3 COMMIT_HASH=069aabc BUILD_TIME=2024-12-15 09:54:03 1734256443 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.13.2-arch1-1-x86_64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=911:911 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.11 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.3 valid /usr/local/bin/archivebox √ CURL_BINARY v8.10.1 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v20.18.1 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.54 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /app/node_modules/@postlight/parser/cli.js √ GIT_BINARY v2.39.5 valid /usr/bin/git √ YOUTUBEDL_BINARY v2024.12.13 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v131.0.6778.33 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 5 files @ valid /data √ SOURCES_DIR 5 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 4 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 244.0 KB valid ./index.sqlite3 ``` This is on the latest `dev` branch. The last time this script was touched was in https://github.com/ArchiveBox/ArchiveBox/commit/aa5533b80fe068f58daf16cb75f9f4638757f763 ### How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [x] It would be nice to have eventually - [x] I'm willing to [start a PR](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) to develop this myself - [ ] I have [donated money](https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) to go towards fixing this issue ### Mini Survey - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up - [ ] I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
kerem closed this issue 2026-03-01 17:59:28 +03:00
Author
Owner

@pirate commented on GitHub (Feb 17, 2025):

Before writing any new code, can you try reverting that PR and seeing if https://github.com/ArchiveBox/ArchiveBox/pull/1152/files

Also you should check out the latest ArchiveBox browser extension PR, it adds support for importing from browser history through the extension UI now: https://github.com/ArchiveBox/archivebox-browser-extension/pull/31

<!-- gh-comment-id:2662307393 --> @pirate commented on GitHub (Feb 17, 2025): Before writing any new code, can you try reverting that PR and seeing if https://github.com/ArchiveBox/ArchiveBox/pull/1152/files Also you should check out the latest ArchiveBox browser extension PR, it adds support for importing from browser history through the extension UI now: https://github.com/ArchiveBox/archivebox-browser-extension/pull/31
Author
Owner

@pcrockett commented on GitHub (Feb 17, 2025):

My code is already based on the commit that you linked. That commit fixed one sqlite syntax error, but left another syntax error above it (should be SELECT '[' instead of SELECT \"[\").

The first syntax error probably wasn't caught because the script wasn't using set -eo pipefail, which is another thing my implementation adds.

I will indeed check out that browser extension, thanks.


UPDATE: Checked out the extension. I plan to use it going forward, but this script is more useful to those who want to retroactively import their browser history into ArchiveBox.

<!-- gh-comment-id:2662375836 --> @pcrockett commented on GitHub (Feb 17, 2025): My code is already based on the commit that you linked. That commit fixed one sqlite syntax error, but left another syntax error above it (should be `SELECT '['` instead of `SELECT \"[\"`). The first syntax error probably wasn't caught because the script wasn't using `set -eo pipefail`, which is another thing my implementation adds. I will indeed check out that browser extension, thanks. --- UPDATE: Checked out the extension. I plan to use it going forward, but this script is more useful to those who want to retroactively import their browser history into ArchiveBox.
Author
Owner

@pirate commented on GitHub (Feb 18, 2025):

I opened a PR to track your fixes: #1661. can you check the diff and let me know if it looks ready for review/merge? Thanks!

<!-- gh-comment-id:2665170237 --> @pirate commented on GitHub (Feb 18, 2025): I opened a PR to track your fixes: #1661. can you check the diff and let me know if it looks ready for review/merge? Thanks!
Author
Owner

@pcrockett commented on GitHub (Feb 19, 2025):

Ready for review, with a few comments:

  • Probably want to change the PR title.
  • I got Chromium working on Linux as well as Firefox.
  • I included support for proprietary Chrome as well, and I'm 90% sure it works, but I didn't test the Chrome part because I didn't want to install it... 😬
  • You should definitely test this on a mac. Don't skip that; It's totally possible I broke something for macOS.

There are probably other things that could be improved, but this is a good step in the right direction and we don't want to overengineer something that's probably a very minor part of the project.

<!-- gh-comment-id:2669588310 --> @pcrockett commented on GitHub (Feb 19, 2025): Ready for review, with a few comments: * Probably want to change the PR title. * I got Chromium working on Linux as well as Firefox. * I included support for proprietary Chrome as well, and I'm 90% sure it works, but I didn't test the Chrome part because I didn't want to install it... 😬 * You should definitely test this on a mac. Don't skip that; It's totally possible I broke something for macOS. There are probably other things that could be improved, but this is a good step in the right direction and we don't want to overengineer something that's probably a very minor part of the project.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2501
No description provided.