[GH-ISSUE #1638] Add option to share cookies in /api/v1/cli/[add | update | schedule] calls #2491

Open
opened 2026-03-01 17:59:24 +03:00 by kerem · 0 comments
Owner

Originally created by @datoslabs on GitHub (Jan 20, 2025).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1638

Originally assigned to: @pirate on GitHub.

What type of suggestion are you making?

Modification of existing behavior

What is the problem that your feature request solves?

Hi,

Would it be possible to add optional "cookies":["string"] to the request body JSON of /api/v1/cli/[add | update | schedule] api calls and allow applicable extractors to load/use the provided cookies? The cookies are one-time-use only and should not be cached after extraction. This will allow ArchiveBox to extract pages using "dynamic" cookies provided by the requestor and the chrome extension to extract the current/active tab's cookies when requesting to archive the current page.

Please note, while browser enabled AI/LLM agents like https://github.com/browser-use/browser-use and https://github.com/unclecode/crawl4ai can automatically recognize cookie consent or newsletter signup popup/overlays and dismiss them, I feel that sharing cookies and allowing ArchiveBox extractors to dynamically load cookies from different users can support more use cases when ArchiveBox is shared among groups of users.

What is your proposed solution?

I have tried to setup cookies on my ArchiveBox docker container over VNC as documented in the wiki; however, keeping cookies up-to-date on new web sites to bypass consent or newsletter subscription popups are laborious. If we can add optional "cookies":["string"] to the request body JSON of /api/v1/cli/[add | update | schedule] api calls, requestors (including ArchiveBox chrome extension), can have the option to submit their current session cookies as part of the request for one-time-use by the applicable extractors.

What hacks or alternative solutions have you tried to solve the problem?

Besides using noVNC to update ArchiveBox's chrome user profile, recently I began to experiment with using/modifying https://github.com/browser-use/browser-use and https://github.com/unclecode/crawl4ai using browser enabled AI/LLM agents to perform extraction outside of ArchiveBox for select pages.

Share the entire output of the archivebox version command for the current verison you are using.

0.8.5rc51
ArchiveBox v0.8.5rc51 COMMIT_HASH=63bf902 BUILD_TIME=2024-10-24 06:32:16 1729751536
IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.8.0-49-generic-x86_64-with-glibc2.36 PYTHON=Cpython
EUID=1000:0 UID=0:0 PUID=1000:1000 FS_UID=1000:1000 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=True
DEBUG=False IS_TTY=True SUDO=True ID=723cdf77:4743bd15 SEARCH_BACKEND=sonic LDAP=False

 Binary Dependencies:
 √  python                3.11.10      sys_pip    /usr/local/bin/python3.11
 √  django                5.1.2        sys_pip    /usr/local/lib/python3.11/site-packages/django/__init__.py
 √  sqlite                2.6.0        sys_pip    /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py
 √  pip                   24.0.0       sys_pip    /usr/local/bin/pip
 √  pipx                  1.1.0        sys_pip    /usr/bin/pipx
 √  node                  22.10.0      apt        /usr/bin/node
 √  npm                   10.9.0       apt        /usr/bin/npm
 √  npx                   10.9.0       apt        /usr/bin/npx
 √  playwright            1.48.0       sys_pip    /usr/local/bin/playwright
 √  puppeteer             23.6.0       lib_npm    ~/.npm/bin/puppeteer
 √  ldap                  3.4.4        sys_pip    /usr/local/lib/python3.11/site-packages/ldap/__init__.py
 √  rg                    13.0.0       apt        /usr/bin/rg
 √  sonic                 1.4.9        env        /usr/local/bin/sonic
 √  chrome                130.0.6723   env        /usr/bin/chromium-browser
 √  curl                  8.10.1       apt        /usr/bin/curl
 √  git                   2.39.5       apt        /usr/bin/git
 √  postlight-parser      2.2.3        sys_npm    ~/.npm/bin/postlight-parser
 √  readability-extractor 0.0.11       lib_npm    ~/.npm/bin/readability-extractor
 √  single-file           1.1.54       lib_npm    ~/.npm/bin/single-file
 √  wget                  1.21.3       apt        /usr/bin/wget
 √  yt-dlp                2024.10.22   sys_pip    /usr/local/bin/yt-dlp
 √  ffmpeg                5.1.6        env        /usr/bin/ffmpeg

 Package Managers:
 √  env         /usr/bin/which                                       UID=1000 PATH=~/.npm/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 √  apt         /usr/bin/apt-get                                     UID=0    PATH=/usr/bin:/bin
 -  brew        not available                                        UID=1000 PATH=
 √  sys_pip     /usr/local/bin/pip                                   UID=1000 PATH=/usr/bin:/root/.local/bin:/usr/local/bin
 -  venv_pip    not available                                        UID=1000 PATH=/tmp/NotInsideAVenv/lib/bin
 -  lib_pip     not available                                        UID=1000 PATH=./lib/x86_64-linux-docker/pip/venv/bin
 √  sys_npm     /usr/bin/npm                                         UID=1000 PATH=~/.npm/bin
 -  lib_npm     /usr/bin/npm                                         UID=1000 PATH=./lib/x86_64-linux-docker/npm/node_modules/.bin:./node_modules/.bin:~/.npm/bin
 √  playwright  /usr/local/bin/playwright                            UID=0    PATH=./lib/x86_64-linux-docker/bin:~/.npm/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin…
 √  puppeteer   /usr/bin/npx                                         UID=1000 PATH=./lib/x86_64-linux-docker/bin

 Code locations:
 √  PACKAGE_DIR           39 files        valid     /app/archivebox
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  missing         unused    ./user_templates
 -  USER_PLUGINS_DIR      missing         unused    ./user_plugins
 √  LIB_DIR               1 files         valid     /usr/share/archivebox/lib

 Data locations:
 √  DATA_DIR              18 files @      valid     /data
 √  CONFIG_FILE           139.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             26.6 MB         valid     ./index.sqlite3
 √  QUEUE_DATABASE        92.0 KB         valid     ./queue.sqlite3
 √  ARCHIVE_DIR           796 files       valid     ./archive
 √  SOURCES_DIR           828 files       valid     ./sources
 √  PERSONAS_DIR          1 files         valid     ./personas
 √  LOGS_DIR              5 files         valid     ./logs
 √  TMP_DIR               4 files         valid     /tmp/archivebox

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to start a PR to develop this myself
  • I have donated money to go towards fixing this issue

Mini Survey

  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
  • I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
Originally created by @datoslabs on GitHub (Jan 20, 2025). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1638 Originally assigned to: @pirate on GitHub. ### What type of suggestion are you making? Modification of existing behavior ### What is the problem that your feature request solves? Hi, Would it be possible to add optional `"cookies":["string"] ` to the request body JSON of `/api/v1/cli/[add | update | schedule]` api calls and allow applicable extractors to load/use the provided cookies? The cookies are one-time-use only and should not be cached after extraction. This will allow ArchiveBox to extract pages using "dynamic" cookies provided by the requestor and the chrome extension to extract the current/active tab's cookies when requesting to archive the current page. Please note, while browser enabled AI/LLM agents like [https://github.com/browser-use/browser-use](https://github.com/browser-use/browser-use) and [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) can automatically recognize cookie consent or newsletter signup popup/overlays and dismiss them, I feel that sharing cookies and allowing ArchiveBox extractors to dynamically load cookies from different users can support more use cases when ArchiveBox is shared among groups of users. ### What is your proposed solution? I have tried to setup cookies on my ArchiveBox docker container over VNC as documented in the wiki; however, keeping cookies up-to-date on new web sites to bypass consent or newsletter subscription popups are laborious. If we can add optional `"cookies":["string"] ` to the request body JSON of `/api/v1/cli/[add | update | schedule]` api calls, requestors (including ArchiveBox chrome extension), can have the option to submit their current session cookies as part of the request for one-time-use by the applicable extractors. ### What hacks or alternative solutions have you tried to solve the problem? Besides using noVNC to update ArchiveBox's chrome user profile, recently I began to experiment with using/modifying [https://github.com/browser-use/browser-use](https://github.com/browser-use/browser-use) and [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) using browser enabled AI/LLM agents to perform extraction outside of ArchiveBox for select pages. ### Share the entire output of the `archivebox version` command for the current verison you are using. ```shell 0.8.5rc51 ArchiveBox v0.8.5rc51 COMMIT_HASH=63bf902 BUILD_TIME=2024-10-24 06:32:16 1729751536 IN_DOCKER=True IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.8.0-49-generic-x86_64-with-glibc2.36 PYTHON=Cpython EUID=1000:0 UID=0:0 PUID=1000:1000 FS_UID=1000:1000 FS_PERMS=644 FS_ATOMIC=True FS_REMOTE=True DEBUG=False IS_TTY=True SUDO=True ID=723cdf77:4743bd15 SEARCH_BACKEND=sonic LDAP=False Binary Dependencies: √ python 3.11.10 sys_pip /usr/local/bin/python3.11 √ django 5.1.2 sys_pip /usr/local/lib/python3.11/site-packages/django/__init__.py √ sqlite 2.6.0 sys_pip /usr/local/lib/python3.11/site-packages/django/db/backends/sqlite3/base.py √ pip 24.0.0 sys_pip /usr/local/bin/pip √ pipx 1.1.0 sys_pip /usr/bin/pipx √ node 22.10.0 apt /usr/bin/node √ npm 10.9.0 apt /usr/bin/npm √ npx 10.9.0 apt /usr/bin/npx √ playwright 1.48.0 sys_pip /usr/local/bin/playwright √ puppeteer 23.6.0 lib_npm ~/.npm/bin/puppeteer √ ldap 3.4.4 sys_pip /usr/local/lib/python3.11/site-packages/ldap/__init__.py √ rg 13.0.0 apt /usr/bin/rg √ sonic 1.4.9 env /usr/local/bin/sonic √ chrome 130.0.6723 env /usr/bin/chromium-browser √ curl 8.10.1 apt /usr/bin/curl √ git 2.39.5 apt /usr/bin/git √ postlight-parser 2.2.3 sys_npm ~/.npm/bin/postlight-parser √ readability-extractor 0.0.11 lib_npm ~/.npm/bin/readability-extractor √ single-file 1.1.54 lib_npm ~/.npm/bin/single-file √ wget 1.21.3 apt /usr/bin/wget √ yt-dlp 2024.10.22 sys_pip /usr/local/bin/yt-dlp √ ffmpeg 5.1.6 env /usr/bin/ffmpeg Package Managers: √ env /usr/bin/which UID=1000 PATH=~/.npm/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin √ apt /usr/bin/apt-get UID=0 PATH=/usr/bin:/bin - brew not available UID=1000 PATH= √ sys_pip /usr/local/bin/pip UID=1000 PATH=/usr/bin:/root/.local/bin:/usr/local/bin - venv_pip not available UID=1000 PATH=/tmp/NotInsideAVenv/lib/bin - lib_pip not available UID=1000 PATH=./lib/x86_64-linux-docker/pip/venv/bin √ sys_npm /usr/bin/npm UID=1000 PATH=~/.npm/bin - lib_npm /usr/bin/npm UID=1000 PATH=./lib/x86_64-linux-docker/npm/node_modules/.bin:./node_modules/.bin:~/.npm/bin √ playwright /usr/local/bin/playwright UID=0 PATH=./lib/x86_64-linux-docker/bin:~/.npm/bin:/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin… √ puppeteer /usr/bin/npx UID=1000 PATH=./lib/x86_64-linux-docker/bin Code locations: √ PACKAGE_DIR 39 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR missing unused ./user_templates - USER_PLUGINS_DIR missing unused ./user_plugins √ LIB_DIR 1 files valid /usr/share/archivebox/lib Data locations: √ DATA_DIR 18 files @ valid /data √ CONFIG_FILE 139.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 26.6 MB valid ./index.sqlite3 √ QUEUE_DATABASE 92.0 KB valid ./queue.sqlite3 √ ARCHIVE_DIR 796 files valid ./archive √ SOURCES_DIR 828 files valid ./sources √ PERSONAS_DIR 1 files valid ./personas √ LOGS_DIR 5 files valid ./logs √ TMP_DIR 4 files valid /tmp/archivebox ``` ### How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually - [ ] I'm willing to [start a PR](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) to develop this myself - [ ] I have [donated money](https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) to go towards fixing this issue ### Mini Survey - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up - [ ] I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2491
No description provided.