[GH-ISSUE #966] Bug: Chrome history export does not work with synced history entries #2113

Open
opened 2026-03-01 17:56:34 +03:00 by kerem · 1 comment
Owner

Originally created by @avf on GitHub (Apr 18, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/966

Describe the bug

When exporting the Chrome history using bin/export_browser_history.sh, any entries that were visited on a different device (and have already been synced to the local history) are not included in the export. Only after visiting the site again on the device where the export script is run, will they be included.

This is unfortunate, since I'm running ArchiveBox on a headless server, and I want it to automatically archive my history of websites I visit on other devices, especially mobile devices.

Steps to reproduce

  1. Turn on Chrome sync, on device A and B, make sure history is synced
  2. Go to device B, visit a site/URL you've never visited before
  3. Back on device A, wait a bit for the history to sync. Verify that the sync is complete by opening the full history in Chrome (History -> Show Full History). Do not visit the site on device A, only make sure it shows up in the history.
  4. On device, A, export the history, using the export script.
  5. In the exported JSON file, the site/URL you visited on device B will not be included.

ArchiveBox version

I was using the export script included with the following release.

ArchiveBox v0.6.2
Cpython Linux Linux-5.10.25-linuxkit-aarch64-with-glibc2.28 aarch64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data
 √  SOURCES_DIR           0 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           0 files         valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3
Originally created by @avf on GitHub (Apr 18, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/966 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> When exporting the Chrome history using [`bin/export_browser_history.sh`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/bin/export_browser_history.sh), any entries that were visited on a different device (and have already been synced to the local history) are not included in the export. Only after visiting the site again on the device where the export script is run, will they be included. This is unfortunate, since I'm running ArchiveBox on a headless server, and I want it to automatically archive my history of websites I visit on other devices, especially mobile devices. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Turn on Chrome sync, on device A and B, make sure history is synced 2. Go to device B, visit a site/URL you've never visited before 3. Back on device A, wait a bit for the history to sync. Verify that the sync is complete by opening the full history in Chrome (`History -> Show Full History`). Do _not_ visit the site on device A, only make sure it shows up in the history. 4. On device, A, export the history, using the [export script](https://github.com/ArchiveBox/ArchiveBox/blob/dev/bin/export_browser_history.sh). 5. In the exported JSON file, the site/URL you visited on device B will _not_ be included. #### ArchiveBox version I was using the export script included with the following release. <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.10.25-linuxkit-aarch64-with-glibc2.28 aarch64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /data √ SOURCES_DIR 0 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 0 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 204.0 KB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@avf commented on GitHub (Apr 18, 2022):

So, I did a little more digging and it seems that the history stored in <CHROME_PROFILE_DIR>/History does not contain any synced URLs. In fact, it seems that the synced history from other devices is not persisted in any way. I checked it by doing the following:

  1. Turn on Chrome sync, on device A and B, make sure history is synced
  2. Go to device B, visit a site/URL you've never visited before
  3. Back on device A, wait a bit for the history to sync. Verify that the sync is complete by opening the full history in Chrome (History -> Show Full History). Do not visit the site on device A, only make sure it shows up in the history.
  4. On device A, completely quit Chrome.
  5. On device A, turn off the internet entirely.
  6. On device A, re-open Chrome, and open the history. The site/URL will not show up.

So this means of course the export script can't extract this data from the SQLite DB that Chrome creates, since it is not stored there.

Is there any workaround so that I can still export the full synced history in a scheduled way?

<!-- gh-comment-id:1101801884 --> @avf commented on GitHub (Apr 18, 2022): So, I did a little more digging and it seems that the history stored in `<CHROME_PROFILE_DIR>/History` does not contain any synced URLs. In fact, it seems that the synced history from other devices _is not persisted in any way_. I checked it by doing the following: 1. Turn on Chrome sync, on device A and B, make sure history is synced 2. Go to device B, visit a site/URL you've never visited before 3. Back on device A, wait a bit for the history to sync. Verify that the sync is complete by opening the full history in Chrome (History -> Show Full History). Do not visit the site on device A, only make sure it shows up in the history. 4. On device A, completely quit Chrome. 5. On device A, _turn off the internet entirely_. 6. On device A, re-open Chrome, and open the history. The site/URL will not show up. So this means of course the export script can't extract this data from the SQLite DB that Chrome creates, since it is not stored there. Is there any workaround so that I can still export the full synced history in a scheduled way?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2113
No description provided.