[GH-ISSUE #1594] Feature Request: Support csv output files from browser-history python package #2464

Open
opened 2026-03-01 17:59:13 +03:00 by kerem · 2 comments
Owner

Originally created by @1over137 on GitHub (Nov 14, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1594

Originally assigned to: @pirate on GitHub.

What type of suggestion are you making?

Proposing a new feature

What is the problem that your feature request solves?

The export_history script seems to be broken for me and I cannot get it to run for firefox.

What is your proposed solution?

The browser-history python package works nicely. It would be easier to support importing its output instead of maintaining a script.

The csv format is simple: just datetime, link, title
Example
2024-11-13 20:00:00-05:00,https://www.youtube.com/,YouTube

What hacks or alternative solutions have you tried to solve the problem?

I can just import the links, but that would lose the date information

What version of ArchiveBox are you currently using?

➜  ~ archivebox version
0.7.2
ArchiveBox v0.7.2 BUILD_TIME=2024-11-13 21:34:08 1731551648
IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.11.6-arch1-1-x86_64-with-glibc2.40 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/bin/python3.11
 √  SQLITE_BINARY         v2.6.0          valid     /usr/lib/python3.11/sqlite3/dbapi2.py
 √  DJANGO_BINARY         v3.1.14         valid     ./.local/share/pipx/venvs/archivebox/lib/python3.11/site-packages/django/__init__.py
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     ./.local/share/pipx/venvs/archivebox/bin/archivebox

 √  CURL_BINARY           v8.11.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.25.0         valid     /usr/bin/wget
 √  NODE_BINARY           v23.1.0         valid     /usr/bin/node
 X  SINGLEFILE_BINARY     ?               invalid   single-file
 X  READABILITY_BINARY    ?               invalid   readability-extractor
 X  MERCURY_BINARY        ?               invalid   postlight-parser
 √  GIT_BINARY            v2.47.0         valid     /usr/bin/git
 X  YOUTUBEDL_BINARY      ?               invalid   yt-dlp
 √  CHROME_BINARY         v131.0.6778.69  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v14.1.1         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     ./.local/share/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     ./.local/share/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None
 -  COOKIES_FILE          -               disabled  None


[i] Data locations: (not in a data directory)

[!] Warning: Missing 4 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False

    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False

    ! MERCURY_BINARY: postlight-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False

    ! YOUTUBEDL_BINARY: yt-dlp (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_YOUTUBEDL=False

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to work on a PR to develop this myself
  • I have donated money to go towards fixing this issue

Mini Survey

  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
  • I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
Originally created by @1over137 on GitHub (Nov 14, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1594 Originally assigned to: @pirate on GitHub. ### What type of suggestion are you making? Proposing a new feature ### What is the problem that your feature request solves? The export_history script seems to be broken for me and I cannot get it to run for firefox. ### What is your proposed solution? The browser-history python package works nicely. It would be easier to support importing its output instead of maintaining a script. The csv format is simple: just datetime, link, title Example `2024-11-13 20:00:00-05:00,https://www.youtube.com/,YouTube` ### What hacks or alternative solutions have you tried to solve the problem? I can just import the links, but that would lose the date information ### What version of ArchiveBox are you currently using? ```shell ➜ ~ archivebox version 0.7.2 ArchiveBox v0.7.2 BUILD_TIME=2024-11-13 21:34:08 1731551648 IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-6.11.6-arch1-1-x86_64-with-glibc2.40 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=False FS_USER=1000:1000 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid ./.local/share/pipx/venvs/archivebox/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid ./.local/share/pipx/venvs/archivebox/bin/archivebox √ CURL_BINARY v8.11.0 valid /usr/bin/curl √ WGET_BINARY v1.25.0 valid /usr/bin/wget √ NODE_BINARY v23.1.0 valid /usr/bin/node X SINGLEFILE_BINARY ? invalid single-file X READABILITY_BINARY ? invalid readability-extractor X MERCURY_BINARY ? invalid postlight-parser √ GIT_BINARY v2.47.0 valid /usr/bin/git X YOUTUBEDL_BINARY ? invalid yt-dlp √ CHROME_BINARY v131.0.6778.69 valid /usr/bin/chromium √ RIPGREP_BINARY v14.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid ./.local/share/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox √ TEMPLATES_DIR 3 files valid ./.local/share/pipx/venvs/archivebox/lib/python3.11/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: (not in a data directory) [!] Warning: Missing 4 recommended dependencies ! SINGLEFILE_BINARY: single-file (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False ! READABILITY_BINARY: readability-extractor (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False ! MERCURY_BINARY: postlight-parser (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False ! YOUTUBEDL_BINARY: yt-dlp (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_YOUTUBEDL=False ``` ### How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually - [ ] I'm willing to [work on a PR](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) to develop this myself - [ ] I have [donated money](https://github.com/ArchiveBox/ArchiveBox/wiki/Donations) to go towards fixing this issue ### Mini Survey - [ ] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up - [ ] I would pay $10/mo for a hosted version of ArchiveBox if it had this feature
Author
Owner

@pirate commented on GitHub (Nov 14, 2024):

Ok, in the meantime you can use this quick script to convert the CSV to a JSONL format that archviebox can ingest:

# pip install pytz

import csv
import json
from datetime import datetime
import pytz

def convert_csv_to_jsonl(input_file, output_file):
    # Open files for reading and writing
    with open(input_file, 'r') as csv_file, open(output_file, 'w') as jsonl_file:
        # Read CSV without headers since they weren't provided in the example
        csv_reader = csv.reader(csv_file)
        
        for row in csv_reader:
            # Parse the timestamp and convert to UTC
            local_dt = datetime.fromisoformat(row[0])
            utc_dt = local_dt.astimezone(pytz.UTC)
            
            # Create dictionary with required structure
            record = {
                "url": row[1],
                "title": row[2],
                "created_at": utc_dt.strftime("%Y-%m-%dT%H:%M:%S+0000")
            }
            
            # Write the JSON line to output file
            jsonl_file.write(json.dumps(record) + '\n')

if __name__ == "__main__":
    convert_csv_to_jsonl('example_input_urls.csv', 'output.jsonl')  # edit this to point it to your input file

Then pipe the JSONL into archivebox like so:
archivebox add --parser=jsonl < output.jsonl

<!-- gh-comment-id:2475466945 --> @pirate commented on GitHub (Nov 14, 2024): Ok, in the meantime you can use this quick script to convert the CSV to a JSONL format that archviebox can ingest: ```python3 # pip install pytz import csv import json from datetime import datetime import pytz def convert_csv_to_jsonl(input_file, output_file): # Open files for reading and writing with open(input_file, 'r') as csv_file, open(output_file, 'w') as jsonl_file: # Read CSV without headers since they weren't provided in the example csv_reader = csv.reader(csv_file) for row in csv_reader: # Parse the timestamp and convert to UTC local_dt = datetime.fromisoformat(row[0]) utc_dt = local_dt.astimezone(pytz.UTC) # Create dictionary with required structure record = { "url": row[1], "title": row[2], "created_at": utc_dt.strftime("%Y-%m-%dT%H:%M:%S+0000") } # Write the JSON line to output file jsonl_file.write(json.dumps(record) + '\n') if __name__ == "__main__": convert_csv_to_jsonl('example_input_urls.csv', 'output.jsonl') # edit this to point it to your input file ``` Then pipe the JSONL into archivebox like so: `archivebox add --parser=jsonl < output.jsonl`
Author
Owner

@1over137 commented on GitHub (Nov 14, 2024):

There is no jsonl parser, would the json one work?

Edit: I noticed now that this is present in the rc version. Will use that for now.

<!-- gh-comment-id:2476309176 --> @1over137 commented on GitHub (Nov 14, 2024): There is no jsonl parser, would the json one work? Edit: I noticed now that this is present in the rc version. Will use that for now.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2464
No description provided.