[GH-ISSUE #785] Bug: Pinboard JSON parser doesn't keep original bookmarked timestamp when adding URLs #2007

Open
opened 2026-03-01 17:55:46 +03:00 by kerem · 6 comments
Owner

Originally created by @gouku on GitHub (Jul 7, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/785

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

I'd like to migrate from Pinboard to ArchiveBox. I have over 10,000 bookmarks since 2009. Export them as JSON and import them into ArchiveBox but found that all bookmarks have the same timestamps.

docker-compose run archivebox add < ~/archivebox/pinboard_export.2021.07.07_06.33.json

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

The import process could keep the original timestamp that's already in the JSON export:

...
{
  "href":"https:\/\/github.com\/yuliskov\/SmartTubeNext",
  "time":"2021-06-05T18:55:20Z",
  "shared":"no",
  "toread":"no",
  "tags":"youtube android"
},

What hacks or alternative solutions have you tried to solve the problem?

Manually edit 10,000 imported bookmarks? I don't think so...

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @gouku on GitHub (Jul 7, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/785 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design ## What is the problem that your feature request solves I'd like to migrate from Pinboard to ArchiveBox. I have over 10,000 bookmarks since 2009. Export them as JSON and import them into ArchiveBox but found that all bookmarks have the same timestamps. ```bash docker-compose run archivebox add < ~/archivebox/pinboard_export.2021.07.07_06.33.json ``` ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes The import process could keep the original timestamp that's already in the JSON export: ```json ... { "href":"https:\/\/github.com\/yuliskov\/SmartTubeNext", "time":"2021-06-05T18:55:20Z", "shared":"no", "toread":"no", "tags":"youtube android" }, ``` ## What hacks or alternative solutions have you tried to solve the problem? Manually edit 10,000 imported bookmarks? I don't think so... ## How badly do you want this new feature? - [x] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [ ] I like ArchiveBox so far / would recommend it to a friend - [x] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@gouku commented on GitHub (Jul 7, 2021):

Also there is another issue that breaks all tags when import bookmarks. I found a Pocket related issue https://github.com/ArchiveBox/ArchiveBox/issues/725 but also affects Pinboard. This two issues currently block me from migrating from Pinboard.

<!-- gh-comment-id:875392809 --> @gouku commented on GitHub (Jul 7, 2021): Also there is another issue that breaks all tags when import bookmarks. I found a Pocket related issue https://github.com/ArchiveBox/ArchiveBox/issues/725 but also affects Pinboard. This two issues currently block me from migrating from Pinboard.
Author
Owner

@pirate commented on GitHub (Jul 7, 2021):

It's already supposed to be using the original timestamps, so it must be a bug in that code: https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/parsers/generic_json.py#L38 or something weird with your JSON export.

Can you post the first dozen lines of log output from when you add it, I want to make sure that it's actually recognizing it as JSON and not just Generic TXT.

➜ ~/D/o/A/data ⎇ (dev) # archivebox add --depth=1 'https://slatestarcodex.com/#234234' --overwrite
[i] [2021-07-07 10:11:35] ArchiveBox v0.6.3: archivebox add --depth=1 https://slatestarcodex.com/#234234 --overwrite
    > /Users/squash/Documents/opt/ArchiveBox/data

[+] [2021-07-07 10:11:36] Adding 1 links to index (crawl depth=1)...
    > Saved verbatim input to sources/1625652696-import.txt
    > Parsed 1 URLs from input (Generic TXT)                # <---------------------- this is the line I care about

[*] Starting crawl of 1 sites 1 hop out from starting point
    > Downloading https://slatestarcodex.com/#234234 contents
    > Saved verbatim input to sources/1625652696.173032-crawl-slatestarcodex.com.txt
    > Parsed 1012 URLs from input (Generic TXT)
    > Found 299 new URLs not already in index

[*] [2021-07-07 10:11:36] Writing 299 links to main index...
    √ ./index.sqlite3

[] [2021-07-07 10:11:37] Starting archiving of 300 snapshots in index...

[+] [2021-07-07 10:11:37] "slatestarcodex.com/#234234"
    https://slatestarcodex.com/#234234
    > ./archive/1625652696.173032
      > title
      > favicon
      > headers
<!-- gh-comment-id:875477917 --> @pirate commented on GitHub (Jul 7, 2021): It's already supposed to be using the original timestamps, so it must be a bug in that code: https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/parsers/generic_json.py#L38 or something weird with your JSON export. Can you post the first dozen lines of log output from when you add it, I want to make sure that it's actually recognizing it as `JSON` and not just `Generic TXT`. ```bash ➜ ~/D/o/A/data ⎇ (dev) # archivebox add --depth=1 'https://slatestarcodex.com/#234234' --overwrite [i] [2021-07-07 10:11:35] ArchiveBox v0.6.3: archivebox add --depth=1 https://slatestarcodex.com/#234234 --overwrite > /Users/squash/Documents/opt/ArchiveBox/data [+] [2021-07-07 10:11:36] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1625652696-import.txt > Parsed 1 URLs from input (Generic TXT) # <---------------------- this is the line I care about [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://slatestarcodex.com/#234234 contents > Saved verbatim input to sources/1625652696.173032-crawl-slatestarcodex.com.txt > Parsed 1012 URLs from input (Generic TXT) > Found 299 new URLs not already in index [*] [2021-07-07 10:11:36] Writing 299 links to main index... √ ./index.sqlite3 [▶] [2021-07-07 10:11:37] Starting archiving of 300 snapshots in index... [+] [2021-07-07 10:11:37] "slatestarcodex.com/#234234" https://slatestarcodex.com/#234234 > ./archive/1625652696.173032 > title > favicon > headers ```
Author
Owner

@gouku commented on GitHub (Jul 7, 2021):

I think it's recognized as JSON:

run archivebox add < ~/archivebox/pinboard_export.2021.07.07_07.35.json 
Creating archivebox_archivebox_run ... done
[i] [2021-07-07 07:43:25] ArchiveBox v0.6.2: archivebox add
    > /data

[+] [2021-07-07 07:43:28] Adding 17213 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1625643808-import.txt
    > Parsed 17212 URLs from input (Generic JSON)
    > Found 16933 new URLs not already in index

[*] [2021-07-07 07:43:42] Writing 16933 links to main index...
Traceback (most recent call last):
  File "/app/archivebox/index/sql.py", line 41, in write_link_to_sql_index
    info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create
    obj = self.select_for_update().get(**kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get
    raise self.model.DoesNotExist(
core.models.DoesNotExist: Snapshot matching query does not exist.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 103, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 602, in add
    write_main_index(links=new_links, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/__init__.py", line 232, in write_main_index
    write_sql_main_index(links, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/sql.py", line 88, in write_sql_main_index
    write_link_to_sql_index(link)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/sql.py", line 46, in write_link_to_sql_index
    snapshot, _ = Snapshot.objects.update_or_create(url=link.url, defaults=info)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create
    obj, created = self._create_object_from_params(kwargs, params, lock=True)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params
    obj = self.create(**params)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base
    updated = self._save_table(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert
    return manager._insert(
  File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute
    return Database.Cursor.execute(self, query, params)
django.db.utils.OperationalError: database is locked

Also I got errors when import bookmarks. Only about 200 bookmarks imported.

<!-- gh-comment-id:875754587 --> @gouku commented on GitHub (Jul 7, 2021): I think it's recognized as JSON: ```bash run archivebox add < ~/archivebox/pinboard_export.2021.07.07_07.35.json Creating archivebox_archivebox_run ... done [i] [2021-07-07 07:43:25] ArchiveBox v0.6.2: archivebox add > /data [+] [2021-07-07 07:43:28] Adding 17213 links to index (crawl depth=0)... > Saved verbatim input to sources/1625643808-import.txt > Parsed 17212 URLs from input (Generic JSON) > Found 16933 new URLs not already in index [*] [2021-07-07 07:43:42] Writing 16933 links to main index... Traceback (most recent call last): File "/app/archivebox/index/sql.py", line 41, in write_link_to_sql_index info["timestamp"] = Snapshot.objects.get(url=link.url).timestamp File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get raise self.model.DoesNotExist( core.models.DoesNotExist: Snapshot matching query does not exist. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 589, in update_or_create obj = self.select_for_update().get(**kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 429, in get raise self.model.DoesNotExist( core.models.DoesNotExist: Snapshot matching query does not exist. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute return Database.Cursor.execute(self, query, params) sqlite3.OperationalError: database is locked The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/app/archivebox/cli/archivebox_add.py", line 103, in main add( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/main.py", line 602, in add write_main_index(links=new_links, out_dir=out_dir) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/index/__init__.py", line 232, in write_main_index write_sql_main_index(links, out_dir=out_dir) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/index/sql.py", line 88, in write_sql_main_index write_link_to_sql_index(link) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/index/sql.py", line 46, in write_link_to_sql_index snapshot, _ = Snapshot.objects.update_or_create(url=link.url, defaults=info) File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 594, in update_or_create obj, created = self._create_object_from_params(kwargs, params, lock=True) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 610, in _create_object_from_params obj = self.create(**params) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 447, in create obj.save(force_insert=True, using=self.db) File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 753, in save self.save_base(using=using, force_insert=force_insert, File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 790, in save_base updated = self._save_table( File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 895, in _save_table results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw) File "/usr/local/lib/python3.9/site-packages/django/db/models/base.py", line 933, in _do_insert return manager._insert( File "/usr/local/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1254, in _insert return query.get_compiler(using=using).execute_sql(returning_fields) File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1397, in execute_sql cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute return self._execute_with_wrappers(sql, params, many=False, executor=self._execute) File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers return executor(sql, params, many, context) File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__ raise dj_exc_value.with_traceback(traceback) from exc_value File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute return self.cursor.execute(sql, params) File "/usr/local/lib/python3.9/site-packages/django/db/backends/sqlite3/base.py", line 413, in execute return Database.Cursor.execute(self, query, params) django.db.utils.OperationalError: database is locked ``` Also I got errors when import bookmarks. Only about 200 bookmarks imported.
Author
Owner

@pirate commented on GitHub (Aug 9, 2023):

Replicating my comment here from the PR:

We already have a field for the original timestamp, it's Snapshot.bookmarked (aka Snapshot.timestamp, the snapshot primary key)! .added is for when Django first became aware of the URL, and .bookmarked is for any timestamp read in from an import source like bookmarks, RSS, pocket feed, etc. We need both to remain separate, but we could add an option to prefer displaying .bookmarked or .added in the UI in different places.

(So any bugfix should fix the code that sets Snapshot.timestamp, not change Snapshot.added) Sorry for the confusion, I guess we could have clearer comments in the code explaining the differences between each field.

<!-- gh-comment-id:1672064857 --> @pirate commented on GitHub (Aug 9, 2023): Replicating my comment here from the PR: > We already have a field for the original timestamp, it's `Snapshot.bookmarked` (aka `Snapshot.timestamp`, the snapshot primary key)! `.added` is for when Django first became aware of the URL, and `.bookmarked` is for any timestamp read in from an import source like bookmarks, RSS, pocket feed, etc. We need both to remain separate, but we could add an option to prefer displaying `.bookmarked` or `.added` in the UI in different places. (So any bugfix should fix the code that sets Snapshot.timestamp, not change Snapshot.added) Sorry for the confusion, I guess we could have clearer comments in the code explaining the differences between each field.
Author
Owner

@pirate commented on GitHub (Feb 22, 2024):

Maybe we could use this to build a better pinboard importer: https://github.com/davep/tinboard

<!-- gh-comment-id:1959131535 --> @pirate commented on GitHub (Feb 22, 2024): Maybe we could use this to build a better pinboard importer: https://github.com/davep/tinboard
Author
Owner

@jimwins commented on GitHub (Mar 2, 2024):

The feature suggested in issue #1367 would allow sorting/display by the bookmarked date instead of added. (I think that should probably be the default, but I'll make that case over there when I've thought it through more completely.)

We should add some test cases to verify that the timestamps are being parsed and stored correctly, but I have verified via manual testing that it does work with the JSON parser including the fixes on the dev branch.

Issue #1188 (about duplicate timestamps on imports) is another issue that can cause problems with Pinboard imports because of the assumption that specific timestamps are associated with a unique URL. This should be fixed when bookmarked (aka timestamp) is no longer used as a unique identifier for snapshots, which is issue #74. (That will also lay the groundwork for having multiple snapshots of the same URL, which is issue #179.)

<!-- gh-comment-id:1974914059 --> @jimwins commented on GitHub (Mar 2, 2024): The feature suggested in issue #1367 would allow sorting/display by the `bookmarked` date instead of `added`. (I think that should probably be the default, but I'll make that case over there when I've thought it through more completely.) We should add some test cases to verify that the timestamps are being parsed and stored correctly, but I have verified via manual testing that it does work with the JSON parser including the fixes on the dev branch. Issue #1188 (about duplicate timestamps on imports) is another issue that can cause problems with Pinboard imports because of the assumption that specific timestamps are associated with a unique URL. This should be fixed when `bookmarked` (aka `timestamp`) is no longer used as a unique identifier for snapshots, which is issue #74. (That will also lay the groundwork for having multiple snapshots of the same URL, which is issue #179.)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2007
No description provided.