[GH-ISSUE #1188] Bug: Adding two different links with the same timestamp (e.g. from JSON) errors out and stops the entire import #3757

Closed
opened 2026-03-15 00:21:26 +03:00 by kerem · 3 comments
Owner

Originally created by @melyux on GitHub (Jul 22, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1188

Describe the bug

If you have two links with the same timestamp, ArchiveBox throws this error:

AssertionError: Cannot merge two links with different URLs ...

and stops the import. If the links don't match, it should just make separate records, not stop the entire import process. I see that the directories for these two snapshots are created in a way that resolves this (increments the timestamp for the "duplicate" snapshot's directory by 1, so we get 1611619200.0 and 1611619201.0. That's good

Steps to reproduce

  1. Import two links with the same timestamp using `--parser json``.
  2. Notice that the pull/import process stops after throwing an error AssertionError: Cannot merge two links with different URLs.
  3. Notice that any links after this are not pulled.

Screenshots or log output

Source:

[
  ...
  {
    "url": "https://google.com",
    "title": "Google",
    "created": "2021-01-26T00:00:00+0000"
  },
  {
    "url": "https://yahoo.com",
    "title": "Yahoo",
    "created": "2021-01-26T00:00:00+0000"
  },
  ...
]

Log:

[*] [2023-07-22 03:42:04] Archiving 269/1742 URLs from added set...

[▶] [2023-07-22 03:42:04] Starting archiving of 269 snapshots in index...
    ! Failed to archive link: AssertionError: Cannot merge two links with different URLs (google.com != yahoo.com)

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/cli/archivebox_add.py", line 109, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/main.py", line 660, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/extractors/__init__.py", line 200, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/extractors/__init__.py", line 96, in archive_link
    link = load_link_details(link, out_dir=out_dir)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/index/__init__.py", line 350, in load_link_details
    return merge_links(existing_link, link)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/archivebox/index/__init__.py", line 63, in merge_links
    assert a.base_url == b.base_url, f'Cannot merge two links with different URLs ({a.base_url} != {b.base_url})'
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Cannot merge two links with different URLs (google.com != yahoo.com)

ArchiveBox version

0.6.3
ArchiveBox v0.6.3 40ddd33 Cpython Linux Linux-6.1.0-10-amd64-x86_64-with-glibc2.31 x86_64
DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_PERMS=644 1000:1000 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.4         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v18.16.1        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.0.44         valid     /usr/lib/node_modules/single-file-cli/single-file                           
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.07.06     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v114.0.5735.198  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              


[i] Data locations:
Originally created by @melyux on GitHub (Jul 22, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1188 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> If you have two links with the same timestamp, ArchiveBox throws this error: ``` AssertionError: Cannot merge two links with different URLs ... ``` and stops the import. If the links don't match, it should just make separate records, not stop the entire import process. I see that the directories for these two snapshots are created in a way that resolves this (increments the timestamp for the "duplicate" snapshot's directory by 1, so we get `1611619200.0` and `1611619201.0`. That's good #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Import two links with the same timestamp using `--parser json``. 2. Notice that the pull/import process stops after throwing an error `AssertionError: Cannot merge two links with different URLs`. 3. Notice that any links after this are not pulled. #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> Source: ``` [ ... { "url": "https://google.com", "title": "Google", "created": "2021-01-26T00:00:00+0000" }, { "url": "https://yahoo.com", "title": "Yahoo", "created": "2021-01-26T00:00:00+0000" }, ... ] ``` Log: ``` [*] [2023-07-22 03:42:04] Archiving 269/1742 URLs from added set... [▶] [2023-07-22 03:42:04] Starting archiving of 269 snapshots in index... ! Failed to archive link: AssertionError: Cannot merge two links with different URLs (google.com != yahoo.com) Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/cli/archivebox_add.py", line 109, in main add( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/main.py", line 660, in add archive_links(new_links, overwrite=False, **archive_kwargs) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/extractors/__init__.py", line 200, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/extractors/__init__.py", line 96, in archive_link link = load_link_details(link, out_dir=out_dir) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/index/__init__.py", line 350, in load_link_details return merge_links(existing_link, link) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/archivebox/index/__init__.py", line 63, in merge_links assert a.base_url == b.base_url, f'Cannot merge two links with different URLs ({a.base_url} != {b.base_url})' ^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: Cannot merge two links with different URLs (google.com != yahoo.com) ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs 0.6.3 ArchiveBox v0.6.3 40ddd33 Cpython Linux Linux-6.1.0-10-amd64-x86_64-with-glibc2.31 x86_64 DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=False FS_PERMS=644 1000:1000 SEARCH_BACKEND=ripgrep [i] Dependency versions: √ PYTHON_BINARY v3.11.4 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v18.16.1 valid /usr/bin/node √ SINGLEFILE_BINARY v1.0.44 valid /usr/lib/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git √ YOUTUBEDL_BINARY v2023.07.06 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v114.0.5735.198 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@neel-suthar commented on GitHub (Jan 20, 2024):

@pirate Is this because we use the timestamp value to create the output directory?

<!-- gh-comment-id:1901713413 --> @neel-suthar commented on GitHub (Jan 20, 2024): @pirate Is this because we use the timestamp value to create the output directory?
Author
Owner

@pirate commented on GitHub (Jan 20, 2024):

Yes, timestamp is currently the unique key for snapshots, because it has millisecond-level resolution we can always bump it by a few ms if there are conflicts (and even add more decimals). Resolving conflicts here and deduping correctly has historically been a big source of complexity in the archivebox internals.

This will change in the future when we add official support for taking multiple snapshots of the same url over time https://github.com/ArchiveBox/ArchiveBox/issues/179 and when we switch to using UUIDs for unique keys https://github.com/ArchiveBox/ArchiveBox/issues/74

<!-- gh-comment-id:1901714902 --> @pirate commented on GitHub (Jan 20, 2024): Yes, timestamp is currently the unique key for snapshots, because it has millisecond-level resolution we can always bump it by a few ms if there are conflicts (and even add more decimals). Resolving conflicts here and deduping correctly has historically been a big source of complexity in the archivebox internals. This will change in the future when we add official support for taking multiple snapshots of the same url over time https://github.com/ArchiveBox/ArchiveBox/issues/179 and when we switch to using UUIDs for unique keys https://github.com/ArchiveBox/ArchiveBox/issues/74
Author
Owner

@pirate commented on GitHub (Jan 8, 2026):

should be fixed on dev, timestamp is no longer the unique key, now snapshots are just organized by uuid.

dev isn't stable yet so dont upgrade big existing collections, but stay tuned for next release eventually.

<!-- gh-comment-id:3722301166 --> @pirate commented on GitHub (Jan 8, 2026): should be fixed on `dev`, timestamp is no longer the unique key, now snapshots are just organized by uuid. dev isn't stable yet so dont upgrade big existing collections, but stay tuned for next release eventually.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3757
No description provided.