[GH-ISSUE #640] Bug: Fix merge_snapshots() creating a new snapshot dir instead of merging with an existing one during init and add --overwrite #3419

Open
opened 2026-03-14 22:46:00 +03:00 by kerem · 1 comment
Owner

Originally created by @pirate on GitHub (Jan 31, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/640

probably caused by this change to merge_links(a, b) making it just return a instead of merging the directories properly:

archivebox/index/__init__.py:
commit: 8c4ae73 (Cristian Wed Dec 23 14:51:42 2020)

### Link filtering and checking

@enforce_types
def merge_snapshots(a: Model, b: Model) -> Model:
    """deterministially merge two snapshots, favoring longer field values over shorter,
    and "cleaner" values over worse ones.
    TODO: Check if this makes sense with the new setup
    """
+    return a
    assert a.base_url == b.base_url, f'Cannot merge two links with different URLs ({a.base_url} != {b.base_url})'

    # longest url wins (because a fuzzy url will always be shorter)
    url = a.url if len(a.url) > len(b.url) else b.url

    # best title based on length and quality
    possible_titles = [
        title
        for title in (a.title, b.title)
        if title and title.strip() and '://' not in title
    ]
    title = None
    if len(possible_titles) == 2:
        title = 
Originally created by @pirate on GitHub (Jan 31, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/640 probably caused by this change to `merge_links(a, b)` making it just `return a` instead of merging the directories properly: **`archivebox/index/__init__.py`:** commit: 8c4ae73 (Cristian Wed Dec 23 14:51:42 2020) ```diff ### Link filtering and checking @enforce_types def merge_snapshots(a: Model, b: Model) -> Model: """deterministially merge two snapshots, favoring longer field values over shorter, and "cleaner" values over worse ones. TODO: Check if this makes sense with the new setup """ + return a assert a.base_url == b.base_url, f'Cannot merge two links with different URLs ({a.base_url} != {b.base_url})' # longest url wins (because a fuzzy url will always be shorter) url = a.url if len(a.url) > len(b.url) else b.url # best title based on length and quality possible_titles = [ title for title in (a.title, b.title) if title and title.strip() and '://' not in title ] title = None if len(possible_titles) == 2: title = ```
Author
Owner

@cdvv7788 commented on GitHub (Jan 31, 2021):

Oh, @pirate in the refactor this had a TODO too. Sorry.

<!-- gh-comment-id:770320689 --> @cdvv7788 commented on GitHub (Jan 31, 2021): Oh, @pirate in the refactor this had a `TODO` too. Sorry.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3419
No description provided.