[GH-ISSUE #72] Archiver tries to merge/detects conflict between two bookmarks which differ only in query string #3071

Closed
opened 2026-03-14 20:52:40 +03:00 by kerem · 7 comments
Owner

Originally created by @cdzombak on GitHub (Mar 14, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/72

I've been running the archiver locally against my Pinboard RSS feed to test #71, and I noticed that every time I run it, it asks if I want to cleanup the archive. If I do that, it seems to detect that two different pages (both YouTube video pages) need to be merged, and then (rightly) warns me there's a conflict.

I've uploaded the two conflicting files, along with the corresponding JSON entries from archive/index.json, at https://dropbox.dzombak.com/bookmark-archiver/ for examination.

Those JSON entries in index.json are:

{
    "timestamp": "1518721395",
    "url": "https://www.youtube.com/watch?v=Kh0Y2hVe_bw",
    "domain": "www.youtube.com",
    "base_url": "www.youtube.com/watch",
    "tags": "humor is:video birds",
    "title": "WHAT ARE BIRDS - YouTube",
    "sources": [
        "downloads/feeds.pinboard.in.txt",
        "/Users/cdzombak/tmp/pinboard/pinboard-public.rss"
    ],
    "type": "youtube"
},
// ...
{
    "timestamp": "1519258772",
    "url": "https://www.youtube.com/watch?v=KhYfe4R2Es0&app=desktop",
    "domain": "www.youtube.com",
    "base_url": "www.youtube.com/watch",
    "tags": "swiftlang extensions bestpractices api design ux is:video",
    "title": "#Pragma Conference 2017 - Soroush Khanlou - You Deserve Nice Things - YouTube",
    "sources": [
        "downloads/feeds.pinboard.in.txt",
        "/Users/cdzombak/tmp/pinboard/pinboard-public.rss"
    ],
    "type": "youtube"
},

I've seen a similar problem on a server archiving larger bookmarks collection, with pages from entirely different websites. Once my current archive job completes I'll add more examples from there.

I'm running this commit on my fork, which was branched off master in this repo a few hours ago. Locally I'm running on macOS 10.13/Python 3.6.4; on my server I have Ubuntu 16.04/Python Python 3.5.2.

I started to dig into the code, but I haven't yet been able to follow what the cleanup code is doing. I do note that these bookmarks have unique timestamps.

Originally created by @cdzombak on GitHub (Mar 14, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/72 I've been running the archiver locally against my Pinboard RSS feed to test #71, and I noticed that every time I run it, it asks if I want to cleanup the archive. If I do that, it seems to detect that two different pages (both YouTube video pages) need to be merged, and then (rightly) warns me there's a conflict. I've uploaded the two conflicting files, along with the corresponding JSON entries from `archive/index.json`, at https://dropbox.dzombak.com/bookmark-archiver/ for examination. Those JSON entries in `index.json` are: ```json { "timestamp": "1518721395", "url": "https://www.youtube.com/watch?v=Kh0Y2hVe_bw", "domain": "www.youtube.com", "base_url": "www.youtube.com/watch", "tags": "humor is:video birds", "title": "WHAT ARE BIRDS - YouTube", "sources": [ "downloads/feeds.pinboard.in.txt", "/Users/cdzombak/tmp/pinboard/pinboard-public.rss" ], "type": "youtube" }, // ... { "timestamp": "1519258772", "url": "https://www.youtube.com/watch?v=KhYfe4R2Es0&app=desktop", "domain": "www.youtube.com", "base_url": "www.youtube.com/watch", "tags": "swiftlang extensions bestpractices api design ux is:video", "title": "#Pragma Conference 2017 - Soroush Khanlou - You Deserve Nice Things - YouTube", "sources": [ "downloads/feeds.pinboard.in.txt", "/Users/cdzombak/tmp/pinboard/pinboard-public.rss" ], "type": "youtube" }, ``` I've seen a similar problem on a server archiving larger bookmarks collection, with pages from entirely different websites. Once my current archive job completes I'll add more examples from there. I'm running [this commit on my fork](https://github.com/cdzombak/bookmark-archiver/commit/b5c6d34860d1303c696867de12750deac65eb757), which was branched off `master` in this repo a few hours ago. Locally I'm running on macOS 10.13/Python 3.6.4; on my server I have Ubuntu 16.04/Python Python 3.5.2. I started to dig into the code, but I haven't yet been able to follow what the cleanup code is doing. I do note that these bookmarks have unique timestamps.
Author
Owner

@cdzombak commented on GitHub (Mar 14, 2018):

I've identified another case where the archiver detects that two bookmarks need to be merged: https://dropbox.dzombak.com/bookmark-archiver/googlegroups-conflict.zip

In this case, they are two Google Groups links. Their JSON blobs from archive/index.json:

{
    "sources": [
        "/home/cdzombak/pinboard-2018-03-10.json",
        "/home/cdzombak/pinboard-2018-03-12.json"
    ],
    "tags": "aviation sr-71",
    "title": "The King of Speed: SR-71 Blackbird - Google Groups",
    "timestamp": "1329104889",
    "base_url": "groups.google.com/forum/",
    "type": null,
    "domain": "groups.google.com",
    "url": "https://groups.google.com/forum/?fromgroups#!topic/rec.aviation.stories/ueI6JKeEomo"
},
// ...
{
    "sources": [
        "/home/cdzombak/pinboard-2018-03-10.json",
        "/home/cdzombak/pinboard-2018-03-12.json"
    ],
    "tags": "annarbor a2council fiber internet",
    "title": "a2geeks discussion on Ann Arbor's I-net municipal fiber replacement project",
    "timestamp": "1491805757",
    "base_url": "groups.google.com/forum/#!msg/a2geeks/4nH1Xz82GDw/jF3ejlSuDgAJ",
    "type": null,
    "domain": "groups.google.com",
    "url": "https://groups.google.com/forum/#!msg/a2geeks/4nH1Xz82GDw/jF3ejlSuDgAJ"
},
<!-- gh-comment-id:373083619 --> @cdzombak commented on GitHub (Mar 14, 2018): I've identified another case where the archiver detects that two bookmarks need to be merged: https://dropbox.dzombak.com/bookmark-archiver/googlegroups-conflict.zip In this case, they are two Google Groups links. Their JSON blobs from `archive/index.json`: ```json { "sources": [ "/home/cdzombak/pinboard-2018-03-10.json", "/home/cdzombak/pinboard-2018-03-12.json" ], "tags": "aviation sr-71", "title": "The King of Speed: SR-71 Blackbird - Google Groups", "timestamp": "1329104889", "base_url": "groups.google.com/forum/", "type": null, "domain": "groups.google.com", "url": "https://groups.google.com/forum/?fromgroups#!topic/rec.aviation.stories/ueI6JKeEomo" }, // ... { "sources": [ "/home/cdzombak/pinboard-2018-03-10.json", "/home/cdzombak/pinboard-2018-03-12.json" ], "tags": "annarbor a2council fiber internet", "title": "a2geeks discussion on Ann Arbor's I-net municipal fiber replacement project", "timestamp": "1491805757", "base_url": "groups.google.com/forum/#!msg/a2geeks/4nH1Xz82GDw/jF3ejlSuDgAJ", "type": null, "domain": "groups.google.com", "url": "https://groups.google.com/forum/#!msg/a2geeks/4nH1Xz82GDw/jF3ejlSuDgAJ" }, ```
Author
Owner

@cdzombak commented on GitHub (Mar 14, 2018):

In this case, the content on disk is in sync with index.json and index.html.

I've identified another bug, too, where the contents on disk are out of sync with the indices. I'll file that separately after gathering more information.

<!-- gh-comment-id:373087271 --> @cdzombak commented on GitHub (Mar 14, 2018): In this case, the content on disk is in sync with `index.json` and `index.html`. I've identified _another_ bug, too, where the contents on disk are out of sync with the indices. I'll file that separately after gathering more information.
Author
Owner

@cdzombak commented on GitHub (Mar 14, 2018):

I think this bug is that find_link considers links to be equivalent if there's a link whose base URL is a substring of another's URL: https://github.com/pirate/bookmark-archiver/blob/master/util.py#L252-L255

<!-- gh-comment-id:373089061 --> @cdzombak commented on GitHub (Mar 14, 2018): I think this bug is that `find_link` considers links to be equivalent if there's a link whose base URL is a substring of another's URL: https://github.com/pirate/bookmark-archiver/blob/master/util.py#L252-L255
Author
Owner

@pirate commented on GitHub (Mar 15, 2018):

I recently added the cleanup code and it's likely still buggy. The whole idea of "cleanup" to attempt to migrate from the old bookmark archiver output structure to new is a flawed idea, and I decided to move away from it by adding a SQL db with migrations. That branch is still in progress and will take several months, but I will try to squeeze in a bugfix in the meantime.

Sorry to cause you this headache, I think I'll just remove the cleanup code for now since it's causing both the incorrect timestamps and merging issues.

<!-- gh-comment-id:373215300 --> @pirate commented on GitHub (Mar 15, 2018): I recently added the cleanup code and it's likely still buggy. The whole idea of "cleanup" to attempt to migrate from the old bookmark archiver output structure to new is a flawed idea, and I decided to move away from it by adding a SQL db with migrations. That [branch](https://github.com/pirate/bookmark-archiver/tree/django) is still in progress and will take several months, but I will try to squeeze in a bugfix in the meantime. Sorry to cause you this headache, I think I'll just remove the cleanup code for now since it's causing both the incorrect timestamps and merging issues.
Author
Owner

@pirate commented on GitHub (Mar 15, 2018):

The merging of urls with different query strings was intentional behavior, but I will change it. URLs are currently only deduplicated based on base_url (which doesn't include query string).

<!-- gh-comment-id:373216403 --> @pirate commented on GitHub (Mar 15, 2018): The merging of urls with different query strings was intentional behavior, but I will change it. URLs are currently only deduplicated based on `base_url` (which doesn't include query string).
Author
Owner

@cdzombak commented on GitHub (Mar 15, 2018):

The whole idea of "cleanup" to attempt to migrate from the old bookmark archiver output structure to new is a flawed idea, and I decided to move away from it by adding a SQL db with migrations.

Cool!

Sorry to cause you this headache, I think I'll just remove the cleanup code for now since it's causing both the incorrect timestamps and merging issues.

No worries. As noted in #74, though, I don't think this cleanup code is responsible for the issues when re-running the archiver on an export that includes duplicate timestamps.

The merging of urls with different query strings was intentional behavior, but I will change it. URLs are currently only deduplicated based on base_url (which doesn't include query string).

I think changing that makes sense. Pinboard, at least, seems to consider the query string / fragment as part of the URL, and treats URLs that vary in these components as unique.

<!-- gh-comment-id:373233881 --> @cdzombak commented on GitHub (Mar 15, 2018): > The whole idea of "cleanup" to attempt to migrate from the old bookmark archiver output structure to new is a flawed idea, and I decided to move away from it by adding a SQL db with migrations. Cool! > Sorry to cause you this headache, I think I'll just remove the cleanup code for now since it's causing both the incorrect timestamps and merging issues. No worries. As noted in #74, though, I don't think this cleanup code is responsible for the issues when re-running the archiver on an export that includes duplicate timestamps. > The merging of urls with different query strings was intentional behavior, but I will change it. URLs are currently only deduplicated based on base_url (which doesn't include query string). I think changing that makes sense. Pinboard, at least, seems to consider the query string / fragment as part of the URL, and treats URLs that vary in these components as unique.
Author
Owner

@pirate commented on GitHub (Apr 17, 2018):

Fixed as of: 0099849

<!-- gh-comment-id:381990489 --> @pirate commented on GitHub (Apr 17, 2018): Fixed as of: 0099849
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3071
No description provided.