mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #72] Archiver tries to merge/detects conflict between two bookmarks which differ only in query string #49
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#49
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @cdzombak on GitHub (Mar 14, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/72
I've been running the archiver locally against my Pinboard RSS feed to test #71, and I noticed that every time I run it, it asks if I want to cleanup the archive. If I do that, it seems to detect that two different pages (both YouTube video pages) need to be merged, and then (rightly) warns me there's a conflict.
I've uploaded the two conflicting files, along with the corresponding JSON entries from
archive/index.json, at https://dropbox.dzombak.com/bookmark-archiver/ for examination.Those JSON entries in
index.jsonare:I've seen a similar problem on a server archiving larger bookmarks collection, with pages from entirely different websites. Once my current archive job completes I'll add more examples from there.
I'm running this commit on my fork, which was branched off
masterin this repo a few hours ago. Locally I'm running on macOS 10.13/Python 3.6.4; on my server I have Ubuntu 16.04/Python Python 3.5.2.I started to dig into the code, but I haven't yet been able to follow what the cleanup code is doing. I do note that these bookmarks have unique timestamps.
@cdzombak commented on GitHub (Mar 14, 2018):
I've identified another case where the archiver detects that two bookmarks need to be merged: https://dropbox.dzombak.com/bookmark-archiver/googlegroups-conflict.zip
In this case, they are two Google Groups links. Their JSON blobs from
archive/index.json:@cdzombak commented on GitHub (Mar 14, 2018):
In this case, the content on disk is in sync with
index.jsonandindex.html.I've identified another bug, too, where the contents on disk are out of sync with the indices. I'll file that separately after gathering more information.
@cdzombak commented on GitHub (Mar 14, 2018):
I think this bug is that
find_linkconsiders links to be equivalent if there's a link whose base URL is a substring of another's URL: https://github.com/pirate/bookmark-archiver/blob/master/util.py#L252-L255@pirate commented on GitHub (Mar 15, 2018):
I recently added the cleanup code and it's likely still buggy. The whole idea of "cleanup" to attempt to migrate from the old bookmark archiver output structure to new is a flawed idea, and I decided to move away from it by adding a SQL db with migrations. That branch is still in progress and will take several months, but I will try to squeeze in a bugfix in the meantime.
Sorry to cause you this headache, I think I'll just remove the cleanup code for now since it's causing both the incorrect timestamps and merging issues.
@pirate commented on GitHub (Mar 15, 2018):
The merging of urls with different query strings was intentional behavior, but I will change it. URLs are currently only deduplicated based on
base_url(which doesn't include query string).@cdzombak commented on GitHub (Mar 15, 2018):
Cool!
No worries. As noted in #74, though, I don't think this cleanup code is responsible for the issues when re-running the archiver on an export that includes duplicate timestamps.
I think changing that makes sense. Pinboard, at least, seems to consider the query string / fragment as part of the URL, and treats URLs that vary in these components as unique.
@pirate commented on GitHub (Apr 17, 2018):
Fixed as of:
0099849