mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #725] Pocket and Pinboard imports causing tags to be split incorrectly into individual characters w/ broken hyphenation #459
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#459
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Apr 27, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/725
A simple bug due to a
.split()orset()somewhere on the tags_str instead of the tags list. Should be easy to fix.We should also add a filter to prevent emptystring / whitespace-only tags:

@philippemilink commented on GitHub (May 24, 2021):
I have a bug similar to this one, so instead of opening a new issue, I will complete this bug report.
With a JSON input, I want to assign several different tags to a link. I found I just have to seperate them with a comma. Imported links are correctly tagged, but for each letter in tags, a tag is created (as reported in the initial issue).
To reproduce:
urls.json:
@pirate commented on GitHub (May 25, 2021):
Definitely the same bug, there's an extra
.split()or a string is being interpreted as a list/iterator somewhere. Should be relatively simple to fix, just add print statements before/after tags get parsed and created to find where the split is happening.See
archivebox/main.py: add()andparsers/__init__.pyto get started.@tmladek commented on GitHub (Jul 6, 2021):
Follow up question, is it possible to import the links with the bug as it is now, and then fix the tags later? Or anything I could do to the JSON to fix it manually?
@cdzombak commented on GitHub (Nov 16, 2021):
This also affects Pinboard HTML exports imported using the
netscape_htmlparser. I imported my Pinboard export like this:cat ./pinboard.html | docker-compose -f ~/docker-compose.yml run --rm archivebox add --parser=netscape_htmlAnd got a bunch of single-char tags:
@tmladek commented on GitHub (Nov 16, 2021):
I can confirm it's not just HTML exports but also JSON exports (see issue #782).
It's unfortunate that this issue is still present, it prevents me from using archivebox :(
@pirate commented on GitHub (Nov 16, 2021):
I understand the annoyance, I've been seeing this too lately with my imports. Will have to track down what changed recently that could affect this.
@hannah98 commented on GitHub (Dec 1, 2021):
I did a little digging because I would love to get my pinboard.json file imported as well.
I'm not sure if this will help, but this is what I found:
I ran through the debugger when adding a json file, and stepped through each line until I wound up in this method:
github.com/ArchiveBox/ArchiveBox@84b927e3e5/archivebox/core/models.py (L249-L255)where I noticed it was processing one letter of a tag at a time. The entrypoint to this function looks like it is expecting tags to be a list.
There are two places that call this function:
github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L47)github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L113)It looks like the first one passes in the tag string:
github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L36-L38)While the second one sets up a list to pass in:
github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L107-L109)I'm not really a python guy, but it looks like the first call also needs to setup the tags as a list before passing them to the
save_tagsmethod?Feel free to delete this if it doesn't help. Cheers!
@hannah98 commented on GitHub (Dec 20, 2021):
I tried out what I suggested in the previous comment in my test environment and it seems to have fixed the problem. All I did was copy the method to split out the tags into a list from further down in the same file.
One note on the tests: Without my changes, there are 8 failing tests for me. After my changes, there are still the same 8 faililng tests, so my pull request did not introduce any new test failiures.
@pirate commented on GitHub (Dec 21, 2021):
I believe @hannah98's PR fixed this, thanks so much for that! Please test on
:devand let me know if any of you have any issues.@domstubbs commented on GitHub (Dec 21, 2021):
I’ve just tested a new docker container on the dev tag and the single character issue is fixed. If I have a Pinboard tag attribute like
foo barI still end up with one tag namedfoo barrather than separatefooandbartags, but I assume that is the expected behaviour.For whatever reason the singlefile parser is now timing out for all requests, but that must be unrelated.
@hannah98 commented on GitHub (Dec 21, 2021):
Hi @domstubbs - I had been testing with a modified export from Pinboard that had spaces turned into commas in the tags attribute, so I didn't catch this in my testing.
I think this is a separate issue (than tags getting split into individual characters), but I will try take a look at this as well and see if I can figure something out.
If this helps, I created this script a while ago (I haven't polished it for public release, so take it for what it is):
https://gist.github.com/hannah98/79c99ef8ed868be54e12aff9f5bfe275
I used it to put commas in between each my tags in the pinboard json export - it creates a file named fixed.json which I would then pass into ArchiveBox.
That may get you by for now, but I'll poke at the ArchiveBox code and see if I can do something similar.
@hannah98 commented on GitHub (Dec 21, 2021):
I looked at the Pinboard requirements for tags:
https://pinboard.in/faq/#tag_allowed_chars
I was able to modify the generic_json parser to split tags on commas AND spaces, and this will work for pinboard exports based on the tag requirement above. But I don't know if this will break the generic_json parser for other sources besides pinboard... I'm not sure how to proceed on this.
@domstubbs commented on GitHub (Dec 21, 2021):
Fantastic, thanks! I was going to have a bash at something at some point but it looks like that should get me most if not all of the way. It seems like it should be possible to run a single JSON import and then use the RSS feed for ongoing updates, or at least that’s what I was hoping to work my way towards.
If a fix/modification could be added to the core that would be fantastic, but as you say, perhaps it needs to be paired with a something like a
TAG_WHITESPACE_SPLITsetting for anyone that would prefer to retain the current behaviour.@hannah98 commented on GitHub (Dec 21, 2021):
I added a replace on the generic_json parser to replace spaces with commas. Commas are later split in the sql.py script when adding tags. I ran a test of my 8,000 pinboard bookmarks and all tags were parsed successfully.
Here is the simple change I made:
github.com/hannah98/ArchiveBox@4464a4ef4fIt does the job for pinboard exports, but it is unknown if this would break the parsing of exports from other sources.
I could take a crack at using the
TAG_WHITESPACE_SPLIToption as well...@hannah98 commented on GitHub (Dec 21, 2021):
And here are my changes to implement the
TAB_WHITESPACE_SPLIToption.github.com/hannah98/ArchiveBox@b2609f0dc9I've tested it with and without the option set. The option defaults to false (to retain backward compatibility). When the option is not set, or set to False, my tags from pinboard end up like
foo bar. When the option is set to True, my tags from pinboard end up likefooandbarindividually.At this point I think the owner, @pirate, would have to chime in. Maybe it's the case that all tags should be split on spaces and then my first, oneline option would be good. But the second option provides a way for it to be toggled.
I can submit a PR for either one.
@pirate commented on GitHub (Dec 23, 2021):
I'd accept a PR to add an option like
TAG_SEPARATORS = ' ,;'which splits on any of those characters, that way ppl can tune it to their needs and it's generalizable to handle spaces/commas/etc. all in the same way.@hannah98 commented on GitHub (Dec 30, 2021):
Just submitted PR #911 to introduce the
TAG_SEPARATORSoption. It will default to a comma, which is what tags have already been split on, so if this option is not given, there should be no noticeable changes to existing users.As mentioned in the PR, I can supply the text for the wiki page that describes how to use this option.