[GH-ISSUE #725] Pocket and Pinboard imports causing tags to be split incorrectly into individual characters w/ broken hyphenation #3479

Closed
opened 2026-03-14 23:06:48 +03:00 by kerem · 17 comments
Owner

Originally created by @pirate on GitHub (Apr 27, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/725

A simple bug due to a .split() or set() somewhere on the tags_str instead of the tags list. Should be easy to fix.

image

We should also add a filter to prevent emptystring / whitespace-only tags:
image

Originally created by @pirate on GitHub (Apr 27, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/725 A simple bug due to a `.split()` or `set()` somewhere on the tags_str instead of the tags list. Should be easy to fix. ![image](https://user-images.githubusercontent.com/511499/116295736-68ffd980-a767-11eb-8794-beb101d9703d.png) We should also add a filter to prevent emptystring / whitespace-only tags: ![image](https://user-images.githubusercontent.com/511499/116295924-9d739580-a767-11eb-80db-a29c1f32d394.png)
Author
Owner

@philippemilink commented on GitHub (May 24, 2021):

I have a bug similar to this one, so instead of opening a new issue, I will complete this bug report.

With a JSON input, I want to assign several different tags to a link. I found I just have to seperate them with a comma. Imported links are correctly tagged, but for each letter in tags, a tag is created (as reported in the initial issue).

To reproduce:

# Configuration isn't really relevant, just faster to test:
SAVE_SINGLEFILE=False USE_CHROME=False USE_NODE=False USE_YOUTUBEDL=False USE_READABILITY=False USE_MERCURY=False archivebox init --setup
archivebox config --set SAVE_SINGLEFILE=False USE_CHROME=False USE_NODE=False USE_YOUTUBEDL=False USE_READABILITY=False USE_MERCURY=False SAVE_ARCHIVE_DOT_ORG=False
archivebox add --parser json < urls.json

urls.json:

[
    {"url": "https://archivebox.io/", "tags": "test,archivebox"},
    {"url": "https://fr.wikipedia.org/wiki/Gallium", "tags": "test,wikipedia"}
]

image

<!-- gh-comment-id:847099184 --> @philippemilink commented on GitHub (May 24, 2021): I have a bug similar to this one, so instead of opening a new issue, I will complete this bug report. With a JSON input, I want to assign several different tags to a link. I found I just have to seperate them with a comma. Imported links are correctly tagged, but for each letter in tags, a tag is created (as reported in the initial issue). To reproduce: ```bash # Configuration isn't really relevant, just faster to test: SAVE_SINGLEFILE=False USE_CHROME=False USE_NODE=False USE_YOUTUBEDL=False USE_READABILITY=False USE_MERCURY=False archivebox init --setup archivebox config --set SAVE_SINGLEFILE=False USE_CHROME=False USE_NODE=False USE_YOUTUBEDL=False USE_READABILITY=False USE_MERCURY=False SAVE_ARCHIVE_DOT_ORG=False archivebox add --parser json < urls.json ``` urls.json: ```json [ {"url": "https://archivebox.io/", "tags": "test,archivebox"}, {"url": "https://fr.wikipedia.org/wiki/Gallium", "tags": "test,wikipedia"} ] ``` ![image](https://user-images.githubusercontent.com/5911232/119366278-1572a400-bcb1-11eb-9b14-297e123ae949.png)
Author
Owner

@pirate commented on GitHub (May 25, 2021):

Definitely the same bug, there's an extra .split() or a string is being interpreted as a list/iterator somewhere. Should be relatively simple to fix, just add print statements before/after tags get parsed and created to find where the split is happening.

See archivebox/main.py: add() and parsers/__init__.py to get started.

<!-- gh-comment-id:848109293 --> @pirate commented on GitHub (May 25, 2021): Definitely the same bug, there's an extra `.split()` or a string is being interpreted as a list/iterator somewhere. Should be relatively simple to fix, just add print statements before/after tags get parsed and created to find where the split is happening. See `archivebox/main.py: add()` and `parsers/__init__.py` to get started.
Author
Owner

@tmladek commented on GitHub (Jul 6, 2021):

Follow up question, is it possible to import the links with the bug as it is now, and then fix the tags later? Or anything I could do to the JSON to fix it manually?

<!-- gh-comment-id:874601978 --> @tmladek commented on GitHub (Jul 6, 2021): Follow up question, is it possible to import the links with the bug as it is now, and then fix the tags later? Or anything I could do to the JSON to fix it manually?
Author
Owner

@cdzombak commented on GitHub (Nov 16, 2021):

This also affects Pinboard HTML exports imported using the netscape_html parser. I imported my Pinboard export like this: cat ./pinboard.html | docker-compose -f ~/docker-compose.yml run --rm archivebox add --parser=netscape_html

And got a bunch of single-char tags:

Screen Shot 2021-11-16 at 10 37 02 AM
<!-- gh-comment-id:970396521 --> @cdzombak commented on GitHub (Nov 16, 2021): This also affects Pinboard HTML exports imported using the `netscape_html` parser. I imported my Pinboard export like this: `cat ./pinboard.html | docker-compose -f ~/docker-compose.yml run --rm archivebox add --parser=netscape_html` And got a bunch of single-char tags: <img width="458" alt="Screen Shot 2021-11-16 at 10 37 02 AM" src="https://user-images.githubusercontent.com/102904/142016462-6bdd8e5b-ed06-43d0-a40a-83d9b097b739.png">
Author
Owner

@tmladek commented on GitHub (Nov 16, 2021):

I can confirm it's not just HTML exports but also JSON exports (see issue #782).

It's unfortunate that this issue is still present, it prevents me from using archivebox :(

<!-- gh-comment-id:970400148 --> @tmladek commented on GitHub (Nov 16, 2021): I can confirm it's not just HTML exports but also JSON exports (see issue #782). It's unfortunate that this issue is still present, it prevents me from using archivebox :(
Author
Owner

@pirate commented on GitHub (Nov 16, 2021):

I understand the annoyance, I've been seeing this too lately with my imports. Will have to track down what changed recently that could affect this.

<!-- gh-comment-id:970607760 --> @pirate commented on GitHub (Nov 16, 2021): I understand the annoyance, I've been seeing this too lately with my imports. Will have to track down what changed recently that could affect this.
Author
Owner

@hannah98 commented on GitHub (Dec 1, 2021):

I did a little digging because I would love to get my pinboard.json file imported as well.

I'm not sure if this will help, but this is what I found:
I ran through the debugger when adding a json file, and stepped through each line until I wound up in this method:
github.com/ArchiveBox/ArchiveBox@84b927e3e5/archivebox/core/models.py (L249-L255)

where I noticed it was processing one letter of a tag at a time. The entrypoint to this function looks like it is expecting tags to be a list.

There are two places that call this function:

github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L47)

github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L113)

It looks like the first one passes in the tag string:
github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L36-L38)

While the second one sets up a list to pass in:
github.com/ArchiveBox/ArchiveBox@32764347ce/archivebox/index/sql.py (L107-L109)

I'm not really a python guy, but it looks like the first call also needs to setup the tags as a list before passing them to the save_tags method?

Feel free to delete this if it doesn't help. Cheers!

<!-- gh-comment-id:983879811 --> @hannah98 commented on GitHub (Dec 1, 2021): I did a little digging because I would love to get my pinboard.json file imported as well. I'm not sure if this will help, but this is what I found: I ran through the debugger when adding a json file, and stepped through each line until I wound up in this method: https://github.com/ArchiveBox/ArchiveBox/blob/84b927e3e5fb8da93fb86a9070291495a7563a35/archivebox/core/models.py#L249-L255 where I noticed it was processing one letter of a tag at a time. The entrypoint to this function looks like it is expecting tags to be a list. There are two places that call this function: https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L47 https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L113 It looks like the first one passes in the tag string: https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L36-L38 While the second one sets up a list to pass in: https://github.com/ArchiveBox/ArchiveBox/blob/32764347ce2e59919f763c552bd3e250f49c2f5b/archivebox/index/sql.py#L107-L109 I'm not really a python guy, but it looks like the first call also needs to setup the tags as a list before passing them to the ```save_tags``` method? Feel free to delete this if it doesn't help. Cheers!
Author
Owner

@hannah98 commented on GitHub (Dec 20, 2021):

I tried out what I suggested in the previous comment in my test environment and it seems to have fixed the problem. All I did was copy the method to split out the tags into a list from further down in the same file.

One note on the tests: Without my changes, there are 8 failing tests for me. After my changes, there are still the same 8 faililng tests, so my pull request did not introduce any new test failiures.

<!-- gh-comment-id:998007440 --> @hannah98 commented on GitHub (Dec 20, 2021): I tried out what I suggested in the previous comment in my test environment and it seems to have fixed the problem. All I did was copy the method to split out the tags into a list from further down in the same file. One note on the tests: Without my changes, there are 8 failing tests for me. After my changes, there are still the same 8 faililng tests, so my pull request did not introduce any new test failiures.
Author
Owner

@pirate commented on GitHub (Dec 21, 2021):

I believe @hannah98's PR fixed this, thanks so much for that! Please test on :dev and let me know if any of you have any issues.

<!-- gh-comment-id:998483666 --> @pirate commented on GitHub (Dec 21, 2021): I believe @hannah98's PR fixed this, thanks so much for that! Please [test on `:dev`](https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch) and let me know if any of you have any issues.
Author
Owner

@domstubbs commented on GitHub (Dec 21, 2021):

I’ve just tested a new docker container on the dev tag and the single character issue is fixed. If I have a Pinboard tag attribute like foo bar I still end up with one tag named foo bar rather than separate foo and bar tags, but I assume that is the expected behaviour.

For whatever reason the singlefile parser is now timing out for all requests, but that must be unrelated.

<!-- gh-comment-id:998754843 --> @domstubbs commented on GitHub (Dec 21, 2021): I’ve just tested a new docker container on the dev tag and the single character issue is fixed. If I have a Pinboard tag attribute like `foo bar` I still end up with one tag named `foo bar` rather than separate `foo` and `bar` tags, but I assume that is the expected behaviour. For whatever reason the singlefile parser is now timing out for all requests, but that must be unrelated.
Author
Owner

@hannah98 commented on GitHub (Dec 21, 2021):

Hi @domstubbs - I had been testing with a modified export from Pinboard that had spaces turned into commas in the tags attribute, so I didn't catch this in my testing.

I think this is a separate issue (than tags getting split into individual characters), but I will try take a look at this as well and see if I can figure something out.

If this helps, I created this script a while ago (I haven't polished it for public release, so take it for what it is):
https://gist.github.com/hannah98/79c99ef8ed868be54e12aff9f5bfe275

I used it to put commas in between each my tags in the pinboard json export - it creates a file named fixed.json which I would then pass into ArchiveBox.

That may get you by for now, but I'll poke at the ArchiveBox code and see if I can do something similar.

<!-- gh-comment-id:998787589 --> @hannah98 commented on GitHub (Dec 21, 2021): Hi @domstubbs - I had been testing with a modified export from Pinboard that had spaces turned into commas in the tags attribute, so I didn't catch this in my testing. I think this is a separate issue (than tags getting split into individual characters), but I will try take a look at this as well and see if I can figure something out. If this helps, I created this script a while ago (I haven't polished it for public release, so take it for what it is): https://gist.github.com/hannah98/79c99ef8ed868be54e12aff9f5bfe275 I used it to put commas in between each my tags in the pinboard json export - it creates a file named fixed.json which I would then pass into ArchiveBox. That may get you by for now, but I'll poke at the ArchiveBox code and see if I can do something similar.
Author
Owner

@hannah98 commented on GitHub (Dec 21, 2021):

I looked at the Pinboard requirements for tags:
https://pinboard.in/faq/#tag_allowed_chars

Any characters except comma or whitespace are allowed in tags, including emoji.

I was able to modify the generic_json parser to split tags on commas AND spaces, and this will work for pinboard exports based on the tag requirement above. But I don't know if this will break the generic_json parser for other sources besides pinboard... I'm not sure how to proceed on this.

<!-- gh-comment-id:998825990 --> @hannah98 commented on GitHub (Dec 21, 2021): I looked at the Pinboard requirements for tags: https://pinboard.in/faq/#tag_allowed_chars > Any characters except comma or whitespace are allowed in tags, including emoji. I was able to modify the generic_json parser to split tags on commas AND spaces, and this will work for pinboard exports based on the tag requirement above. But I don't know if this will break the generic_json parser for other sources besides pinboard... I'm not sure how to proceed on this.
Author
Owner

@domstubbs commented on GitHub (Dec 21, 2021):

Fantastic, thanks! I was going to have a bash at something at some point but it looks like that should get me most if not all of the way. It seems like it should be possible to run a single JSON import and then use the RSS feed for ongoing updates, or at least that’s what I was hoping to work my way towards.

If a fix/modification could be added to the core that would be fantastic, but as you say, perhaps it needs to be paired with a something like a TAG_WHITESPACE_SPLIT setting for anyone that would prefer to retain the current behaviour.

<!-- gh-comment-id:998848879 --> @domstubbs commented on GitHub (Dec 21, 2021): Fantastic, thanks! I was going to have a bash at something at some point but it looks like that should get me most if not all of the way. It seems like it should be possible to run a single JSON import and then use the RSS feed for ongoing updates, or at least that’s what I was hoping to work my way towards. If a fix/modification could be added to the core that would be fantastic, but as you say, perhaps it needs to be paired with a something like a `TAG_WHITESPACE_SPLIT` setting for anyone that would prefer to retain the current behaviour.
Author
Owner

@hannah98 commented on GitHub (Dec 21, 2021):

I added a replace on the generic_json parser to replace spaces with commas. Commas are later split in the sql.py script when adding tags. I ran a test of my 8,000 pinboard bookmarks and all tags were parsed successfully.

Here is the simple change I made:

github.com/hannah98/ArchiveBox@4464a4ef4f

It does the job for pinboard exports, but it is unknown if this would break the parsing of exports from other sources.

I could take a crack at using the TAG_WHITESPACE_SPLIT option as well...

<!-- gh-comment-id:998855340 --> @hannah98 commented on GitHub (Dec 21, 2021): I added a replace on the generic_json parser to replace spaces with commas. Commas are later split in the sql.py script when adding tags. I ran a test of my 8,000 pinboard bookmarks and all tags were parsed successfully. Here is the simple change I made: https://github.com/hannah98/ArchiveBox/commit/4464a4ef4f21f9bf46759ea7f2ed05b425156dbf It does the job for pinboard exports, but it is unknown if this would break the parsing of exports from other sources. I could take a crack at using the ```TAG_WHITESPACE_SPLIT``` option as well...
Author
Owner

@hannah98 commented on GitHub (Dec 21, 2021):

And here are my changes to implement the TAB_WHITESPACE_SPLIT option.

github.com/hannah98/ArchiveBox@b2609f0dc9

I've tested it with and without the option set. The option defaults to false (to retain backward compatibility). When the option is not set, or set to False, my tags from pinboard end up like foo bar. When the option is set to True, my tags from pinboard end up like foo and bar individually.

At this point I think the owner, @pirate, would have to chime in. Maybe it's the case that all tags should be split on spaces and then my first, oneline option would be good. But the second option provides a way for it to be toggled.

I can submit a PR for either one.

<!-- gh-comment-id:998894498 --> @hannah98 commented on GitHub (Dec 21, 2021): And here are my changes to implement the ```TAB_WHITESPACE_SPLIT``` option. https://github.com/hannah98/ArchiveBox/commit/b2609f0dc9be76f60cb6fde1ab97c6a81ce03376 I've tested it with and without the option set. The option defaults to false (to retain backward compatibility). When the option is not set, or set to False, my tags from pinboard end up like ```foo bar```. When the option is set to True, my tags from pinboard end up like ```foo``` and ```bar``` individually. At this point I think the owner, @pirate, would have to chime in. Maybe it's the case that all tags should be split on spaces and then my first, oneline option would be good. But the second option provides a way for it to be toggled. I can submit a PR for either one.
Author
Owner

@pirate commented on GitHub (Dec 23, 2021):

I'd accept a PR to add an option like TAG_SEPARATORS = ' ,;' which splits on any of those characters, that way ppl can tune it to their needs and it's generalizable to handle spaces/commas/etc. all in the same way.

<!-- gh-comment-id:1000540546 --> @pirate commented on GitHub (Dec 23, 2021): I'd accept a PR to add an option like `TAG_SEPARATORS = ' ,;'` which splits on any of those characters, that way ppl can tune it to their needs and it's generalizable to handle spaces/commas/etc. all in the same way.
Author
Owner

@hannah98 commented on GitHub (Dec 30, 2021):

Just submitted PR #911 to introduce the TAG_SEPARATORS option. It will default to a comma, which is what tags have already been split on, so if this option is not given, there should be no noticeable changes to existing users.

As mentioned in the PR, I can supply the text for the wiki page that describes how to use this option.

<!-- gh-comment-id:1003173069 --> @hannah98 commented on GitHub (Dec 30, 2021): Just submitted PR #911 to introduce the ```TAG_SEPARATORS``` option. It will default to a comma, which is what tags have already been split on, so if this option is not given, there should be no noticeable changes to existing users. As mentioned in the PR, I can supply the text for the wiki page that describes how to use this option.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3479
No description provided.