[GH-ISSUE #891] Bug: Pinboard RSS import doesn't split tags by whitespace #3572

Closed
opened 2026-03-14 23:32:20 +03:00 by kerem · 1 comment
Owner

Originally created by @cdzombak on GitHub (Nov 16, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/891

Describe the bug

After a link directly from a Pinboard RSS feed, the link has a single tag, which is all the Pinboard tags joined by hyphens.

ArchiveBox appears to be taking Pinboard's whitespace-separated tag list and, rather than splitting on whitespace, replacing spaces by hyphens.

Example: given this Pinboard link: https://pinboard.in/u:cdzombak/b:f06aca53892f with several tags…

Screen Shot 2021-11-16 at 10 27 21 AM

ArchiveBox imported the tags like this:

Screen Shot 2021-11-16 at 10 25 24 AM

Steps to reproduce

  1. Import some tagged Pinboard links using the pinboard-rss parser: curl -s https://feeds.pinboard.in/rss/secret:REMOVED/u:cdzombak/ | docker-compose -f ~/docker-compose.yml run --rm archivebox add --parser=pinboard_rss
  2. Observe that the imported links are not tagged correctly

ArchiveBox version

$ docker-compose run --rm archivebox version
ArchiveBox v0.6.3
Cpython Linux Linux-5.10.0-0.bpo.8-amd64-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.8          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.13         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /data
 √  SOURCES_DIR           5 files         valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           681 files       valid     ./archive
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf
 √  SQL_INDEX             9.2 MB          valid     ./index.sqlite3
Originally created by @cdzombak on GitHub (Nov 16, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/891 #### Describe the bug After a link directly from a Pinboard RSS feed, the link has a single tag, which is all the Pinboard tags joined by hyphens. ArchiveBox appears to be taking Pinboard's whitespace-separated tag list and, rather than splitting on whitespace, replacing spaces by hyphens. **Example:** given this Pinboard link: https://pinboard.in/u:cdzombak/b:f06aca53892f with several tags… <img width="490" alt="Screen Shot 2021-11-16 at 10 27 21 AM" src="https://user-images.githubusercontent.com/102904/142014455-4664af0f-0506-46b8-9289-b48d61ab0096.png"> ArchiveBox imported the tags like this: <img width="989" alt="Screen Shot 2021-11-16 at 10 25 24 AM" src="https://user-images.githubusercontent.com/102904/142014551-c4a30d68-7df5-4f43-a633-9eb862c672fc.png"> #### Steps to reproduce 1. Import some tagged Pinboard links using the pinboard-rss parser: `curl -s https://feeds.pinboard.in/rss/secret:REMOVED/u:cdzombak/ | docker-compose -f ~/docker-compose.yml run --rm archivebox add --parser=pinboard_rss` 2. Observe that the imported links are not tagged correctly #### ArchiveBox version ``` $ docker-compose run --rm archivebox version ArchiveBox v0.6.3 Cpython Linux Linux-5.10.0-0.bpo.8-amd64-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.8 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.13 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.212 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 8 files valid /data √ SOURCES_DIR 5 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 681 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 9.2 MB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-14 23:32:25 +03:00
Author
Owner

@pirate commented on GitHub (Nov 17, 2021):

I'm going to merge this with #725 as it seems this and similar issues are affecting tag-splitting across multiple parsers and I might as well fix them all together.

<!-- gh-comment-id:971091351 --> @pirate commented on GitHub (Nov 17, 2021): I'm going to merge this with #725 as it seems this and similar issues are affecting tag-splitting across multiple parsers and I might as well fix them all together.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3572
No description provided.