[GH-ISSUE #874] Bookmarks/links that archive PDF content are not searchable with url:<substring> #569

Closed
opened 2026-03-02 11:50:57 +03:00 by kerem · 1 comment
Owner

Originally created by @ahgraber on GitHub (Jan 12, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/874

Describe the Bug

I'm using Hoarder to archive whitepapers from arxiv.org. When I bookmark a paper (https://arxiv.org/pdf/1706.03762), the "asset" object that is created uses 'sourceUrl' instead of the "link" object's 'url'. This means that when I search for bookmarks with url:arxiv.org, none of the whitepapers are found.

Steps to Reproduce

  1. Create a pdf bookmark (https://arxiv.org/pdf/1706.03762)
  2. Search for url:arxiv.org

Expected Behaviour

sourceUrl is also used in the search

Screenshots or Additional Context

The first object is a pdf, the second is a link

[
  {
    "id": "erie3afnpzcymf1i9oh9bauj",
    "createdAt": "2025-01-11T01:19:09.000Z",
    "title": null,
    "archived": false,
    "favourited": false,
    "taggingStatus": "success",
    "note": null,
    "summary": null,
    "tags": [{}],
    "content": {
      "type": "asset",
      "assetType": "pdf",
      "assetId": "b81dfb4a-cfb7-466f-bd27-7cd85f0c2741",
      "fileName": "1706.03762",
      "sourceUrl": "https://arxiv.org/pdf/1706.03762"
    },
    "assets": [
      {
        "id": "b81dfb4a-cfb7-466f-bd27-7cd85f0c2741",
        "assetType": "bookmarkAsset"
      }
    ]
  },
  {
    "id": "ss5bifixtn4o8pyqf3ke375e",
    "createdAt": "2025-01-12T01:29:42.000Z",
    "title": null,
    "archived": false,
    "favourited": false,
    "taggingStatus": "success",
    "note": null,
    "summary": null,
    "tags": [{}],
    "content": {
      "type": "link",
      "url": "https://magazine.sebastianraschka.com/p/understanding-large-language-models",
      "title": "Understanding Large Language Models",
      "description": "Explore the transformative power of large language models in AI. Dive into a curated reading list for ML enthusiasts. Discover the impact of transformers on NLP, vision, and biology.",
      "imageUrl": "...",
      "imageAssetId": "23500314-1405-4356-ab71-b763c02ce119",
      "favicon": "...",
      "htmlContent": "...",
      "crawledAt": "2025-01-12T01:29:46.000Z"
    },
    "assets": [
      {
        "id": "23500314-1405-4356-ab71-b763c02ce119",
        "assetType": "bannerImage"
      }
    ]
  }
]

Device Details

No response

Exact Hoarder Version

0.21.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @ahgraber on GitHub (Jan 12, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/874 ### Describe the Bug I'm using Hoarder to archive whitepapers from arxiv.org. When I bookmark a paper (https://arxiv.org/pdf/1706.03762), the "asset" object that is created uses 'sourceUrl' instead of the "link" object's 'url'. This means that when I search for bookmarks with `url:arxiv.org`, none of the whitepapers are found. ### Steps to Reproduce 1. Create a pdf bookmark (https://arxiv.org/pdf/1706.03762) 2. Search for `url:arxiv.org` ### Expected Behaviour `sourceUrl` is also used in the search ### Screenshots or Additional Context The first object is a pdf, the second is a link ```json [ { "id": "erie3afnpzcymf1i9oh9bauj", "createdAt": "2025-01-11T01:19:09.000Z", "title": null, "archived": false, "favourited": false, "taggingStatus": "success", "note": null, "summary": null, "tags": [{}], "content": { "type": "asset", "assetType": "pdf", "assetId": "b81dfb4a-cfb7-466f-bd27-7cd85f0c2741", "fileName": "1706.03762", "sourceUrl": "https://arxiv.org/pdf/1706.03762" }, "assets": [ { "id": "b81dfb4a-cfb7-466f-bd27-7cd85f0c2741", "assetType": "bookmarkAsset" } ] }, { "id": "ss5bifixtn4o8pyqf3ke375e", "createdAt": "2025-01-12T01:29:42.000Z", "title": null, "archived": false, "favourited": false, "taggingStatus": "success", "note": null, "summary": null, "tags": [{}], "content": { "type": "link", "url": "https://magazine.sebastianraschka.com/p/understanding-large-language-models", "title": "Understanding Large Language Models", "description": "Explore the transformative power of large language models in AI. Dive into a curated reading list for ML enthusiasts. Discover the impact of transformers on NLP, vision, and biology.", "imageUrl": "...", "imageAssetId": "23500314-1405-4356-ab71-b763c02ce119", "favicon": "...", "htmlContent": "...", "crawledAt": "2025-01-12T01:29:46.000Z" }, "assets": [ { "id": "23500314-1405-4356-ab71-b763c02ce119", "assetType": "bannerImage" } ] } ] ``` ### Device Details _No response_ ### Exact Hoarder Version 0.21.0 ### Have you checked the troubleshooting guide? - [X] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem 2026-03-02 11:50:57 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@MohamedBassem commented on GitHub (Jan 12, 2025):

yeah, that's a bug, thanks for the report!

<!-- gh-comment-id:2585928931 --> @MohamedBassem commented on GitHub (Jan 12, 2025): yeah, that's a bug, thanks for the report!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#569
No description provided.