[PR #2488] feat: Add content quality assessment and platform-specific renderers #2129

Open
opened 2026-03-02 12:00:41 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/karakeep-app/karakeep/pull/2488
Author: @MohamedBassem
Created: 2/15/2026
Status: 🔄 Open

Base: mainHead: claude/reader-view-non-articles-1eQ0i


📝 Commits (6)

  • 124fe4c feat: improve reader view for non-article pages
  • a325f0d Merge branch 'main' into claude/reader-view-non-articles-1eQ0i
  • 57b370c fix parsing
  • 9e5c6cf Merge branch 'main' into claude/reader-view-non-articles-1eQ0i
  • 29b6c73 revert renderers
  • 4232568 change recommendation

📊 Changes

13 files changed (+3402 additions, -5 deletions)

View changed files

📝 apps/web/components/dashboard/preview/LinkContentSection.tsx (+73 -3)
📝 apps/web/lib/i18n/locales/en/translation.json (+3 -0)
📝 apps/workers/scripts/parseHtmlSubprocess.ts (+46 -0)
📝 apps/workers/workers/crawlerWorker.ts (+8 -2)
📝 apps/workers/workers/utils/parseHtmlSubprocessIpc.ts (+1 -0)
packages/db/drizzle/0080_add_content_quality.sql (+1 -0)
packages/db/drizzle/meta/0080_snapshot.json (+3247 -0)
📝 packages/db/drizzle/meta/_journal.json (+7 -0)
📝 packages/db/schema.ts (+3 -0)
📝 packages/open-api/karakeep-openapi-spec.json (+8 -0)
📝 packages/sdk/src/karakeep-api.d.ts (+2 -0)
📝 packages/shared/types/bookmarks.ts (+1 -0)
📝 packages/trpc/models/bookmarks.ts (+2 -0)

📄 Description

Summary

This PR enhances the bookmark preview system by introducing content quality assessment for crawled articles and adding specialized renderers for popular platforms (GitHub, Reddit, Stack Overflow, Hacker News).

Key Changes

  • Content Quality Assessment: Added contentQuality field to track whether extracted article content is "good" or "poor"

    • New assessContentQuality() function in crawler worker evaluates readability content
    • Assesses based on text length, paragraph count, and link density
    • Stored in database schema and exposed through API
  • Platform-Specific Renderers: Implemented custom content renderers for:

    • GitHub: Displays repository, issue, and pull request information with branch icon
    • Reddit: Shows subreddit and post details with message icon
    • Stack Overflow: Renders question information with help icon
    • Hacker News: Displays item information with newspaper icon
  • Smart Default View Selection: Updated LinkContentSection to intelligently choose the best preview based on:

    1. Custom renderer availability (YouTube, X, GitHub, etc.)
    2. Content quality ("good" content → reader view)
    3. Available assets (archive → screenshot → fallback to reader)
  • Database & Schema Updates:

    • Added contentQuality column to bookmarkLinks table
    • Updated Zod schemas, TypeScript types, and API specifications
    • New migration: 0080_add_content_quality.sql
  • Internationalization: Added translation keys for content quality messaging

Implementation Details

  • URL extraction uses regex patterns to identify platform-specific content types
  • Renderers follow consistent pattern with canRender() and component functions
  • Content quality assessment uses JSDOM to parse HTML and analyze text metrics
  • Backward compatible: contentQuality is nullable for existing bookmarks

https://claude.ai/code/session_01Pwf8TNbR26KSW7WxveMymS


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/karakeep-app/karakeep/pull/2488 **Author:** [@MohamedBassem](https://github.com/MohamedBassem) **Created:** 2/15/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `claude/reader-view-non-articles-1eQ0i` --- ### 📝 Commits (6) - [`124fe4c`](https://github.com/karakeep-app/karakeep/commit/124fe4cb5ec832c2e0e32bb49c1d7b6d4f118c21) feat: improve reader view for non-article pages - [`a325f0d`](https://github.com/karakeep-app/karakeep/commit/a325f0d3a78e7fb8dc97c7414a2327ecdb355c91) Merge branch 'main' into claude/reader-view-non-articles-1eQ0i - [`57b370c`](https://github.com/karakeep-app/karakeep/commit/57b370caecc5c64c0e3bf23c78ab68b92b6263a8) fix parsing - [`9e5c6cf`](https://github.com/karakeep-app/karakeep/commit/9e5c6cff61df30e2f629a45e7072c237230017f8) Merge branch 'main' into claude/reader-view-non-articles-1eQ0i - [`29b6c73`](https://github.com/karakeep-app/karakeep/commit/29b6c737db964c3928a3d68ea086e3631a034dfd) revert renderers - [`4232568`](https://github.com/karakeep-app/karakeep/commit/42325680bdf02a33c0f27c72793072b324ee4285) change recommendation ### 📊 Changes **13 files changed** (+3402 additions, -5 deletions) <details> <summary>View changed files</summary> 📝 `apps/web/components/dashboard/preview/LinkContentSection.tsx` (+73 -3) 📝 `apps/web/lib/i18n/locales/en/translation.json` (+3 -0) 📝 `apps/workers/scripts/parseHtmlSubprocess.ts` (+46 -0) 📝 `apps/workers/workers/crawlerWorker.ts` (+8 -2) 📝 `apps/workers/workers/utils/parseHtmlSubprocessIpc.ts` (+1 -0) ➕ `packages/db/drizzle/0080_add_content_quality.sql` (+1 -0) ➕ `packages/db/drizzle/meta/0080_snapshot.json` (+3247 -0) 📝 `packages/db/drizzle/meta/_journal.json` (+7 -0) 📝 `packages/db/schema.ts` (+3 -0) 📝 `packages/open-api/karakeep-openapi-spec.json` (+8 -0) 📝 `packages/sdk/src/karakeep-api.d.ts` (+2 -0) 📝 `packages/shared/types/bookmarks.ts` (+1 -0) 📝 `packages/trpc/models/bookmarks.ts` (+2 -0) </details> ### 📄 Description ## Summary This PR enhances the bookmark preview system by introducing content quality assessment for crawled articles and adding specialized renderers for popular platforms (GitHub, Reddit, Stack Overflow, Hacker News). ## Key Changes - **Content Quality Assessment**: Added `contentQuality` field to track whether extracted article content is "good" or "poor" - New `assessContentQuality()` function in crawler worker evaluates readability content - Assesses based on text length, paragraph count, and link density - Stored in database schema and exposed through API - **Platform-Specific Renderers**: Implemented custom content renderers for: - **GitHub**: Displays repository, issue, and pull request information with branch icon - **Reddit**: Shows subreddit and post details with message icon - **Stack Overflow**: Renders question information with help icon - **Hacker News**: Displays item information with newspaper icon - **Smart Default View Selection**: Updated `LinkContentSection` to intelligently choose the best preview based on: 1. Custom renderer availability (YouTube, X, GitHub, etc.) 2. Content quality ("good" content → reader view) 3. Available assets (archive → screenshot → fallback to reader) - **Database & Schema Updates**: - Added `contentQuality` column to `bookmarkLinks` table - Updated Zod schemas, TypeScript types, and API specifications - New migration: `0080_add_content_quality.sql` - **Internationalization**: Added translation keys for content quality messaging ## Implementation Details - URL extraction uses regex patterns to identify platform-specific content types - Renderers follow consistent pattern with `canRender()` and component functions - Content quality assessment uses JSDOM to parse HTML and analyze text metrics - Backward compatible: `contentQuality` is nullable for existing bookmarks https://claude.ai/code/session_01Pwf8TNbR26KSW7WxveMymS --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#2129
No description provided.