[PR #2525] feat(workers): add Bilibili metascraper (phase 1 reader-view crawling) #2141

Open
opened 2026-03-02 12:00:44 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/karakeep-app/karakeep/pull/2525
Author: @CircleCrop
Created: 2/27/2026
Status: 🔄 Open

Base: mainHead: codex/route


📝 Commits (7)

  • 1bb630e feat(workers): add bilibili metascraper plugin and harden readable HTML handling
  • 1d70a6f feat(parser): enhance HTML sanitization with referrer policy options
  • 9c71e22 feat(workers): harden bilibili metascraper metadata resolution
  • db6d500 feat(metascraper-bilibili): add fallback image for non-cover pages
  • b865925 feat(metascraper-bilibili): enhance error handling and logging for API requests
  • 5fbc04b feat(metascraper-bilibili): add avatar image exclusion in image normalization
  • 0444a41 feat(parser): enhance HTML sanitization with referrer policy validation

📊 Changes

3 files changed (+3251 additions, -16 deletions)

View changed files

📝 .gitignore (+1 -0)
apps/workers/metascraper-plugins/metascraper-bilibili.ts (+3206 -0)
📝 apps/workers/scripts/parseHtmlSubprocess.ts (+44 -16)

📄 Description

Does this PR include all my current changes?

Yes.

Compared against upstream/main, this PR currently contains all changes on codex/route:

  • 6 commits
  • 3 files changed
    • .gitignore
    • apps/workers/metascraper-plugins/metascraper-bilibili.ts
    • apps/workers/scripts/parseHtmlSubprocess.ts

Summary

This PR adds phase-1 Bilibili crawling support focused on:

  • metadata completeness
  • Reader-ready HTML output
  • resilient fallback behavior under risk-control/captcha scenarios

No DB migration. No API contract changes.

Motivation

Bilibili pages (especially dynamic/opus) are unstable for generic metadata extraction under anti-bot constraints. This PR adds a dedicated plugin so Karakeep can produce stable metadata and readable content.

Scope

Supported URL types

  • Video: /video/BV..., /video/av..., bangumi ep/ss, festival with bvid
  • Dynamic/Opus: /opus/{id}, t.bilibili.com/{id}, /h5/dynamic/detail/{id}
  • Article/Column: /read/cv{cvid}, mobile read URLs

Not in scope

  • No new DB schema
  • No new REST/tRPC routes
  • No login-required/private-content guarantee

What changed

1) New Bilibili metascraper plugin

  • Added apps/workers/metascraper-plugins/metascraper-bilibili.ts
  • Provides metadata + readableContentHtml
  • Routes target type by URL and resolves via Bilibili APIs

2) Dynamic fallback chain (risk-aware)

  • opus/detail (WBI2 + WBI)
  • web-dynamic/v1/detail
  • desktop/v1/detail (desktop cookie)
  • legacy dynamic_svr/get_dynamic_detail
  • article fallback when opus payload indicates article target

3) Metadata quality and defaults

  • ISO date normalization (datePublished, article dateModified)
  • safer title fallbacks for dynamic pages
  • default fallback image for dynamic/article when cover is missing
  • avatar URL exclusion for *.hdslb.com/bfs/face/*

4) Worker parser integration

  • Registered metascraper-bilibili in parseHtmlSubprocess.ts
  • Extended DOMPurify options to preserve referrerpolicy in sanitized HTML

5) Misc

  • .gitignore now ignores .mcp-vector-search

Behavior notes

  • Video is expected to keep real cover from video payload.
  • Dynamic/article with no valid cover will use fallback image instead of ad-like/irrelevant images.
  • Under risk-control (-352/-403) dynamic resolution retries and downgrades to fallback endpoints.

Testing

  • pnpm --filter @karakeep/workers typecheck
  • Manual crawl verification:
    • video (BV/av)
    • dynamic (text/image/forward)
    • article (cv)
    • risk-control fallback path

Risks / Limitations

  • Public endpoints may still be blocked intermittently by region/IP quality.
  • Dynamic payload shape may change; parser is best-effort with bounded fallback.

Screenshots

Card Preview

image

Metadata

image

Content Preview

While the screenshot shows a captcha page, the reader view can normally read correct content.

image image

Acknowledgements

This PR’s endpoint selection and fallback strategy were informed by:

Checklist

  • No breaking API changes
  • No DB migration
  • Worker typecheck passes
  • Added fallback behavior for risk-control scenarios

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/karakeep-app/karakeep/pull/2525 **Author:** [@CircleCrop](https://github.com/CircleCrop) **Created:** 2/27/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `codex/route` --- ### 📝 Commits (7) - [`1bb630e`](https://github.com/karakeep-app/karakeep/commit/1bb630e2891386551d45a9d20e00114d391fe511) feat(workers): add bilibili metascraper plugin and harden readable HTML handling - [`1d70a6f`](https://github.com/karakeep-app/karakeep/commit/1d70a6f5c08cfd616ae3318c5182ce0aa6e5f4bc) feat(parser): enhance HTML sanitization with referrer policy options - [`9c71e22`](https://github.com/karakeep-app/karakeep/commit/9c71e22a2bfda6f7a6a61e8ab2e7c6b01b133ce2) feat(workers): harden bilibili metascraper metadata resolution - [`db6d500`](https://github.com/karakeep-app/karakeep/commit/db6d500354ef2cf81fd7eff774ff8378da76083e) feat(metascraper-bilibili): add fallback image for non-cover pages - [`b865925`](https://github.com/karakeep-app/karakeep/commit/b865925cdd62b13905998a88e1e4dc3b57a8ca2e) feat(metascraper-bilibili): enhance error handling and logging for API requests - [`5fbc04b`](https://github.com/karakeep-app/karakeep/commit/5fbc04bbe7fa2e173e7819898e5067deda5975f0) feat(metascraper-bilibili): add avatar image exclusion in image normalization - [`0444a41`](https://github.com/karakeep-app/karakeep/commit/0444a41dfe93a5ceea5a91d2d44a201043386ecc) feat(parser): enhance HTML sanitization with referrer policy validation ### 📊 Changes **3 files changed** (+3251 additions, -16 deletions) <details> <summary>View changed files</summary> 📝 `.gitignore` (+1 -0) ➕ `apps/workers/metascraper-plugins/metascraper-bilibili.ts` (+3206 -0) 📝 `apps/workers/scripts/parseHtmlSubprocess.ts` (+44 -16) </details> ### 📄 Description ## Does this PR include all my current changes? Yes. Compared against `upstream/main`, this PR currently contains all changes on `codex/route`: - 6 commits - 3 files changed - `.gitignore` - `apps/workers/metascraper-plugins/metascraper-bilibili.ts` - `apps/workers/scripts/parseHtmlSubprocess.ts` --- ## Summary This PR adds phase-1 Bilibili crawling support focused on: - metadata completeness - Reader-ready HTML output - resilient fallback behavior under risk-control/captcha scenarios No DB migration. No API contract changes. ## Motivation Bilibili pages (especially dynamic/opus) are unstable for generic metadata extraction under anti-bot constraints. This PR adds a dedicated plugin so Karakeep can produce stable metadata and readable content. ## Scope ### Supported URL types - Video: `/video/BV...`, `/video/av...`, bangumi `ep/ss`, festival with `bvid` - Dynamic/Opus: `/opus/{id}`, `t.bilibili.com/{id}`, `/h5/dynamic/detail/{id}` - Article/Column: `/read/cv{cvid}`, mobile read URLs ### Not in scope - No new DB schema - No new REST/tRPC routes - No login-required/private-content guarantee ## What changed ### 1) New Bilibili metascraper plugin - Added `apps/workers/metascraper-plugins/metascraper-bilibili.ts` - Provides metadata + `readableContentHtml` - Routes target type by URL and resolves via Bilibili APIs ### 2) Dynamic fallback chain (risk-aware) - `opus/detail` (WBI2 + WBI) - `web-dynamic/v1/detail` - `desktop/v1/detail` (desktop cookie) - legacy `dynamic_svr/get_dynamic_detail` - article fallback when opus payload indicates article target ### 3) Metadata quality and defaults - ISO date normalization (`datePublished`, article `dateModified`) - safer title fallbacks for dynamic pages - default fallback image for dynamic/article when cover is missing - avatar URL exclusion for `*.hdslb.com/bfs/face/*` ### 4) Worker parser integration - Registered `metascraper-bilibili` in `parseHtmlSubprocess.ts` - Extended DOMPurify options to preserve `referrerpolicy` in sanitized HTML ### 5) Misc - `.gitignore` now ignores `.mcp-vector-search` ## Behavior notes - Video is expected to keep real cover from video payload. - Dynamic/article with no valid cover will use fallback image instead of ad-like/irrelevant images. - Under risk-control (`-352/-403`) dynamic resolution retries and downgrades to fallback endpoints. ## Testing - `pnpm --filter @karakeep/workers typecheck` - Manual crawl verification: - video (BV/av) - dynamic (text/image/forward) - article (cv) - risk-control fallback path ## Risks / Limitations - Public endpoints may still be blocked intermittently by region/IP quality. - Dynamic payload shape may change; parser is best-effort with bounded fallback. ## Screenshots ### Card Preview <img width="1028" height="448" alt="image" src="https://github.com/user-attachments/assets/78bd5765-c508-4b2d-86af-a13e00bb21cd" /> ### Metadata <img width="618" height="1049" alt="image" src="https://github.com/user-attachments/assets/9e1e1f8b-88dd-42ed-9488-39649e4b10ed" /> ### Content Preview > While the screenshot shows a captcha page, the reader view can normally read correct content. <img width="2073" height="1057" alt="image" src="https://github.com/user-attachments/assets/28ac3d08-e09e-4d59-8aff-28db69852599" /> <img width="2072" height="1061" alt="image" src="https://github.com/user-attachments/assets/dfa71a8e-20b8-415e-9709-a979234fd71a" /> ## Acknowledgements This PR’s endpoint selection and fallback strategy were informed by: - [Nemo2011/bilibili-api](https://github.com/Nemo2011/bilibili-api) ## Checklist - [x] No breaking API changes - [x] No DB migration - [x] Worker typecheck passes - [x] Added fallback behavior for risk-control scenarios --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#2141
No description provided.