starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 17:16:00 +03:00

[GH-ISSUE #1094] Feature Request: Adding a `DOM_WHITELIST` to fine tune the `depth` parameter #2195

New issue

Closed

opened 2026-03-01 17:57:12 +03:00 by kerem · 2 comments

kerem commented

2026-03-01 17:57:12 +03:00

Owner

Originally created by @diego898 on GitHub (Feb 7, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1094

Type

Propose a brand new feature

What is the problem that your feature request solves

Archiving "structured discussion" pages like a HackerNews post

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

I'm a frequent reader of a discussion+link aggregator called HackerNews. I frequently want to archive both the main link, and the discussions on HN.

Right now, I have to: click the discussions page on a post, and then click the main URL to have two new URLS that I can then archive.

It would be great if I can specify a depth=1, and maybe use a DOM_WHITELIST or something to only capture the DOM element calledtitlelink, and not any of the other million links on a HackerNews post (parent, next, etc on each post).

How badly do you want this new feature?

It would be nice to have eventually

I like ArchiveBox so far / would recommend it to a friend

Originally created by @diego898 on GitHub (Feb 7, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1094 ## Type - Propose a brand new feature ## What is the problem that your feature request solves Archiving "structured discussion" pages like a HackerNews post ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes I'm a frequent reader of a discussion+link aggregator called [HackerNews](https://news.ycombinator.com/). I frequently want to archive both the main link, and the discussions on HN. Right now, I have to: click the discussions page on a post, and then click the main URL to have two new URLS that I can then archive. It would be great if I can specify a `depth=1`, and maybe use a `DOM_WHITELIST` or something to only capture the DOM element called`titlelink`, and not any of the other million links on a HackerNews post (parent, next, etc on each post). ## How badly do you want this new feature? - It would be nice to have eventually --- - I like ArchiveBox so far / would recommend it to a friend

kerem

2026-03-01 17:57:12 +03:00

closed this issue
added the
touches: configuration

size: medium

help wanted

expected: unlikely unless contributed
labels

kerem commented

2026-03-01 17:57:13 +03:00

Author

Owner

@pirate commented on GitHub (Feb 19, 2023):

This is a reasonable idea but unlikely to be implemented anytime soon to be honest. I have too much other work to allow ArchiveBox to feature creep into being a full-fledged crawler instead of just an archiving engine / data warehouse. I recommend finding another crawling/scraping solution (like scrapy) and piping the outputted URLs into ArchiveBox.

@pirate commented on GitHub (Feb 19, 2023): This is a reasonable idea but unlikely to be implemented anytime soon to be honest. I have too much other work to allow ArchiveBox to feature creep into being a full-fledged crawler instead of just an archiving engine / data warehouse. I recommend finding another crawling/scraping solution (like scrapy) and piping the outputted URLs into ArchiveBox.

kerem commented

2026-03-01 17:57:13 +03:00

Author

Owner

@diego898 commented on GitHub (Feb 19, 2023):

Understood thanks!

@diego898 commented on GitHub (Feb 19, 2023): Understood thanks!

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#2195

No description provided.

Rows
Columns

[GH-ISSUE #1094] Feature Request: Adding a DOM_WHITELIST to fine tune the depth parameter #2195

Type

What is the problem that your feature request solves

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

How badly do you want this new feature?

[GH-ISSUE #1094] Feature Request: Adding a `DOM_WHITELIST` to fine tune the `depth` parameter #2195