[GH-ISSUE #1094] Feature Request: Adding a DOM_WHITELIST to fine tune the depth parameter #2195

Closed
opened 2026-03-01 17:57:12 +03:00 by kerem · 2 comments
Owner

Originally created by @diego898 on GitHub (Feb 7, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1094

Type

  • Propose a brand new feature

What is the problem that your feature request solves

Archiving "structured discussion" pages like a HackerNews post

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

I'm a frequent reader of a discussion+link aggregator called HackerNews. I frequently want to archive both the main link, and the discussions on HN.

Right now, I have to: click the discussions page on a post, and then click the main URL to have two new URLS that I can then archive.

It would be great if I can specify a depth=1, and maybe use a DOM_WHITELIST or something to only capture the DOM element calledtitlelink, and not any of the other million links on a HackerNews post (parent, next, etc on each post).

How badly do you want this new feature?

  • It would be nice to have eventually

  • I like ArchiveBox so far / would recommend it to a friend
Originally created by @diego898 on GitHub (Feb 7, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1094 ## Type - Propose a brand new feature ## What is the problem that your feature request solves Archiving "structured discussion" pages like a HackerNews post ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes I'm a frequent reader of a discussion+link aggregator called [HackerNews](https://news.ycombinator.com/). I frequently want to archive both the main link, and the discussions on HN. Right now, I have to: click the discussions page on a post, and then click the main URL to have two new URLS that I can then archive. It would be great if I can specify a `depth=1`, and maybe use a `DOM_WHITELIST` or something to only capture the DOM element called`titlelink`, and not any of the other million links on a HackerNews post (parent, next, etc on each post). ## How badly do you want this new feature? - It would be nice to have eventually --- - I like ArchiveBox so far / would recommend it to a friend
Author
Owner

@pirate commented on GitHub (Feb 19, 2023):

This is a reasonable idea but unlikely to be implemented anytime soon to be honest. I have too much other work to allow ArchiveBox to feature creep into being a full-fledged crawler instead of just an archiving engine / data warehouse. I recommend finding another crawling/scraping solution (like scrapy) and piping the outputted URLs into ArchiveBox.

<!-- gh-comment-id:1435833922 --> @pirate commented on GitHub (Feb 19, 2023): This is a reasonable idea but unlikely to be implemented anytime soon to be honest. I have too much other work to allow ArchiveBox to feature creep into being a full-fledged crawler instead of just an archiving engine / data warehouse. I recommend finding another crawling/scraping solution (like scrapy) and piping the outputted URLs into ArchiveBox.
Author
Owner

@diego898 commented on GitHub (Feb 19, 2023):

Understood thanks!

<!-- gh-comment-id:1436083428 --> @diego898 commented on GitHub (Feb 19, 2023): Understood thanks!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2195
No description provided.