mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1094] Feature Request: Adding a DOM_WHITELIST to fine tune the depth parameter #2195
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2195
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @diego898 on GitHub (Feb 7, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1094
Type
What is the problem that your feature request solves
Archiving "structured discussion" pages like a HackerNews post
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
I'm a frequent reader of a discussion+link aggregator called HackerNews. I frequently want to archive both the main link, and the discussions on HN.
Right now, I have to: click the discussions page on a post, and then click the main URL to have two new URLS that I can then archive.
It would be great if I can specify a
depth=1, and maybe use aDOM_WHITELISTor something to only capture the DOM element calledtitlelink, and not any of the other million links on a HackerNews post (parent, next, etc on each post).How badly do you want this new feature?
@pirate commented on GitHub (Feb 19, 2023):
This is a reasonable idea but unlikely to be implemented anytime soon to be honest. I have too much other work to allow ArchiveBox to feature creep into being a full-fledged crawler instead of just an archiving engine / data warehouse. I recommend finding another crawling/scraping solution (like scrapy) and piping the outputted URLs into ArchiveBox.
@diego898 commented on GitHub (Feb 19, 2023):
Understood thanks!