[GH-ISSUE #778] Feature Request: add BDfR as a new extractor for archiving Reddit content #3513

Open
opened 2026-03-14 23:19:34 +03:00 by kerem · 2 comments
Owner

Originally created by @pirate on GitHub (Jul 2, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/778

Discussed in https://github.com/ArchiveBox/ArchiveBox/discussions/754

Originally posted by BlipRanger May 24, 2021
Just wanted to make a quick mention of BDfR as a cool project that might make for a good starting point for the unrolling of reddit comments/posts as mentioned in the roadmap. They currently support grabbing a variety of media types from the post as well as the comments/text in a separate (json) file. I've been working on an addon for it lately and I think it's a pretty great project with well-maintained code. If nothing else, they have really good examples of working with reddit data which could be useful! Just wanted to bring that to your attention!

I'd love to add BDfR as an extractor for Reddit content (and something similar for Twitter too https://github.com/ArchiveBox/ArchiveBox/issues/345) but am somewhat swamped with work and travel for the near future.

If you @BlipRanger or anyone else wants to add it as an extractor (matching the style of our other extractors, e.g. archivebox/extractors/media.py is a great example to copy), I'd be happy to review PRs!

We have some good instructions for contributing a new extractor and getting started with ArchiveBox development in general:

Originally created by @pirate on GitHub (Jul 2, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/778 ### Discussed in https://github.com/ArchiveBox/ArchiveBox/discussions/754 <div type='discussions-op-text'> <sup>Originally posted by **BlipRanger** May 24, 2021</sup> Just wanted to make a quick mention of [BDfR](https://github.com/aliparlakci/bulk-downloader-for-reddit) as a cool project that might make for a good starting point for the unrolling of reddit comments/posts as mentioned in the roadmap. They currently support grabbing a variety of media types from the post as well as the comments/text in a separate (json) file. I've been working on an [addon](https://github.com/BlipRanger/bdfr-html) for it lately and I think it's a pretty great project with well-maintained code. If nothing else, they have really good examples of working with reddit data which could be useful! Just wanted to bring that to your attention!</div> I'd love to add [BDfR](https://github.com/aliparlakci/bulk-downloader-for-reddit) as an extractor for Reddit content (and something similar for Twitter too https://github.com/ArchiveBox/ArchiveBox/issues/345) but am somewhat swamped with work and travel for the near future. If you @BlipRanger or anyone else wants to add it as an extractor (matching the style of our other extractors, e.g. [`archivebox/extractors/media.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/media.py) is a great example to copy), I'd be happy to review PRs! We have some good instructions for contributing a new extractor and getting started with ArchiveBox development in general: - https://github.com/ArchiveBox/ArchiveBox/blob/dev/README.md#contributing-a-new-extractor - https://github.com/ArchiveBox/ArchiveBox/blob/dev/README.md#archivebox-development - https://github.com/ArchiveBox/ArchiveBox/blob/dev/.github/CONTRIBUTING.md
Author
Owner

@pirate commented on GitHub (Oct 20, 2023):

We use Mercury (recently renamed postlight) as an extractor already, and they're rapidly adding extractors on their side for many different kinds of sites, so we should get these improvements with no effort required on the archivebox side:

<!-- gh-comment-id:1773337171 --> @pirate commented on GitHub (Oct 20, 2023): We use Mercury (recently renamed `postlight`) as an extractor already, and they're rapidly adding extractors on their side for many different kinds of sites, so we should get these improvements with no effort required on the archivebox side: - Reddit threads: https://github.com/postlight/parser/pull/746 - HN threads: https://github.com/postlight/parser/pull/745 - Twitter threads: https://github.com/postlight/parser/pull/622
Author
Owner

@rmelotte commented on GitHub (Jun 24, 2024):

It looks like the postlight project has no recent activity unfortunately (no PR reviews at least)...
Is there any plan to replace it with something else, or integrate the existing Reddit and HN PRs in a different way?

<!-- gh-comment-id:2186852605 --> @rmelotte commented on GitHub (Jun 24, 2024): It looks like the postlight project has no recent activity unfortunately (no PR reviews at least)... Is there any plan to replace it with something else, or integrate the existing Reddit and HN PRs in a different way?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3513
No description provided.