starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

[GH-ISSUE #778] Feature Request: add BDfR as a new extractor for archiving Reddit content #3513

New issue

Open

opened 2026-03-14 23:19:34 +03:00 by kerem · 2 comments

kerem commented

2026-03-14 23:19:34 +03:00

Owner

Originally created by @pirate on GitHub (Jul 2, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/778

Discussed in https://github.com/ArchiveBox/ArchiveBox/discussions/754

^{Originally posted by BlipRanger May 24, 2021}
Just wanted to make a quick mention of BDfR as a cool project that might make for a good starting point for the unrolling of reddit comments/posts as mentioned in the roadmap. They currently support grabbing a variety of media types from the post as well as the comments/text in a separate (json) file. I've been working on an addon for it lately and I think it's a pretty great project with well-maintained code. If nothing else, they have really good examples of working with reddit data which could be useful! Just wanted to bring that to your attention!

I'd love to add BDfR as an extractor for Reddit content (and something similar for Twitter too https://github.com/ArchiveBox/ArchiveBox/issues/345) but am somewhat swamped with work and travel for the near future.

If you @BlipRanger or anyone else wants to add it as an extractor (matching the style of our other extractors, e.g. archivebox/extractors/media.py is a great example to copy), I'd be happy to review PRs!

We have some good instructions for contributing a new extractor and getting started with ArchiveBox development in general:

Originally created by @pirate on GitHub (Jul 2, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/778 ### Discussed in https://github.com/ArchiveBox/ArchiveBox/discussions/754 <div type='discussions-op-text'> <sup>Originally posted by **BlipRanger** May 24, 2021</sup> Just wanted to make a quick mention of [BDfR](https://github.com/aliparlakci/bulk-downloader-for-reddit) as a cool project that might make for a good starting point for the unrolling of reddit comments/posts as mentioned in the roadmap. They currently support grabbing a variety of media types from the post as well as the comments/text in a separate (json) file. I've been working on an [addon](https://github.com/BlipRanger/bdfr-html) for it lately and I think it's a pretty great project with well-maintained code. If nothing else, they have really good examples of working with reddit data which could be useful! Just wanted to bring that to your attention!</div> I'd love to add [BDfR](https://github.com/aliparlakci/bulk-downloader-for-reddit) as an extractor for Reddit content (and something similar for Twitter too https://github.com/ArchiveBox/ArchiveBox/issues/345) but am somewhat swamped with work and travel for the near future. If you @BlipRanger or anyone else wants to add it as an extractor (matching the style of our other extractors, e.g. [`archivebox/extractors/media.py`](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/extractors/media.py) is a great example to copy), I'd be happy to review PRs! We have some good instructions for contributing a new extractor and getting started with ArchiveBox development in general: - https://github.com/ArchiveBox/ArchiveBox/blob/dev/README.md#contributing-a-new-extractor - https://github.com/ArchiveBox/ArchiveBox/blob/dev/README.md#archivebox-development - https://github.com/ArchiveBox/ArchiveBox/blob/dev/.github/CONTRIBUTING.md

kerem added the

touches: configuration

touches: dependencies/packaging

labels

2026-03-14 23:19:34 +03:00

kerem commented

2026-03-14 23:19:44 +03:00

Author

Owner

@pirate commented on GitHub (Oct 20, 2023):

We use Mercury (recently renamed postlight) as an extractor already, and they're rapidly adding extractors on their side for many different kinds of sites, so we should get these improvements with no effort required on the archivebox side:

Reddit threads: https://github.com/postlight/parser/pull/746
HN threads: https://github.com/postlight/parser/pull/745
Twitter threads: https://github.com/postlight/parser/pull/622

@pirate commented on GitHub (Oct 20, 2023): We use Mercury (recently renamed `postlight`) as an extractor already, and they're rapidly adding extractors on their side for many different kinds of sites, so we should get these improvements with no effort required on the archivebox side: - Reddit threads: https://github.com/postlight/parser/pull/746 - HN threads: https://github.com/postlight/parser/pull/745 - Twitter threads: https://github.com/postlight/parser/pull/622

kerem commented

2026-03-14 23:19:49 +03:00

Author

Owner

@rmelotte commented on GitHub (Jun 24, 2024):

It looks like the postlight project has no recent activity unfortunately (no PR reviews at least)...
Is there any plan to replace it with something else, or integrate the existing Reddit and HN PRs in a different way?

@rmelotte commented on GitHub (Jun 24, 2024): It looks like the postlight project has no recent activity unfortunately (no PR reviews at least)... Is there any plan to replace it with something else, or integrate the existing Reddit and HN PRs in a different way?

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#3513

No description provided.

Rows
Columns