mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1012] Feature Request: OCR archived PDF files to extract titles and full-text contents #2146
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2146
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @turian on GitHub (Aug 12, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1012
Type
What is the problem that your feature request solves
The PDF is saved but without a title, or any PDF text. Thus making it impossible to search.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
At the very least, the title of this PDF "Boilerplate Removal using a Neural Sequence Labeling Model" should be in the index.
Better yet, the full PDF text should be available too for searching.
What hacks or alternative solutions have you tried to solve the problem?
Using a unix pdf to text or similar to manually convert all PDFs in the archive. But I'm not sure how to write to the archive.
How badly do you want this new feature?
@turian commented on GitHub (Aug 12, 2022):
I have added sponsorship
@turian commented on GitHub (Aug 12, 2022):
For me, this issue is a deal-breaker if it will never come. If it would come, I can start archiving and expect to have the fix later.
@pirate commented on GitHub (Aug 20, 2022):
This is a good idea. This could either be a whole new extractor to implement (similar to readability/mercury in that it takes the output of a previous extractor and processes it to extract text), or a final step in the existing PDF extractor. If anyone wants to implement this I'd welcome proposals/PRs, but I'm unlikely to have time to do it myself in the next 6mo.
Maybe something like this:
@turian commented on GitHub (Aug 27, 2022):
It appears that pd3f is a better PDF parser, because it uses OCR for really old PDFs that aren't text.
@kylemclaren commented on GitHub (Jan 30, 2023):
Hey folks, this thread has gone a bit stale. I am also looking to understand if this feature can be expected in a future release? To be clear, I want PDF content to be searchable.
@pirate commented on GitHub (Jan 31, 2023):
The status currently is that I think this is a good idea, but I can't promise I'll build it myself in the near future.
I unfortunately have limited time for ArchiveBox and a backlog of higher priority things at the moment.
In the meantime I'd be willing to accept PRs for a solution, and I've already outlined a potential approach here if anyone wants to take a stab at implementing this: https://github.com/ArchiveBox/ArchiveBox/issues/1012#issuecomment-1221371065
The development docs are here:
@pirate commented on GitHub (Jun 13, 2023):
A stopgap solution to search inside of PDF text right now is to replace the
RIPGREP_BINARYwith ripgrep-all, further discussion here: https://github.com/ArchiveBox/ArchiveBox/issues/1091If anyone knows of any youtube-dl/yt-dlp equivalent for extracting paper PDFs from a given URL, please comment here! A
yt-dlp->paper-dlequivalent solution that can take a DOI number or URL and extract the paper PDF, and metadata about the date published, authors, abstract, etc. into JSON, that would be amazing.@pirate commented on GitHub (Mar 1, 2024):
Related:
New Extractor Idea: scihub-dl to auto-detect inline DOI numbers and download academic paper PDFs #720@benmuth commented on GitHub (Mar 16, 2024):
@pirate pd3f seems like the most capable tool, but the docs say that it takes 8GB. Just wanted to check if we're okay with a dependency that heavy before I start trying it out.
@pirate commented on GitHub (Mar 17, 2024):
Nah too heavy haha, everything archivebox needs currently is ~500mb (including OS and Python itself).