[GH-ISSUE #1012] Feature Request: OCR archived PDF files to extract titles and full-text contents #2146

Open
opened 2026-03-01 17:56:51 +03:00 by kerem · 10 comments
Owner

Originally created by @turian on GitHub (Aug 12, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1012

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

echo 'https://arxiv.org/pdf/2004.14294.pdf' | archivebox add

The PDF is saved but without a title, or any PDF text. Thus making it impossible to search.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

At the very least, the title of this PDF "Boilerplate Removal using a Neural Sequence Labeling Model" should be in the index.

Better yet, the full PDF text should be available too for searching.

What hacks or alternative solutions have you tried to solve the problem?

Using a unix pdf to text or similar to manually convert all PDFs in the archive. But I'm not sure how to write to the archive.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @turian on GitHub (Aug 12, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1012 ## Type - [ ] General question or discussion - [X] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves ``` echo 'https://arxiv.org/pdf/2004.14294.pdf' | archivebox add ``` The PDF is saved but without a title, or any PDF text. Thus making it impossible to search. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes At the very least, the title of this PDF "Boilerplate Removal using a Neural Sequence Labeling Model" should be in the index. Better yet, the full PDF text should be available too for searching. ## What hacks or alternative solutions have you tried to solve the problem? Using a unix pdf to text or similar to manually convert all PDFs in the archive. But I'm not sure how to write to the archive. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [X] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [X] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [X] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@turian commented on GitHub (Aug 12, 2022):

I have added sponsorship

<!-- gh-comment-id:1212861990 --> @turian commented on GitHub (Aug 12, 2022): I have added sponsorship
Author
Owner

@turian commented on GitHub (Aug 12, 2022):

For me, this issue is a deal-breaker if it will never come. If it would come, I can start archiving and expect to have the fix later.

<!-- gh-comment-id:1212864799 --> @turian commented on GitHub (Aug 12, 2022): For me, this issue is a deal-breaker if it will never come. If it would come, I can start archiving and expect to have the fix later.
Author
Owner

@pirate commented on GitHub (Aug 20, 2022):

This is a good idea. This could either be a whole new extractor to implement (similar to readability/mercury in that it takes the output of a previous extractor and processes it to extract text), or a final step in the existing PDF extractor. If anyone wants to implement this I'd welcome proposals/PRs, but I'm unlikely to have time to do it myself in the next 6mo.

Maybe something like this:

import PyPDF2 
    
# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 
    
# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
    
# printing number of pages in pdf file 
print(pdfReader.numPages) 
    
# creating a page object 
pageObj = pdfReader.getPage(0) 
# repeat for each page
    
# extracting text from page 
archive_texts += pageObj.extractText()
    
# closing the pdf file object 
pdfFileObj.close() 

# save archive_texts output as txt file / pass it to sonic for indexing
<!-- gh-comment-id:1221371065 --> @pirate commented on GitHub (Aug 20, 2022): This is a good idea. This could either be a whole new extractor to implement (similar to readability/mercury in that it takes the output of a previous extractor and processes it to extract text), or a final step in the existing PDF extractor. If anyone wants to implement this I'd welcome proposals/PRs, but I'm unlikely to have time to do it myself in the next 6mo. Maybe something like this: ```python3 import PyPDF2 # creating a pdf file object pdfFileObj = open('example.pdf', 'rb') # creating a pdf reader object pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # printing number of pages in pdf file print(pdfReader.numPages) # creating a page object pageObj = pdfReader.getPage(0) # repeat for each page # extracting text from page archive_texts += pageObj.extractText() # closing the pdf file object pdfFileObj.close() # save archive_texts output as txt file / pass it to sonic for indexing ```
Author
Owner

@turian commented on GitHub (Aug 27, 2022):

It appears that pd3f is a better PDF parser, because it uses OCR for really old PDFs that aren't text.

<!-- gh-comment-id:1229201046 --> @turian commented on GitHub (Aug 27, 2022): It appears that [pd3f](https://github.com/pd3f/pd3f) is a better PDF parser, because it uses OCR for really old PDFs that aren't text.
Author
Owner

@kylemclaren commented on GitHub (Jan 30, 2023):

Hey folks, this thread has gone a bit stale. I am also looking to understand if this feature can be expected in a future release? To be clear, I want PDF content to be searchable.

<!-- gh-comment-id:1408797160 --> @kylemclaren commented on GitHub (Jan 30, 2023): Hey folks, this thread has gone a bit stale. I am also looking to understand if this feature can be expected in a future release? To be clear, I want PDF content to be searchable.
Author
Owner

@pirate commented on GitHub (Jan 31, 2023):

The status currently is that I think this is a good idea, but I can't promise I'll build it myself in the near future.
I unfortunately have limited time for ArchiveBox and a backlog of higher priority things at the moment.

In the meantime I'd be willing to accept PRs for a solution, and I've already outlined a potential approach here if anyone wants to take a stab at implementing this: https://github.com/ArchiveBox/ArchiveBox/issues/1012#issuecomment-1221371065

The development docs are here:

<!-- gh-comment-id:1409898762 --> @pirate commented on GitHub (Jan 31, 2023): The status currently is that I think this is a good idea, but I can't promise I'll build it myself in the near future. I unfortunately have limited time for ArchiveBox and a backlog of higher priority things at the moment. In the meantime I'd be willing to accept PRs for a solution, and I've already outlined a potential approach here if anyone wants to take a stab at implementing this: https://github.com/ArchiveBox/ArchiveBox/issues/1012#issuecomment-1221371065 The development docs are here: - https://github.com/ArchiveBox/ArchiveBox#archivebox-development - https://github.com/ArchiveBox/ArchiveBox#contributing-a-new-extractor
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

A stopgap solution to search inside of PDF text right now is to replace the RIPGREP_BINARY with ripgrep-all, further discussion here: https://github.com/ArchiveBox/ArchiveBox/issues/1091

If anyone knows of any youtube-dl/yt-dlp equivalent for extracting paper PDFs from a given URL, please comment here! A yt-dlp -> paper-dl equivalent solution that can take a DOI number or URL and extract the paper PDF, and metadata about the date published, authors, abstract, etc. into JSON, that would be amazing.

<!-- gh-comment-id:1589043415 --> @pirate commented on GitHub (Jun 13, 2023): A stopgap solution to search inside of PDF text right now is to replace the `RIPGREP_BINARY` with ripgrep-all, further discussion here: https://github.com/ArchiveBox/ArchiveBox/issues/1091 If anyone knows of any youtube-dl/yt-dlp equivalent for extracting paper PDFs from a given URL, please comment here! A `yt-dlp` -> `paper-dl` equivalent solution that can take a DOI number or URL and extract the paper PDF, and metadata about the date published, authors, abstract, etc. into JSON, that would be amazing.
Author
Owner

@pirate commented on GitHub (Mar 1, 2024):

Related: New Extractor Idea: scihub-dl to auto-detect inline DOI numbers and download academic paper PDFs #720

<!-- gh-comment-id:1972377203 --> @pirate commented on GitHub (Mar 1, 2024): Related: `New Extractor Idea: scihub-dl to auto-detect inline DOI numbers and download academic paper PDFs #720` - https://github.com/ArchiveBox/ArchiveBox/issues/720
Author
Owner

@benmuth commented on GitHub (Mar 16, 2024):

@pirate pd3f seems like the most capable tool, but the docs say that it takes 8GB. Just wanted to check if we're okay with a dependency that heavy before I start trying it out.

<!-- gh-comment-id:2000802354 --> @benmuth commented on GitHub (Mar 16, 2024): @pirate pd3f seems like the most capable tool, but [the docs](https://pd3f.com/docs/pd3f/installation/) say that it takes 8GB. Just wanted to check if we're okay with a dependency that heavy before I start trying it out.
Author
Owner

@pirate commented on GitHub (Mar 17, 2024):

Nah too heavy haha, everything archivebox needs currently is ~500mb (including OS and Python itself).

<!-- gh-comment-id:2002596127 --> @pirate commented on GitHub (Mar 17, 2024): Nah too heavy haha, everything archivebox needs currently is ~500mb (including OS and Python itself).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2146
No description provided.