[GH-ISSUE #720] New Extractor Idea: scihub-dl to auto-detect inline DOI numbers and download academic paper PDFs #455

Open
opened 2026-03-01 14:43:44 +03:00 by kerem · 12 comments
Owner

Originally created by @pirate on GitHub (Apr 24, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/720

New extractor idea: SCIHUB

e.g. take this academic paper for example: https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1

If a full paper PDF is available on scihub, e.g.: https://sci-hub.se/https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 it could be downloaded to a ./archive/<timestmap>/scihub/ output folder.

# try downloading via verbatim URL first
$ scihub.py -d https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1'
DEBUG:Sci-Hub:Successfully downloaded file with identifier https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1

We could also look for a DOI number in the page URL or page html contents e.g.: 10.1016/j.cub.2019.11.030 using a regex and try downloading that.

# otherwise try downloading via any regex-extracted bare DOI numbers on the page or in the URL
$ scihub.py -d '10.1016/j.cub.2019.11.030'
DEBUG:Sci-Hub:Successfully downloaded file with identifier 10.1016/j.cub.2019.11.030

$ ls
c28dc1242df6f931c29b9cd445a55597-.cub.2019.11.030.pdf

New Dependencies:

New Extractors:

  • extractors/scihub.py

New Config Options:

  • SAVE_SCIHUB=True
Originally created by @pirate on GitHub (Apr 24, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/720 New extractor idea: SCIHUB e.g. take this academic paper for example: https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 If a full paper PDF is available on scihub, e.g.: https://sci-hub.se/https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 it could be downloaded to a `./archive/<timestmap>/scihub/` output folder. ```bash # try downloading via verbatim URL first $ scihub.py -d https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1' DEBUG:Sci-Hub:Successfully downloaded file with identifier https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 ``` We could also look for a DOI number in the page URL or page html contents e.g.: `10.1016/j.cub.2019.11.030` using a regex and try downloading that. ```bash # otherwise try downloading via any regex-extracted bare DOI numbers on the page or in the URL $ scihub.py -d '10.1016/j.cub.2019.11.030' DEBUG:Sci-Hub:Successfully downloaded file with identifier 10.1016/j.cub.2019.11.030 $ ls c28dc1242df6f931c29b9cd445a55597-.cub.2019.11.030.pdf ``` **New Dependencies:** - https://github.com/zaytoun/scihub.py **New Extractors:** - `extractors/scihub.py` **New Config Options:** - `SAVE_SCIHUB=True`
Author
Owner

@danisztls commented on GitHub (Apr 27, 2021):

Captchas would be a problem.

<!-- gh-comment-id:828027965 --> @danisztls commented on GitHub (Apr 27, 2021): Captchas would be a problem.
Author
Owner

@pirate commented on GitHub (Apr 27, 2021):

In my testing so far it's not a problem as long as you're not doing dozens of papers at a time.

<!-- gh-comment-id:828030950 --> @pirate commented on GitHub (Apr 27, 2021): In my testing so far it's not a problem as long as you're not doing dozens of papers at a time.
Author
Owner

@jezcope commented on GitHub (Aug 27, 2023):

Two more avenues to explore:

  1. Zotero has a big collection of translators to get full text from many different publishers and repositories https://github.com/zotero/translators
  2. Unpaywall has a huge database of open access copies of articles https://unpaywall.org/

Would be great to capture full text for academic publications where its available.

<!-- gh-comment-id:1694633999 --> @jezcope commented on GitHub (Aug 27, 2023): Two more avenues to explore: 1. Zotero has a big collection of translators to get full text from many different publishers and repositories https://github.com/zotero/translators 2. Unpaywall has a huge database of open access copies of articles https://unpaywall.org/ Would be great to capture full text for academic publications where its available.
Author
Owner
<!-- gh-comment-id:1901693815 --> @pirate commented on GitHub (Jan 20, 2024): Parsing DOIs out of PDF/HTML content: - https://github.com/MicheleCotrufo/pdf2doi - https://www.findingyourway.io/blog/2019/03/13/2019-03-13_extracting-doi-from-text/ - https://stackoverflow.com/questions/27910/finding-a-doi-in-a-document-or-page Found good candidate extractor dependencies to try for the major free scientific paper DBs: - https://github.com/Tishacy/SciDownl - - https://pypi.org/project/arxiv-dl/ - https://github.com/viown/libgen-dl - https://github.com/unpywall/unpywall - https://openaccessbutton.org/?from=integration&id= - https://github.com/bibcure/scihub2pdf - https://github.com/dheison0/annas-archive-api and more via... - https://github.com/mrossinek/cobib ⭐️ (awesome bibliography manager that supports DOI search) - https://gist.github.com/taskylizard/5ba73bf97dccf159316edcf4c6520856#-academic-papers - https://github.com/cipher387/osint_stuff_tool_collection - https://www.reddit.com/r/FREEMEDIAHECKYEAH/wiki/tools-index something like this might be interesting for linking together citations too https://inciteful.xyz/
Author
Owner

@benmuth commented on GitHub (Feb 11, 2024):

I have a rough script for this working (just using scihub.py as a module and downloading a pdf with the url->doi fallback). I had to slightly modify scihub.py to get it to work, but I was thinking we can just add the modified version to the vendor directory or something. It looks like scihub.py isn't available on PyPI anyway.

Should I just wait for the plugin system? Or I can create a repo with the script and modified scihub.py so it can be looked at and improved in the meantime. Or I can try to minimally integrate it with ArchiveBox (just get it working as a oneshot?) and open a PR for feedback and testing, whatever makes sense.

<!-- gh-comment-id:1937373942 --> @benmuth commented on GitHub (Feb 11, 2024): I have a rough script for this working (just using `scihub.py` as a module and downloading a pdf with the url->doi fallback). I had to slightly modify `scihub.py` to get it to work, but I was thinking we can just add the modified version to the `vendor` directory or something. It looks like `scihub.py` isn't available on PyPI anyway. Should I just wait for the plugin system? Or I can create a repo with the script and modified `scihub.py` so it can be looked at and improved in the meantime. Or I can try to minimally integrate it with ArchiveBox (just get it working as a `oneshot`?) and open a PR for feedback and testing, whatever makes sense.
Author
Owner

@pirate commented on GitHub (Feb 11, 2024):

If you can post it in a gist and share the link that would be helpful @benmuth!

<!-- gh-comment-id:1937871069 --> @pirate commented on GitHub (Feb 11, 2024): If you can post it in a gist and share the link that would be helpful @benmuth!
Author
Owner

@benmuth commented on GitHub (Feb 13, 2024):

Here's what I have so far https://gist.github.com/benmuth/b2c12cbb40ca4d8183c6f17f819e2f2d @pirate

Usage:

python scihub-extractor.py -d <DOI|URL> -o <output directory>

or

python scihub-extractor.py -f <text file with newline separated DOIs|URLs> -o <output directory>

It should either

  • download the paper directly identified by the DOI or URL you gave it
    or
  • if you gave it a URL but it can't directly find a paper through it, then it will parse the page's HTML for DOIs and attempt to download all DOIs that are found.

In the second case, we can probably be smarter about only downloading the DOIs intended (based on title or something?), but it's pretty dumb right now.

<!-- gh-comment-id:1942349267 --> @benmuth commented on GitHub (Feb 13, 2024): Here's what I have so far https://gist.github.com/benmuth/b2c12cbb40ca4d8183c6f17f819e2f2d @pirate Usage: ``` python scihub-extractor.py -d <DOI|URL> -o <output directory> ``` or ``` python scihub-extractor.py -f <text file with newline separated DOIs|URLs> -o <output directory> ``` It should either - download the paper directly identified by the DOI or URL you gave it or - if you gave it a URL but it can't directly find a paper through it, then it will parse the page's HTML for DOIs and attempt to download all DOIs that are found. In the second case, we can probably be smarter about only downloading the DOIs intended (based on title or something?), but it's pretty dumb right now.
Author
Owner

@pirate commented on GitHub (Feb 21, 2024):

@benmuth it might take me a month or more till I'm able to merge this, as I'm working on some paid ArchiveBox projects right now for a client that are taking up most of my time. Don't worry, I'm keen to merge it though, I've been wanting this feature personally for a long time.

<!-- gh-comment-id:1956030828 --> @pirate commented on GitHub (Feb 21, 2024): @benmuth it might take me a month or more till I'm able to merge this, as I'm working on some paid ArchiveBox projects right now for a client that are taking up most of my time. Don't worry, I'm keen to merge it though, I've been wanting this feature personally for a long time.
Author
Owner

@benmuth commented on GitHub (Feb 21, 2024):

@benmuth it might take me a month or more till I'm able to merge this, as I'm working on some paid ArchiveBox projects right now for a client that are taking up most of my time. Don't worry, I'm keen to merge it though, I've been wanting this feature personally for a long time.

No worries! In the meantime maybe I can keep adding to it. I've kept it simple for demonstration purposes, but it seems straightforward to add some of the ideas you linked here, like the PDF parsing.

<!-- gh-comment-id:1957715870 --> @benmuth commented on GitHub (Feb 21, 2024): > @benmuth it might take me a month or more till I'm able to merge this, as I'm working on some paid ArchiveBox projects right now for a client that are taking up most of my time. Don't worry, I'm keen to merge it though, I've been wanting this feature personally for a long time. No worries! In the meantime maybe I can keep adding to it. I've kept it simple for demonstration purposes, but it seems straightforward to add some of the ideas you linked here, like the PDF parsing.
Author
Owner

@pirate commented on GitHub (Mar 22, 2024):

@benmuth after thinking about this a bit more, I think it could be fun to release your scihub-dl script as its own dedicated PyPI package!

Publishing a package is great resume candy and it'll make this tool usable to a much wider audience than just ArchiveBox users. Then we can just add it to pyproject.toml as a requirement for ArchiveBox and call it like any other command-line tool extractor.


Here's how I'm imagining it could work (starting simple at first, lots of this can wait for v2,v3,etc.):

  • papers-dl parse [--match=doi|pmid|issn|isbn|url|*] <path>

    expose a parsing CLI that can take an html/text file and regex through it to find any DOI/PMID/ISSN/ISBN/URL/etc. paper identifiers it finds within, outputting a line for each matching result to stdout

    • --match=doi|pmid|... specify what patterns to search for (multiple allowed if --json|--csv are passed)
    • --raw | --json | --csv whether to output one identifier per line as raw text,
      as JSONL {id: '...', type: 'doi', ...}, or as CSV <id>,<type>,<title>,<url>,...
    • <path> path to local file to try and parse for identifiers e.g. ./singlefile_archive.html or ./some_citations.txt
  • papers-dl search [--doi|--title|--url|--author|--publisedAfter|--publishedBefore] <query>

    expose a paper searching and metadata retrieval CLI that can search the most common research databases by DOI/PMID/ISSN/ISBN/title/URL/etc. and return info as JSON for each result

    • --providers=auto|arxiv.org|sci-hub.ru|annas-archive.org|... specify what index to search (the default auto can try all of them and return first/best matching result)
    • --fields=title,authors,description,publishDate,downloadURLs,... which metadata parameters to try and fetch
    • --raw | --json | --csv format output as one line of raw text per result, CSV, or JSONL
    • --doi|--issn|--title|... <query> specify one or more queries that are AND-ed together, e.g.
      --doi='10.1145/3375633' --title='bananas' --url='https://doi.org/.* --publishedAfter=2021-01-01
  • papers-dl fetch [--providers=...] [--formats=pdf,epub,...] [--doi|--title|--url] <query>

    expose paper downloading CLI that can take an identifier and download the corresponding document as PDF/html/epub/etc.

    • --providers=auto "by any means necessary" try all scihub mirrors, arxiv-dl, annas-archive, archive.org, ...
    • --providers=arxiv.org,sci-hub.ru,... try specific scihub mirror, annas-archive mirror, etc. other provider
    • --formats=pdf,html,epub,md,json (specify what filetypes to try downloading for the given paper, json could write a .json file containing the metadata that would show up from a search)
    • --doi|--issn|--url|... <query> can take an exact downloadURL (found via search), or re-use same query search system as search and run the download on the first/best result
  • papers-dl citations [--incoming] [--outgoing] [--authors] <query> (someday)

    definitely not a priority, but it would be super cool to someday be able to use this to graph-search over scientific publications, finding related DOIs based on incoming/outgoing citations, similar authors, institutions, funding sources, keywords, etc. (powered by Annas-Archive, DOI.org, Google Scholar, https://github.com/mrossinek/cobib or other existing search tools)


Here are some examples as inspiration for CLI interface, PyPI packaging setup, README, etc.:

Your tool could even call out to these and others ^ if you want to be an all-in-one paper / research downloader.

We can keep brainstorming on names, scihub-dl is a bit narrow as I'm imagining this tool could use Anna's archive, Libgen, Project Gutenberg, Google Scholar, Archive.org, etc. eventually, so something broader like research-dl/papers-dl/sci-dl/lit-dl etc. might be better.


What do you think? Does it sound like a project that you'd be excited about?

You could own the package yourself and help maintain it, or we could do it under the @ArchiveBox org and I'd be down to handle support/maintenance with you as a co-owner/admin on the repo, I'm down for either.

I think if done well, this tool has the potential to be super popular, I'm sure it'd get a bunch of attention on Github/HN/Reddit because there are so many people trying to scrape scientific research for AI training right now.

<!-- gh-comment-id:2014322311 --> @pirate commented on GitHub (Mar 22, 2024): @benmuth after thinking about this a bit more, I think it could be fun to release your `scihub-dl` script as its own dedicated PyPI package! Publishing a package is great resume candy and it'll make this tool usable to a much wider audience than just ArchiveBox users. Then we can just add it to `pyproject.toml` as a requirement for ArchiveBox and call it like any other command-line tool extractor. --- Here's how I'm imagining it could work (starting simple at first, lots of this can wait for v2,v3,etc.): - **`papers-dl parse [--match=doi|pmid|issn|isbn|url|*] <path>`** > expose a parsing CLI that can take an html/text file and regex through it to find any DOI/PMID/ISSN/ISBN/URL/etc. paper identifiers it finds within, outputting a line for each matching result to stdout - `--match=doi|pmid|...` specify what patterns to search for (multiple allowed if `--json|--csv` are passed) - `--raw | --json | --csv` whether to output one identifier per line as raw text, as JSONL `{id: '...', type: 'doi', ...}`, or as CSV `<id>,<type>,<title>,<url>,...` - `<path>` path to local file to try and parse for identifiers e.g. `./singlefile_archive.html` or `./some_citations.txt` - **`papers-dl search [--doi|--title|--url|--author|--publisedAfter|--publishedBefore] <query>`** > expose a paper searching and metadata retrieval CLI that can search the most common research databases by DOI/PMID/ISSN/ISBN/title/URL/etc. and return info as JSON for each result - `--providers=auto|arxiv.org|sci-hub.ru|annas-archive.org|...` specify what index to search (the default `auto` can try all of them and return first/best matching result) - `--fields=title,authors,description,publishDate,downloadURLs,...` which metadata parameters to try and fetch - `--raw | --json | --csv` format output as one line of raw text per result, CSV, or JSONL - `--doi|--issn|--title|... <query>` specify one or more queries that are AND-ed together, e.g. `--doi='10.1145/3375633' --title='bananas' --url='https://doi.org/.* --publishedAfter=2021-01-01` - **`papers-dl fetch [--providers=...] [--formats=pdf,epub,...] [--doi|--title|--url] <query>`** > expose paper downloading CLI that can take an identifier and download the corresponding document as PDF/html/epub/etc. - `--providers=auto` "by any means necessary" try all scihub mirrors, arxiv-dl, annas-archive, archive.org, ... - `--providers=arxiv.org,sci-hub.ru,...` try specific scihub mirror, annas-archive mirror, etc. other provider - `--formats=pdf,html,epub,md,json` (specify what filetypes to try downloading for the given paper, json could write a `.json` file containing the metadata that would show up from a `search`) - `--doi|--issn|--url|... <query>` can take an exact `downloadURL` (found via `search`), or re-use same query search system as `search` and run the download on the first/best result - **`papers-dl citations [--incoming] [--outgoing] [--authors] <query>`** (*someday*) > definitely not a priority, but it would be super cool to someday be able to use this to graph-search over scientific publications, finding related DOIs based on incoming/outgoing citations, similar authors, institutions, funding sources, keywords, etc. (powered by Annas-Archive, DOI.org, Google Scholar, https://github.com/mrossinek/cobib or other existing search tools) --- Here are some examples as inspiration for CLI interface, PyPI packaging setup, README, etc.: - https://pypi.org/project/arxiv-dl/ - https://pypi.org/project/scidownl/ - https://github.com/dheison0/annas-archive-api Your tool could even call out to these and others ^ if you want to be an all-in-one paper / research downloader. We can keep brainstorming on names, `scihub-dl` is a bit narrow as I'm imagining this tool could use [Anna's archive](https://annas-archive.org/), Libgen, Project Gutenberg, Google Scholar, Archive.org, etc. eventually, so something broader like `research-dl`/`papers-dl`/`sci-dl`/`lit-dl` etc. might be better. --- What do you think? Does it sound like a project that you'd be excited about? You could own the package yourself and help maintain it, or we could do it under the `@ArchiveBox` org and I'd be down to handle support/maintenance with you as a co-owner/admin on the repo, I'm down for either. I think if done well, this tool has the potential to be super popular, I'm sure it'd get a bunch of attention on Github/HN/Reddit because there are so many people trying to scrape scientific research for AI training right now.
Author
Owner

@benmuth commented on GitHub (Mar 23, 2024):

@pirate Yeah, I think that's a great idea, I'd be happy to try to work on this. I think a more comprehensive tool should definitely exist. Thanks for the overview, that's really helpful.

As for ownership, I'm not really sure. Maybe I can start it and I can transfer ownership to ArchiveBox if it grows and the maintenance burden proves too much for whatever reason? I don't feel strongly about it either.

I'll start a repo with what I have so far soon. I think I'll go with papers-dl or sci-dl for now.

<!-- gh-comment-id:2016335479 --> @benmuth commented on GitHub (Mar 23, 2024): @pirate Yeah, I think that's a great idea, I'd be happy to try to work on this. I think a more comprehensive tool should definitely exist. Thanks for the overview, that's really helpful. As for ownership, I'm not really sure. Maybe I can start it and I can transfer ownership to ArchiveBox if it grows and the maintenance burden proves too much for whatever reason? I don't feel strongly about it either. I'll start a repo with what I have so far soon. I think I'll go with `papers-dl` or `sci-dl` for now.
Author
Owner

@amosip commented on GitHub (May 3, 2024):

Is this happening? I would be very interested in using a tool like this.

<!-- gh-comment-id:2092083894 --> @amosip commented on GitHub (May 3, 2024): Is this happening? I would be very interested in using a tool like this.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#455
No description provided.