mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #720] New Extractor Idea: scihub-dl to auto-detect inline DOI numbers and download academic paper PDFs #1964
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1964
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Apr 24, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/720
New extractor idea: SCIHUB
e.g. take this academic paper for example: https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1
If a full paper PDF is available on scihub, e.g.: https://sci-hub.se/https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 it could be downloaded to a
./archive/<timestmap>/scihub/output folder.We could also look for a DOI number in the page URL or page html contents e.g.:
10.1016/j.cub.2019.11.030using a regex and try downloading that.New Dependencies:
New Extractors:
extractors/scihub.pyNew Config Options:
SAVE_SCIHUB=True@danisztls commented on GitHub (Apr 27, 2021):
Captchas would be a problem.
@pirate commented on GitHub (Apr 27, 2021):
In my testing so far it's not a problem as long as you're not doing dozens of papers at a time.
@jezcope commented on GitHub (Aug 27, 2023):
Two more avenues to explore:
Would be great to capture full text for academic publications where its available.
@pirate commented on GitHub (Jan 20, 2024):
Parsing DOIs out of PDF/HTML content:
Found good candidate extractor dependencies to try for the major free scientific paper DBs:
and more via...
something like this might be interesting for linking together citations too https://inciteful.xyz/
@benmuth commented on GitHub (Feb 11, 2024):
I have a rough script for this working (just using
scihub.pyas a module and downloading a pdf with the url->doi fallback). I had to slightly modifyscihub.pyto get it to work, but I was thinking we can just add the modified version to thevendordirectory or something. It looks likescihub.pyisn't available on PyPI anyway.Should I just wait for the plugin system? Or I can create a repo with the script and modified
scihub.pyso it can be looked at and improved in the meantime. Or I can try to minimally integrate it with ArchiveBox (just get it working as aoneshot?) and open a PR for feedback and testing, whatever makes sense.@pirate commented on GitHub (Feb 11, 2024):
If you can post it in a gist and share the link that would be helpful @benmuth!
@benmuth commented on GitHub (Feb 13, 2024):
Here's what I have so far https://gist.github.com/benmuth/b2c12cbb40ca4d8183c6f17f819e2f2d @pirate
Usage:
or
It should either
or
In the second case, we can probably be smarter about only downloading the DOIs intended (based on title or something?), but it's pretty dumb right now.
@pirate commented on GitHub (Feb 21, 2024):
@benmuth it might take me a month or more till I'm able to merge this, as I'm working on some paid ArchiveBox projects right now for a client that are taking up most of my time. Don't worry, I'm keen to merge it though, I've been wanting this feature personally for a long time.
@benmuth commented on GitHub (Feb 21, 2024):
No worries! In the meantime maybe I can keep adding to it. I've kept it simple for demonstration purposes, but it seems straightforward to add some of the ideas you linked here, like the PDF parsing.
@pirate commented on GitHub (Mar 22, 2024):
@benmuth after thinking about this a bit more, I think it could be fun to release your
scihub-dlscript as its own dedicated PyPI package!Publishing a package is great resume candy and it'll make this tool usable to a much wider audience than just ArchiveBox users. Then we can just add it to
pyproject.tomlas a requirement for ArchiveBox and call it like any other command-line tool extractor.Here's how I'm imagining it could work (starting simple at first, lots of this can wait for v2,v3,etc.):
papers-dl parse [--match=doi|pmid|issn|isbn|url|*] <path>--match=doi|pmid|...specify what patterns to search for (multiple allowed if--json|--csvare passed)--raw | --json | --csvwhether to output one identifier per line as raw text,as JSONL
{id: '...', type: 'doi', ...}, or as CSV<id>,<type>,<title>,<url>,...<path>path to local file to try and parse for identifiers e.g../singlefile_archive.htmlor./some_citations.txtpapers-dl search [--doi|--title|--url|--author|--publisedAfter|--publishedBefore] <query>--providers=auto|arxiv.org|sci-hub.ru|annas-archive.org|...specify what index to search (the defaultautocan try all of them and return first/best matching result)--fields=title,authors,description,publishDate,downloadURLs,...which metadata parameters to try and fetch--raw | --json | --csvformat output as one line of raw text per result, CSV, or JSONL--doi|--issn|--title|... <query>specify one or more queries that are AND-ed together, e.g.--doi='10.1145/3375633' --title='bananas' --url='https://doi.org/.* --publishedAfter=2021-01-01papers-dl fetch [--providers=...] [--formats=pdf,epub,...] [--doi|--title|--url] <query>--providers=auto"by any means necessary" try all scihub mirrors, arxiv-dl, annas-archive, archive.org, ...--providers=arxiv.org,sci-hub.ru,...try specific scihub mirror, annas-archive mirror, etc. other provider--formats=pdf,html,epub,md,json(specify what filetypes to try downloading for the given paper, json could write a.jsonfile containing the metadata that would show up from asearch)--doi|--issn|--url|... <query>can take an exactdownloadURL(found viasearch), or re-use same query search system assearchand run the download on the first/best resultpapers-dl citations [--incoming] [--outgoing] [--authors] <query>(someday)Here are some examples as inspiration for CLI interface, PyPI packaging setup, README, etc.:
Your tool could even call out to these and others ^ if you want to be an all-in-one paper / research downloader.
We can keep brainstorming on names,
scihub-dlis a bit narrow as I'm imagining this tool could use Anna's archive, Libgen, Project Gutenberg, Google Scholar, Archive.org, etc. eventually, so something broader likeresearch-dl/papers-dl/sci-dl/lit-dletc. might be better.What do you think? Does it sound like a project that you'd be excited about?
You could own the package yourself and help maintain it, or we could do it under the
@ArchiveBoxorg and I'd be down to handle support/maintenance with you as a co-owner/admin on the repo, I'm down for either.I think if done well, this tool has the potential to be super popular, I'm sure it'd get a bunch of attention on Github/HN/Reddit because there are so many people trying to scrape scientific research for AI training right now.
@benmuth commented on GitHub (Mar 23, 2024):
@pirate Yeah, I think that's a great idea, I'd be happy to try to work on this. I think a more comprehensive tool should definitely exist. Thanks for the overview, that's really helpful.
As for ownership, I'm not really sure. Maybe I can start it and I can transfer ownership to ArchiveBox if it grows and the maintenance burden proves too much for whatever reason? I don't feel strongly about it either.
I'll start a repo with what I have so far soon. I think I'll go with
papers-dlorsci-dlfor now.@amosip commented on GitHub (May 3, 2024):
Is this happening? I would be very interested in using a tool like this.