[PR #1450] [CLOSED] papers-dl support #1431

Closed
opened 2026-03-01 14:49:46 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1450
Author: @benmuth
Created: 6/6/2024
Status: Closed

Base: devHead: papers-dl


📝 Commits (10+)

📊 Changes

12 files changed (+221 additions, -3 deletions)

View changed files

📝 Dockerfile (+6 -0)
📝 archivebox/config.py (+19 -0)
📝 archivebox/config_stubs.py (+3 -0)
archivebox/core/migrations/0023_alter_archiveresult_extractor.py (+36 -0)
archivebox/core/migrations/0027_merge_20240610_1731.py (+14 -0)
📝 archivebox/extractors/__init__.py (+2 -0)
archivebox/extractors/papers_dl.py (+122 -0)
📝 archivebox/index/html.py (+2 -1)
📝 archivebox/index/schema.py (+3 -1)
📝 archivebox/templates/core/snapshot.html (+11 -0)
📝 pyproject.toml (+1 -0)
📝 tests/fixtures.py (+2 -1)

📄 Description

Summary

This PR adds basic support for the papers-dl extractor for downloading scientific papers.

At this point, papers-dl only supports Sci-Hub, but it is planned to support other repositories.

Current known issues with papers-dl include performance and reliability. Searching known Sci-Hub aliases for a paper is slow and it occasionally causes timeouts during an ArchiveBox add job. There are also other occasional failures, even when doing a repeat search for a paper you've already found. Potential problems with captchas are also currently unaddressed.

papers-dl supports fetching multiple types of paper identifiers, but I still need to add parsing for those to ArchiveBox. At this moment only URLs work. Extractor functions expect Links that look like URLs, so I'm not sure how best to pass identifiers like DOI or ISBN to the papers_dl extractor function.

Related issues

#720

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1450 **Author:** [@benmuth](https://github.com/benmuth) **Created:** 6/6/2024 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `papers-dl` --- ### 📝 Commits (10+) - [`03980dc`](https://github.com/ArchiveBox/ArchiveBox/commit/03980dc33bf8132ec197f53353dd3117d6df3322) Add papers-dl options to config - [`ee74e06`](https://github.com/ArchiveBox/ArchiveBox/commit/ee74e0627c7ab52d781288fb0b890458d387c04b) Add papers-dl extractor - [`a8a57db`](https://github.com/ArchiveBox/ArchiveBox/commit/a8a57dbfc3948d2a196430b67ed6ec4d2f88a241) Integrate papers-dl - [`24ecae2`](https://github.com/ArchiveBox/ArchiveBox/commit/24ecae2de7574db7ee56d8736f6ef629d6ad5e6d) Merge branch 'dev' into papers-dl - [`b4c7ba0`](https://github.com/ArchiveBox/ArchiveBox/commit/b4c7ba0c392bbf2aa6b9048cced43989061c1719) Add `get_output_path` function to papers_dl - [`39afd59`](https://github.com/ArchiveBox/ArchiveBox/commit/39afd595679deb40baa695ee3500e017d0beb517) Change `papers` to `papers_dl` - [`6da7cd7`](https://github.com/ArchiveBox/ArchiveBox/commit/6da7cd7d7d55666e39c9e7c5f2191168e3897cfa) Make papers-dl verbose - [`48004af`](https://github.com/ArchiveBox/ArchiveBox/commit/48004afc6ae89659ed2acb839d73f4e26f24aae8) Add permission hack for pdf2doi - [`c2cb15e`](https://github.com/ArchiveBox/ArchiveBox/commit/c2cb15edd4d4ca6605da3c9907e42527a7b10f64) Make migrations - [`59ec122`](https://github.com/ArchiveBox/ArchiveBox/commit/59ec122940705471dca10f628e5f1af91f07fde0) Add missing comma ### 📊 Changes **12 files changed** (+221 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+6 -0) 📝 `archivebox/config.py` (+19 -0) 📝 `archivebox/config_stubs.py` (+3 -0) ➕ `archivebox/core/migrations/0023_alter_archiveresult_extractor.py` (+36 -0) ➕ `archivebox/core/migrations/0027_merge_20240610_1731.py` (+14 -0) 📝 `archivebox/extractors/__init__.py` (+2 -0) ➕ `archivebox/extractors/papers_dl.py` (+122 -0) 📝 `archivebox/index/html.py` (+2 -1) 📝 `archivebox/index/schema.py` (+3 -1) 📝 `archivebox/templates/core/snapshot.html` (+11 -0) 📝 `pyproject.toml` (+1 -0) 📝 `tests/fixtures.py` (+2 -1) </details> ### 📄 Description <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary This PR adds basic support for the [papers-dl](https://pypi.org/project/papers-dl/) extractor for downloading scientific papers. At this point, `papers-dl` only supports Sci-Hub, but it is planned to support other repositories. Current known issues with `papers-dl` include performance and reliability. Searching known Sci-Hub aliases for a paper is slow and it occasionally causes timeouts during an ArchiveBox add job. There are also other occasional failures, even when doing a repeat search for a paper you've already found. Potential problems with captchas are also currently unaddressed. `papers-dl` supports fetching multiple types of paper identifiers, but I still need to add parsing for those to ArchiveBox. At this moment only URLs work. Extractor functions expect `Link`s that look like URLs, so I'm not sure how best to pass identifiers like DOI or ISBN to the `papers_dl` extractor function. # Related issues #720 <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [x] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:49:46 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1431
No description provided.