[PR #1450] [CLOSED] `papers-dl` support #1431

New issue

Closed

opened 2026-03-01 14:49:46 +03:00 by kerem · 0 comments

kerem commented

2026-03-01 14:49:46 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1450
Author: @benmuth
Created: 6/6/2024
Status: ❌ Closed

Base: dev ← Head: papers-dl

📝 Commits (10+)

03980dc Add papers-dl options to config
ee74e06 Add papers-dl extractor
a8a57db Integrate papers-dl
24ecae2 Merge branch 'dev' into papers-dl
b4c7ba0 Add get_output_path function to papers_dl
39afd59 Change papers to papers_dl
6da7cd7 Make papers-dl verbose
48004af Add permission hack for pdf2doi
c2cb15e Make migrations
59ec122 Add missing comma

📊 Changes

12 files changed (+221 additions, -3 deletions)

View changed files

📝 Dockerfile (+6 -0)
📝 archivebox/config.py (+19 -0)
📝 archivebox/config_stubs.py (+3 -0)
➕ archivebox/core/migrations/0023_alter_archiveresult_extractor.py (+36 -0)
➕ archivebox/core/migrations/0027_merge_20240610_1731.py (+14 -0)
📝 archivebox/extractors/__init__.py (+2 -0)
➕ archivebox/extractors/papers_dl.py (+122 -0)
📝 archivebox/index/html.py (+2 -1)
📝 archivebox/index/schema.py (+3 -1)
📝 archivebox/templates/core/snapshot.html (+11 -0)
📝 pyproject.toml (+1 -0)
📝 tests/fixtures.py (+2 -1)

📄 Description

Summary

This PR adds basic support for the papers-dl extractor for downloading scientific papers.

At this point, papers-dl only supports Sci-Hub, but it is planned to support other repositories.

Current known issues with papers-dl include performance and reliability. Searching known Sci-Hub aliases for a paper is slow and it occasionally causes timeouts during an ArchiveBox add job. There are also other occasional failures, even when doing a repeat search for a paper you've already found. Potential problems with captchas are also currently unaddressed.

papers-dl supports fetching multiple types of paper identifiers, but I still need to add parsing for those to ArchiveBox. At this moment only URLs work. Extractor functions expect Links that look like URLs, so I'm not sure how best to pass identifiers like DOI or ISBN to the papers_dl extractor function.

#720

Changes these areas

Bugfixes
Feature behavior
Command line interface
Configuration options
Internal architecture
Snapshot data layout on disk

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1450 **Author:** [@benmuth](https://github.com/benmuth) **Created:** 6/6/2024 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `papers-dl` --- ### 📝 Commits (10+) - [`03980dc`](https://github.com/ArchiveBox/ArchiveBox/commit/03980dc33bf8132ec197f53353dd3117d6df3322) Add papers-dl options to config - [`ee74e06`](https://github.com/ArchiveBox/ArchiveBox/commit/ee74e0627c7ab52d781288fb0b890458d387c04b) Add papers-dl extractor - [`a8a57db`](https://github.com/ArchiveBox/ArchiveBox/commit/a8a57dbfc3948d2a196430b67ed6ec4d2f88a241) Integrate papers-dl - [`24ecae2`](https://github.com/ArchiveBox/ArchiveBox/commit/24ecae2de7574db7ee56d8736f6ef629d6ad5e6d) Merge branch 'dev' into papers-dl - [`b4c7ba0`](https://github.com/ArchiveBox/ArchiveBox/commit/b4c7ba0c392bbf2aa6b9048cced43989061c1719) Add `get_output_path` function to papers_dl - [`39afd59`](https://github.com/ArchiveBox/ArchiveBox/commit/39afd595679deb40baa695ee3500e017d0beb517) Change `papers` to `papers_dl` - [`6da7cd7`](https://github.com/ArchiveBox/ArchiveBox/commit/6da7cd7d7d55666e39c9e7c5f2191168e3897cfa) Make papers-dl verbose - [`48004af`](https://github.com/ArchiveBox/ArchiveBox/commit/48004afc6ae89659ed2acb839d73f4e26f24aae8) Add permission hack for pdf2doi - [`c2cb15e`](https://github.com/ArchiveBox/ArchiveBox/commit/c2cb15edd4d4ca6605da3c9907e42527a7b10f64) Make migrations - [`59ec122`](https://github.com/ArchiveBox/ArchiveBox/commit/59ec122940705471dca10f628e5f1af91f07fde0) Add missing comma ### 📊 Changes **12 files changed** (+221 additions, -3 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+6 -0) 📝 `archivebox/config.py` (+19 -0) 📝 `archivebox/config_stubs.py` (+3 -0) ➕ `archivebox/core/migrations/0023_alter_archiveresult_extractor.py` (+36 -0) ➕ `archivebox/core/migrations/0027_merge_20240610_1731.py` (+14 -0) 📝 `archivebox/extractors/__init__.py` (+2 -0) ➕ `archivebox/extractors/papers_dl.py` (+122 -0) 📝 `archivebox/index/html.py` (+2 -1) 📝 `archivebox/index/schema.py` (+3 -1) 📝 `archivebox/templates/core/snapshot.html` (+11 -0) 📝 `pyproject.toml` (+1 -0) 📝 `tests/fixtures.py` (+2 -1) </details> ### 📄 Description  # Summary This PR adds basic support for the [papers-dl](https://pypi.org/project/papers-dl/) extractor for downloading scientific papers. At this point, `papers-dl` only supports Sci-Hub, but it is planned to support other repositories. Current known issues with `papers-dl` include performance and reliability. Searching known Sci-Hub aliases for a paper is slow and it occasionally causes timeouts during an ArchiveBox add job. There are also other occasional failures, even when doing a repeat search for a paper you've already found. Potential problems with captchas are also currently unaddressed. `papers-dl` supports fetching multiple types of paper identifiers, but I still need to add parsing for those to ArchiveBox. At this moment only URLs work. Extractor functions expect `Link`s that look like URLs, so I'm not sure how best to pass identifiers like DOI or ISBN to the `papers_dl` extractor function. # Related issues #720  # Changes these areas - [ ] Bugfixes - [x] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>