starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

[GH-ISSUE #1608] Support: v0.7.2 hitting ratelimits when archiving `bespoke.altroconsumo.it` #962

New issue

Open

opened 2026-03-01 14:47:34 +03:00 by kerem · 1 comment

kerem commented

2026-03-01 14:47:34 +03:00

Owner

Originally created by @Axel303 on GitHub (Dec 5, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1608

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

Hello,

We are trying to archive some URLS and we received an error for all the extractors that we select.

As you can see from the printscreen before, getting headers returns HTTPSConnectionPool(host='bespoke.altroconsumo.it', port=443): Max retries exceeded with url: /2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6cb05ad8d0>: Failed to establish a new connection: [Errno 111] Connection refused')).

For the others types of extractor, we also have this error message but as printscreen or even as HTML.

Let me know if you need more information.

Steps to reproduce

1. Started ArchiveBox by running: `docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox:latest`
2. Went to the https://127.0.0.1:8000/add/ page in Google Chrome
3. Typed 'https://bespoke.altroconsumo.it/2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract' into the 'Add URL' input field
4. Clicked the 'Add+' button
5. I checked the created snapshot and we have as result of the archiving "error_connection" refused.

Logs or errors

archivebox add 'https://bespoke.altroconsumo.it/2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract'
[i] [2024-12-05 14:34:46] ArchiveBox v0.7.2: archivebox add https://bespoke.altroconsumo.it/2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract
> /home/archivebox/data1
 
[+] [2024-12-05 14:34:47] Adding 1 links to index (crawl depth=0)...
> Saved verbatim input to sources/1733409287-import.txt
> Parsed 1 URLs from input (Generic TXT)
> Found 0 new URLs not already in index
 
[*] [2024-12-05 14:34:47] Writing 0 links to main index...
    √ ./index.sqlite3

ArchiveBox Version

0.7.2
ArchiveBox v0.7.2 BUILD_TIME=2024-11-13 09:29:28 1731486568
IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-97-generic-x86_64-with-glibc2.35 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=False FS_USER=1001:1001 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False

How did you install the version of ArchiveBox you are using?

pip

What operating system are you running on?

Linux (Ubuntu/Debian/Arch/Alpine/etc.)

What type of drive are you using to store your ArchiveBox data?

data/ is on a local SSD or NVMe drive
data/ is on a spinning hard drive or external USB drive
data/ is on a network mount (e.g. NFS/SMB/CIFS/etc.)
data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.)

Docker Compose Configuration

We are not using Docker compose configuration

ArchiveBox Configuration

[SERVER_CONFIG]
SECRET_KEY = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
[DEPENDENCY_CONFIG]
CHROME_BINARY = /home/archivebox/.cache/ms-playwright/chromium-1140/chrome-linux/chrome

Originally created by @Axel303 on GitHub (Dec 5, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1608 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug Hello, We are trying to archive some URLS and we received an error for all the extractors that we select. ![Image](https://github.com/user-attachments/assets/ecfbde61-112d-4633-ac1a-0055a7d6af1e) As you can see from the printscreen before, getting headers returns HTTPSConnectionPool(host='bespoke.altroconsumo.it', port=443): Max retries exceeded with url: /2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6cb05ad8d0>: Failed to establish a new connection: [Errno 111] Connection refused')). For the others types of extractor, we also have this error message but as printscreen or even as HTML. Let me know if you need more information. ### Steps to reproduce ```markdown 1. Started ArchiveBox by running: `docker run -v $PWD:/data -p 8000:8000 archivebox/archivebox:latest` 2. Went to the https://127.0.0.1:8000/add/ page in Google Chrome 3. Typed 'https://bespoke.altroconsumo.it/2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract' into the 'Add URL' input field 4. Clicked the 'Add+' button 5. I checked the created snapshot and we have as result of the archiving "error_connection" refused. ``` ### Logs or errors ```shell archivebox add 'https://bespoke.altroconsumo.it/2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract' [i] [2024-12-05 14:34:46] ArchiveBox v0.7.2: archivebox add https://bespoke.altroconsumo.it/2024/friggitricearianew?key=f6P4XylFQlAFspeM2eigFmY3hrs&site_name=CLH2E2M50DGAirfryerBestPract > /home/archivebox/data1 [+] [2024-12-05 14:34:47] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1733409287-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 0 new URLs not already in index [*] [2024-12-05 14:34:47] Writing 0 links to main index... √ ./index.sqlite3 ``` ### ArchiveBox Version ```shell 0.7.2 ArchiveBox v0.7.2 BUILD_TIME=2024-11-13 09:29:28 1731486568 IN_DOCKER=False IN_QEMU=False ARCH=x86_64 OS=Linux PLATFORM=Linux-5.15.0-97-generic-x86_64-with-glibc2.35 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=False FS_USER=1001:1001 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=ripgrep LDAP=False ``` ### How did you install the version of ArchiveBox you are using? pip ### What operating system are you running on? Linux (Ubuntu/Debian/Arch/Alpine/etc.) ### What type of drive are you using to store your ArchiveBox data? - [ ] `data/` is on a local SSD or NVMe drive - [x] `data/` is on a spinning hard drive or external USB drive - [ ] `data/` is on a network mount (e.g. NFS/SMB/CIFS/etc.) - [ ] `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/OneDrive, etc.) ### Docker Compose Configuration ```shell We are not using Docker compose configuration ``` ### ArchiveBox Configuration ```shell [SERVER_CONFIG] SECRET_KEY = XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX [DEPENDENCY_CONFIG] CHROME_BINARY = /home/archivebox/.cache/ms-playwright/chromium-1140/chrome-linux/chrome ```

kerem added the

why: performance

label

2026-03-01 14:47:34 +03:00

kerem commented

2026-03-01 14:47:35 +03:00

Author

Owner

@pirate commented on GitHub (Dec 6, 2024):

You've been rate-limited by the site you're trying to archive making too many requests in a short time. Wait 24hr and try again.

If you keep getting ratelmitied I recommend only archiving with 1 extractor initially, then updating all the URLs with a 2nd extractor some hours later, then a 3rd later, etc. so you don't overwhelm the site. The >=v0.8.5 releases have more improvements to make rate-limiting easier, follow this issue for updates:

@pirate commented on GitHub (Dec 6, 2024): You've been rate-limited by the site you're trying to archive making too many requests in a short time. Wait 24hr and try again. If you keep getting ratelmitied I recommend only archiving with 1 extractor initially, then updating all the URLs with a 2nd extractor some hours later, then a 3rd later, etc. so you don't overwhelm the site. The >=v0.8.5 releases have more improvements to make rate-limiting easier, follow this issue for updates: - https://github.com/ArchiveBox/ArchiveBox/issues/1475 - https://github.com/ArchiveBox/ArchiveBox/issues/191

kerem referenced this issue

2026-03-01 17:56:33 +03:00

[GH-ISSUE #962] Bug: Running archivebox update --index-only doesn't upgrade Snapshot index.{html,json} files #2108

kerem referenced this issue

2026-03-14 23:45:42 +03:00

[GH-ISSUE #962] Bug: Running archivebox update --index-only doesn't upgrade Snapshot index.{html,json} files #3620

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#962

No description provided.

Rows
Columns

[GH-ISSUE #1608] Support: v0.7.2 hitting ratelimits when archiving bespoke.altroconsumo.it #962