starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

[GH-ISSUE #136] Lots of 403 Forbiddens #3113

New issue

Closed

opened 2026-03-14 21:05:55 +03:00 by kerem · 2 comments

kerem commented

2026-03-14 21:05:55 +03:00

Owner

Originally created by @sbrl on GitHub (Jan 31, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/136

Describe the bug
I'm getting a bunch of random 403 forbiddens. The curious thing though is that the resulting archive output looks to be complete.

This still happens if I tweak archivebox to invoke wget with -e robots=off, and if I go 'superstealth' - see the script below.

Here's an example URL: https://www.destroyallsoftware.com/talks/wat

Here's the pair of scripts I'm using to archive things:

archive-custom

#!/usr/bin/env bash

export FETCH_MEDIA=true;
export FETCH_SCREENSHOT=false;
export FETCH_PDF=false;
export FETCH_DOM=false;
export RESOLUTION="1920x1080"; # Only used for screenshots
export OUTPUT_PERMISSIONS=644;


# Hide the fact that we're wget - unfortunately we get blocked by some sites for this - even though our use-case is legit
# Commented out by default, as ArchiveBox has a default user-agent string based on this now :D
# https://github.com/pirate/ArchiveBox/issues/39#issuecomment-454201125
if [ ! -z "${HIDE_UA}" ]; then
	export WGET_USER_AGENT="ArchiveBox/$(git rev-parse HEAD | head -c7) (+https://github.com/pirate/ArchiveBox/)";
fi

if [ ! -z "${HIDE_UA_SUPERSTEALTH}" ]; then
	export WGET_USER_AGENT="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0";
fi

./archive $@

archive-url

#!/usr/bin/env bash
export ONLY_NEW=true;
echo $1 | ./archive-custom

Steps to reproduce
Steps to reproduce the behavior:

Try to archive https://www.destroyallsoftware.com/talks/wat
See the 403

Screenshots or log output
If applicable, use screenshots or copy/pasted terminal output to help explain your problem.

[+] [2019-01-31 23:00:42] Adding 1 new links from output/sources/stdin-1548975642.txt to output/index.json
[√] [2019-01-31 23:00:43] Updated main index files:
    > output/index.json
    > output/index.html
[▶] [2019-01-31 23:00:43] Updating files for 1 links in archive...
[+] [2019-01-31 23:00:46] "https://www.destroyallsoftware.com/talks/wat"
    https://www.destroyallsoftware.com/talks/wat
    > output/archive/1548975642 (new)
      > favicon
      > wget                                                                    
        got wget response code 4:                                               
          2019-01-31 23:00:50 (4.52 MB/s) - ‘destroyallsoftware-talks.s3.amazonaws.com/wat.mp4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIKRVCECXBC4ZGHIQ%2F20190131%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190131T230047Z&X-Amz-Expires=14400&X-Amz-SignedHeaders=host&X-Amz-Signature=23288dab44d705f730e67fc10’ saved [6488766/6488766]
          FINISHED --2019-01-31 23:00:50--
          Total wall clock time: 4.0s
          Downloaded: 6 files, 6.7M in 1.9s (3.58 MB/s)
          Converting links in www.destroyallsoftware.com/talks/wat.html... 4-22
          Converting links in www.destroyallsoftware.com/assets/application-7954ec132c7b079cd5e42384785a18fc6e7fd408e3bc11530771c77b4269a84b.css... nothing to do.
          Converted links in 2 files in 0.02 seconds.
        Run to see full output: cd /mnt/elfstone/srv/ArchiveBox/output/archive/1548975642; wget  -e robots=off --no-verbose --timestamping --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --page-requisites --user-agent="ArchiveBox/45c499dc9 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.destroyallsoftware.com/talks/wat
        Failed: Exception 403 Forbidden (try changing WGET_USER_AGENT)

Software versions (please complete the following information):

ArchiveBox version: 45c499dc9e
Python version: 3.5.3
OS: Raspbian GNU/Linux 9.6 (stretch)
Chrome version: not installed

Originally created by @sbrl on GitHub (Jan 31, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/136 **Describe the bug** I'm getting a bunch of random 403 forbiddens. The curious thing though is that the resulting archive output looks to be complete. This still happens if I tweak archivebox to invoke `wget` with `-e robots=off`, and if I go 'superstealth' - see the script below. Here's an example URL: https://www.destroyallsoftware.com/talks/wat Here's the pair of scripts I'm using to archive things: **archive-custom** ```bash #!/usr/bin/env bash export FETCH_MEDIA=true; export FETCH_SCREENSHOT=false; export FETCH_PDF=false; export FETCH_DOM=false; export RESOLUTION="1920x1080"; # Only used for screenshots export OUTPUT_PERMISSIONS=644; # Hide the fact that we're wget - unfortunately we get blocked by some sites for this - even though our use-case is legit # Commented out by default, as ArchiveBox has a default user-agent string based on this now :D # https://github.com/pirate/ArchiveBox/issues/39#issuecomment-454201125 if [ ! -z "${HIDE_UA}" ]; then export WGET_USER_AGENT="ArchiveBox/$(git rev-parse HEAD | head -c7) (+https://github.com/pirate/ArchiveBox/)"; fi if [ ! -z "${HIDE_UA_SUPERSTEALTH}" ]; then export WGET_USER_AGENT="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:65.0) Gecko/20100101 Firefox/65.0"; fi ./archive $@ ``` **archive-url** ``` #!/usr/bin/env bash export ONLY_NEW=true; echo $1 | ./archive-custom ``` **Steps to reproduce** Steps to reproduce the behavior: 1. Try to archive https://www.destroyallsoftware.com/talks/wat 2. See the 403 **Screenshots or log output** If applicable, use screenshots or copy/pasted terminal output to help explain your problem. ``` [+] [2019-01-31 23:00:42] Adding 1 new links from output/sources/stdin-1548975642.txt to output/index.json [√] [2019-01-31 23:00:43] Updated main index files: > output/index.json > output/index.html [▶] [2019-01-31 23:00:43] Updating files for 1 links in archive... [+] [2019-01-31 23:00:46] "https://www.destroyallsoftware.com/talks/wat" https://www.destroyallsoftware.com/talks/wat > output/archive/1548975642 (new) > favicon > wget got wget response code 4: 2019-01-31 23:00:50 (4.52 MB/s) - ‘destroyallsoftware-talks.s3.amazonaws.com/wat.mp4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIKRVCECXBC4ZGHIQ%2F20190131%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190131T230047Z&X-Amz-Expires=14400&X-Amz-SignedHeaders=host&X-Amz-Signature=23288dab44d705f730e67fc10’ saved [6488766/6488766] FINISHED --2019-01-31 23:00:50-- Total wall clock time: 4.0s Downloaded: 6 files, 6.7M in 1.9s (3.58 MB/s) Converting links in www.destroyallsoftware.com/talks/wat.html... 4-22 Converting links in www.destroyallsoftware.com/assets/application-7954ec132c7b079cd5e42384785a18fc6e7fd408e3bc11530771c77b4269a84b.css... nothing to do. Converted links in 2 files in 0.02 seconds. Run to see full output: cd /mnt/elfstone/srv/ArchiveBox/output/archive/1548975642; wget -e robots=off --no-verbose --timestamping --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --page-requisites --user-agent="ArchiveBox/45c499dc9 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.destroyallsoftware.com/talks/wat Failed: Exception 403 Forbidden (try changing WGET_USER_AGENT) ``` **Software versions (please complete the following information):** - ArchiveBox version: 45c499dc9e7f487f7c7958e1e1d7e201dca44d93 - Python version: 3.5.3 - OS: Raspbian GNU/Linux 9.6 (stretch) - Chrome version: not installed

kerem closed this issue

2026-03-14 21:06:01 +03:00

kerem commented

2026-03-14 21:06:07 +03:00

Author

Owner

@pirate commented on GitHub (Feb 1, 2019):

That's actually fairly normal, often one or two resources on a page will be blocked for various reasons, it doesn't actually prevent the rest of the page from downloading.

Generally, if it says FINISHED with some megabytes downloaded in the output then it ran successfully.

@pirate commented on GitHub (Feb 1, 2019): That's actually fairly normal, often one or two resources on a page will be blocked for various reasons, it doesn't actually prevent the rest of the page from downloading. Generally, if it says `FINISHED` with some megabytes downloaded in the output then it ran successfully.

kerem commented

2026-03-14 21:06:12 +03:00

Author

Owner

@sbrl commented on GitHub (Feb 2, 2019):

Ah ok! Thanks for the clarification :-)

@sbrl commented on GitHub (Feb 2, 2019): Ah ok! Thanks for the clarification :-)

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#3113

No description provided.

Rows
Columns