[GH-ISSUE #158] Archive Method: wget has issues when archiving gamestar.de #111

Closed
opened 2026-03-01 14:40:42 +03:00 by kerem · 2 comments
Owner

Originally created by @Powerbless on GitHub (Mar 3, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/158

when i try to Archive https://gamestar.de the archived website(not screenshot or pdf) looks like an very old html site. when i use wget lokal to download the site with my wget command, everything works fine. Does ArchiveBox use the "--execute robots=off" Flag?

my wget command that works:
"C:\wget\wget.exe" --mirror -c --recursive --level 1 --timestamping --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=.\gamestar\ --span-hosts --domains=gamestar.de,www.gamestar.de https://www.gamestar.de

Originally created by @Powerbless on GitHub (Mar 3, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/158 when i try to Archive https://gamestar.de the archived website(not screenshot or pdf) looks like an very old html site. when i use wget lokal to download the site with my wget command, everything works fine. Does ArchiveBox use the "--execute robots=off" Flag? my wget command that works: "C:\wget\wget.exe" --mirror -c --recursive --level 1 --timestamping --page-requisites --html-extension --convert-links --execute robots=off --directory-prefix=.\gamestar\ --span-hosts --domains=gamestar.de,www.gamestar.de https://www.gamestar.de
kerem closed this issue 2026-03-01 14:40:42 +03:00
Author
Owner

@pirate commented on GitHub (Mar 3, 2019):

This is the wget command that ArchiveBox uses, and we do ignore robots exclusions using the -e robots=off flag.

wget --server-response
     --no-verbose
     --adjust-extension
     --convert-links
     --force-directories
     --backup-converted
     --span-hosts
     --no-parent
     -e robots=off
     --restrict-file-names=unix
     --timeout=60
     --warc-file=warc.gz
     --page-requisites
     --no-check-certificate
     --no-hsts
     https://example.com/url/to/archive/goes/here.html

We don't do --mirror or --level 1 though, maybe you can test your command with those removed and the ArchiveBox one with those added. I can also experiment adding those flags and seeing how it affects wget behavior on other sites.

<!-- gh-comment-id:469054702 --> @pirate commented on GitHub (Mar 3, 2019): This is the `wget` command that ArchiveBox uses, and we do ignore robots exclusions using the ` -e robots=off` flag. ```bash wget --server-response --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=unix --timeout=60 --warc-file=warc.gz --page-requisites --no-check-certificate --no-hsts https://example.com/url/to/archive/goes/here.html ``` We don't do `--mirror` or `--level 1` though, maybe you can test your command with those removed and the ArchiveBox one with those added. I can also experiment adding those flags and seeing how it affects `wget` behavior on other sites.
Author
Owner

@pirate commented on GitHub (Jul 24, 2020):

Please give this a try on the latest django branch (which contains the latest wget, youtubedl, curl, etc versions), if you're still seeing issues comment back here and I'll reopen the ticket.

git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://gamestar.de'
<!-- gh-comment-id:663633661 --> @pirate commented on GitHub (Jul 24, 2020): Please give this a try on the latest `django` branch (which contains the latest wget, youtubedl, curl, etc versions), if you're still seeing issues comment back here and I'll reopen the ticket. ```bash git checkout django git pull docker build . -t archivebox docker run -v $PWD/output:/data archivebox init docker run -v $PWD/output:/data archivebox add 'https://gamestar.de' ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#111
No description provided.