[GH-ISSUE #92] Use rel=canonical links

kerem commented

2026-03-01 17:51:51 +03:00

Owner

Originally created by @ghost on GitHub (Sep 3, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/92

In order to not create duplicate search results please use rel=canonical links in the headers to indicate to search machines that the site is just hosting a copy.

Originally created by @ghost on GitHub (Sep 3, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/92 In order to not create duplicate search results please use rel=canonical links in the headers to indicate to search machines that the site is just hosting a copy.

kerem

2026-03-01 17:51:51 +03:00

closed this issue
added the
size: medium
label

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@pirate commented on GitHub (Sep 6, 2018):

Unfortunately since it's just static html produced, we don't always have control over the headers served. What I can do is add the rel=canonical header in the suggested nginx config.

There's also an argument to be made for allowing pages to show up in search results in case the originals go down, but I'll let the users decide how they want to serve their archives.

@pirate commented on GitHub (Sep 6, 2018): Unfortunately since it's just static html produced, we don't always have control over the headers served. What I can do is add the rel=canonical header in the suggested nginx config. There's also an argument to be made for allowing pages to show up in search results in case the originals go down, but I'll let the users decide how they want to serve their archives.

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@ghost commented on GitHub (Sep 6, 2018):

The rel=canonical can be added as a link tag in the HTML and is intended to indicate that the current URL is a copy of the original.

The reason I bring this up is because I was about to submit a DMCA takedown to Google on one such URL as the full republishing or articles is prohibited under my sites terms of use unless the canonical link is present and it doesn't show up in the search results.

@ghost commented on GitHub (Sep 6, 2018): The rel=canonical can be added as a link tag in the HTML and is intended to indicate that the current URL is a copy of the original. The reason I bring this up is because I was about to submit a DMCA takedown to Google on one such URL as the full republishing or articles is prohibited under my sites terms of use unless the canonical link is present and it doesn't show up in the search results.

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@pirate commented on GitHub (Sep 11, 2018):

Is the content hosted on my archive or someone else's? I'm happy to help find the server owner and relay a takedown request.

I'm aware that it can be added to the html, but I'd like to avoid modifying the html on disk to be different from the original source if possible.
For anyone using nginx to serve the archive we can add a header:

add_header Link "<$scheme://$http_host$request_uri>; rel=\"canonical\"";

@pirate commented on GitHub (Sep 11, 2018): Is the content hosted on [my archive](https://archive.sweeting.me) or someone else's? I'm happy to help find the server owner and relay a takedown request. I'm aware that it can be added to the html, but I'd like to avoid modifying the html on disk to be different from the original source if possible. For anyone using nginx to serve the archive we can add a header: ```nginx add_header Link "<$scheme://$http_host$request_uri>; rel=\"canonical\""; ```

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@ghost commented on GitHub (Sep 12, 2018):

It's on yours: https://archive.sweeting.me/archive/1518642320/index.html

But that's not really the point. The problem here is that the archive page has no contact information on it and many people are not going to bother hunting down the site owner, they are just going to file a DMCA takedown request with Google and the hosting company, potentially causing trouble for the archive owner.

You could fix the whole thing by adding this to the <head> part of the HTML document:

<link rel="canonical" href="https://original/site/here" />

Alternatively a robots.txt file could be generated that disallows indexing, just like the wayback machine does it.

@ghost commented on GitHub (Sep 12, 2018): It's on yours: https://archive.sweeting.me/archive/1518642320/index.html **But that's not really the point.** The problem here is that the archive page has no contact information on it and many people are not going to bother hunting down the site owner, they are just going to file a DMCA takedown request with Google and the hosting company, potentially causing trouble for the archive owner. You could fix the whole thing by adding this to the `<head>` part of the HTML document: ``` <link rel="canonical" href="https://original/site/here" /> ``` Alternatively a robots.txt file could be generated that disallows indexing, [just like the wayback machine does it](https://web.archive.org/robots.txt).

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@pirate commented on GitHub (Sep 12, 2018):

robots.txt is a good idea, especially since it doesn't modify the archive html. I'll add that now.

I'd definitely like to follow the Wayback Machine's example wherever possible.

@pirate commented on GitHub (Sep 12, 2018): `robots.txt` is a good idea, especially since it doesn't modify the archive html. I'll add that now. I'd definitely like to follow the Wayback Machine's example wherever possible.

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@pirate commented on GitHub (Sep 12, 2018):

Fixed in 8a23358fc8, and published: https://archive.sweeting.me/robots.txt

@pirate commented on GitHub (Sep 12, 2018): Fixed in 8a23358fc82f09853a3721d5f36ae4adac32b84b, and published: https://archive.sweeting.me/robots.txt

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@gwern commented on GitHub (Oct 28, 2019):

I realize this is a closed issue but I'd like to comment that this is a feature I'd like. I've been considering hosting a large set of mirrors of external links on my site as an anti-link-rot mechanism, and I feel it's both good for my site (for SEO reasons) to make the relationship clear and morally appropriate to encode the credit explicitly; simply putting into place a robots.txt ban (which I would do as well) doesn't solve the latter problem, and the former is only partially solved by a robots.txt ban since it is inevitable that some of these mirrors will spread into the wild (as people will discover the original has linkrotten and copy the version they know exists - robots.txt only stops crawlers, not people).

@gwern commented on GitHub (Oct 28, 2019): I realize this is a closed issue but I'd like to comment that this is a feature I'd like. I've been considering hosting a large set of mirrors of external links on my site as an anti-link-rot mechanism, and I feel it's both good for my site (for SEO reasons) to make the relationship clear and morally appropriate to encode the credit explicitly; simply putting into place a robots.txt ban (which I would do as well) doesn't solve the latter problem, and the former is only partially solved by a robots.txt ban since it is inevitable that some of these mirrors will spread into the wild (as people will discover the original has linkrotten and copy the version they know exists - robots.txt only stops *crawlers*, not people).

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@pirate commented on GitHub (Oct 28, 2019):

Oh actually I can reopen this, since the new >=v0.4 system has a Django server built in, we actually can control which headers are served with archived content. I wont promise it'll be added immediately with the release of v0.4, but I'll try to get it in by v0.4.5 or v0.5 at the latest.

@pirate commented on GitHub (Oct 28, 2019): Oh actually I can reopen this, since the new `>=v0.4` system has a Django server built in, we actually can control which headers are served with archived content. I wont promise it'll be added immediately with the release of `v0.4`, but I'll try to get it in by `v0.4.5` or `v0.5` at the latest.

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@gwern commented on GitHub (Oct 28, 2019):

Thanks. I think for my use, I can hack around it by using sed (this might drop in duplicate canonicals because searching my existing ArchiveBox archives, there are a fair number of canonical links already defined, but oh well), so it's not a urgent need by any means.

@gwern commented on GitHub (Oct 28, 2019): Thanks. I think for my use, I can hack around it by using `sed` (this might drop in duplicate canonicals because searching my existing ArchiveBox archives, there are a fair number of canonical links already defined, but oh well), so it's not a urgent need by any means.

kerem commented

2026-03-01 17:51:52 +03:00

Author

Owner

@pirate commented on GitHub (Jul 16, 2020):

This is done in #366! Thanks @cdvv7788!

@pirate commented on GitHub (Jul 16, 2020): This is done in #366! Thanks @cdvv7788!

kerem referenced this issue

2026-03-01 17:59:08 +03:00

[GH-ISSUE #1574] Feature Request: Modify the Github issue template to a user-friendly form, with input fields and automatic filling of labels and issue type #2451

kerem referenced this issue

2026-03-15 01:06:31 +03:00