[GH-ISSUE #92] Use rel=canonical links #1574

Closed
opened 2026-03-01 17:51:51 +03:00 by kerem · 10 comments
Owner

Originally created by @ghost on GitHub (Sep 3, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/92

In order to not create duplicate search results please use rel=canonical links in the headers to indicate to search machines that the site is just hosting a copy.

Originally created by @ghost on GitHub (Sep 3, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/92 In order to not create duplicate search results please use rel=canonical links in the headers to indicate to search machines that the site is just hosting a copy.
kerem 2026-03-01 17:51:51 +03:00
Author
Owner

@pirate commented on GitHub (Sep 6, 2018):

Unfortunately since it's just static html produced, we don't always have control over the headers served. What I can do is add the rel=canonical header in the suggested nginx config.

There's also an argument to be made for allowing pages to show up in search results in case the originals go down, but I'll let the users decide how they want to serve their archives.

<!-- gh-comment-id:419114588 --> @pirate commented on GitHub (Sep 6, 2018): Unfortunately since it's just static html produced, we don't always have control over the headers served. What I can do is add the rel=canonical header in the suggested nginx config. There's also an argument to be made for allowing pages to show up in search results in case the originals go down, but I'll let the users decide how they want to serve their archives.
Author
Owner

@ghost commented on GitHub (Sep 6, 2018):

The rel=canonical can be added as a link tag in the HTML and is intended to indicate that the current URL is a copy of the original.

The reason I bring this up is because I was about to submit a DMCA takedown to Google on one such URL as the full republishing or articles is prohibited under my sites terms of use unless the canonical link is present and it doesn't show up in the search results.

<!-- gh-comment-id:419164583 --> @ghost commented on GitHub (Sep 6, 2018): The rel=canonical can be added as a link tag in the HTML and is intended to indicate that the current URL is a copy of the original. The reason I bring this up is because I was about to submit a DMCA takedown to Google on one such URL as the full republishing or articles is prohibited under my sites terms of use unless the canonical link is present and it doesn't show up in the search results.
Author
Owner

@pirate commented on GitHub (Sep 11, 2018):

Is the content hosted on my archive or someone else's? I'm happy to help find the server owner and relay a takedown request.

I'm aware that it can be added to the html, but I'd like to avoid modifying the html on disk to be different from the original source if possible.
For anyone using nginx to serve the archive we can add a header:

add_header Link "<$scheme://$http_host$request_uri>; rel=\"canonical\"";
<!-- gh-comment-id:420137908 --> @pirate commented on GitHub (Sep 11, 2018): Is the content hosted on [my archive](https://archive.sweeting.me) or someone else's? I'm happy to help find the server owner and relay a takedown request. I'm aware that it can be added to the html, but I'd like to avoid modifying the html on disk to be different from the original source if possible. For anyone using nginx to serve the archive we can add a header: ```nginx add_header Link "<$scheme://$http_host$request_uri>; rel=\"canonical\""; ```
Author
Owner

@ghost commented on GitHub (Sep 12, 2018):

It's on yours: https://archive.sweeting.me/archive/1518642320/index.html

But that's not really the point. The problem here is that the archive page has no contact information on it and many people are not going to bother hunting down the site owner, they are just going to file a DMCA takedown request with Google and the hosting company, potentially causing trouble for the archive owner.

You could fix the whole thing by adding this to the <head> part of the HTML document:

<link rel="canonical" href="https://original/site/here" />

Alternatively a robots.txt file could be generated that disallows indexing, just like the wayback machine does it.

<!-- gh-comment-id:420527211 --> @ghost commented on GitHub (Sep 12, 2018): It's on yours: https://archive.sweeting.me/archive/1518642320/index.html **But that's not really the point.** The problem here is that the archive page has no contact information on it and many people are not going to bother hunting down the site owner, they are just going to file a DMCA takedown request with Google and the hosting company, potentially causing trouble for the archive owner. You could fix the whole thing by adding this to the `<head>` part of the HTML document: ``` <link rel="canonical" href="https://original/site/here" /> ``` Alternatively a robots.txt file could be generated that disallows indexing, [just like the wayback machine does it](https://web.archive.org/robots.txt).
Author
Owner

@pirate commented on GitHub (Sep 12, 2018):

robots.txt is a good idea, especially since it doesn't modify the archive html. I'll add that now.

I'd definitely like to follow the Wayback Machine's example wherever possible.

<!-- gh-comment-id:420788678 --> @pirate commented on GitHub (Sep 12, 2018): `robots.txt` is a good idea, especially since it doesn't modify the archive html. I'll add that now. I'd definitely like to follow the Wayback Machine's example wherever possible.
Author
Owner

@pirate commented on GitHub (Sep 12, 2018):

Fixed in 8a23358fc8, and published: https://archive.sweeting.me/robots.txt

<!-- gh-comment-id:420831180 --> @pirate commented on GitHub (Sep 12, 2018): Fixed in 8a23358fc82f09853a3721d5f36ae4adac32b84b, and published: https://archive.sweeting.me/robots.txt
Author
Owner

@gwern commented on GitHub (Oct 28, 2019):

I realize this is a closed issue but I'd like to comment that this is a feature I'd like. I've been considering hosting a large set of mirrors of external links on my site as an anti-link-rot mechanism, and I feel it's both good for my site (for SEO reasons) to make the relationship clear and morally appropriate to encode the credit explicitly; simply putting into place a robots.txt ban (which I would do as well) doesn't solve the latter problem, and the former is only partially solved by a robots.txt ban since it is inevitable that some of these mirrors will spread into the wild (as people will discover the original has linkrotten and copy the version they know exists - robots.txt only stops crawlers, not people).

<!-- gh-comment-id:546749718 --> @gwern commented on GitHub (Oct 28, 2019): I realize this is a closed issue but I'd like to comment that this is a feature I'd like. I've been considering hosting a large set of mirrors of external links on my site as an anti-link-rot mechanism, and I feel it's both good for my site (for SEO reasons) to make the relationship clear and morally appropriate to encode the credit explicitly; simply putting into place a robots.txt ban (which I would do as well) doesn't solve the latter problem, and the former is only partially solved by a robots.txt ban since it is inevitable that some of these mirrors will spread into the wild (as people will discover the original has linkrotten and copy the version they know exists - robots.txt only stops *crawlers*, not people).
Author
Owner

@pirate commented on GitHub (Oct 28, 2019):

Oh actually I can reopen this, since the new >=v0.4 system has a Django server built in, we actually can control which headers are served with archived content. I wont promise it'll be added immediately with the release of v0.4, but I'll try to get it in by v0.4.5 or v0.5 at the latest.

<!-- gh-comment-id:546750763 --> @pirate commented on GitHub (Oct 28, 2019): Oh actually I can reopen this, since the new `>=v0.4` system has a Django server built in, we actually can control which headers are served with archived content. I wont promise it'll be added immediately with the release of `v0.4`, but I'll try to get it in by `v0.4.5` or `v0.5` at the latest.
Author
Owner

@gwern commented on GitHub (Oct 28, 2019):

Thanks. I think for my use, I can hack around it by using sed (this might drop in duplicate canonicals because searching my existing ArchiveBox archives, there are a fair number of canonical links already defined, but oh well), so it's not a urgent need by any means.

<!-- gh-comment-id:546751583 --> @gwern commented on GitHub (Oct 28, 2019): Thanks. I think for my use, I can hack around it by using `sed` (this might drop in duplicate canonicals because searching my existing ArchiveBox archives, there are a fair number of canonical links already defined, but oh well), so it's not a urgent need by any means.
Author
Owner

@pirate commented on GitHub (Jul 16, 2020):

This is done in #366! Thanks @cdvv7788!

<!-- gh-comment-id:659594985 --> @pirate commented on GitHub (Jul 16, 2020): This is done in #366! Thanks @cdvv7788!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1574
No description provided.