mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #92] Use rel=canonical links #1574
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1574
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ghost on GitHub (Sep 3, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/92
In order to not create duplicate search results please use rel=canonical links in the headers to indicate to search machines that the site is just hosting a copy.
@pirate commented on GitHub (Sep 6, 2018):
Unfortunately since it's just static html produced, we don't always have control over the headers served. What I can do is add the rel=canonical header in the suggested nginx config.
There's also an argument to be made for allowing pages to show up in search results in case the originals go down, but I'll let the users decide how they want to serve their archives.
@ghost commented on GitHub (Sep 6, 2018):
The rel=canonical can be added as a link tag in the HTML and is intended to indicate that the current URL is a copy of the original.
The reason I bring this up is because I was about to submit a DMCA takedown to Google on one such URL as the full republishing or articles is prohibited under my sites terms of use unless the canonical link is present and it doesn't show up in the search results.
@pirate commented on GitHub (Sep 11, 2018):
Is the content hosted on my archive or someone else's? I'm happy to help find the server owner and relay a takedown request.
I'm aware that it can be added to the html, but I'd like to avoid modifying the html on disk to be different from the original source if possible.
For anyone using nginx to serve the archive we can add a header:
@ghost commented on GitHub (Sep 12, 2018):
It's on yours: https://archive.sweeting.me/archive/1518642320/index.html
But that's not really the point. The problem here is that the archive page has no contact information on it and many people are not going to bother hunting down the site owner, they are just going to file a DMCA takedown request with Google and the hosting company, potentially causing trouble for the archive owner.
You could fix the whole thing by adding this to the
<head>part of the HTML document:Alternatively a robots.txt file could be generated that disallows indexing, just like the wayback machine does it.
@pirate commented on GitHub (Sep 12, 2018):
robots.txtis a good idea, especially since it doesn't modify the archive html. I'll add that now.I'd definitely like to follow the Wayback Machine's example wherever possible.
@pirate commented on GitHub (Sep 12, 2018):
Fixed in
8a23358fc8, and published: https://archive.sweeting.me/robots.txt@gwern commented on GitHub (Oct 28, 2019):
I realize this is a closed issue but I'd like to comment that this is a feature I'd like. I've been considering hosting a large set of mirrors of external links on my site as an anti-link-rot mechanism, and I feel it's both good for my site (for SEO reasons) to make the relationship clear and morally appropriate to encode the credit explicitly; simply putting into place a robots.txt ban (which I would do as well) doesn't solve the latter problem, and the former is only partially solved by a robots.txt ban since it is inevitable that some of these mirrors will spread into the wild (as people will discover the original has linkrotten and copy the version they know exists - robots.txt only stops crawlers, not people).
@pirate commented on GitHub (Oct 28, 2019):
Oh actually I can reopen this, since the new
>=v0.4system has a Django server built in, we actually can control which headers are served with archived content. I wont promise it'll be added immediately with the release ofv0.4, but I'll try to get it in byv0.4.5orv0.5at the latest.@gwern commented on GitHub (Oct 28, 2019):
Thanks. I think for my use, I can hack around it by using
sed(this might drop in duplicate canonicals because searching my existing ArchiveBox archives, there are a fair number of canonical links already defined, but oh well), so it's not a urgent need by any means.@pirate commented on GitHub (Jul 16, 2020):
This is done in #366! Thanks @cdvv7788!