mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #704] Feature Request: Deduplicate files on archives #3463
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3463
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Dryusdan on GitHub (Apr 13, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/704
Hello :)
Type
What is the problem that your feature request solves
When archiving a lot of pages, it is possible that some files remain identical between each of these pages. The problem is that these files take up more and more space even though they are still identical.
There are solutions on the file system side (ZFS for example) but on the application side it is more complex.
I'm thinking of using Rdfind and coupling it to a script to transform duplicate links into hardlinks.
I've been thinking of using rdfind and finding all the files to make a hardlink, that way you delete the original page, you don't lose the other files. But I'm afraid to make archivebox crazy in the future with my tricks ^^
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
I think use link (hard or symlink) store in a "global folder" and each archive use theses files. The duplicate file have the same md5 hash and each hash are stored in DB to find quickly without many IO duplicate file
What hacks or alternative solutions have you tried to solve the problem?
Not tried but I think use rdfind to find or hash each file.
How badly do you want this new feature?
(Yes both, is it a nice to have but my disk space say it's important ^^)
@Dryusdan commented on GitHub (Apr 13, 2021):
(mini question why archivebox 0.6 is not on pipy repo actualy ?
@pirate commented on GitHub (Apr 13, 2021):
I'm in the process fixing an issue with the auto build worker by moving it to github actions, but I think if it takes me any more than 1 day to fix it I'll just roll the release by hand 😓
@Dryusdan commented on GitHub (Apr 13, 2021):
Arf, thank for your answer.
Good luck!
@pirate commented on GitHub (Apr 13, 2021):
Related issues to the content hashing / merkel tree / deduping process:
@Dryusdan commented on GitHub (Apr 13, 2021):
Oups, I don't find theses file.
I close this issue for dedup. Sorry
@pirate commented on GitHub (Apr 13, 2021):
https://pypi.org/project/archivebox/0.6.2/ 👍
@Dryusdan commented on GitHub (Apr 13, 2021):
I read all man of rdfind and I think is a good solution with subprocess to make deduplication without lot of work :
@pirate commented on GitHub (Apr 13, 2021):
It's a good solution, but I think I'd rather have users manage that process themselves for now than build it into archivebox. Hardlinks/symlinks are not well supported on all platforms and filesystems, and many people use ArchiveBox on a weird filesystems (docker overlayFS, NFS, FUSE, network mounts, windows file shares, etc.) that don't even support FSYNC, let alone hard links. Also the more "special" the setup is and the farther away from a flat folder structure it is, the more likely it is to break over time as file systems and specifications change, which defeats the purpose of having a long-term durable archive.
@Dryusdan commented on GitHub (Apr 14, 2021):
Arf, I see... :/
Big and complex problem :/
@pirate commented on GitHub (Apr 12, 2022):
Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting
Contributions/suggestions welcome there.