[GH-ISSUE #704] Feature Request: Deduplicate files on archives

kerem commented

2026-03-14 23:02:40 +03:00

Owner

Originally created by @Dryusdan on GitHub (Apr 13, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/704

Hello :)

Type

General question or discussion
Propose a brand new feature
Request modification of existing behavior or design

What is the problem that your feature request solves

When archiving a lot of pages, it is possible that some files remain identical between each of these pages. The problem is that these files take up more and more space even though they are still identical.
There are solutions on the file system side (ZFS for example) but on the application side it is more complex.
I'm thinking of using Rdfind and coupling it to a script to transform duplicate links into hardlinks.

I've been thinking of using rdfind and finding all the files to make a hardlink, that way you delete the original page, you don't lose the other files. But I'm afraid to make archivebox crazy in the future with my tricks ^^

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

I think use link (hard or symlink) store in a "global folder" and each archive use theses files. The duplicate file have the same md5 hash and each hash are stored in DB to find quickly without many IO duplicate file

What hacks or alternative solutions have you tried to solve the problem?

Not tried but I think use rdfind to find or hash each file.

How badly do you want this new feature?

It's an urgent deal-breaker, I can't live without it
It's important to add it in the near-mid term future
It would be nice to have eventually

(Yes both, is it a nice to have but my disk space say it's important ^^)

I'm willing to contribute dev time / money to fix this issue
I like ArchiveBox so far / would recommend it to a friend (and write an article ^^)
I've had a lot of difficulty getting ArchiveBox set up

Originally created by @Dryusdan on GitHub (Apr 13, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/704 Hello :) ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves When archiving a lot of pages, it is possible that some files remain identical between each of these pages. The problem is that these files take up more and more space even though they are still identical. There are solutions on the file system side (ZFS for example) but on the application side it is more complex. I'm thinking of using Rdfind and coupling it to a script to transform duplicate links into hardlinks. I've been thinking of using rdfind and finding all the files to make a hardlink, that way you delete the original page, you don't lose the other files. But I'm afraid to make archivebox crazy in the future with my tricks ^^ ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes I think use link (hard or symlink) store in a "global folder" and each archive use theses files. The duplicate file have the same md5 hash and each hash are stored in DB to find quickly without many IO duplicate file ## What hacks or alternative solutions have you tried to solve the problem? Not tried but I think use rdfind to find or hash each file. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [x] It would be nice to have eventually (Yes both, is it a nice to have but my disk space say it's important ^^) --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend (and write an article ^^) - [ ] I've had a lot of difficulty getting ArchiveBox set up

kerem

2026-03-14 23:02:40 +03:00

closed this issue
added the
status: idea-phase
label

kerem commented

2026-03-14 23:02:56 +03:00

Author

Owner

@Dryusdan commented on GitHub (Apr 13, 2021):

(mini question why archivebox 0.6 is not on pipy repo actualy ?

@Dryusdan commented on GitHub (Apr 13, 2021): (mini question why archivebox 0.6 is not on pipy repo actualy ?

kerem commented

2026-03-14 23:03:01 +03:00

Author

Owner

@pirate commented on GitHub (Apr 13, 2021):

I'm in the process fixing an issue with the auto build worker by moving it to github actions, but I think if it takes me any more than 1 day to fix it I'll just roll the release by hand 😓

@pirate commented on GitHub (Apr 13, 2021): I'm in the process fixing an issue with the auto build worker by moving it to github actions, but I think if it takes me any more than 1 day to fix it I'll just roll the release by hand 😓

kerem commented

2026-03-14 23:03:06 +03:00

Author

Owner

@Dryusdan commented on GitHub (Apr 13, 2021):

Arf, thank for your answer.

Good luck!

@Dryusdan commented on GitHub (Apr 13, 2021): Arf, thank for your answer. Good luck!

kerem commented

2026-03-14 23:03:11 +03:00

Author

Owner

@pirate commented on GitHub (Apr 13, 2021):

Related issues to the content hashing / merkel tree / deduping process:

@pirate commented on GitHub (Apr 13, 2021): Related issues to the content hashing / merkel tree / deduping process: - https://github.com/ArchiveBox/ArchiveBox/issues/74 - https://github.com/ArchiveBox/ArchiveBox/issues/679

kerem commented

2026-03-14 23:03:16 +03:00

Author

Owner

@Dryusdan commented on GitHub (Apr 13, 2021):

Oups, I don't find theses file.
I close this issue for dedup. Sorry

@Dryusdan commented on GitHub (Apr 13, 2021): Oups, I don't find theses file. I close this issue for dedup. Sorry

kerem commented

2026-03-14 23:03:21 +03:00

Author

Owner

@pirate commented on GitHub (Apr 13, 2021):

https://pypi.org/project/archivebox/0.6.2/ 👍

@pirate commented on GitHub (Apr 13, 2021): https://pypi.org/project/archivebox/0.6.2/ 👍

kerem commented

2026-03-14 23:03:26 +03:00

Author

Owner

@Dryusdan commented on GitHub (Apr 13, 2021):

I read all man of rdfind and I think is a good solution with subprocess to make deduplication without lot of work :

rdfind -makehardlinks true -removeidentinode false -checksum sha256 /var/www/nextcloud/data/

@Dryusdan commented on GitHub (Apr 13, 2021): I read all man of rdfind and I think is a good solution with subprocess to make deduplication without lot of work : ``` rdfind -makehardlinks true -removeidentinode false -checksum sha256 /var/www/nextcloud/data/ ```

kerem commented

2026-03-14 23:03:31 +03:00

Author

Owner

@pirate commented on GitHub (Apr 13, 2021):

It's a good solution, but I think I'd rather have users manage that process themselves for now than build it into archivebox. Hardlinks/symlinks are not well supported on all platforms and filesystems, and many people use ArchiveBox on a weird filesystems (docker overlayFS, NFS, FUSE, network mounts, windows file shares, etc.) that don't even support FSYNC, let alone hard links. Also the more "special" the setup is and the farther away from a flat folder structure it is, the more likely it is to break over time as file systems and specifications change, which defeats the purpose of having a long-term durable archive.

@pirate commented on GitHub (Apr 13, 2021): It's a good solution, but I think I'd rather have users manage that process themselves for now than build it into archivebox. Hardlinks/symlinks are not well supported on all platforms and filesystems, and many people use ArchiveBox on a weird filesystems (docker overlayFS, NFS, FUSE, network mounts, windows file shares, etc.) that don't even support FSYNC, let alone hard links. Also the more "special" the setup is and the farther away from a flat folder structure it is, the more likely it is to break over time as file systems and specifications change, which defeats the purpose of having a long-term durable archive.

kerem commented

2026-03-14 23:03:36 +03:00

Author

Owner

@Dryusdan commented on GitHub (Apr 14, 2021):

Arf, I see... :/

Big and complex problem :/

@Dryusdan commented on GitHub (Apr 14, 2021): Arf, I see... :/ Big and complex problem :/

kerem commented

2026-03-14 23:03:41 +03:00

Author

Owner

@pirate commented on GitHub (Apr 12, 2022):

Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

@pirate commented on GitHub (Apr 12, 2022): Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting Contributions/suggestions welcome there.

Rows
Columns

[GH-ISSUE #704] Feature Request: Deduplicate files on archives #3463

Type

What is the problem that your feature request solves

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

What hacks or alternative solutions have you tried to solve the problem?

How badly do you want this new feature?

(Yes both, is it a nice to have but my disk space say it's important ^^)