this post was submitted on 23 Jun 2024
82 points (100.0% liked)

Linux

48090 readers
761 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
82
Deduplication tool (lemmy.world)
submitted 4 months ago* (last edited 4 months ago) by Agility0971@lemmy.world to c/linux@lemmy.ml
 

I'm in the process of starting a proper backup solution however over the years I've had a few copy-paste home directory from different systems as a quick and dirty solution. Now I have to pay my technical debt and remove the duplicates. I'm looking for a deduplication tool.

  • accept a destination directory
  • source locations should be deleted after the operation
  • if files content is the same then delete the redundant copy
  • if files content is different, move and change the name to avoid name collision I tried doing it in nautilus but it does not look at the files content, only the file name. Eg if two photos have the same content but different name then it will also create a redundant copy.

Edit: Some comments suggested using btrfs' feature duperemove. This will replace the same file content with points to the same location. This is not what I intend, I intend to remove the redundant files completely.

Edit 2: Another quite cool solution is to use hardlinks. It will replace all occurances of the same data with a hardlink. Then the redundant directories can be traversed and whatever is a link can be deleted. The remaining files will be unique. I'm not going for this myself as I don't trust my self to write a bug free implementation.

you are viewing a single comment's thread
view the rest of the comments
[–] biribiri11@lemmy.ml 2 points 4 months ago (1 children)

As said previously, Borg is a full dedplicating incremental archiver complete with compression. You can use relative paths temporarily to build up your backups and a full backup history, then use something like pika to browse the archives to ensure a complete history.

[–] Agility0971@lemmy.world -1 points 4 months ago (2 children)

I did not ask for a backup solution, but for a deduplication tool

[–] biribiri11@lemmy.ml 3 points 4 months ago* (last edited 4 months ago)

Tbf you did start your post with

I’m in the process of starting a proper backup

So you’re going to end up with at least a few people talking about how to onboard your existing backups into a proper backup solution (like borg). Your bullet points can certainly probably be organized into a shell script with sync, but why? A proper backup solution with a full backup history is going to be way more useful than dumping all your files into a directory and renaming in case something clobbers. I don’t see the point in doing anything other than tarring your old backups and using borg import-tar (docs). It feels like you’re trying to go from one half-baked, odd backup solution to another, instead of just going with a full, complete solution.

[–] rotopenguin@infosec.pub -2 points 4 months ago* (last edited 4 months ago)

Use rm with the redundant files option.

rm -rf /