Freedups searches through the directories you specify. When it finds two identical files, it hard links them together. Now the two or more files still exist in their respective directories, but only one copy of the data is stored on disk; both directory entries point to the same data blocks. This allows you to reclaim space on your drive. It's that simple. Run it every night from a cron job. Why you'd want to use it: - You have multiple copies of a source code tree on your system. Freedups will link any identical files together and ignore any files that changed between versions. - You have multiple copies of the file COPYING in /usr/doc or /usr/share/DOC - Depending on your system, the following might be good places to try linking (size in parentheses is amount saved on a very basic RedHat 7.3 install; you'll probably get even more savings): freedups /lib/kbd (463K) freedups /usr/doc /usr/share/doc freedups /usr/src/linux* freedups /usr/src/pcmcia-cs* freedups /usr/share (8.6M) freedups /usr/lib (97K) freedups /usr/man /usr/share/man freedups /usr/share/locale /etc/locale (652K) freedups /usr/share/scrollkeeper /var/lib/scrollkeeper (719K) - Directories holding files that are only read are good candidates. You might also find some space savings by deleting the /usr/share/locale/country_code/LC_MESSAGES/*.mo files in country_codes you don't need. Things to watch out for: - You'll need to use the _full path_, starting with /, to any files or directories you want freedups to search. If you don't, you'll likely get an error like "cannot stat file". - Remember that you now have multiple directory entries pointing at one block of data. Depending on what editor you use, when you change one of the files you may be changing the others as well. See below for a list of applications and whether they automatically handle hardlinks or not. - For the above reason, you probably don't want to create links to any backup copies on the drive. - If the files are on different partitions, it's not possible to create a hardlink between them. Freedups handles this gracefully. - Directories holding files that might be written to are generally not good candidates. Similarly, avoid directories holding security-related files. /etc is a bad choice on both counts. - If you run freedups without the --datesequal=yes option, freedups may link files with different modification times together. If you later use "rpm -Va" (or the equivalent debian system verify command), it may report that the timestamps on some files have changed. If this is _all_ that has changed, this is a cosmetic problem only. For example, the following is cosmetic and not indicative of a modified file: .......T /usr/share/automake/COPYING Can I run this and just see what would have been linked together without modifying anything? Sure. In fact, unless you put -a on the command line, that's _all_ freedups will do. By default, it won't actually do anything, it'll just tell you what the approximate space savings would be. Does this really save any space? It really depends on whether you have duplicate files on your filesystem or not. I've personally recovered ~3G on my main drive from hardlinking identical files in the various kernel trees I have there. One user reports saving ~2G simply from hardlinking identical files downloaded by a p2p file sharing program. Does this slow down the system like the drive compression programs? No. No files are compressed with this tool. It only instructs the filesystem to keep one copy of two or more identical files and have all their directory entries point at the sole copy of the actual file data. In fact, for certain operations (such as using diff between two freedup'd directory trees), the system runs much, much faster. File reads should _not_ become slower. Running freedups can take quite a while, but it can certainly be run off-hours or when the system is generally idle. It can be run under nice to give other programs priority. Do I have to run this as root? Not at all. As long as you own the files, freedups runs just fine as a normal user. What has to be true for two files to get linked together? - They have to be files (i.e. not character or block devices, no pipes, no directories, no symlinks). - They have to have at least one byte. I don't want to link all 0 byte files on the system together. - They have to have the same size. - They have to have the same user owner, group owner and mode. Skirting this requirement would raise _serious_ security considerations. If you want to link two files that currently differ in owner or mode, use chown or chmod to make their owners or modes identical and re-run freedups. - They have to be readable by the current user. - The contents of the files have to be identical. - Optionally (--minsize=1000), the files have to be larger than the given number of bytes. - Optionally (--datesequal=yes), the files have to have identical modification timestamps. - Optionally (--filenamesequal=yes), the filenames have to be identical (in different directories, obviously). - They have to be on the same partition. - That partition must support hardlinks. Ext2, ext3 and reiserfs do. I'm pretty sure fat/vfat/msdos do not. If you know whether another linux filesystem supports hardlinks or not, please let me know. I think I have a bunch of files that should be linked together, but freedups doesn't link them. Why not? Walk through the above list of criteria for a given pair of files in question. Which one fails? To examine a pair of files, look at the output from: ls -ali firstfile secondfile which looks like: 2097229 -rw-rw-r-- 1 wstearns wstearns 4 Mar 11 16:09 firstfile 2097673 -rw------- 1 nobody nobody 5 Mar 11 16:10 secondfile The columns are: inode number, file mode, number of links to this inode, user owner, group owner, file size, modification date, modification time, and filename. The above two files wouldn't be linked because their modes are different, they're owned by different users, they're owned by different groups, and have different sizes (so must have different contents). Depending on options, they may also be disqualified because their modification times and filenames are different. That said, if you do come up with files that legitimately should be linked but aren't, please email me so I can fix freedups. Can this be safely run more than once? Definitely. Freedups is smart enough to recognize that two files are already linked together and just moves on to the next pair. For this reason, running it twice on the exact same set of files won't save any more space. Are there different ways to do this? Sure. - Rewrite this in a more efficient language. - When copying a directory tree, hard link the files during the copy: cp -av --link linux-2.1.anything.orig linux-2.1.anything Many thanks to the Kernel FAQ and Janos Farkas for that trick. - Delete truly unneeded files - Use CVS or Bitkeeper; the latter, at least, can save substantial amounts of space. How can I test that the program is working? Try the following: [wstearns@sparrow wstearns]$ cd /tmp [wstearns@sparrow /tmp]$ mkdir duptest [wstearns@sparrow /tmp]$ cd duptest [wstearns@sparrow duptest]$ echo Hi there. >test1 [wstearns@sparrow duptest]$ cp -p test1 test2 [wstearns@sparrow duptest]$ ls -ali test1 test2 1885113 -rw-rw-r-- 1 wstearns wstearns 10 Feb 28 00:55 test1 1885114 -rw-rw-r-- 1 wstearns wstearns 10 Feb 28 00:55 test2 Note the different inode numbers - the total space used by these two files is 20 bytes (actually 2 filesytem blocks, but that's a detail). [wstearns@sparrow duptest]$ freedups ./test1 ./test2 Options chosen: None About to check for links in " ./test1 ./test2" 10: Would have linked ./test2 and ./test1 Total space would have saved: 10 (An overestimate if more than two files would have been linked together.) By default, it just reports what the savings would have been. [wstearns@sparrow duptest]$ freedups -a ./test1 ./test2 Options chosen: ActuallyLink About to check for links in " ./test1 ./test2" 10 Linked ./test2 and ./test1 Total space saved: 10 (Small risk of overcounting space saved if linked files have different times.) [wstearns@sparrow duptest]$ ls -ali test1 test2 1885114 -rw-rw-r-- 2 wstearns wstearns 10 Feb 28 00:55 test1 1885114 -rw-rw-r-- 2 wstearns wstearns 10 Feb 28 00:55 test2 Now both files share a single inode, so all but one copy is freed and the free space rises accordingly. For more examples, run freedups with the "-h" help option. Application list This list of applications shows whether they handle unlinking a file before saving to it. I made an attempt on each to find an option that allows one to change this behavior, but may not have found one. Contributions and corrections are gratefully accepted. Here's how to test: [wstearns@sparrow wstearns]$ cd /tmp [wstearns@sparrow /tmp]$ mkdir linktest [wstearns@sparrow /tmp]$ cd linktest [wstearns@sparrow linktest]$ echo Hi there >test1 [wstearns@sparrow linktest]$ ln -f test1 test2 [wstearns@sparrow linktest]$ ls -ali test* 1885112 -rw-rw-r-- 2 wstearns wstearns 9 Mar 5 12:52 test1 1885112 -rw-rw-r-- 2 wstearns wstearns 9 Mar 5 12:52 test2 [wstearns@sparrow linktest]$ myprogram test1 #Replace myprogram with the program under test. #In this program, add some characters to the file and save your changes. [wstearns@sparrow linktest]$ ls -ali test* 1885112 -rw-rw-r-- 2 wstearns wstearns 19 Mar 5 12:54 test1 1885112 -rw-rw-r-- 2 wstearns wstearns 19 Mar 5 12:54 test2 The fact that the two files still share an inode and both changed in content means that the link between test1 and test2 was preserved. If, instead, you get: [wstearns@sparrow linktest]$ ls -ali test* 2236994 -rw-rw-r-- 2 wstearns wstearns 19 Mar 5 12:54 test1 1885112 -rw-rw-r-- 2 wstearns wstearns 9 Mar 5 12:52 test2 , this means the program unlinked test1 before saving the changes. Note that neither behavior is "correct"; it's just that you may prefer one over the other while working on a given file. Editor Action on save Notes abiword-0.7.11 preserves link bash-1.14.7's ">" preserves link bash-1.14.7's ">>" preserves link emacs-20.7 preserves link gedit-0.9.2 preserves link gnotepad+-1.3.1 preserves link #When "write backup file" turned off gnotepad+-1.3.1 unlinks #When "write backup file" turned on gnumeric-0.58 preserves link gxedit-1.23 preserves link jove-4.16.0.24 preserves link kedit-1.1.2 preserves link #When "Backup Copies" turned off kedit-1.1.2 unlinks #When "Backup Copies" turned on lyx-0.12.0 preserves link mcedit-4.5.51 preserves link #~/.mc/ini: editor_option_save_mode=0 (Save mode=quick save) mcedit-4.5.51 unlinks #~/.mc/ini: editor_option_save_mode=1 (Save mode=safe save) netscape-4.76 unlinks #Editor in netscape-communicator nedit-5.1.1 preserves link patch-2.5.4 unlinks rpm-4.0 unlinks #on "-U" upgrade, at least. rsync-2.3.2 unlinks #on server, hardlink is unlinked when a new version sent vim-5.1 preserves link wordperfect-7.0 preserves link #"Original document backup" has no effect; always preserves link. xedit-3.3.2 preserves link Contacts and credits. Please send comments, suggestions, bug reports, patches, and/or additions to the filesystem or applications list to William Stearns . Many thanks to Kevin Burton for his constructive suggestions, most of which made it into v0.3.0. Sorry, Kevin, it's still written in bash. :-)