Dupseek

A command-line interactive perl program to find and remove duplicate files.

Algorithm

A few strategies are possible for finding duplicate files in a big set, such as a heavily populated directory.

One of the most widely used consists of grouping files by size (because files of different size can't be identical) and then computing a short digital fingerprint (such as a md5 checksum) for the files. Files with a different fingerprint are different, and files with the same digital fingerprint are very probably the same. Just to be sure, one can further check possible duplicates.

Dupseek does something different:

This algorithm is much more efficient than competitors when dealing with large files of the same size. When files differ, reading usually stops after very few reads.

Partial execution

Dupseek (and destroy) can be interrupted at any moment. The user is then presented with partial results and can either intervene manually or go on with the reading and computation, on a group-by-group basis. Since subsequent reads happen sparsely in the file, if some files are still in the same group after many iterations, they are most probably identical, unless the differences are very small.

Platforms

Dupseek was reported to run on the following platforms:

Dependencies

Dupseek was developed with perl 5.6.1 and was also tested with perl 5.8.4. It relies on the following modules:

License

Dupseek (and destroy) is Copyright Antonio Bellezza 2003-2005. It is released under the GPL v2. Here is the license notice:


 This program is free software; you can redistribute it and/or modify
 it under the terms of version 2 of the GNU General Public License
 as published by the Free Software Foundation;
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.

Beware

The program destroys files. Starting from version 1.1, it can also do it in an automatic way, and mistakes can happen, on the user's or programmer's part. So, be warned!!!

Usage

dupseek -h outputs a help page.

Hit Ctrl-C to interrupt interactive execution and be presented with partial results.

Credits

I would like to thank Henry Laxen for sending me his patch implementing batch processing and option parsing (see credits.txt).

My thanks also go to Glenn Powers for extensive testing on Mac OS X and pointing out the problem with changing files/directories.

Download

The latest version is

Dupseek version 1.3 (September 24, 2005)
The file is a tgz archive, containing development files and the stand-alone program dupseek (which is the only file you need as a user).

You can also download the older releases

Dupseek version 1.2 (March 7, 2005)
Dupseek version 1.1 (June 27, 2003)
Dupseek version 1.0 (June 6, 2003)

Bugs

Further work

If I had more spare time, I would like to add a graphical user interface, possibly managed by a different process or thread, allowing interaction while the program is running without the need to interrupt the main loop. Since the program works well enough for my needs now, I will probably leave it as-is, but any contribution is welcome.