Google


   


You are here: CodeIdol.com > Other > Ruby Cookbook > System Administration > Finding Duplicate Files

SAVE
Digg
Shown on del.icio.us del.icio.us
See Whos Talking About This on Technorati Technorati
I've Reddit reddit

Recipe 23.7. Finding Duplicate Files

Problem

You want to find the duplicate files that are taking up all the space on your hard drive.

Solution

The simple solution is to group the files by size and then by their MD5 checksum. Two files are presumed identical if they have the same size and MD5 sum.

The following program takes a list of directories on the command line, and prints out all sets of duplicate files. You can pass a different code block into each_set_of_ duplicates for different behavior: for instance, to prompt the user about which of the duplicates to keep and which to delete.

	#!/usr/bin/ruby
	# find_duplicates.rb
	
	require 'find'
	require 'digest/md5'

	def each_set_of_duplicates(*paths)
	  sizes = {}
	  Find.find(*paths) do |f|
	   (sizes[File.size(f)] ||= []) << f if File.file? f
	  end
	  sizes.each do |size, files|
	    next unless files.size > 1
	    md5s = {}
	    files.each do |f|
	      digest = Digest::MD5.hexdigest(File.read(f))
	      (md5s[digest] ||= []) << f
	    end
	    md5s.each { |sum, files| yield files if files.size > 1 }
	  end
	end

	each_set_of_ 
duplicates(*ARGV) do |f|
	  puts " 
Duplicates: #{f.join(", ")}"
	end

Discussion

This is one task that can't be handled with a simple Find.find code block, because it's trying to figure out which files have certain relationships to each other. Find.find takes care of walking the file tree, but it would be very inefficient to try to make a single trip through the tree and immediately spit out a set of duplicates. Instead, we group the files by size and then by their MD5 checksum.

The MD5 checksum is a short binary string used as a stand-in for the contents of a file. It's commonly used to verify that a huge file was downloaded without errors. It's not impossible for two different files to have an MD5 sum, but unless someone is deliberately trying to trick you, it's almost impossible to have two files with the same size and the same MD5 sum.

Calculating a MD5 sum is very expensive: it means performing a mathematical calculation on the entire contents of the file. Grouping the files by size beforehand greatly reduces the number of sums that must be calculated, but that's still a lot of I/O. Even if two similarly sized files differ in the first byte, the code above will read the entire files.

Here's a different version of the same program that takes an incremental approach like that seen in Recipe 6.10. When it thinks a set of files might contain duplicates, it makes repeated calls to a method called eliminate_non_duplicates. The duplicates are yielded and the nonduplicates discarded over the course of these calls.

	#!/usr/bin/ruby
	# find_duplicates2.rb

	require 'find'
	BLOCK_SIZE = 1024*8

	def each_set_of_duplicates(*paths, &block)
	  sizes = Hash.new {|h, k| h[k] = [] }
	  Find.find(*paths) { |f| sizes[File.size(f)] << f if File.file? f }

	  sizes.each_pair do |size, files|
	    next unless files.size > 1
	    offset = 0
	    files = [files]
	    while !files.empty? && offset <= size
	      files = eliminate_non_ 
duplicates(files, size, offset, &block)
	      offset += BLOCK_SIZE
	    end
	  end
	end

The method eliminate_non_ duplicates takes lists of files that might contain duplicates. It reads each file an eight-kilobyte block at a time, and compares just one block of each file. Files whose blocks don't match the corresponding blocks of any other file are discarded; they're not duplicates. All files with the same block are put into a new list of possible duplicates, and sent back to each_set_of_duplicates.

If two files are not duplicates, eliminate_non_duplicates will eventually find a block where they differ. Otherwise, it will eventually read the last block of each file and confirm them as duplicates.

	def eliminate_non_duplicates(partition, size, offset)
	  possible_duplicates = []
	  partition.each do |possible_duplicate_set|
	    blocks = Hash.new {|h, k| h[k] = [] }
	    possible_duplicate_set.each do |f|
	      block = open(f, 'rb') do |file|
	        file.seek(offset)
	        file.read(BLOCK_SIZE)
	      end
	      blocks[block || ''] << f
	    end
	    blocks.each_value do |files|
	      if files.size > 1
	        if offset+BLOCK_SIZE >= size
	          # We know these are duplicates.
	          yield files
	        else
	          # We suspect these are duplicates, but we need to compare
	          # more blocks of data.
	          possible_duplicates << files
	        end
	      end
	    end
	  end
	 return possible_duplicates
	end

	each_set_of_duplicates(*ARGV) do |f|
	  puts "Duplicates: #{f.join(", ")}"
	end

This code is more complicated, but in real-world situations, it's considerably faster. Most files of the same size are not duplicates, and it's cheaper to find this out by reading eight kilobytes than by reading many megabytes and then performing two MD5 sums. This solution also eliminates any last possibility that each_set_of_ duplicates will claim two files are duplicates when they're not.

See Also


SAVE
Digg
Shown on del.icio.us del.icio.us
See Whos Talking About This on Technorati Technorati
I've Reddit reddit

You are here: CodeIdol.com > Other > Ruby Cookbook > System Administration > Finding Duplicate Files


ADBRITE ads links
   
Related tags







Popular Categories
Unix books and guides

AJAX popular information
C# language guides
Windows books and cookbooks

.......








Business Key Top Sites

be number one
rate your site




    С 2009 года мы стали переводить структура сайта на различные языки. Сайт теперь будет содержать книги не только на английском языке, но также и на других европейских языках, в том числе и на Русском языке.

    Русский Polski Francais Deutsch
    support sitemap terms

© CodeIdol Labs, 2007 - 2009