Feb. 1, 2011, 4:11 p.m.
posted by bruce
Comparing Directory Trees
Engineers can be a paranoid sort (but you didn't hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there's still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it's tough to have a lot of faith in tools that occasionally trash files and seem to crash my Windows machine every third Tuesday of the month. When push comes to shove, it's nice to be able to verify that data copied to a backup CD is the same as the originalor at least to spot deviations from the originalas soon as possible. If a backup is ever needed, it will be really needed.
Because data CDs are accessible as simple directory trees, we are once again in the realm of tree walkersto verify a backup CD, we simply need to walk its top-level directory. If our script is general enough, we will also be able to use it to verify other copy operations as welle.g., downloaded tar files, hard-drive backups, and so on. In fact, the combination of the cpall script we saw earlier and a general tree comparison would provide a portable and scriptable way to copy and verify data sets.
We've already written a generic walker class (the visitor module), but it won't help us here directly: we need to walk two directories in parallel and inspect common files along the way. Moreover, walking either one of the two directories won't allow us to spot files and directories that exist only in the other. Something more custom seems in order here.
Finding Directory Differences
Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem reduces to comparing corresponding files in each tree. In general, though, the trees can have arbitrarily different shapes, depths, and so on.
More generally, the contents of a directory in one tree may have more or fewer entries than the corresponding directory in the other tree. If those differing contents are filenames, there is no corresponding file to compare with; if they are directory names, there is no corresponding branch to descend through. In fact, the only way to detect files and directories that appear in one tree but not the other is to detect differences in each level's directory.
In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested and simpler operation, let's start by coding and debugging a single-directory comparison of filenames in Figure.
Given listings of names in two directories, this script simply picks out unique names in the first and unique names in the second, and reports any unique names found as differences (that is, files in one directory but not the other). Its comparedirs function returns a true result if no differences were found, which is useful for detecting differences in callers.
Let's run this script on a few directories; differences are detected and reported as names unique in either passed-in directory pathname. Notice that this is only a structural comparison that just checks names in listings, not file contents (we'll add the latter in a moment):
C:\temp>python %X%\system\filetools\dirdiff.py examples cpexamples Comparing examples to cpexamples Directory lists are identical C:\temp>python %X%\system\filetools\dirdiff.py examples\PyTools cpexamples\PyTools Comparing examples\PyTools to cpexamples\PyTools Files unique to examples\PyTools ... visitor.py C:\temp>python %X%\system\filetools\dirdiff.py examples\System\Filetools cpexamples\System\Filetools Comparing examples\System\Filetools to cpexamples\System\Filetools Files unique to examples\System\Filetools ... dirdiff2.py Files unique to cpexamples\System\Filetools ... cpall.py
>>> L1 = [1, 3, 5, 7, 9] >>> L2 = [2, 3, 6, 8, 9] >>> from dirdiff import unique >>> unique(L1, L2) # items in L1 but not in L2 [1, 5, 7] >>> unique(L2, L1) # items in L2 but not in L1 [2, 6, 8]
These two lists have objects 3 and 9 in common; the rest appear only in one of the two. When applied to directories, unique items represent tree differences, and common items are names of files or subdirectories that merit further comparisons or traversals. There are other ways to code this operation; see the dirdiff variants on the book's examples distribution for a few of them.
In fact, in Python 2.4 and later, we could also use the built-in set object type if we don't care about the order in the results (we'll use our own functions for now to avoid requiring users to upgrade):
>>> S1 = set([1, 3, 5, 7, 9]) >>> S2 = set([2, 3, 6, 8, 9]) >>> S1 - S2 # difference: unique set([1, 5, 7]) >>> S2 - S1 set([8, 2, 6]) >>> S1 & S2 # intersection: common set([9, 3])
Finding Tree Differences
Now all we need is a tree walker that applies dirdiff at each level to pick out unique files and directories, explicitly compares the contents of files in common, and descends through directories in common. Figure fits the bill.
At each directory in the tree, this script simply runs the dirdiff tool to detect unique names, and then compares names in common by intersecting directory lists. It uses recursive function calls to traverse the tree and visits subdirectories only after comparing all the files at each level so that the output is more coherent to read (the trace output for subdirectories appears after that for files; it is not intermixed).
Notice the misses list, added in the third edition of this book; it's very unlikely, but not impossible, that the same name might be a file in one directory and a subdirectory in the other. Also notice the blocksize variable; as in the tree copy script we saw earlier, instead of blindly reading entire files into memory all at once, we limit each read to grab up to 1 MB at a time, just in case any files in the directories are too big to be loaded into available memory. Without this limit, I ran into MemoryError exceptions on some machines with a prior version of this script that read both files all at once, like this:
bytes1 = open(path1, 'rb').read( ) bytes2 = open(path2, 'rb').read( ) if bytes1 == bytes2: ...match... else: ...difference...
This code was simpler, but is less practical for very large files that can't fit into your available memory space (consider CD and DVD image files, for example). In the new version's loop, the file reads return what is left when there is less than 1 MB present or remaining and return empty strings at end-of-file. Files match if all blocks read are the same, and they reach end-of-file at the same time. On some platforms, you may also want to detect and skip certain kinds of special files in order to be fully general, but these were not in my trees, so they are not in my script.
Running the Script
Since we've already studied the tree-walking tools this script employs, let's jump right into a few example runs. When run on identical trees, status messages scroll during the traversal, and a No diffs found. message appears at the end:
C:\temp>python %X%\system\filetools\diffall.py examples cpexamples -------------------- Comparing examples to cpexamples Directory lists are identical Comparing contents -------------------- Comparing examples\old_Part2 to cpexamples\old_Part2 Directory lists are identical Comparing contents -------------------- ...more lines deleted... -------------------- Comparing examples\EmbExt\Regist to cpexamples\EmbExt\Regist Directory lists are identical Comparing contents -------------------- Comparing examples\PyTools to cpexamples\PyTools Directory lists are identical Comparing contents ======================================== No diffs found.
I run this with the verbose flag passed in as False; use TRue to watch more status messages fly by. To show how differences are reported, we need to generate a few. Let's run the global search-and-replace script we met earlier in order to change a few files scattered about one of the trees (seeI told you global replacement could trash your files!):
C:\temp\examples>python %X%\PyTools\visitor_replace.py -exec SPAM Are you sure?y ... 1355 => .\PyTools\visitor_find_quiet1.py 1356 => .\PyTools\fixeoln_one.doc.txt Visited 1356 files Changed 8 files: .\package.csh .\README-PP3E.txt .\readme-old-pp1E.txt .\temp .\remp .\Internet\Cgi-Web\fixcgi.py .\System\Processes\output.txt .\PyTools\cleanpyc.py
While we're at it, let's remove a few common files so that directory uniqueness differences show up on the scope too; the following three removal commands will make two directories differ (the last two commands impact the same directory in different trees):
C:\temp>rm cpexamples\PyTools\visitor.py C:\temp>rm cpexamples\System\Filetools\dirdiff2.py C:\temp>rm examples\System\Filetools\cpall.py
Now, rerun the comparison walker to pick out differences and redirect its output report to a file for easy inspection. The following lists just the parts of the output report that identify differences. In typical use, I inspect the summary at the bottom of the report first, and then search for the strings "DIFF" and "unique" in the report's text if I need more information about the differences summarized; this could be more user-friendly, but it does the job for me:
C:\temp>python %X%\system\filetools\diffall.py examples cpexamples > diffs C:\temp>type diffs -------------------- Comparing examples to cpexamples Directory lists are identical Comparing contents package.csh DIFFERS README-PP3E.txt DIFFERS readme-old-pp1E.txt DIFFERS temp DIFFERS remp DIFFERS -------------------- Comparing examples\old_Part2 to cpexamples\old_Part2 Directory lists are identical Comparing contents -------------------- ... -------------------- Comparing examples\Internet\Cgi-Web to cpexamples\Internet\Cgi-Web Directory lists are identical Comparing contents fixcgi.py DIFFERS -------------------- ... -------------------- Comparing examples\System\Filetools to cpexamples\System\Filetools Files unique to examples\System\Filetools ... dirdiff2.py Files unique to cpexamples\System\Filetools ... cpall.py Comparing contents -------------------- ... -------------------- Comparing examples\System\Processes to cpexamples\System\Processes Directory lists are identical Comparing contents output.txt DIFFERS -------------------- ... -------------------- Comparing examples\PyTools to cpexamples\PyTools Files unique to examples\PyTools ... visitor.py Comparing contents cleanpyc.py DIFFERS ======================================== Diffs found: 10 - files differ at examples\package.csh - cpexamples\package.csh - files differ at examples\README-PP3E.txt - cpexamples\README-PP3E.txt - files differ at examples\readme-old-pp1E.txt - cpexamples\readme-old-pp1E.txt - files differ at examples\temp - cpexamples\temp - files differ at examples\remp - cpexamples\remp - files differ at examples\Internet\Cgi-Web\fixcgi.py - cpexamples\Internet\Cgi-Web\fixcgi.py - unique files at examples\System\Filetools - cpexamples\System\Filetools - files differ at examples\System\Processes\output.txt - cpexamples\System\Processes\output.txt - unique files at examples\PyTools - cpexamples\PyTools - files differ at examples\PyTools\cleanpyc.py - cpexamples\PyTools\cleanpyc.py
I added line breaks and tabs in a few of these output lines to make them fit on this page, but the report is simple to understand. Ten differences were foundthe eight files we changed (trashed) with the replacement script, and the two directories we threw out of sync with the three rm remove commands.
Verifying CD backups
So how does this script placate CD backup paranoia? To double-check my CD writer's work, I run a command such as the following. I can also use a command like this to find out what has been changed since the last backup. Again, since the CD is "G:" on my machine when plugged in, I provide a path rooted there; use a root such as /dev/cdrom or /mnt/cdrom on Linux:
C:\temp>python %X%\system\filetools\diffall.py examples g:\PP3rdEd\examples\PP3E > exdiffs091500 C:\temp>more exdiffs091500 -------------------- Comparing examples to g:\PP3rdEd\examples\PP3E Files unique to examples ... .cshrc Comparing contents tounix.py DIFFERS -------------------- Comparing examples\old_Part2 to g:\PP3rdEd\examples\PP3E\old_Part2 Directory lists are identical Comparing contents -------------------- ...more visitor_fixeoln.py DIFFERS visitor_fixnames.py DIFFERS ======================================== Diffs found: 41 - unique files at examples - g:\PP3rdEd\examples\PP3E - files differ at examples\tounix.py - g:\PP3rdEd\examples\PP3E\tounix.py ...more
The CD spins, the script compares, and a summary of 41 differences appears at the end of the report (in this case, representing changes to the examples tree since the latest backup CD was burned). For an example of a full difference report, see the file exdiffs091500 in the book's examples distribution. More typically, this is what turns up for most of my example backups; files with a leading "." are not copied to the CD:
C:\temp>python %X%\System\Filetools\diffall.py examples g:\PP3rdEd\examples\PP3E ... -------------------- Comparing examples\Config to g:\PP3rdEd\examples\PP3E\Config Files unique to examples\Config ... .cshrc Comparing contents ======================================== Diffs found: 1 - unique files at examples\Config - g:\PP3rdEd\examples\PP3E\Config
And to be really sure, I run the following global comparison command against the true book directory to verify the entire book tree backup on CD:
C:\>python %X%\System\Filetools\diffall.py PP3rdEd G:\PP3rdEd -------------------- Comparing PP3rdEd to G:\PP3rdEd Files unique to G:\PP3rdEd ... examples.tar.gz Comparing contents README.txt DIFFERS -------------------- ...more -------------------- Comparing PP3rdEd\examples\PP3E\Config to G:\PP3rdEd\examples\PP3E\Config Files unique to PP3rdEd\examples\PP3E\Config ... .cshrc Comparing contents -------------------- ...more -------------------- Comparing PP3rdEd\chapters to G:\PP3rdEd\chapters Directory lists are identical Comparing contents ch01-intro.doc DIFFERS ch04-os3.doc DIFFERS ch05-gui1.doc DIFFERS ch06-gui2.doc DIFFERS -------------------- ...more ======================================== Diffs found: 11 - unique files at PP3rdEd - G:\PP3rdEd - files differ at PP3rdEd\README.txt - G:\PP3rdEd\README.txt ...more
This particular run indicates that I've changed a README file, four chapter files, and a bunch more since the last backup; if run immediately after making a backup, only the .cshrc unique file shows up on diffall radar. This global comparison can take a few minutes. It performs byte-for-byte comparisons of all chapter files and screenshots, the examples tree, and more, but it's an accurate and complete verification. Given that this book tree contained roughly 119 MB of data in 7,300 files and 570 directories the last time I checked, a more manual verification procedure without Python's help would be utterly impossible.
After writing this script, I also started using it to verify full backups of my laptops onto an external hard-drive device. To do so, I run the cpall copy script we met earlier, and then the comparison script here to check results and get a list of files that didn't copy correctly. The last time I did this, this procedure copied and compared 225,000 files and 15,000 directories in 20 GB of spacenot the sort of task that lends itself to manual labor!
Here are the magic incantations on my Windows laptop. f:\ is a partition on my external hard drive, and you shouldn't be surprised if each of these commands runs for half an hour or more on currently common hardware. A drag-and-drop copy takes at least as long, assuming it works at all.
C:\...\System\Filetools>cpall.py c:\ f:\ > f:\copy-log.txt C:\...\System\Filetools>diffall.py f:\ c:\ > f:\diff-log.txt
Finally, it's worth noting that this script still only detects differences in the tree but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with string comparisons. It's a simple yes/no result.
If and when I need more details about how two reported files actually differ, I either edit the files or run the file-comparison command on the host platform (e.g., fc on Windows/DOS, diff or cmp on Unix and Linux). That's not a portable solution for this last step; but for my purposes, just finding the differences in a 1,300-file tree was much more critical than reporting which lines differ in files flagged in the report.
Of course, since we can always run shell commands in Python, this last step could be automated by spawning a diff or fc command with os.popen as differences are encountered (or after the traversal, by scanning the report summary). The output of these system calls could be displayed verbatim, or parsed for relevant parts.
We also might try to do a bit better here by opening text files in text mode to ignore line-terminator differences, but it's not clear that such differences should be ignored (what if the caller wants to know whether line-end markers have been changed?). For example, after downloading a web site with an FTP script we'll meet in Chapter 14, the diffall script detected a discrepancy between the local copy of a file and the one at the remote server. To probe further, I simply ran some interactive Python code:
>>> a = open('lp2e-updates.html', 'rb').read( ) >>> b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read( ) >>> a == b False
This verifies that there really is a binary difference in the downloaded and local versions of the file; to see whether it's because a Unix or DOS line-end snuck into the file, try again in text mode so that line ends are all mapped to the standard \n character:
>>> a = open('lp2e-updates.html', 'r').read( ) >>> b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'r').read( ) >>> a == b True
Sure enough; now, to find where the difference is, the following code checks character by character until the first mismatch is found (in binary mode, so we retain the difference):
>>> a = open('lp2e-updates.html', 'rb').read( ) >>> b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read( ) >>> for (i, (ac, bc)) in enumerate(zip(a, b)): ... if ac != bc: ... print i, repr(ac), repr(bc) ... break ... 37966 '\r' '\n'
This means that at byte offset 37,966, there is a \r in the downloaded file, but a \n in the local copy. This line has a DOS line end in one and a Unix line end in the other. To see more, print text around the mismatch:
>>> for (i, (ac, bc)) in enumerate(zip(a, b)): ... if ac != bc: ... print i, repr(ac), repr(bc) ... print repr(a[i-20:i+20]) ... print repr(b[i-20:i+20]) ... break ... 37966 '\r' '\n' 're>\r\ndef min(*args):\r\n tmp = list(arg' 're>\r\ndef min(*args):\n tmp = list(args'
Apparently, I wound up with a Unix line end at one point in the local copy and a DOS line end in the version I downloadedthe combined effect of the text mode used by the download script itself (which translated \n to \r\n) and years of edits on both Linux and Windows PDAs and laptops (I probably coded this change on Linux and copied it to my local Windows copy in binary mode). Code such as this could be integrated into the diffall script to make it more intelligent about text files and difference reporting.
Because Python excels at processing files and strings, it's even possible to go one step further and code a Python equivalent of the fc and diff commands. In fact, much of the work has already been done; the standard library module difflib, new as of Python 2.1, could make this task simple. See the Python library manual for details and usage examples.
We could also be smarter by avoiding the load and compare steps for files that differ in size, and we might use a smaller block size to reduce the script's memory requirements. For most trees, such optimizations are unnecessary; reading multimegabyte files into strings is very fast in Python, and garbage collection reclaims the space as you go.
Since such extensions are beyond both this script's scope and this chapter's size limits, though, they will have to await the attention of a curious reader.