Splitting and Joining Files

Splitting and Joining Files

Like most kids, mine spend a lot of time on the Internet. As far as I can tell, it's the thing to do these days. Among this latest generation, computer geeks and gurus seem to be held in the same sort of esteem that my generation once held rock stars. When kids disappear into their rooms, the chances are good that they are hacking on computers, not mastering guitar riffs. It's probably healthier than some of the diversions of my own misspent youth, but that's a topic for another kind of book.

If you have teenage kids and computers, or know someone who does, you probably know that it's not a bad idea to keep tabs on what those kids do on the Web. Type your favorite four-letter word in almost any web search engine and you'll understand the concernit's much better stuff than I could get during my teenage career. To sidestep the issue, only a few of the machines in my house have Internet feeds.

While they're on one of these machines, my kids download lots of games. To avoid infecting our Very Important Computers with viruses from public-domain games, though, my kids usually have to download games on a computer with an Internet feed and transfer them to their own computers to install. The problem is that game files are not small; they are usually much too big to fit on a floppy (and burning a CD takes away valuable game-playing time).

If all the machines in my house ran Linux, this would be a nonissue. There are standard command-line programs on Unix for chopping a file into pieces small enough to fit on a floppy (split), and others for putting the pieces back together to re-create the original file (cat). Because we have all sorts of different machines in the house, though, we needed a more portable solution.[*]

[*] As I'm writing the third edition of this book, I should probably note that some of this background story is now a bit dated. Some six years later, floppies have largely gone the way of the parallel port and the dinosaur. Moreover, burning a CD is no longer as painful as it once was, there are new options today such as large flash memory cards and wireless home networks, and the configuration of my home computers isn't what it once was. For that matter, some of my kids are no longer kids (though they've retained backward compatibility with the past).

Splitting Files Portably

Since all the computers in my house run Python, a simple portable Python script came to the rescue. The Python program in Figure distributes a single file's contents among a set of part files and stores those part files in a directory.


# split a file into a set of parts; join.py puts them back together;
# this is a customizable version of the standard Unix split command-line
# utility; because it is written in Python, it also works on Windows and
# can be easily modified; because it exports a function, its logic can
# also be imported and reused in other applications;

import sys, os
kilobytes = 1024
megabytes = kilobytes * 1000
chunksize = int(1.4 * megabytes)                   # default: roughly a floppy

def split(fromfile, todir, chunksize=chunksize):
    if not os.path.exists(todir):                  # caller handles errors
        os.mkdir(todir)                            # make dir, read/write parts
        for fname in os.listdir(todir):            # delete any existing files
            os.remove(os.path.join(todir, fname))
    partnum = 0
    input = open(fromfile, 'rb')                   # use binary mode on Windows
    while 1:                                       # eof=empty string from read
        chunk = input.read(chunksize)              # get next part <= chunksize
        if not chunk: break
        partnum  = partnum+1
        filename = os.path.join(todir, ('part%04d' % partnum))
        fileobj  = open(filename, 'wb')
        fileobj.close()                            # or simply open().write( )
    input.close( )
    assert partnum <= 9999                         # join sort fails if 5 digits
    return partnum

if _ _name_ _ == '_ _main_ _':
    if len(sys.argv) == 2 and sys.argv[1] == '-help':
        print 'Use: split.py [file-to-split target-dir [chunksize]]'
        if len(sys.argv) < 3:
            interactive = 1
            fromfile = raw_input('File to be split? ')       # input if clicked
            todir    = raw_input('Directory to store part files? ')
            interactive = 0
            fromfile, todir = sys.argv[1:3]                  # args in cmdline
            if len(sys.argv) == 4: chunksize = int(sys.argv[3])
        absfrom, absto = map(os.path.abspath, [fromfile, todir])
        print 'Splitting', absfrom, 'to', absto, 'by', chunksize

            parts = split(fromfile, todir, chunksize)
            print 'Error during split:'
            print sys.exc_info()[0], sys.exc_info( )[1]
            print 'Split finished:', parts, 'parts are in', absto
        if interactive: raw_input('Press Enter key') # pause if clicked

By default, this script splits the input file into chunks that are roughly the size of a floppy diskperfect for moving big files between electronically isolated machines. Most importantly, because this is all portable Python code, this script will run on just about any machine, even ones without their own file splitter. All it requires is an installed Python. Here it is at work splitting the Python 1.5.2 self-installer executable on Windows:

C:\temp>echo %X%              shorthand shell variable 

C:\temp>ls -l py152.exe 
-rwxrwxrwa   1 0        0        5028339 Apr 16  1999 py152.exe

C:\temp>python %X%\System\Filetools\split.py -help 
Use: split.py [file-to-split target-dir [chunksize]]

C:\temp>python %X%\System\Filetools\split.py py152.exe pysplit 
Splitting C:\temp\py152.exe to C:\temp\pysplit by 1433600
Split finished: 4 parts are in C:\temp\pysplit

C:\temp>ls -l pysplit 
total 9821
-rwxrwxrwa   1 0        0        1433600 Sep 12 06:03 part0001
-rwxrwxrwa   1 0        0        1433600 Sep 12 06:03 part0002
-rwxrwxrwa   1 0        0        1433600 Sep 12 06:03 part0003
-rwxrwxrwa   1 0        0         727539 Sep 12 06:03 part0004

Each of these four generated part files represents one binary chunk of the file py152.exeaa chunk small enough to fit comfortably on a floppy disk. In fact, if you add the sizes of the generated part files given by the ls command, you'll come up with 5,028,339 bytesexactly the same as the original file's size. Before we see how to put these files back together again, let's explore a few of the splitter script's finer points.

Operation modes

This script is designed to input its parameters in either interactive or command-line mode; it checks the number of command-line arguments to find out the mode in which it is being used. In command-line mode, you list the file to be split and the output directory on the command line, and you can optionally override the default part file size with a third command-line argument.

In interactive mode, the script asks for a filename and output directory at the console window with raw_input and pauses for a key press at the end before exiting. This mode is nice when the program file is started by clicking on its icon; on Windows, parameters are typed into a pop-up DOS box that doesn't automatically disappear. The script also shows the absolute paths of its parameters (by running them through os.path.abspath) because they may not be obvious in interactive mode. We'll see examples of other split modes at work in a moment.

Binary file access

This code is careful to open both input and output files in binary mode (rb, wb), because it needs to portably handle things like executables and audio files, not just text. In Chapter 4, we learned that on Windows, text-mode files automatically map \r\n end-of-line sequences to \n on input and map \n to \r\n on output. For true binary data, we really don't want any \r characters in the data to go away when read, and we don't want any superfluous \r characters to be added on output. Binary-mode files suppress this \r mapping when the script is run on Windows and so avoid data corruption.

Manually closing files

This script also goes out of its way to manually close its files. For instance:

fileobj  = open(partname, 'wb')
fileobj.close( )

As we also saw in Chapter 4, these three lines can usually be replaced with this single line:

open(partname, 'wb').write(chunk)

This shorter form relies on the fact that the current Python implementation automatically closes files for you when file objects are reclaimed (i.e., when they are garbage collected, because there are no more references to the file object). In this line, the file object would be reclaimed immediately, because the open result is temporary in an expression and is never referenced by a longer-lived name. Similarly, the input file is reclaimed when the split function exits.

As I was writing this chapter, though, there was some possibility that this automatic-close behavior may go away in the future. Moreover, the Jython Java-based Python implementation does not reclaim unreferenced objects as immediately as the standard Python. If you care about the Java port, your script may potentially create many files in a short amount of time, and it may run on a machine that has a limit on the number of open files per program and then close manually. The close calls in this script have never been necessary for my purposes, but because the split function in this module is intended to be a general-purpose tool, it accommodates such worst-case scenarios.

Joining Files Portably

Back to moving big files around the house: after downloading a big game program file, my kids generally run the previous splitter script by clicking on its name in Windows Explorer and typing filenames. After a split, they simply copy each part file onto its own floppy, walk the floppies upstairs, and re-create the split output directory on their target computer by copying files off the floppies. Finally, the script in Figure is clicked or otherwise run to put the parts back together.


# join all part files in a dir created by split.py, to re-create file.
# This is roughly like a 'cat fromdir/* > tofile' command on unix, but is
# more portable and configurable, and exports the join operation as a
# reusable function.  Relies on sort order of filenames: must be same
# length.  Could extend split/join to pop up Tkinter file selectors.

import os, sys
readsize = 1024

def join(fromdir, tofile):
    output = open(tofile, 'wb')
    parts  = os.listdir(fromdir)
    parts.sort( )
    for filename in parts:
        filepath = os.path.join(fromdir, filename)
        fileobj  = open(filepath, 'rb')
        while 1:
            filebytes = fileobj.read(readsize)
            if not filebytes: break
        fileobj.close( )
    output.close( )

if _ _name_ _ == '_ _main_ _':
    if len(sys.argv) == 2 and sys.argv[1] == '-help':
        print 'Use: join.py [from-dir-name to-file-name]'
        if len(sys.argv) != 3:
            interactive = 1
            fromdir = raw_input('Directory containing part files? ')
            tofile  = raw_input('Name of file to be recreated? ')
            interactive = 0
            fromdir, tofile = sys.argv[1:]
        absfrom, absto = map(os.path.abspath, [fromdir, tofile])
        print 'Joining', absfrom, 'to make', absto

            join(fromdir, tofile)
            print 'Error joining files:'
            print sys.exc_info()[0], sys.exc_info( )[1]
           print 'Join complete: see', absto
        if interactive: raw_input('Press Enter key') # pause if clicked

After running the join script, my kids still may need to run something like zip, gzip, or tar to unpack an archive file, unless it's shipped as an executable;[*] but at least they're much closer to seeing the Starship Enterprise spring into action. Here is a join in progress on Windows, combining the split files we made a moment ago:

[*] It turns out that the zip, gzip, and tar commands can all be replaced with pure Python code today. The gzip module in the Python standard library provides tools for reading and writing compressed gzip files, usually named with a .gz filename extension. It can serve as an all-Python equivalent of the standard gzip and gunzip command-line utility programs. This built-in module uses another module called zlib that implements gzip-compatible data compressions. In recent Python releases, the zipfile module can be imported to make and use ZIP format archives (zip is an archive and compression format, gzip is a compression scheme), and the tarfile module allows scripts to read and write tar archives. See the Python library manual for details.

C:\temp>python %X%\System\Filetools\join.py -help
Use: join.py [from-dir-name to-file-name]

C:\temp>python %X%\System\Filetools\join.py pysplit mypy152.exe
Joining C:\temp\pysplit to make C:\temp\mypy152.exe
Join complete: see C:\temp\mypy152.exe

C:\temp>ls -l mypy152.exe py152.exe
-rwxrwxrwa   1 0        0        5028339 Sep 12 06:05 mypy152.exe
-rwxrwxrwa   1 0        0        5028339 Apr 16  1999 py152.exe

C:\temp>fc /b mypy152.exe py152.exe
Comparing files mypy152.exe and py152.exe
FC: no differences encountered

The join script simply uses os.listdir to collect all the part files in a directory created by split, and sorts the filename list to put the parts back together in the correct order. We get back an exact byte-for-byte copy of the original file (proved by the DOS fc command in the code; use cmp on Unix).

Some of this process is still manual, of course (I haven't quite figured out how to script the "walk the floppies upstairs" bit yet), but the split and join scripts make it both quick and simple to move big files around. Because this script is also portable Python code, it runs on any platform to which we care to move split files. For instance, my kids typically download both Windows and Linux games; since this script runs on either platform, they're covered.

Reading by blocks or files

Before we move on, there are a couple of details worth underscoring in the join script's code. First of all, notice that this script deals with files in binary mode but also reads each part file in blocks of 1 KB each. In fact, the readsize setting here (the size of each block read from an input part file) has no relation to chunksize in split.py (the total size of each output part file). As we learned in Chapter 4, this script could instead read each part file all at once:

filebytes = open(filepath, 'rb').read( )

The downside to this scheme is that it really does load all of a file into memory at once. For example, reading a 1.4 MB part file into memory all at once with the file object read method generates a 1.4 MB string in memory to hold the file's bytes. Since split allows users to specify even larger chunk sizes, the join script plans for the worst and reads in terms of limited-size blocks. To be completely robust, the split script could read its input data in smaller chunks too, but this hasn't become a concern in practice (recall that as your program runs, Python automatically reclaims strings that are no longer referenced, so this isn't as wasteful as it might seem).

Sorting filenames

If you study this script's code closely, you may also notice that the join scheme it uses relies completely on the sort order of filenames in the parts directory. Because it simply calls the list sort method on the filenames list returned by os.listdir, it implicitly requires that filenames have the same length and format when created by split. The splitter uses zero-padding notation in a string formatting expression ('part%04d') to make sure that filenames all have the same number of digits at the end (four), much like this list:

>>> list = ['xx008', 'xx010', 'xx006', 'xx009', 'xx011', 'xx111']
>>> list.sort( )
>>> list
['xx006', 'xx008', 'xx009', 'xx010', 'xx011', 'xx111']

When sorted, the leading zero characters in small numbers guarantee that part files are ordered for joining correctly. Without the leading zeros, join would fail whenever there were more than nine part files, because the first digit would dominate:

>>> list = ['xx8', 'xx10', 'xx6', 'xx9', 'xx11', 'xx111']
>>> list.sort( )
>>> list
['xx10', 'xx11', 'xx111', 'xx6', 'xx8', 'xx9']

Because the list sort method accepts a comparison function as an argument, we could in principle strip off digits in filenames and sort numerically:

>>> list = ['xx8', 'xx10', 'xx6', 'xx9', 'xx11', 'xx111']
>>> list.sort(lambda x, y: cmp(int(x[2:]), int(y[2:])))
>>> list
['xx6', 'xx8', 'xx9', 'xx10', 'xx11', 'xx111']

But that still implies that all filenames must start with the same length substring, so this doesn't quite remove the file-naming dependency between the split and join scripts. Because these scripts are designed to be two steps of the same process, though, some dependencies between them seem reasonable.

Usage Variations

Let's run a few more experiments with these Python system utilities to demonstrate other usage modes. When run without full command-line arguments, both split and join are smart enough to input their parameters interactively. Here they are chopping and gluing the Python self-installer file on Windows again, with parameters typed in the DOS console window:

C:\temp>python %X%\System\Filetools\split.py
File to be split? py152.exe
Directory to store part files? splitout
Splitting C:\temp\py152.exe to C:\temp\splitout by 1433600
Split finished: 4 parts are in C:\temp\splitout
Press Enter key

C:\temp>python %X%\System\Filetools\join.py
Directory containing part files? splitout
Name of file to be recreated? newpy152.exe
Joining C:\temp\splitout to make C:\temp\newpy152.exe
Join complete: see C:\temp\newpy152.exe
Press Enter key

C:\temp>fc /B py152.exe newpy152.exe
Comparing files py152.exe and newpy152.exe
FC: no differences encountered

When these program files are double-clicked in a file explorer GUI, they work the same way (there are usually no command-line arguments when they are launched this way). In this mode, absolute path displays help clarify where files really are. Remember, the current working directory is the script's home directory when clicked like this, so the name tempsplit actually maps to a source code directory; type a full path to make the split files show up somewhere else:

 [in a pop-up DOS console box when split is clicked]
File to be split? c:\temp\py152.exe 
Directory to store part files? tempsplit 
Splitting c:\temp\py152.exe to C:\PP3rdEd\examples\PP3E\System\Filetools\
tempsplit by 1433600
Split finished: 4 parts are in C:\PP3rdEd\examples\PP3E\System\Filetools\
Press Enter key

 [in a pop-up DOS console box when join is clicked]
Directory containing part files? tempsplit 
Name of file to be recreated? c:\temp\morepy152.exe 
Joining C:\PP3rdEd\examples\PP3E\System\Filetools\tempsplit to make
Join complete: see c:\temp\morepy152.exe
Press Enter key

Because these scripts package their core logic in functions, though, it's just as easy to reuse their code by importing and calling from another Python component:

>>> from PP3E.System.Filetools.split import split
>>> from PP3E.System.Filetools.join  import join
>>> numparts = split('py152.exe', 'calldir')
>>> numparts
>>> join('calldir', 'callpy152.exe')
>>> import os
>>> os.system(r'fc /B py152.exe callpy152.exe')
Comparing files py152.exe and callpy152.exe
FC: no differences encountered

A word about performance: all the split and join tests shown so far process a 5 MB file, but they take at most one second of real wall-clock time to finish on my Windows 98 300 and 650 MHz laptop computersplenty fast for just about any use I could imagine. (They run even faster after Windows has cached information about the files involved, and they would be even quicker on a more modern computer.) Both scripts run just as fast for other reasonable part file sizes too; here is the splitter chopping up the file into 500,000- and 50,000-byte parts:

C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 500000
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 500000
Split finished: 11 parts are in C:\temp\tempsplit

C:\temp>ls -l tempsplit
total 9826
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0001
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0002
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0003
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0004
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0005
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0006
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0007
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0008
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0009
-rwxrwxrwa   1 0        0         500000 Sep 12 06:29 part0010
-rwxrwxrwa   1 0        0          28339 Sep 12 06:29 part0011

C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 50000
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 50000
Split finished: 101 parts are in C:\temp\tempsplit

C:\temp>ls tempsplit
part0001  part0014  part0027  part0040  part0053  part0066  part0079  part0092
part0002  part0015  part0028  part0041  part0054  part0067  part0080  part0093
part0003  part0016  part0029  part0042  part0055  part0068  part0081  part0094
part0004  part0017  part0030  part0043  part0056  part0069  part0082  part0095
part0005  part0018  part0031  part0044  part0057  part0070  part0083  part0096
part0006  part0019  part0032  part0045  part0058  part0071  part0084  part0097
part0007  part0020  part0033  part0046  part0059  part0072  part0085  part0098
part0008  part0021  part0034  part0047  part0060  part0073  part0086  part0099
part0009  part0022  part0035  part0048  part0061  part0074  part0087  part0100
part0010  part0023  part0036  part0049  part0062  part0075  part0088  part0101
part0011  part0024  part0037  part0050  part0063  part0076  part0089
part0012  part0025  part0038  part0051  part0064  part0077  part0090
part0013  part0026  part0039  part0052  part0065  part0078  part0091

The split can take longer to finish, but only if the part file's size is set small enough to generate thousands of part files; splitting into 1,006 parts works but runs slower (though machines today are quick enough that you probably won't notice):

C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 5000 
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 5000
Split finished: 1006 parts are in C:\temp\tempsplit

C:\temp>python %X%\System\Filetools\join.py tempsplit mypy152.exe 
Joining C:\temp\tempsplit to make C:\temp\py152.exe
Join complete: see C:\temp\py152.exe

C:\temp>fc /B py152.exe mypy152.exe 
Comparing files py152.exe and mypy152.exe
FC: no differences encountered

C:\temp>ls -l tempsplit 
 ...1,000 lines deleted...
-rwxrwxrwa   1 0        0           5000 Sep 12 06:30 part1001
-rwxrwxrwa   1 0        0           5000 Sep 12 06:30 part1002
-rwxrwxrwa   1 0        0           5000 Sep 12 06:30 part1003
-rwxrwxrwa   1 0        0           5000 Sep 12 06:30 part1004
-rwxrwxrwa   1 0        0           5000 Sep 12 06:30 part1005
-rwxrwxrwa   1 0        0           3339 Sep 12 06:30 part1006

Finally, the splitter is also smart enough to create the output directory if it doesn't yet exist and to clear out any old files there if it does exist. Because the joiner combines whatever files exist in the output directory, this is a nice ergonomic touch. If the output directory was not cleared before each split, it would be too easy to forget that a prior run's files are still there. Given that my kids are running these scripts, they need to be as forgiving as possible; your user base may vary, but perhaps not by much.

C:\temp>python %X%\System\Filetools\split.py py152.exe tempsplit 700000 
Splitting C:\temp\py152.exe to C:\temp\tempsplit by 700000
Split finished: 8 parts are in C:\temp\tempsplit

C:\temp>ls -l tempsplit
total 9827
-rwxrwxrwa   1 0        0         700000 Sep 12 06:32 part0001
-rwxrwxrwa   1 0        0         700000 Sep 12 06:32 part0002
-rwxrwxrwa   1 0        0         700000 Sep 12 06:32 part0003
 ...only new files here...
-rwxrwxrwa   1 0        0         700000 Sep 12 06:32 part0006
-rwxrwxrwa   1 0        0         700000 Sep 12 06:32 part0007
-rwxrwxrwa   1 0        0         128339 Sep 12 06:32 part0008

 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows