Packing and Unpacking Files






Packing and Unpacking Files

Many moons ago (about 10 years), I used machines that had no tools for bundling files into a single package for easy transport. Here is the situation: you have a large set of text files laying around that you need to transfer to another computer. These days, tools like tar are widely available for packaging many files into a single file that can be copied, uploaded, mailed, or otherwise transferred in a single step. As mentioned in an earlier footnote, even Python itself has grown to support zip and tar archives in the standard library (see the zipfile and tarfile modules in the library reference).

Before I managed to install such tools on my PC, though, portable Python scripts served just as well. Figure copies all of the files listed on the command line to the standard output stream, separated by marker lines.

PP3E\System\App\Clients\textpack.py

#!/usr/local/bin/python
import sys                           # load the system module
marker = ':'*10 + 'textpak=>'        # hopefully unique separator

def pack( ):
    for name in sys.argv[1:]:         # for all command-line arguments
        input = open(name, 'r')       # open the next input file
        print marker + name           # write a separator line
        print input.read( ),          # and write the file's contents

if _ _name_ _ == '_ _main_ _': pack( )   # pack files listed on cmdline

The first line in this file is a Python comment (#...), but it also gives the path to the Python interpreter using the Unix executable-script trick discussed in Chapter 3. If we give textpack.py executable permission with a Unix chmod command, we can pack files by running this program file directly from a shell console and redirect its standard output stream to the file in which we want the packed archive to show up:

C:\...\PP3E\System\App\Clients\test>type spam.txt
SPAM
spam

C:\......\test>python ..\textpack.py spam.txt eggs.txt ham.txt > packed.all

C:\......\test>type packed.all
::::::::::textpak=>spam.txt
SPAM
spam
::::::::::textpak=>eggs.txt
EGGS
::::::::::textpak=>ham.txt
ham

Running the program this way creates a single output file called packed.all, which contains all three input files, with a header line giving the original file's name before each file's contents. Combining many files into one file in this way makes it easy to transfer in a single steponly one file need be copied to floppy, emailed, and so on. If you have hundreds of files to move, this can be a big win.

After such a file is transferred, though, it must somehow be unpacked on the receiving end to re-create the original files. To do so, we need to scan the combined file line by line, watching for header lines left by the packer to know when a new file's contents begin. Another simple Python script, shown in Figure, does the trick.

PP3E\System\App\Clients\textunpack.py

#!/usr/local/bin/python
import sys
from textpack import marker                     # use common separator key
mlen = len(marker)                              # filenames after markers

for line in sys.stdin.readlines( ):             # for all input lines
    if line[:mlen] != marker:
        print line,                             # write real lines
    else:
        sys.stdout = open(line[mlen:-1], 'w')   # or make new output file

We could code this in a function like we did in textpack, but there is little point in doing so here; as written, the script relies on standard streams, not function parameters. Run this in the directory where you want unpacked files to appear, with the packed archive file piped in on the command line as the script's standard input stream:

C:\......\test\unpack>python ..\..\textunpack.py < ..\packed.all

C:\......\test\unpack>ls
eggs.txt  ham.txt   spam.txt

C:\......\test\unpack>type spam.txt
SPAM
Spam

Packing Files "++"

So far so good; the textpack and textunpack scripts made it easy to move lots of files around without lots of manual intervention. They are prime examples of what are often called tactical scriptsprograms you code quickly for a specific task.

But after playing with these and similar scripts for a while, I began to see commonalities that almost cried out for reuse. For instance, almost every shell tool I wrote had to scan command-line arguments, redirect streams to a variety of sources, and so on. Further, almost every command-line utility wound up with a different command-line option pattern, because each was written from scratch.

The following few classes are one solution to such problems. They define a class hierarchy that is designed for reuse of common shell tool code. Moreover, because of the reuse going on, every program that ties into its hierarchy sports a common look-and-feel in terms of command-line options, environment variable use, and more. As usual with object-oriented systems, once you learn which methods to overload, such a class framework provides a lot of work and consistency for free.

And once you start thinking in such ways, you make the leap to more strategic development modes, writing code with broader applicability and reuse in mind. The module in Figure, for instance, adapts the textpack script's logic for integration into this hierarchy.

PP3E\System\App\Clients\packapp.py

#!/usr/local/bin/python
######################################################
# pack text files into one, separated by marker line;
# % packapp.py -v -o target src src...
# % packapp.py *.txt -o packed1
# >>> apptools.appRun('packapp.py', args...)
# >>> apptools.appCall(PackApp, args...)
######################################################

from textpack import marker
from PP3E.System.App.Kinds.redirect import StreamApp

class PackApp(StreamApp):
    def start(self):
        StreamApp.start(self)
        if not self.args:
            self.exit('packapp.py [-o target]? src src...')
    def run(self):
        for name in self.restargs( ):
            try:
                self.message('packing: ' + name)
                self.pack_file(name)
            except:
                self.exit('error processing: ' + name)
    def pack_file(self, name):
        self.setInput(name)
        self.write(marker + name + '\n')
        while 1:
            line = self.readline( )
            if not line: break
            self.write(line)

if _ _name_ _ == '_ _main_ _':  PackApp().main( )

Here, PackApp inherits members and methods that handle:

  • Operating system services

  • Command-line processing

  • Input/output stream redirection

from the StreamApp class, imported from another Python module file (listed in Figure). StreamApp provides a "read/write" interface to redirected streams and a standard "start/run/stop" script execution protocol. PackApp simply redefines the start and run methods for its own purposes and reads and writes itself to access its standard streams. Most low-level system interfaces are hidden by the StreamApp class; in OOP terms, we say they are encapsulated.

This module can both be run as a program and imported by a client (remember, Python sets a module's name to _ _main_ _ when it's run directly, so it can tell the difference). When run as a program, the last line creates an instance of the PackApp class and starts it by calling its main methoda method call exported by StreamApp to kick off a program run:

C:\......\test>python ..\packapp.py -v -o packedapp.all spam.txt eggs.txt ham.txt
PackApp start.
packing: spam.txt
packing: eggs.txt
packing: ham.txt
PackApp done.

C:\......\test>type packedapp.all
::::::::::textpak=>spam.txt
SPAM
spam
::::::::::textpak=>eggs.txt
EGGS
::::::::::textpak=>ham.txt
ham

This has the same effect as the textpack.py script, but command-line options (-v for verbose mode, -o to name an output file) are inherited from the StreamApp superclass. The unpacker in Figure looks similar when migrated to the object-oriented framework, because the very notion of running a program has been given a standard structure.

PP3E\System\App\Clients\unpackapp.py

#!/usr/bin/python
###########################################
# unpack a packapp.py output file;
# % unpackapp.py -i packed1 -v
# apptools.appRun('unpackapp.py', args...)
# apptools.appCall(UnpackApp, args...)
###########################################

from textpack import marker
from PP3E.System.App.Kinds.redirect import StreamApp

class UnpackApp(StreamApp):
    def start(self):
        StreamApp.start(self)
        self.endargs( )              # ignore more -o's, etc.
    def run(self):
        mlen = len(marker)
        while True:
            line = self.readline( )
            if not line:
                break
            elif line[:mlen] != marker:
                self.write(line)
            else:
                name = line[mlen:].strip( )
                self.message('creating: ' + name)
                self.setOutput(name)

if _ _name_ _ == '_ _main_ _':  UnpackApp().main( )

This subclass redefines the start and run methods to do the right thing for this script: prepare for and execute a file unpacking operation. All the details of parsing command-line arguments and redirecting standard streams are handled in superclasses:

C:\......\test\unpackapp>python ..\..\unpackapp.py -v -i ..\packedapp.all
UnpackApp start.
creating: spam.txt
creating: eggs.txt
creating: ham.txt
UnpackApp done.

C:\......\test\unpackapp>ls
eggs.txt  ham.txt   spam.txt

C:\......\test\unpackapp>type spam.txt
SPAM
spam

Running this script does the same job as the original textunpack.py, but we get command-line flags for free (-i specifies the input files). In fact, there are more ways to launch classes in this hierarchy than I have space to show here. A command-line pair, -i -, for instance, makes the script read its input from stdin, as though it were simply piped or redirected in the shell:

C:\......\test\unpackapp>type ..\packedapp.all | python ..\..\unpackapp.py -i -
creating: spam.txt
creating: eggs.txt
creating: ham.txt

Application Hierarchy Superclasses

This section lists the source code of StreamApp and App the classes that do all of this extra work on behalf of PackApp and UnpackApp. We don't have space to go through all of this code in detail, so be sure to study these listings on your own for more information. It's all straight Python code.

I should also point out that the classes listed in this section are just the ones used by the object-oriented mutations of the textpack and textunpack scripts. They represent just one branch of an overall application framework class tree, which you can study on this book's examples distribution (browse its directory, PP3E\System\App). Other classes in the tree provide command menus, internal string-based file streams, and so on. You'll also find additional clients of the hierarchy that do things like launch other shell tools and scan Unix-style email mailbox files.

StreamApp: adding stream redirection

StreamApp adds a few command-line arguments (-i, -o) and input/output stream redirection to the more general App root class listed later in this section; App, in turn, defines the most general kinds of program behavior, to be inherited in Examples 6-8, 6-9, and 6-10i.e., in all classes derived from App.

PP3E\System\App\Kinds\redirect.py

################################################################################
# App subclasses for redirecting standard streams to files
################################################################################

import sys
from PP3E.System.App.Bases.app import App

################################################################################
# an app with input/output stream redirection
################################################################################

class StreamApp(App):
    def _ _init_ _(self, ifile='-', ofile='-'):
        App._ _init_ _(self)                               # call superclass init
        self.setInput( ifile or self.name + '.in')      # default i/o filenames
        self.setOutput(ofile or self.name + '.out')     # unless '-i', '-o' args

    def closeApp(self):                                 # not _ _del_ _
        try:
            if self.input != sys.stdin:                 # may be redirected
                self.input.close( )                         # if still open
        except: pass
        try:
            if self.output != sys.stdout:               # don't close stdout!
                self.output.close( )                        # input/output exist?
        except: pass

    def help(self):
        App.help(self)
        print '-i <input-file |"-">  (default: stdin  or per app)'
        print '-o <output-file|"-">  (default: stdout or per app)'

    def setInput(self, default=None):
        file = self.getarg('-i') or default or '-'
        if file == '-':
            self.input = sys.stdin
            self.input_name = '<stdin>'
        else:
            self.input = open(file, 'r')            # cmdarg | funcarg | stdin
            self.input_name = file                  # cmdarg '-i -' works too

    def setOutput(self, default=None):
        file = self.getarg('-o') or default or '-'
        if file == '-':
            self.output = sys.stdout
            self.output_name = '<stdout>'
        else:
            self.output = open(file, 'w')           # error caught in main( )
            self.output_name = file                 # make backups too?

class RedirectApp(StreamApp):
    def _ _init_ _(self, ifile=None, ofile=None):
        StreamApp._ _init_ _(self, ifile, ofile)
        self.streams = sys.stdin, sys.stdout
        sys.stdin    = self.input                 # for raw_input, stdin
        sys.stdout   = self.output                # for print, stdout

    def closeApp(self):                           # not _ _del_ _
        StreamApp.closeApp(self)                  # close files?
        sys.stdin, sys.stdout = self.streams      # reset sys files

################################################################################
# to add as a mix-in (or use multiple-inheritance...)
################################################################################

class RedirectAnyApp:
    def _ _init_ _(self, superclass, *args):
        superclass._ _init_ _(self, *args)
        self.super   = superclass
        self.streams = sys.stdin, sys.stdout
        sys.stdin    = self.input                 # for raw_input, stdin
        sys.stdout   = self.output                # for print, stdout

    def closeApp(self):
        self.super.closeApp(self)                 # do the right thing
        sys.stdin, sys.stdout = self.streams      # reset sys files

App: the root class

The top of the hierarchy knows what it means to be a shell application, but not how to accomplish a particular utility task (those parts are filled in by subclasses). App, listed in Figure, exports commonly used tools in a standard and simplified interface and a customizable start/run/stop method protocol that abstracts script execution. It also turns application objects into file-like objects: when an application reads itself, for instance, it really reads whatever source its standard input stream has been assigned to by other superclasses in the tree (such as StreamApp).

PP3E\System\App\Bases\app.py

################################################################################
# an application class hierarchy, for handling top-level components;
# App is the root class of the App hierarchy, extended in other files;
################################################################################

import sys, os, traceback
class AppError(Exception): pass                            # errors raised here

class App:                                                 # the root class
    def _ _init_ _(self, name=None):
        self.name    = name or self._ _class_ _._ _name_ _    # the lowest class
        self.args    = sys.argv[1:]
        self.env     = os.environ
        self.verbose = self.getopt('-v') or self.getenv('VERBOSE')
        self.input   = sys.stdin
        self.output  = sys.stdout
        self.error   = sys.stderr                     # stdout may be piped
    def closeApp(self):                               # not _ _del_ _: ref's?
        pass                                          # nothing at this level
    def help(self):
        print self.name, 'command-line arguments:'    # extend in subclass
        print '-v (verbose)'

    ##############################
    # script environment services
    ##############################

    def getopt(self, tag):
        try:                                    # test "-x" command arg
            self.args.remove(tag)               # not real argv: > 1 App?
            return 1
        except:
            return 0
    def getarg(self, tag, default=None):
        try:                                    # get "-x val" command arg
            pos = self.args.index(tag)
            val = self.args[pos+1]
            self.args[pos:pos+2] = []
            return val
        except:
            return default                      # None: missing, no default
    def getenv(self, name, default=''):
        try:                                    # get "$x" environment var
            return self.env[name]
        except KeyError:
            return default
    def endargs(self):
        if self.args:
            self.message('extra arguments ignored: ' + repr(self.args))
            self.args = []
    def restargs(self):
        res, self.args = self.args, []          # no more args/options
        return res
    def message(self, text):
        self.error.write(text + '\n')           # stdout may be redirected
    def exception(self):
        return tuple(sys.exc_info( )[:2])        # the last exception type,data
    def exit(self, message='', status=1):
        if message:
            self.message(message)
        sys.exit(status)
    def shell(self, command, fork=0, inp=''):
        if self.verbose:
            self.message(command)                         # how about ipc?
        if not fork:
            os.system(command)                            # run a shell cmd
        elif fork == 1:
            return os.popen(command, 'r').read( )             # get its output
        else:                                             # readlines too?
            pipe = os.popen(command, 'w')
            pipe.write(inp)                               # send it input
            pipe.close( )

    #################################################
    # input/output-stream methods for the app itself;
    # redefine in subclasses if not using files, or
    # set self.input/output to file-like objects;
    #################################################

    def read(self, *size):
        return self.input.read(*size)
    def readline(self):
        return self.input.readline( )
    def readlines(self):
        return self.input.readlines( )
    def write(self, text):
        self.output.write(text)
    def writelines(self, text):
        self.output.writelines(text)

    ###################################################
    # to run the app
    # main( ) is the start/run/stop execution protocol;
    ###################################################

    def main(self):
        res = None
        try:
            self.start( )
            self.run( )
            res = self.stop( )               # optional return val
        except SystemExit:                    # ignore if from exit( )
            pass
        except:
            self.message('uncaught: ' + str(self.exception( )))
            traceback.print_exc( )
        self.closeApp( )
        return res

    def start(self):
        if self.verbose: self.message(self.name + ' start.')
    def stop(self):
        if self.verbose: self.message(self.name + ' done.')
    def run(self):
        raise AppError, 'run must be redefined!'

Why use classes here?

Now that I've listed all this code, some readers might naturally want to ask, "So why go to all this trouble?" Given the amount of extra code in the object-oriented version of these scripts, it's a perfectly valid question. Most of the code listed in Figure is general-purpose logic, designed to be used by many applications. Still, that doesn't explain why the packapp and unpackapp object-oriented scripts are larger than the original equivalent textpack and textunpack non-object-oriented scripts.

The answers will become more apparent after the first few times you don't have to write code to achieve a goal, but there are some concrete benefits worth summarizing here:


Encapsulation

StreamApp clients need not remember all the system interfaces in Python, because StreamApp exports its own unified view. For instance, arguments, streams, and shell variables are split across Python modules (e.g., sys.argv, sys.stdout, os.environ); in these classes, they are all collected in the same single place.


Standardization

From the shell user's perspective, StreamApp clients all have a common look-and-feel, because they inherit the same interfaces to the outside world from their superclasses (e.g., -i and -v flags).


Maintenance

As an added benefit of encapsulation, all of the common code in the App and StreamApp superclasses must be debugged only once. Moreover, localizing code in superclasses makes it easier to understand and change in the future. Only one copy of the code implements a system operation, and we're free to change its implementation in the future without breaking code that makes use of it.


Reuse

Such a framework can provide an extra precoded utility that we would otherwise have to recode in every script we write (command-line argument extraction, for instance). That holds true now and will hold true in the futureservices added to the App root class become immediately usable and customizable among all applications derived from this hierarchy.


Utility

Because file access isn't hardcoded in PackApp and UnpackApp, they can easily take on new behavior just by changing the class they inherit from. Given the right superclass, PackApp and UnpackApp could just as easily read and write to strings or sockets as to text files and standard streams.

Although it's not obvious until you start writing larger class-based systems, code reuse is perhaps the biggest win for class-based programs. For instance, in Chapter 11, we will reuse the object-oriented-based packer and unpacker scripts by invoking them from a menu GUI like so:

from PP3E.System.App.Clients.packapp import PackApp
...get dialog inputs, glob filename patterns
app = PackApp(ofile=output)            # run with redirected output
app.args = filenames                   # reset cmdline args list
app.main( )


from PP3E.System.App.Clients.unpackapp import UnpackApp
...get dialog input
app = UnpackApp(ifile=input)            # run with input from file
app.main( )                             # execute app class

Because these classes encapsulate the notion of streams, they can be imported and called, not just run as top-level scripts. Further, their code is reusable in two ways: not only do they export common system interfaces for reuse in subclasses, but they can also be used as software components, as in the previous code listing. See the PP3E\Gui\Shellgui directory for the full source code of these clients.

Python doesn't impose object-oriented programming, of course, and you can get a lot of work done with simpler functions and scripts. But once you learn how to structure class trees for reuse, going the extra object-oriented mile usually pays off in the long run.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows