FILE PROCESSING





FILE PROCESSING

The basic processes of opening a file have already been covered in Chapters 4 and 8. Once the file has been opened you end up with a file object which is then used to access the information from the file. The methods supported by a file object closely mimic the functions and operators we would normally use in Perl and other oddities, such as using line based input (as opposed to free-form or byte based input) are supported natively by the file object.

The basic Perl functions and equivalent Python file methods are listed in Figure. We'll have a look at the specifics of reading, writing, and locating our position within a file in this section.

Reading and writing files in Perl and Python
Perl operation Python equivalent Description
read() or sysread() f.read([count]) Read count bytes
$line = <FILE> f.readline() Read a single line
@lines = <FILE> f.readlines() Read all the lines
print FILE $string f.write(string) Write string
print FILE @lines f.writelines(list) Write the lines in list
close(FILE) f.close() Close the file
tell(FILE) f.tell() Get the current file pointer
seek() f.seek(offset [, where]) Seek to offset relative to where
POSIX::isatty() f.isatty() Returns 1 if file is an interactive terminal
IO::Handle->flush() f.flush() Flush the output buffers
truncate FILE, size f.truncate([size]) Truncates file to size
fileno(FILE) f.fileno() Return the integer file descriptor

Reading

Perl supports two methods of reading information from a file, either on a line basis using the <FILE> construct or using the read(), sysread(), and other functions to read information from a file. As you can see from Figure most of the basic Perl file handling operations are supported by similar methods under Python.

Line by line

If you want to emulate the functionality of the Perl line input construct:

$line = <FILE>;

then you need to use the Python readline() method to an opened filehandle:

line = file.readline()

The method reads a single line from the file, and is platform-aware so that the same operation should work fine for reading standard text files on any platform. Also note that like <FILE> the readline() method also returns the line terminate sequence – use string.strip() or string.rstrip() to remove this (see Chapter 10 for more information).

Getting all the lines

You can get all of the lines from a file using the readlines() method on an active file object. This works in the same way as the <FILE> construct when used in list context. For example, we can import an entire file into an array in Perl using:

open(FILE, "myfile.txt");
@lines = <FILE>;

In Python we can rewrite this as:

myfile = open('myfile.txt')
lines = myfile.lines()

To read the entire contents of a file into a string object use the read() method without any arguments. For example:

myfiledata = myfile.read()

This is equivalent to using <FILE> in scalar context in Perl when the $/ has been set to undef. A common trick for this in Perl is to create a new block and localize $/:

{
    local $/;
    $myfiledata = <FILE>;
}
Byte by byte

We can also use the read() method to get a specific number of bytes from a file. For example, to read a 512 byte record from a file we'd use a statement like this:

record = file.read(512)

Which is identical to the read() function in Perl – we can perform the same operation as the Python fragment above in Perl using:

read FILE, $record, 512;
End of file

Python does not support the notion of an "end of file" status for an open file object. Although an EOFError exception does exist, it is actually used by the input() and raw_input() built-in functions to identify when an end of file character has been found when reading input from the keyboard.

Instead, the read() and readline() methods return an empty string when they see the end of the file. This means that Python naturally supports the special behavior provided by Perl when using the while(<FILE>) construct.

For example, in Perl to read and print all of the lines from a file we'd probably use:

open(FILE, 'myfile.txt');
while(<FILE>)
{
    print;
}
close(FILE);

which we can rewrite in Python as:

myfile = open('myfile.txt')
while 1:
    line = myfile.readline()
    if !line: break
    print line,
myfile.close()

The if test breaks us out of the loop when we see an empty line. The other alternative is to use a for loop and the readlines() method to step through each individual line:

myfile = open('myfile.txt')
for line in myfile.readlines():
    print line,
myfile.close()
Processing example

Processing a file in Python requires more than just the basic ability of reading lines or information. Because Python does not include the built-in ability to split or otherwise manipulate strings with more than sequence syntax we need to import a few modules to perform the operations we'd do natively in Perl. In all other respects, the basics of the processing system remain the same – we read in a line, extract the components we want and work on them before moving to the next line.

As an example, here's a very simple script in Perl which uses split() to compile a count of the hosts and URL access from a standard web log:

my (%hostaccess, %urlaccess) = ((),());

if (@ARGV < 1)
{
    print "Usage: $0 logfile\n";
    exit(1);
}

open(FILE, $ARGV[0]) or die "Woah! Couldn't open 
$ARGV[0]: $!\n";

while(<FILE>)
{
    @splitline = split;
    if (@splitline < 10)
    {
        print;
        next;
    }
    ($host,$ident,$user,$time,$offset,$req,
     $loc,$httpver,$success,$bytes) = @splitline;

    $hostaccess{$host} ++;
    $urlaccess{$loc} ++;
}

foreach my $host (sort keys %hostaccess)
{
    print "$host: $hostaccess{$host}\n";
}

foreach my $loc (sort keys %urlaccess)
{
    print "$loc: $urlaccess{$loc} \n";
}

For comparison, here's the same script in Python:

import string
import sys

def cmpval(tuple1, tuple2):
    return cmp(tuple2[1],tuple1[1])

hostaccess = {}
urlaccess = {}

if len(sys.argv) < 2:
    print "Usage:",sys.argv[0],"logfile"
    sys.exit(1)

try:
    file = open(sys.argv[1])
except:
    print "Whoa!","Couldn't open the file",sys.argv[1]
    sys.exit(1)

while 1:
    line = file.readline()
    if line:
        splitline = string.split(line)
        if len(splitline) < 10:
            print splitline
            continue
        (host,ident,user,time,offset,req,
         loc,httpver,success,bytes) = splitline
        try:
            hostaccess[host] = hostaccess[host] + 1
        except:
            hostaccess[host] = 1
        try:
            urlaccess[loc] = urlaccess[loc] + 1
        except:
            urlaccess[loc] = 1
    else:
        break
hosts = hostaccess.items()
hosts.sort(lambda f, s: cmp(s[1], f[1]))

for host, count in hosts:
    print host, ": ", count

urls = urlaccess.items()
urls.sort(cmpval)

for url, count in urls:
    print url, ": ", count

There are a few major differences between the two scripts, most of which we've already covered elsewhere, but to recap:

  • Python requires that we import the sys module – required for accessing the command line arguments and exit() function.

  • We import the string module to gain access to the split() function so that we can extract the input.

  • The try statement is used in place of die and or to determine whether we can open the file correctly.

  • The hostaccess and urlaccess dictionaries need to be sorted first by extracting a list of key/value pairs and second by using a separate function (I've used both a lambda define function and a separate cmpval function) to sort the list according to the value in each tuple.

Writing

Writing back to a file in Perl is easy, in nearly all cases we can use print() to send the output to a specific filehandle. In Python the print function has not always worked this way, although since version 2.0 the facility has been added.

Using write() or writelines()

The write() and writelines() methods are the most obvious way of writing information back to a file in Python. Both can be used to write binary data and unlike print they do not automatically add a newline to each string written to the file. The write() method writes a single string and despite the name the writelines() method actually only writes a list of strings to the file. This means that the Perl statement:

print FILE 'Some text';

is directly equivalent to:

file.write('Some text')

and

print FILE @lines;

is directly equivalent to:

file.writelines(lines)
Using print

New in Python 2.0 is the ability to write direct to an open file without using the write() or writelines() methods. You can use this method as a direct replacement for the print FILEHANDLE construct in Perl. For example, we can rewrite the Perl fragment:

open(FILE, '>file.out');
print FILE "Some kind of error probably occurred\n"

in Python and this becomes:

file = open('file.out','w')
print >>file 'Some kind of error occurred'

Because we're using print the newline character is automatically appended, so be careful when using print in Python – appending a comma will stop Python from appending the line termination sequence.

Changing position

The seek() and tell() functions in Perl move to and return the current position within a given file and the Python tell() and seek() methods work in exactly the same way. Even the where argument, which determines from which point (start of file, current position, end of file) is the same. However, you have to use the numeric values (0, 1, and 2 respectively) as there are no handy definitions for these values.


     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows