Strings





Strings

In C and some other languages, the term string refers to a variablelength array of characters, defined by a starting point and by a string-termination character marking the end. Programmers use library functions to process strings and also work directly with the low-level representation. In Java, strings are a higher-level abstraction with built-in language support whose representation is hidden. To work at a lower level of abstraction, programmers convert strings to concrete representations such as arrays of characters and back. In this section, we consider some examples that illustrate these points.

Strings are valuable data structures, for two basic reasons. First, many computing applications involve processing textual data, which can be represented directly with strings. Second, many computer systems provide direct and efficient access to bytes of memory, which correspond directly to characters in strings. That is, in a great many situations, the string abstraction matches needs of the application to the capabilities of the machine.

The abstract notion of a sequence of characters could be implemented in many ways. For example, we could use a linked list. That choice would exact a cost of one reference per character, but could make more efficient operations such as concatenating together two long strings to make a third. Many algorithms are based upon the concrete array-based implementation of C-style strings, but many others are also amenable to the other representations. Generally, we assume that we can perform the following two operations in constant time:

  • Index a string (access its kth character for a given k)

  • Determine the number of characters in a string

We make these assumptions despite the fact that the first does not hold for strings represented as linked lists; the second does not hold for C-style strings; and we can have no guarantee that either will hold for Java strings, because their representation is hidden. In cases where performance is critical or an algorithm is best expressed using a particular representation, we can usually convert to a given representation without much difficulty. We return to the topic of representing strings (in considerable detail) at the beginning of Part 6, which is entirely devoted to string processing.

One of the most important operations that we perform on strings is the compare operation, which tells us which of two strings would appear first in the dictionary. This operation is provided in the compareTo method in the String class; but for purposes of discussion, we assume an idealized dictionary (since the actual rules for strings that contain punctuation, uppercase and lowercase letters, numbers, and so forth are rather complex) and compare strings character-by-character, from beginning to end. This ordering is called lexicographic order. We also use the compare method to tell whether strings are equal—by convention, the compare method returns a negative number if the first parameter string appears before the second in the dictionary, returns 0 if they are equal, and returns a positive number if the first appears after the second in lexicographic order. It is critical to take note that doing equality testing is not the same as determining whether two string references are equal—if two string referencess are equal, then so are the referenced strings (they are the same string), but we also could have different string references that point to equal strings (identical sequences of characters). Numerous applications involve storing information as strings, then processing or accessing that information by comparing the strings, so the compare operation is a particularly critical one.

Program 3.14 String search

This method discovers all occurrences of the pattern string given as its first parameter in the (presumably much larger) text string given as its second parameter.

For each starting position i in the text, we try matching the substring starting at that position with the pattern, testing for equality character by character. Whenever we reach the end of p successfully, we increment the count of the number of occurrences of the pattern in the text.

static int countMatches(String p, String s) 
  { 
    int cnt = 0, M = p.length(), N = s.length(); 
    if (M > N) return 0; 
    for (int i = 0; i < N;i ++) 
      { int j; 
        for (j = 0; j < M; j++) 
          if (s.charAt(i+j) != p.charAt(j)) break; 
        if (j == p.length()) cnt++; 
      } 
    return cnt; 
  } 

Program 3.14 is an implementation of a simple string-processing task, which counts the number of occurences of a short pattern string within a long text string. Several sophisticated algorithms have been developed for this task (see Chapter 23), but this simple one illustrates several of the basic facilities available for processing strings in Java. In practice, as we shall see, we are normally more interested in variations of this program that tell us where in the text the matches occur (see Exercises 3.56 and 3.57).

String processing provides a convincing example of the need to be knowledgeable about the performance of library methods. Consider an implementation where determining the length of a string takes time proportional to the length of the string, as it does in C-style strings. Ignoring this fact can lead to severe performance problems. For example, if we have such a length method and we code a for loop with

for (i = 0; i <s.length(); i++) 

then the time required is proportional to at least the square of the length of s, no matter what code is in the body of the loop! This cost can be considerable, even prohibitive: Running such code to check whether this book (which has more than 1 million characters) contains a certain word would require trillions of instructions. Problems such as this one are difficult to detect because a program might work fine when we are debugging it for small strings, but then slow down or even never finish when it goes into production. Moreover, we can avoid such problems only if we know about them—we have no way of knowing if some remote or future Java user will encounter a slow length method, because the Java string representation is hidden from us.

This kind of error is called a performance bug, because the code can be verified to be correct, but it does not perform as efficiently as we (implicitly) expect. Before we can even begin the study of efficient algorithms, we must be certain to have eliminated performance bugs of this type. Although built-in language support and standard libraries have many virtues, we must be wary of the dangers of using them for simple methods of this kind. In this particular case, there is not much to worry about, because typical Java implementations have source code which we can examine to check assumptions about performance, and they also typically have constant-time string indexing and length methods.

Java strings are immutable—they cannot be changed. When we use the + operator to concatenate two strings, we create a new string with the result. Accordingly, when implementing algorithms that involve substantial changes to a String, we convert the string to an array of characters, perform the computation, and convert the resulting array of characters back to a String. This approach provides concise code with performance guarantees.

Program 3.15 is an example of such a computation which replaces sequences of blanks in a string by a single blank. Conceptually, we might think of implementing this by, every time we encounter a sequence of blanks, moving the tail of the string to the left to cover all but one of them. Indeed, the System library provides an arraycopy operation that we could use as the basis for such an implementation. We do not use this approach because it could be very slow: for example, if we have a huge string with plenty of blanks, the running time approaches the product of the length of the string and the number of sequences of blanks, which is excessive and perhaps prohibitive. By contrast, the implementation in Program 3.15 runs in time proportional to the length of the string.

Program 3.15 anipulating strings

This method returns a string that is the same as the string given as its parameter except that each sequence of blank characters is replaced with a single blank character. Since Java String objects cannot be changed, this program illustrates a typical approach that we use to manipulate strings: copy into a character array (using the toCharArray method from the String class), then change the character array, then create a new String with the result (using a String constructor).

The index N refers to the next character in the result: the main loop copies the next string character into a[N] if it is not a blank or if a[N-1] is not a blank.

static String squeeze(String s) 
  { 
    char[] a = s.toCharArray(); 
    intN=1; 
    for (int i = 1; i < a.length; i++) 
      { 
        a[N] = a[i]; 
        if (a[N] != '
							') N++; 
        else if (a[N-1] != '
							') N++; 
      } 
    return new String(a, 0, N); 
  } 

Memory allocation for strings is typically more difficult than for linked lists because strings vary in size; but again, the Java system takes care of all the details. Indeed, a fully general mechanism to reserve space for strings is neither more nor less than the general memory allocation mechanism required for all Java objects. As mentioned in Section 3.6, various algorithms, whose performance characteristics are system and machine dependent, have been developed for this problem. Often, memory allocation is a less severe problem when we are working with strings than it might first appear, because we often work with references to the strings, rather than with the characters themselves.

Exercises

graphics/icon01.gif 3.56 Modify Program 3.14 to print out all character positions in the text where a substring is found that matches the pattern.

graphics/icon01.gif 3.57 Modify Program 3.14 to return a linked list with all the character positions in the text where a substring is found that matches the pattern.

graphics/icon01.gif 3.58 Write a method that takes a string as an parameter and prints out a table giving, for each character that occurs in the string, the character and its frequency of occurrence.

graphics/icon01.gif 3.59 Write a method that checks whether a given string is a palindrome (reads the same backward or forward), ignoring blanks. For example, your program should report success for the string if i had a hifi.

3.60 Write a method that takes a string as a parameter and reads a sequence of words (sequences of characters separated by blank space) from standard input, printing out those that appear as substrings somewhere in the parameter string.

3.61 Write a method that takes a string and a character as parameters and returns a string that is the result of removing all occurrences of the character from the string. Your implementation should run in time proportional to the length of the string.

3.62 Write an efficient method that finds the length of the longest sequence of blanks in a given string, examining as few characters in the string as possible. Hint: Your program should become faster as the length of the sequence of blanks increases.

3.63 Write a version of Program 3.15 that, instead of a character array, uses an interface that supports just two methods: one that returns the next character in a string, and another that appends a character to a string.

3.64 Implement your interface from Exercise 3.63 using a character array representation, and test it with your client program from Exercise 3.63.

3.65 Implement your interface from Exercise 3.63 using a linked-list representation, and test it with your client program from Exercise 3.63.


 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows