Program: Finding Fresh Links






Program: Finding Fresh Links

Figure is a modification of the program in Figure that produces a list of links and their last-modified time. If the server on which a URL lives doesn't provide a last-modified time, the program reports the URL's last-modified time as the time the URL was requested. If the program can't retrieve the URL successfully, it prints out the status code it got when it tried to retrieve the URL. Run the program by passing it a URL to scan for links:

% php fresh-links.php http://www.oreilly.com
https://epoch.oreilly.com/account/default.orm: MOVED: https://epoch.oreilly.com/
lib/p_sso.orm?d=account
https://epoch.oreilly.com/shop/cart.orm: OK
http://www.oreilly.com/: OK; Last Modified: Mon, 08 May 2006 22:11:04 GMT
http://oreillynet.com/: OK
http://www.oreilly.com/store/: OK
http://safari.oreilly.com: OK
http://conferences.oreillynet.com/: OK
http://www.oreillylearning.com: OK
http://academic.oreilly.com: MOVED: http://academic.oreilly.com/index.csp
...

This output is from a run of the program at about 11:48 P.M. GMT on May 8, 2006. Most links aren't accompanied by a last modified time'this means the server didn't provide one, so the page is probably dynamic. The link to http://www.oreilly.com/ shows that page being about 90 minutes old. The link to http://academic.oreilly.com shows that it has been moved elsewhere, as reported by the output of stale-links.php in Recipe 13.17.

The program to find fresh links is conceptually almost identical to the program to find stale links. It uses the same techniques to pull links out of a page; however, it uses the HTTP_Request class instead of cURL to retrieve URLs. The code to get the base URL specified on the command line is inside a loop so that it can follow any redirects that are provided and easily return the final URL in a redirect chain.

Once a page has been retrieved, each linked URL is retrieved with the head method. Instead of just printing out a new location for moved links, however, it prints out a formatted version of the Last-Modified header if it's available.

fresh-links.php

<?php
error_reporting(E_ALL);
require_once 'HTTP/Request.php';

if (! isset($_SERVER['argv'][1])) {
    die("No URL provided.\n");
}

$url = $_SERVER['argv'][1];

// Load the page
$r = load_with_http_request($url);

if (! strlen($r->getResponseBody())) {
    die("No page retrieved from $url");
}

// Convert to XML for easy parsing
$opts = array('output-xhtml' => true,
              'numeric-entities' => true);
$xml = tidy_repair_string($r->getResponseBody(), $opts);
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');

// Compute the Base URL for relative links.
$baseURL = '';
// Check if there is a <base href=""/> in the page
$nodeList = $xpath->query('//xhtml:[email protected]');
if ($nodeList->length == 1) {
    $baseURL = $nodeList->item(0)->nodeValue;
}
// No <base href=""/>, so build the Base URL from $url
else {
    $URLParts = parse_url($r->_url->getURL());
    if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {
        $basePath = '';
    } else {
        $basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);
    }
    if (isset($URLParts['username']) || isset($URLParts['password'])) {
        $auth = isset($URLParts['username']) ? $URLParts['username'] : '';
        $auth .= ':';
        $auth .= isset($URLParts['password']) ? $URLParts['password'] : '';
        $auth .= '@';
    } else {
        $auth = '';
    }
    $baseURL = $URLParts['scheme'] . '://' .
               $auth . $URLParts['host'] .
               $basePath;
}

// Keep track of the links we visit so we don't visit each more than once
$seenLinks = array();

// Grab all links
$links = $xpath->query('//xhtml:[email protected]');

foreach ($links as $node) {
    $link = $node->nodeValue;
    // Resolve relative links
    if (! preg_match('#^(http|https|mailto):#', $link)) {
        if (((strlen($link) == 0)) || ($link[0] != '/')) {
            $link = '/' . $link;
        }
        $link = $baseURL . $link;
    }
    // Skip this link if we've seen it already
    if (isset($seenLinks[$link])) {
        continue;
    }
    // Mark this link as seen
    $seenLinks[$link] = true;
    // Print the link we're visiting
    print $link.': ';
    flush();

    $r = load_with_http_request($link, 'HEAD');
    // Decide what to do based on the response code
    // 2xx response codes mean the page is OK

    if (($r->getResponseCode() >= 200) && ($r->getResponseCode() < 300)) {
        $status = 'OK';
    }
    // 3xx response codes mean redirection
    else if (($r->getResponseCode() >= 300) && ($r->getResponseCode() < 400)) {
        $status = 'MOVED';
        if (strlen($location = $r->getResponseHeader('location'))) {
            $status .= ": $location";
        }
    }
    // Other response codes mean errors
    else {
        $status = "ERROR: {$r->getResponseCode()}";
    }
    if (strlen($lastModified = $r->getResponseHeader('last-modified'))) {
        $status .= "; Last Modified: $lastModified";
    }
    // Print what we know about the link
    print "$status\n";
}

function load_with_http_request($url, $method = 'GET') {
    if ($method == 'GET') {
        $done = false; $max_redirects = 10;
        while ((! $done) && ($max_redirects > 0)) {
            $r = new HTTP_Request($url);
            $r->sendRequest();
            $responseCode = $r->getResponseCode();
            if (($responseCode >= 300) && ($responseCode < 400) &&
                strlen($location = $r->getResponseHeader('location'))) {
                    $url = $location;
                    $max_redirects--;
            } else {
                $done = true;
            }
        }
    } else {
        $r = new HTTP_Request($url);
        $r->setMethod(HTTP_REQUEST_METHOD_HEAD);
        $r->sendRequest();
    }
    return $r;
}
?>



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows