Track the Media's Attention Span over Time





Track the Media's Attention Span over Time

Visualize media trends by counting the total number of Yahoo! News mentions of a specific phrase over a series of dates.

The nature of news is that it reports about what's new in the world each day. But in the rush to bring the latest news to the public, news organizations often have a pack mentality. The news being covered by the top media outlets this week is different from what was covered last week. And sometimes, looking at what news organizations decide to cover can be more interesting than the news itself.

In that spirit, this hack is about tracking a topic's ebbs and flows through the news cycle. Because Yahoo! News brings together over 7,000 different news sources from around the world into one site, it's the perfect place to spot trends and track what the media is tracking. One drawback to tracking 7,000 news sources is that storage of that information becomes an issue. So Yahoo! News stores only the last 30 days' worth of articles. But in our 24-hour-a-day news world, 30 days ago can seem like ancient history.

This hack was inspired by "Tracking Result Counts over Time" [Hack #63] from the first edition of Google Hacks. If you'd like to see how to implement a similar hack for Google Search results, track down a copy of the first edition of Google Hacks.


The key to being able to track a keyword in news articles over time is being able to isolate articles by day. Luckily, the Yahoo! News advanced search interface (http://news.search.yahoo.com/news/advanced) gives the option to limit searches by time. So, if you want just the stories about Apple from March 1, 2005, the advanced search interface lets you bring them up by specifying March 1 as the start and end date. Another great feature of the advanced search interface is the ability to limit your search to a specific category. By selecting Technology in addition to specifying the date, you can be sure to weed out stories about the fruit that grows on trees and stick with stories about the company that makes computers.

Once you isolate stories to a particular day, you can find out how many stories contained the term you're interested in on that day. For example, there were 143 technology stories that mentioned Apple on March 1, 2005, but only 115 stories mentioned Apple on March 2. At the time of this writing, the news data available at Yahoo! Search Web Services doesn't include the ability to limit requests to a specific date, so this hack uses screen scraping to gather the data.

Screen scraping involves programmatically downloading the HTML for a web page and picking through the source to find the bits of information you're looking for. Screen scraping is a notoriously brittle process, because it relies on finding patterns within the HTML. If Yahoo! decides to change its HTML tomorrow, the code in this hack that picks up the total results for a query will fail. Even knowing this, we're interested in only one bit of data on Yahoo! News search results pages: the estimated total number of articles for our query. Figure shows the bit we're looking for in a search for Apple stories from March 1, 2005.

Searching through the almost 250 lines of HTML in a results page, you can pick out the total results number from this line:

	<em>Results <strong>1 - 10</strong> of about <strong>143</strong> for 
	<strong>apple</strong>.</em>

Armed with the pattern to find the total results, you can assemble the code.

Total results of a Yahoo! News search for "apple"


The Code

Though you can't limit search by date with the Yahoo! Search Web Services, this code relies on the fact that Yahoo! News search pages at the web site have stable, predictable URLs for date-specific searches. Sticking with our example, here are the relevant pieces from a URL for Apple articles from March 1:

	http://news.search.yahoo.com/news/search?va=apple&smonth=3&
sday=1&emonth=3&eday=1

As you can see, the va variable holds the query, smonth and sday the start date, and emonth and eday the end date. Knowing this pattern, you can construct a query for any time period you'd like.

You'll need a couple of modules for this hack, including LWP::Simple to fetch the Yahoo! News page, and Date::Manip to work with dates. Add the following code to a file named track_news.pl:

	#!/usr/bin/perl
	# track_news.pl
	# Builds a Yahoo! News URL for every day
	# between the specified start and end dates, returning
	# the date and estimated total results as a CSV list.
	# usage: track_news.pl query="{query}" start={date} end={date}
	# where dates are of the format: yyyy-mm-dd, e.g. 2005-03-30

	use strict;
	use Date::Manip; 
	use LWP::Simple qw(!head); 
	use CGI qw/:standard/;

	# Set your unique Yahoo! Application ID 
	my $appID = "insert your app ID";

	# Get the query 
	my $query = param('query');

	# Set the News category to search tech articles 
	# Alternates: top, world, politics, entertainment, business 
	# more at: http://news.search.yahoo.com/news/advanced 
	my $category = "technology";

	# Regular Expression to check date validity 
	my $date_regex = '(\d{4})-(\d{1, 2})-(\d{1, 2})';
	
	# Make sure all arguments are passed correctly
	( param('query') and param('start') =~ /^(?:$date_regex)?$/
		and param('end') =~ /^(?:$date_regex)?$/ ) or 
		die qq{usage: track_news.pl query="{query}" start={date} end={date}\n};

	# Set timezone, parse incoming dates
	Date_Init("TZ=PST");
	my $start_date = ParseDate(param('start'));
	my $end_date = ParseDate(param('end'));

	# Print the CSV column titles
	print qq{"date","count"\n};

	# Loop through the dates
	while ($start_date <= $end_date) { 
		my $month = int UnixDate($start_date, "%m"); 
		my $day = int UnixDate($start_date, "%d");
		my $date_f = UnixDate($start_date,"%y-%m-%d");
		my $total;

		# Construct a Yahoo! News URL
		my $news_url = "http://news.search.yahoo.com/news/search?";
			$news_url .= "ei=UTF-8";
			$news_url .= "&va=$query";
			$news_url .= "&cat=$category";
			$news_url .= "&catfilt=1";
			$news_url .= "&pub=1";
			$news_url .= "&smonth=$month";
			$news_url .= "&sday=$day";
			$news_url .= "&emonth=$month";
			$news_url .= "&eday=$day";
		# Make the request
		my $news_response = get($news_url);

		# Find the number of results 
		if ($news_response =~ m!of about <strong>(.*?)</strong>!gi) {
			$total = $1;
		} else {

			$total = 0;
		}

		# Print out results
		print
			'"',
			$date_f,
			qq{","$total"\n};
		# Add a day, and continue the loop
		$start_date = DateCalc($start_date, " + 1 day");
	}

Running the Hack

Run the script from a command line, specifying the query term and dates. Here's the query for Apple news between March 1 and March 10, 2005:

	track_news.pl query="apple" start=2005-03-01 end=2005-03-10

Of course, by the time you're reading this, these dates are out of the 30-day window, so you'll need to replace them with dates that fall into the range Yahoo! News can deliver.

If you'd like to pipe the script output to a text file, simply call it like so:

	track_news.pl query="apple" start=2005-03-01 end=2005-03-10 > apple.csv

The results will look like this:

	"date","count"
	"05-03-01","147"
	"05-03-02","111"
	"05-03-03","112"
	"05-03-04","173"
	"05-03-05","27"
	"05-03-06","51"
	"05-03-07","181"
	"05-03-08","171"
	"05-03-09","111"
	"05-03-10","130"

Just glancing at this list, you can see that Apple media coverage started off strong, tapered off a bit, and then came back with a vengeance on March 7. It's tough to pinpoint a reason for the differences, but it might be a way to spot changes that will affect the company.

Working with the Results

With a short list, it's easy to see where the spikes in media mentions are. But with longer lists, it might help to have a visual representation of the data. If you send the script output to a .csv file, you can simply double-click the file to open it with Excel. The chart wizard can give you a quick overview, such as the one for the entire month of March 2005 shown in Figure.

Excel graph tracking tech news mentioning "apple"


As you can see, the mentions of Apple across technology stories in the month of March dip and peak at the beginning and end of the work week.



 Python   SQL   Java   php   Perl 
 game development   web development   internet   *nix   graphics   hardware 
 telecommunications   C++ 
 Flash   Active Directory   Windows