Fine-Tune Your Data Collection

Fine-Tune Your Data Collection

One of the most important steps during implementation is fine-tuning your data collection to suit your specific needs.

One of the things that web measurement in no way lacks is available data there are hundreds of primary reports that can be generated and thousands of secondary reports available when you begin to drill down and cross-tab within the data. While some paint the plethora of data as "good news, " the converse is often true: there is definitely such a thing as "too much information" in web measurement.

This is one reason key performance indicators [Hack #94] are such a valuable management tool: they help simplify data presentation and dissemination. After you have carefully considered your data needs before you set everything up [Hack #14], the next step is to fine-tune the data you collect so that you can make effective use of the KPI framework.

From a technical standpoint, the decisions you make about data collection are driven by your choice between using web server logfiles and JavaScript page tags. The sections below describe some techniques for eliminating some of the clutter in your data for each technique.

Web Server Logfiles

One of the first steps in reducing clutter is to log only data that you might like to eventually analyze. In the web measurement world, a web server logfile [Hack #22] refers to a combination of as many as four individual files: error logs, access logs, referrer logs, and agent logs. Fortunately, the combined and extended log formats used by Apache, Internet Information Server, and other popular web servers remove the need to process four separate files by combining useful elements into a single entry in the access log (often called the NCSA Extended or "combined" log format).

The combined logfile looks something like this (from -frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 
	2326 "" "Mozilla/4.08 [en] (Win98; I ;Nav)"

An excellent overview of log formats is available in HTTP: The Definitive Guide (O'Reilly).

The combined logfile provides a record of every request for every resource made to the web server. You're technically not required to record every hit [Hack #1] in the logfile. In fact, many people choose to exclude image requests from logging to cut down on the sheer volume of requests logged. In Apache, a change to mod_log_config will allow you to exclude image (GIF, PNG, or JPG) requests from your logfile:

	SetEnvIf Request_URI "(\.gif|\.png|\.jpg)$" image
	CustomLog logs/images.log common env=image
	CustomLog logs/access.log common env=!image

The described functionality also requires mod_setenvif, which may need to be installed by your system administrator.

Making this change to mod_log_config, and restarting Apache will exclude requests for images from your primary access log and instead write them to a separate file (images.log), which can periodically be erased to save space.

Keep in mind that making this change will cause your image requests to be logged in a separate file, resulting in extra work if for some reason you need that information at a later date. Make this change cautiously and consciously.

JavaScript Page Tags

Deciding which data to keep and which to ignore when you're using a JavaScript page tag is trickier than simply tweaking your web server configuration files. Most often, you'll need to consult with your vendor since they're usually the ones dealing with data collection and storage. You may be thinking to yourself, "Now, why would I want to go messing with my page tags?" There is a good reason to consider tag modification.

Because JavaScript page tags [Hack #28] always come with some performance overhead, if you can identify which data you're pretty sure you won't need, you can ask your vendor to build a smaller file that excludes unnecessary code. Smaller files equal faster downloads for the users of your site. Until every customer is on high-speed broadband, smaller code is better code. Period.

While the majority of tag vendors have optimized the bulk of their code to live in the browser's cache, this code still needs to be loaded on the first visit or any time a visitor clears his cache. Put another way, most vendors have employed a "round-trip" strategy that places a small JavaScript file in your pages that sets some variables and then calls a larger script. The script that is called is placed into your pages on the "return trip" and is also usually stored in the visitor's browser cache. This allows easier maintenance on the bulk of the JavaScript and a faster load on subsequent page views. Still, that first page view, when your visitor is required to download the external file, can sometimes be a doozy!

Every little bit of code reduction improves the initial performance of a page tag, something important to accuracy of measurement and, in some instances, the perception of page load time into the visitor's browser. In the same way, you should always strive to present visitors with the most optimized page possible. Any code used on your site should also be optimized.

Here are some categories you should consider eliminating:

Technographic data

The code that gathers data like monitor color depth, JavaScript status, and Java status is suspect because technographic data is rarely valuable [Hack #74]. Unless you have a really good reason to track details like monitor color depth, you should consider having this code stripped out.

Data about plug-ins

Unless you're deploying some kind of specific program to collect data about plug-ins [Hack #73], you should consider having this code removed.

File downloads

If you're absolutely sure you don't have any downloadable files on your siteno PDFs or EXEs or DOC/XLS/PPT files that anyone would be requesting from your sitethis code is often suitable for removal. If in doubt, keep this code in place for 90 days to be absolutely sure you don't have downloadable files. Once you're sure that no such files exist, consider removing this code.

Commerce data

If you know you're not doing any online commerce and you're not going to be either leveraging commerce variables in some other fashion or assigning a dollar value to a nonrevenue event [Hack #39], removing code for collecting commerce data can often reduce files sizes significantly.

Again, you want to be careful to never change your JavaScript page tag without your vendor's help (bad things will almost assuredly happen). Still, every few bytes you can save in your page tag are bytes you can use elsewhere to delight your visitors!

Why Not Just Collect Everything?

Given all you've read so far, you may perhaps be tempted to hedge your bets and just collect everything. This is not a bad idea in theory, but reality dictates that there is a cost associated with data collectionone you should be conscious of to prevent surprises down the road. Depending on your data collection and storage strategy, you need to consider reprocessing time for data and storage and collection costs.

3.1 It takes time to reprocess data.

Given the volumes of data you can collect from webserver logs, page tags, commerce applications, CRM databases, and the like, surely you can appreciate the problem associated with "Can I get a summary of the last two years' data?" By taking the time to streamline data collection, you will save yourself time if you need to look back and generate historical reports.

3.2 There are costs associated with collection and storage.

While disk I/O and storage gets less expensive every year, it's unlikely that it will ever be free, regardless of whether you house the data internally or keep it with a hosted provider. In fact, many hosted analytics vendors put a limit on the amount of time they keep your data (or at least keep it in a readily accessible format). From a financial perspective, ongoing data storage is a clear case of "more is more."

Finally, always remember, when in doubt, ask your vendor. They will undoubtedly have an opinion on what you should collect and how long that information should be kept. Don't necessarily take their word as gospel, but don't discount their opinions without consideration.

     Python   SQL   Java   php   Perl 
     game development   web development   internet   *nix   graphics   hardware 
     telecommunications   C++ 
     Flash   Active Directory   Windows