Feb. 27, 2011, 1:45 a.m.
posted by quicksort
Fine-Tune Your Data Collection
One of the most important steps during implementation is fine-tuning your data collection to suit your specific needs.
One of the things that web measurement in no way lacks is available data there are hundreds of primary reports that can be generated and thousands of secondary reports available when you begin to drill down and cross-tab within the data. While some paint the plethora of data as "good news, " the converse is often true: there is definitely such a thing as "too much information" in web measurement.
This is one reason key performance indicators [Hack #94] are such a valuable management tool: they help simplify data presentation and dissemination. After you have carefully considered your data needs before you set everything up [Hack #14], the next step is to fine-tune the data you collect so that you can make effective use of the KPI framework.
Web Server Logfiles
One of the first steps in reducing clutter is to log only data that you might like to eventually analyze. In the web measurement world, a web server logfile [Hack #22] refers to a combination of as many as four individual files: error logs, access logs, referrer logs, and agent logs. Fortunately, the combined and extended log formats used by Apache, Internet Information Server, and other popular web servers remove the need to process four separate files by combining useful elements into a single entry in the access log (often called the NCSA Extended or "combined" log format).
The combined logfile looks something like this (from http://httpd.apache.org/docs/logs.html#combined):
127.0.0.1 -frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
The combined logfile provides a record of every request for every resource made to the web server. You're technically not required to record every hit [Hack #1] in the logfile. In fact, many people choose to exclude image requests from logging to cut down on the sheer volume of requests logged. In Apache, a change to mod_log_config will allow you to exclude image (GIF, PNG, or JPG) requests from your logfile:
SetEnvIf Request_URI "(\.gif|\.png|\.jpg)$" image CustomLog logs/images.log common env=image CustomLog logs/access.log common env=!image
Making this change to mod_log_config, and restarting Apache will exclude requests for images from your primary access log and instead write them to a separate file (images.log), which can periodically be erased to save space.
Every little bit of code reduction improves the initial performance of a page tag, something important to accuracy of measurement and, in some instances, the perception of page load time into the visitor's browser. In the same way, you should always strive to present visitors with the most optimized page possible. Any code used on your site should also be optimized.
Here are some categories you should consider eliminating:
Data about plug-ins
Unless you're deploying some kind of specific program to collect data about plug-ins [Hack #73], you should consider having this code removed.
If you're absolutely sure you don't have any downloadable files on your siteno PDFs or EXEs or DOC/XLS/PPT files that anyone would be requesting from your sitethis code is often suitable for removal. If in doubt, keep this code in place for 90 days to be absolutely sure you don't have downloadable files. Once you're sure that no such files exist, consider removing this code.
If you know you're not doing any online commerce and you're not going to be either leveraging commerce variables in some other fashion or assigning a dollar value to a nonrevenue event [Hack #39], removing code for collecting commerce data can often reduce files sizes significantly.
Why Not Just Collect Everything?
Given all you've read so far, you may perhaps be tempted to hedge your bets and just collect everything. This is not a bad idea in theory, but reality dictates that there is a cost associated with data collectionone you should be conscious of to prevent surprises down the road. Depending on your data collection and storage strategy, you need to consider reprocessing time for data and storage and collection costs.
3.1 It takes time to reprocess data.
Given the volumes of data you can collect from webserver logs, page tags, commerce applications, CRM databases, and the like, surely you can appreciate the problem associated with "Can I get a summary of the last two years' data?" By taking the time to streamline data collection, you will save yourself time if you need to look back and generate historical reports.
3.2 There are costs associated with collection and storage.
While disk I/O and storage gets less expensive every year, it's unlikely that it will ever be free, regardless of whether you house the data internally or keep it with a hosted provider. In fact, many hosted analytics vendors put a limit on the amount of time they keep your data (or at least keep it in a readily accessible format). From a financial perspective, ongoing data storage is a clear case of "more is more."
Finally, always remember, when in doubt, ask your vendor. They will undoubtedly have an opinion on what you should collect and how long that information should be kept. Don't necessarily take their word as gospel, but don't discount their opinions without consideration.