Tracking Web Site Traffic

Part 1: An explanation of terms that often cause confusion.

Sooner or later, nearly everyone with a web site wants to know how much traffic their site is getting. Other information, such as how long your visitors are staying on your site, which pages they are visiting, and how often they return will also help you develop a better site. There are various ways of gathering this information, some of which are more accurate than others. Here, we'll take a look at what's in a log file, various methods of analyzing log data, and the problems you will encounter with analyzing the data, but first let's define some terms that often cause confusion.

Hits vs Pageviews

When someone visits your page, their browser sends a number of requests to your web server. One request is for the HTML file but individual requests are also sent for each of the other elements that make up the web page - graphics files, audio files, and so on. Each of these requests is called a hit - a hit is a request to the server for a file not a page.

As you can see, counting hits is not the same as tracking pageviews. It takes multiple hits to view a page. A pageview is the number of times a page is accessed as a whole.

Methods of Tracking Traffic
Hit Counters
Free hit counters are available to use on your web pages. I would recommend against using counters for the following reasons:

Most counters record hits rather than pageviews.
Counters don't provide information on when and how viewers are using your site.
The counter adds weight to the page, increasing its load time.
Visitors care about content, not how much traffic a page gets.
Counters are not usually visually pleasing but rather serve to distract from the design and content of the page.
Log File Analysis

Nearly all web servers maintain log files - a text file that lists each request made to the server. Analyzing log data can give you a good idea of where your site visitors are coming from, which pages they are visiting, how long they stay, and which browsers they are using. Before signing on with a hosting company, make sure they offer access to raw log files. Even if you don't need them immediately, sooner or later you'll be glad to have them.

Next, let's take a look at what's in a log file. Then we'll look at methods of extracting and compiling the relevant information with software, online services, and even a few do-it-yourself techniques. Then last, but not least, we'll look at problems with analyzing the data.

Tracking Web Site Traffic

Part 2: What's in a Log File?

There are a number of different log file formats but they are all fairly similar. The most common format is called CLF (Common Logfile Format). There are also different types of log files - access, referer, error, and agent are the primary ones.

The following examples were taken from my own log files. You can check with your hosting company to find out the format they provide or just have a look at the raw data. It's not hard to decipher.

Access Log
Analyzing the access log will give you information about who visited your site, which pages they visited, and how long they stayed on the site. This is useful information in determining whether or not your site is working as you intend.

The record below shows the visitor's IP number or hostname, date and time of the request, the command received from the client, the status code returned, the size of the document transferred, and the browser and operating system the visitor was using.

nas-112-52.slc.navinet.net - - [29/Jan/2000:17:17:12 -0500] "GET page.html HTTP/1.1" 200 23443 "http://www.mydomain.com/page.html" "Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)"

Referer Log
The referer log contains referral information - the source that referred the visitor to your site. If the referrer was a search engine, you will also find the keywords that were entered to find your site - very useful information. Here are some example records.

The record below shows that the visitor followed a link from somedomain.com to the index page of my site.

http://www.somedomain.com/page.html -> /

This record shows that the visitor came to my site from a search engine link. Notice the keyword data is included in the record.

http://search.yahoo.com/bin/search?p=design+tips -> /

Agent Log
This log provides information on which browser and operating system was used to access your site.

Mozilla/4.0 (compatible; MSIE 5.01; Windows 98)

Error Log
The error log obviously provides a record of errors generated by the server and sent back to the client.

The record below shows the type of server, date and time of the error, client identification, explanation of the error code generated by the server, and the path to the file that caused the error.

apache: [Sun Jan 30 10:09:57 2000] [error] [client 195.238.2.162] File does not exist: /u/web/mydomain/favicon.ico

As you can see, log files contain a wealth of information about how your visitors are using your site. The next topic we will explore is how to get the relevant data extracted from the log files and compiled into a useable format.

Tracking Web Site Traffic

Part 3: Methods of Analyzing Log Data

There are a number of options available for extracting and compiling the information from a log file. The method you choose should be based on your specific needs. For some sites, the path visitors take and the number of pageviews is essential. Other webmasters may need to know where their visitors are coming from, what browsers they are using, or the keywords used to find their site from the search engines.

Commercial Software
Commercial software products are available that provide sophisticated analysis with charts, graphs, and reports that make data easily digestible. Commercial products range in price from hundreds to thousands of dollars. Modified versions of commercial products are often provided at no charge by hosting companies.
Some of the most popular log analysis software has been developed by WebTrends. Their software provides in-depth analysis and extensive management capabilities. HitList is also a good bet if your primary focus is general analysis. There are online reviews available to help you find the perfect software.

Freeware and Shareware Scripts
This type of software is usually suitable for smaller sites but may require some modification to meet your specific requirements. If you don't need charts or in-depth analysis, this is good option. CGI City has compiled an extensive list of freeware and shareware log file analysis scripts.
Online Services

If you don't have CGI capabilities on your site, an online service may be the ideal solution. Services such as HitBox or eXTReMe Tracker provide both free and fee services. The free services typically require displaying advertising on your page. More online service options are listed in the Traffic Analysis Tools resource on this site.

Tracking Web Site Traffic

Part 4: Problems with Analyzing Log Data


Analyzing log data seems straightforward enough. It's just a matter of counting and compiling information from a simple text file. One problem though, is that many requests to the server never make it into the access log. Other requests shouldn't be counted. Then there are problems with how to determine a unique visitor. So, here are some of the challenges you will face in log data analysis.

Caching
Netscape and Internet Explorer both use caching. If a requested page resides in the cache, the browser may retrieve the page from the cache rather than making a new request to the server.


Proxy Servers
Many Internet Service Providers (including AOL) use proxy servers. Proxy servers also cache pages and will check to see if a page is in the cache before sending the request for a page on to the server.

AOL makes extensive use of proxy serves, routing users through various proxy servers according to their own proprietary scheme. In order to better understand how AOL handles traffic, read the AOL Guide for Webmasters.

IP Address Considerations
Some analysis tools track unique visitors by assuming that each user with the same IP address and the same browser is a unique visitor. Unfortunately, things just aren't that simple. Each time someone connects to the Internet through an ISP they will most likely be given a different IP address. Also, services such as AOL reassign addresses as users disconnect in order to make the most efficient use of their available IP addresses. This means that it is possible for two visitors to have the same IP address.

Robots
Robots are computer programs that run automatically, visiting sites for the purpose of cataloging Web pages for search engines. Obviously visits by bots are not as significant as pageviews received from an actual site visitor. It is possible to filter out hits from major robots, but be aware that there are new ones all of the time (as well as ones you've never heard of).

Frames
It is my understanding that Netscape and Internet Explorer do not pass referral information in the same way for frames.