Skip to main content

Making a simple traffic analytics page



Big data has become a very important part of doing business, but there's one particular type of data that every business has deal with, and that's web site traffic data. Web analytics, as it's commonly known, is crucial in positioning a web site because it shows you where your users came from, what kind of device they use, their browser type, and whether they clicked on an ad to get to your site. This is all important data to know if you want to run a successful online business.

In the early days of the web, there used to be a lot of different analytics options, but these days it's safe to say that the vast majority of sites use Google Analytics. This is because it's well known, integrates with other Google products, and is supported out of the box in many web site builders, making it easy to implement. However, I think it has two big flaws. The first is that it's difficult to get actionable data from it. Over the years, many additional features were added, but the result is that the interface is cluttered and can be hard to navigate. Only SEO experts really take advantage of the Google interface. The second issue is that Google services are client based, meaning it's the user's browser that reports to Google Analytics, not the web site, so anyone using an AdBlock plugin or blocking third party cookies will not report data correctly.

So a while back I decided to write my own PHP based analytics script. It works very well for my business, so I decided to share it in this blog post. I published the code for free on GitHub here: https://github.com/dendory/tracking. Note that I wrote it for our own use, and I don't imagine it would fit anyone else's particular needs, but I share it just to show an alternative to using Google Analytics.

The database

This system is divided in three parts. The script requires a single MySQL database and table, which is where data will be stored. I deployed a new database in our internal RDS instance, and made a table named log with the following fields:

+---------+-------------+------+-----+---------+-------+
| Field   | Type        | Null | Key | Default | Extra |
+---------+-------------+------+-----+---------+-------+
| date    | varchar(25) | NO   |     | NULL    |       |
| ip      | varchar(20) | NO   |     | NULL    |       |
| url     | varchar(50) | YES  |     | NULL    |       |
| query   | varchar(50) | YES  |     | NULL    |       |
| ref     | varchar(50) | YES  |     | NULL    |       |
| browser | varchar(20) | YES  |     | NULL    |       |
+---------+-------------+------+-----+---------+-------+
 Any database will do, as long as it has enough resources to hold all the data.

The collection script

The first script is data collection, named tracking.php, which gets included in every page around the web site. For any PHP deployment this can easily be done with:

<?php include 'tracking.php'; ?>

The tracking script itself is fairly simple and I'll go through it here. First, there's a check for bots, calls from localhost, and load balancer health checks:

<?php
if($_SERVER['REMOTE_ADDR'] != "127.0.0.1" && strpos($_SERVER['HTTP_USER_AGENT'], 'Bot') === false && strpos($_SERVER['HTTP_USER_AGENT'], 'bot') === false && strpos($_SERVER['HTTP_USER_AGENT'], 'ELB-HealthChecker
') === false)
{

Then, the connection to the database happens. Needless to say, you will need to put your database information here:

    $db = new mysqli("localhost", "my_user", "my_password", "tracking") or die("");

After that, it truncates some of the information, both to save space in the database and on the page once we display the data. The bottom line is that I didn't care to know more than the domain of the site our users came from, so the first 30 characters is plenty.

    $ref = (strlen($_SERVER['HTTP_REFERER']) > 29) ? substr($_SERVER['HTTP_REFERER'], 0, 27) . '...' : $_SERVER['HTTP_REFERER'];
    $url = (strlen(strtok($_SERVER["REQUEST_URI"], '?')) > 29) ? substr(strtok($_SERVER["REQUEST_URI"], '?'), 0, 27) . '...' : strtok($_SERVER["REQUEST_URI"], '?');
    $query = (strlen($_SERVER['QUERY_STRING']) > 29) ? substr($_SERVER['QUERY_STRING'], 0, 27) . '...' : $_SERVER['QUERY_STRING'];
    $date = date("Y-m-d H:i:s");

Then it parses the user-agent string for well known browser fingerprint information, with a distinction for mobiles:

    $browser = "question";
    if(strpos($_SERVER['HTTP_USER_AGENT'], 'Trident') !== false) { $browser = "internet-explorer"; }
    if(strpos($_SERVER['HTTP_USER_AGENT'], 'Opera') !== false) { $browser = "opera"; }
    if(strpos($_SERVER['HTTP_USER_AGENT'], 'Chrome') !== false) { $browser = "chrome"; }
    if(strpos($_SERVER['HTTP_USER_AGENT'], 'Firefox') !== false) { $browser = "firefox"; }
    if(strpos($_SERVER['HTTP_USER_AGENT'], 'Safari') !== false) { $browser = "safari"; }
    if(strpos($_SERVER['HTTP_USER_AGENT'], 'Edge') !== false) { $browser = "edge"; }
    if(strpos($_SERVER['HTTP_USER_AGENT'], 'Mobile') !== false || strpos($_SERVER['HTTP_USER_AGENT'], 'Android') !== false) { $browser = "mobile"; }

Once we have all of our variables, we insert them into the database:

    if($stmt = $db->prepare("INSERT INTO log (date, ip, url, query, ref, browser) VALUES (?, ?, ?, ?, ?, ?)"))
    {
        $stmt->bind_param("ssssss", $date, $_SERVER['REMOTE_ADDR'], $url, $query, $ref, $browser);
        $stmt->execute();
        $stmt->close();
    }

And finally, we close the connection.

    $db -> close();
}
?>

That's all there is to it. Note that there isn't any error handling, because this code will be used on web pages, so there's no point in showing errors to the user. If data stops coming into the database, it'll be pretty obvious something happened. Also note that the parsing code for browsers could probably be more precise, but I don't particularly care about having a 100% success rate on which browser each user has (which isn't possible anyways), so a general idea is good enough here.

The end result

The last part of this system is the web page displaying the results, making use of the Bootstrap framework for prettiness and Font Awesome for the icons. I won't go through the entire script since it's pretty long but you can look through index.php to see what's happening. I'll just point out a few useful parts.

First an HTML table is made to display each entry in the log. The script connects to the same database, then fetches the data. As it does, it displays the data but also fills two variables:

$src = array('google' => 0, 'twitter' => 0, 'facebook' => 0, 'linkedin' => 0, 'other' => 0, 'internal' => 0, 'organic' => 0);
$dst = array('the-voip-pack' => 0, 'the-data-pack' => 0, 'the-cloud-audit-pack' => 0, 'the-devops-pack' => 0, 'the-startup-pack' => 0, 'main' => 0, 'other' => 0);

These variables contain source and destination page names that I care about. If you use this script you would likely want to adjust at least the destinations.

To find out whether or not someone came to the site from an ad, we use well known variables that the popular advertisement platforms Google, LinkedIn and Facebook set:

    if(strpos($result['query'], 'gclid=') !== false) { echo "<i class='fa fa-google'></i>"; }
    if(strpos($result['query'], 'lrsc=') !== false) { echo "<i class='fa fa-linkedin'></i>"; }
    if(strpos($result['query'], 'fbclid=') !== false) { echo "<i class='fa fa-facebook'></i>"; }

Once the table is filled, we use Google's free API to create the two pie charts, based on the two variables above. The result looks like this:

Now, what are some of the insights I can gain from this data? Here's a good example:

This shows the IP address of someone who ended up buying one of our packs. By doing a quick search on the data, I can easily see exactly the path the person took. First, they clicked on a Google ad at 3am on Thursday, landed on the main page, then clicked on our Cloud Audit pack. Then the next day they did a Google search for our site, landing on our main site, went back to the Cloud Audit page, and ended up purchasing the product.

Things missing

This is a fairly simple project that I completed over a weekend, and there are things it can't do. One of the additions that I will likely make is some type of archival system. Right now data just accumulates, and the page will take longer and longer to load as data becomes overwhelming. This can easily be remedied with an archival script that runs once a month and archives previous data.

Another useful insight would be geo-location. There are free databases out there that map IP addresses to physical locations, down to the city and state. I could easily add a column to the table, and have a script that runs daily to look up each IP address, and inputs a likely location.

The bottom line is that this system works for us and is extendable. I've never been one to use a popular option just because it's popular, and I think that's the basis of what it means to be an innovator.