Harvesting cPanel Raw Access Logs

[ Image: Harvesting the Land ] For those of you using cPanel as the control panel for our websites, a wealth of information is readily available via cPanel ‘Raw Access Logs’. The cPanel log files perpetually are updated with data. Each logged visit includes information about the user agent, IP address, HTTP response, request URI, request size, and a whole lot more. To help you make use of this potentially valuable information, here is a quick tutorial on accessing and interpreting your cPanel raw access logs. It’s a powerful tool to have in your web-dev toolbelt.

Part One: Grab ’em

To grab a copy of your raw access logs, log into cPanel and click on the "Raw Access Logs" icon. Within the Raw Access Log interface, scroll through the list of available log files and download the raw access log(s) of your choice.

Exit cPanel and navigate to your local copy of the raw access log, which should have been downloaded as a zipped/g-zipped file (i.e., .zip or .gz file extension), with a name similar to accesslog_your-domain.com_4_20_2007.gz.

Unzip the file and extract its contents, which should be a single file named your-domain.com. Rename the file by appending a .log or .txt extension to the file name. Alternatively, if the file is not named with a .com, .net, or .whatever extension, no rename is necessary, as it also may be opened via right-click » ‘Open With…’.

That’s all there is to it. If you understand how to interpret the contents of your Raw Access Log, you’re solid gold, baby. Otherwise, continue reading for a breif tutorial to get you started with the basics..

Part Two: Use ’em

To examine your raw access log, open the file in a decent text editor. Personally, I prefer WordPad with word-wrap disabled (View menu » Options » Text tab » Word wrap » No wrap). This optimizes viewing via log-entry alignment and data-pattern visibility.

Once the access log is open, you should see many log entries, each one resembling something similar to this (taken from an actual perishablepress.com log):

crawl-66-249-65-82.googlebot.com - - [26/Mar/2007:02:31:03 -0400] "GET /press/archives/ HTTP/1.1" 200 60280 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

That’s the money right there. Once you understand what everything means, you can begin harvesting your cPanel raw access logs all day long. Or whenever it might be necessary ;)

Part Three: Learn ’em

To understand the data provided by raw access logs, let’s analyze the previous log entry. The line begins with the identity of the visiting agent as resolved via its IP address. If no agent name is available, the IP itself is generally recorded. Often, the IP information will be included with the agent name, either prepended, appended, or by some other method. In the example, we would infer as the IP for our busy friend, the googlebot.

Next up, the date and time of the agent visit are recorded. In this case, googlebot dropped in for a spell on March 26th, 2007, quite early in the morning. Note the standardized time format, which includes any relevant time-offset information (-0400 in this case).

The next bit of data describes the type of HTTP request elicited by the agent. Although GET (e.g., content download) and POST (e.g., comment upload) are by far the most common values for this field, don’t be surprised if something unexpected jumps at you. After all, there are all kinds of nutballs out there in cyberspace, doing things that average users just don’t do. In our example, we see that googlebot downloaded a copy of the Perishable Press Archive page. Hmmm..

Immediately following the HTTP method, we find the resource associated with the request. Generally, this field displays the relative path to a file, image, or dynamic query. Again, in our example, googlebot hit /press/archives/, which is our main archives page. After the requested resource, the HTTP version is given, which is 1.1 in our example.

The next numerical value, 200 in this case, is a three-digit code that specifies the resulting status of the request. Typical values for this code are 200 (success) and 404 (not found). To learn more about HTTP Status codes, refer to our article, HTTP Error Codes. The unit-less number following the response code indicates the total size (in bytes) of data downloaded for the request (60280 bytes in our example).

The next portion of our log entry specifies the referrer, if any. Looking closely at our example, we see a null value (i.e., the quoted hyphen, "-") in the referrer field. Apparently, our busy little googlebot dropped in from nowhere (i.e., directly). Although null-referrer values are common for robots, actual URI data is generally available from human users.

Finally, the formal identity and credentials of the user-agent are specified, as provided. Along with identity, user-agent data also may include version number, website resource, and compatibility information. In our example, googlebot identified itself as Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html), which is just swell. Keep in mind, however, that user-agent spoofing is not uncommon, and therefore a healthy dose of skepticism may serve advantageous. To verify a specific agent’s identity, try a Reverse-IP Lookup via a site such as kloth.net.

Part Four: Wrap ’em

Well, that’s it for the 101 course. Try a few Google searches for more in-depth information on the art and science of harvesting and analyzing online statistical data, such as are provided via cPanel raw access logs. Of course, statistical analysis is a vast arena, and there are many tools available for interpreting log data and even automating the process. Hopefully, this article has demonstrated the process by which even a casual investigation may reap an abundance of useful information.