AWStats and CloudFront Logs

How does one use AWStats to analyze AWS CloudFront logs?

AWStats is a widely used tool to analyze website logs, but unfortunately there is not much information available on how to use it with AWS's (Standard) CloudFront logs. The AWStats documentation seems to assume that you are using AWStats on the actual web server generating the logs, or at least that you have access to normal web server logs. That isn't the case when using CloudFront. I was able to find a single blog post from 2011 documenting how to process CloudFront logs with AWStats, and although that post was helpful, I believe more needs to be said about how to shoehorn CloudFront logs into something AWStats can use. This blog post will document what I learned while getting this to work for me.

Download CloudFront Logs

Obviously the first step is to obtain the log files. Standard CloudFront logging writes the log files to a S3 bucket. I use the aws s3 sync command to download only new log files to my computer. This accomplishes the same task as the Python code in that other blog post I found.

aws s3 sync <s3 bucket> <local directory> --exclude "*" --include "*/*2021-*"

The log data is stored in many small gzip files. My use of the --exclude and --include parameters limit the syncing to files from this year.

Combine CloudFront Logs

The second step is to combine all of the data into one log file. One could try to combine them with a command like zcat *.gz > /tmp/combined_logs.log but there are too many little files for that to work:

$ gzip *.gz > /tmp/combined_logs.log
-bash: /usr/bin/gzip: Argument list too long

Instead I wrote a bash script to read each gzip file one at a time and append to a single merged log file. The bash script also does some data cleaning to remove the unneeded columns and comments. The script sorts the logs, as properly sorted logs are necessary for AWStats to work. The sorting could have been done with logresolvemerge.pl but I chose not to do it that way.

My actual script is below. There's some extra nonsense in there to mark each gzip file as "done" so I don't read the same log messages over and over. The result of this script could have been accomplished in other ways. The critical outcome is that everything is combined into one file and that the logs are sorted by timestamp.

#!/bin/bash

LOG_DIR="/local/DATA/aws_cloudfront_logs"

# create an empty output file
TEMP_OUTPUT="/tmp/temp_merged.${1}.log"
if [ -f "$TEMP_OUTPUT" ]
then
    rm "$TEMP_OUTPUT"
fi
touch "$TEMP_OUTPUT"

# loop through all gzipped files
LOG_FILES="$LOG_DIR/${1}/*.gz"
for f in $LOG_FILES
do
    # append contents to merged output file
    # if log file is not flagged as done
    f2="$f.done"
    if [ ! -f "$f2" ]
    then
        zcat "$f" | grep -v "^#" | awk 'BEGIN{ FS="\t"; OFS="\t" } { print $1 " " $2, $4, $5, $6, $8, $9, $10, $11 }' >> "$TEMP_OUTPUT"
        touch "$f2"
    fi
done

# sort the output and remove the temp file
sort "$TEMP_OUTPUT" > "/tmp/merged_logs.${1}.log"
rm "$TEMP_OUTPUT"

AWStats Configuration

The most confusing part of the AWStats instructions is the configuration. The awstats_configure.pl script seems to want to extract information from the web server's configuration, but without an actual web server, that approach won't work. There aren't many hints as to how to create a configuration file manually.

Below is the actual AWStats configuration file I am using for this website, saved to /var/www/cgi-bin/awstats.root.conf. The file needs to be in the same location as awstats.pl. When I first installed AWStats, the relevant files were found in /usr/share/awstats/wwwroot/cgi-bin/, but after setting up an Apache web server, they are now in /var/www/cgi-bin/.

LogFile=/tmp/merged_logs.root.log
LogType=W
LogFormat="%time2 %bytesd %host %method %url %code %referer %ua"
LogSeparator="\t"
DNSLookup=0
DynamicDNSLookup=1
DirData="/var/www/data/root"
DirCgi="/cgi-bin"
DirIcons="/icon"
SiteDomain="ixora.io"
HostAliases="localhost 127.0.0.1 REGEX[^.*\.ixora\.io$]"
AllowToUpdateStatsFromBrowser=0
AllowFullYearView=1
LoadPlugin="geoip GEOIP_STANDARD /usr/share/GeoIP/GeoIP.dat"
LoadPlugin="tooltips"
LoadPlugin="graphgooglechartapi"

There are several things in this file that need to be explained:

  • The LogFormat value is shorter because of the awk filtering I did in the bash script. If I didn't do that, there would be about 20 extra columns that I'd have to tell AWStats to ignore using the %other identifier. If your log files are different from mine, refer to the LogFormat documentation for help.

  • The %time2 field requires the concatenation of the first two CloudFront fields, date and time. That explains the $1 " " $2 bit in my awk program and the use of sed in that other blog post.

  • I had to change the permissions of /var/www/data/ so that my user could write to it. By default, that directory is owned by root.

  • I installed and configured Maxmind's GeoIP database so AWStats could map IP addresses to countries. I had to explicitly tell AWStats to load and use that plugin.

  • The other two plugins provide extra information and make the charts easier to use. Multiple plugins are loaded with separate lines. The names must match the files found in /usr/share/awstats/plugins/, not as they are in the list of AWStats plugins.

If I wasn't viewing the AWStats reports through my browser, this configuration would be simpler. The most important things to keep in mind are that the configuration file needs to be in the same directory as awstats.pl and that the user running that script has write access to the DirData directory.

AWStats Data Compilation

If the configuration is done correctly, AWStats can be run with this command:

$ /var/www/cgi-bin/awstats.pl -config=root -update
Create/Update database for config "/var/www/cgi-bin/awstats.root.conf" by AWStats version 7.8 (build 20200416)
From data in log file "/tmp/merged_logs.root.log"...
Phase 1 : First bypass old records, searching new record...
Direct access to last remembered record is out of file.
So searching it from beginning of log file...
Phase 2 : Now process new records (Flush history on disk after 20000 hosts)...
Jumped lines in file: 0
Parsed lines in file: 767
Found 0 dropped records,
Found 0 comments,
Found 0 blank records,
Found 0 corrupted records,
Found 0 old records,
Found 767 new qualified records.

This puts some text data files in /var/www/data/root/. That data is used to create the actual reports.

AWStats Report Generation

The final step is to run the actual reports. That could be done from the terminal using commands found in the Building and Reading Reports documentation, or through the CGI script using a browser by visiting the URL https://localhost/cgi-bin/awstats.pl?config=root.

AWStats reports don't provide everything I ever wanted to know about my web traffic, but then again, neither did Google Analytics. At least with this approach the data is under my control and I can answer any question I want to by doing a little bit of programming.

Comments