wwwstat: Process a sequence of NCSA httpd 1.2 access_log files and output a summary of the access statistics in a nice HTML format. The program oldwwwstat handles NCSA httpd 1.1 and earlier. Copyright (c) 1994 Regents of the University of California. ========================================================================== Licensing and Distribution Information: This software has been developed by Roy Fielding as part of the Arcadia project at the University of California, Irvine. Wwwstat was originally based on a multi-server statistics program called fwgstat-0.035 by Jonathan Magid (jem@sunsite.unc.edu) which, in turn, was heavily based on xferstats (packaged with the version 17 of the Wuarchive FTP daemon) by Chris Myers (chris@wugate.wustl.edu). Those parts of wwwstat derived from fwgstat and xferstats are in the public domain. As such, this software and all derivations will always be free to the general public. The latest version of wwwstat can always be obtained at http://www.ics.uci.edu/WebSoft/wwwstat/ or by anonymous ftp from ftp://liege.ics.uci.edu/pub/arcadia/wwwstat/ The wwwstat package and those portions developed exclusively at the University of California are covered by the above copyright notice. Redistribution and use in source and binary forms are permitted, subject to the restriction noted below, provided that the above copyright notice and this paragraph and the following paragraphs are duplicated in all such forms and that any documentation, advertising materials, and other materials related to such distribution and use acknowledge that the software was developed in part by the University of California, Irvine. The name of the University may not be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. Use of this software in any way or in any form, source or binary, is not allowed in any country which prohibits disclaimers of any implied warranties of merchantability or fitness for a particular purpose or any disclaimers of a similar nature. IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION (INCLUDING, BUT NOT LIMITED TO, LOST PROFITS) EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ========================================================================== Installation instructions: 1. Get the wwwstat package from the distribution site (above). Normally, it will be in the form of a compressed unix tar file. If it has not already been decompressed by your WWW client, than do one of: % uncompress wwwstat-1.0.tar.Z % gunzip wwwstat-1.0.tar.gz depending on which compressed version you downloaded. 2. Move the resulting wwwstat-1.0.tar file to the directory above where you want to install wwwstat, cd to that directory, and do % tar xvf wwwstat-1.0.tar to create the directory ./wwwstat-1.0 containing the following: Changes -- the list of known problems and version information README -- this file country-codes -- a table of Internet domains and their country names example.html -- an example of what wwwstat output should look like old2newlog -- A tool for converting httpd 1.1 logs to 1.2 format oldwwwstat -- wwwstat for old NCSA httpd 1.0 or 1.1 servers wwwstat -- the main perl script If you are already using NCSA httpd 1.2, delete the oldwwwstat script. 3. Configure the wwwstat script(s) to match the server configuration and default options desired for your site. You will probably need to change the following (with any text editor). The first line (it should point to your perl executable) #!/usr/public/bin/perl The following variables set in the first section of code: $OutputTitle # The output document's HTML Title. $LastSummary # The URL of the previous summary period $ServerHome # The server's default home page. $countrycodefile # The location of the country-codes file. $access_log # The location of your default server access log $srm_conf # The location of your server configuration file $zcat # The name of your "uncompress to stdout" program $zhandle # The file extensions that indicate compressed $AppendToLocalhost # If address in log entry is one word (a local host), # what should be appended? (e.g. ".sub.dom.ain") $mydom1 # Identify the last two components of your local $mydom2 # hostname addresses for special treatment $HeadEstimate # Estimated size of the response headers The defaults for options that can be overridden by the command line: $LocalFullAddress # Show full address for hosts in my domain? $OthersFullAddress # Show full address for hosts outside my domain? $ShowUnresolved # Show all unresolved addresses? 4. Make sure the script is executable: % chmod 755 wwwstat 5. That's it. You should now be able to run wwwstat, e.g. % wwwstat > results.html 6. If you have some old (prior to 1.2) logfiles that you want converted to the new format, you will also need to customize the old2newlog script (most variables are the same as those above). Usage information can be obtained via the -h option. ========================================================================== Usage: (NOTE - oldwwwstat has a different set of options) wwwstat [-helLoOuUrvx] [-s srmfile] [-i pathname] [-a IP_address] [-c code] [-d date] [-t hour] [-n archive_name] [-A IP_address] [-C code] [-D date] [-T hour] [-N archive_name] [logfile ...] [logfile.gz ...] [logfile.Z ...] Display Options: -h Help -- just display this message and quit. -e Display all invalid log entries on STDERR. -l Do display full IP address of clients in my domain. -L Don't (i.e. strip the machine name from local addresses). -o Do display full IP address of clients from other domains. -O Don't (i.e. strip the machine name from non-local addresses). -u Do display IP address from unresolved domain names. -U Don't (i.e. group all "unresolved" addresses under that name). -r Display table of requests by each remote ident or authuser. -v Verbose display (to STDERR) of each log entry processed. -x Display all requests of nonexistant files to STDERR. Input Options: -s Get the server directives from the following srm.conf file. -i Include the following file (assumed to be a prior wwwstat output). ... Process the sequence of logfiles (compressed if extension (gz|Z|z)). Search Options (include in summary only those log entries): -a Containing a hostname/IP address matching the given perl regexp. -A Not containing " " " " " " " " -c Containing a server response code matching the given perl regexp. -C Not containing " " " " " " " " -d Containing a date ("Feb 2 1994") matching the given perl regexp. -D Not containing " " " " " " " " -t Containing an hour ("00" -- "23") matching the given perl regexp. -T Not containing " " " " " " " " -n Containing an archive (URL) name matching perl regexp (except +.). -N Not containing " " " " " " " " Note that the Search Options allow for full use of perl regular expressions (with the exception that the -a, -A, -n and -N options treat '+' and '.' characters as normal alphabetic characters). The following description of perl regular expressions is mostly from the Perl Reference by Johan Vromans: Each character matches itself, unless it is one of the special characters ^$+?.*()[]{}|\ ^ at start of pattern, anchors pattern to the beginning of the string being matched. $ at end of pattern, anchors pattern to the end of the string being matched. . matches any arbitrary character, but not a newline. (...) groups a series of pattern elements to a single element. + matches the preceding pattern element one or more times. ? matches zero or one times. * matches zero or more times. {N,M} denotes the minimum N and maximum M match count. {N} means exactly N times; {N,} means at least N times. [...] denotes a class of characters to match. [^...] negates the class. Inside a class, '-' indicates a range of characters. (...|...|...) matches one of the alternatives. Non-alphanumerics can be escaped from their special meaning using a backslash (\). Backslash is also used to form more special characters: \w matches alphanumeric, including `_', \W matches non-alphanumeric. \b matches word boundaries, \B matches non-boundaries. \s matches whitespace, \S matches non-whitespace. \d matches numeric, \D matches non-numeric. Examples: # Summarize: wwwstat -a '.com$' # only reqs from US commercial orgs. wwwstat -a '^simplon.ics.uci.edu$' # only reqs from that hostname wwwstat -A '^simplon.ics.uci.edu$' # no reqs from that hostname wwwstat -c '302' # only redirected requests wwwstat -c '^5' # only reqs resulting in server errors wwwstat -C '200' # only unsuccessful requests wwwstat -d ' [1-7] ' # only the first week of the month wwwstat -d ' ([89]|1[0-4]) ' # only the second week of the month wwwstat -d ' (1[5-9]|2[01]) ' # only the third week of the month wwwstat -d ' 2[2-8] ' # only the fourth week of the month wwwstat -d ' (29|30|31) ' # only the leftover days of the month wwwstat -d 'Feb' # only February log entries wwwstat -d '1994' # only year 1994 log entries wwwstat -D 'Apr' # no entries from April wwwstat -t '00' # only reqs between midnight and 1am wwwstat -T '12' # no reqs between noon and 1pm wwwstat -n '.gif$' # only those reqs with a gif extension wwwstat -n '^/~user/' # only those reqs under user's directory wwwstat -N '/hidden/' # no reqs for files under "hidden" dirs Depending on your unix shell, some special characters may need to be escaped on the command line to avoid shell interpretation. The intention is that wwwstat be run by a wrapper program as a crontab entry at midnight, with its output redirected to a temporary file which can then be moved to the site's summary file. The temporary file is necessary to avoid erasing your published file during wwwstat's processing (which would look very odd if someone tried to GET it from your web). See below for example crontab entries. One of the nicest things about wwwstat is that it does not make any changes to or write any files in the server directories. Thus, this program can be safely run by any user with read access to the httpd server's access_log and srm.conf files. This allows people to do specialized summaries of just the things they are interested in. This program could easily be modified to run as a CGI script, but that is not recommended for slow processors or heavily utilized servers unless some effort is made to keep the active log file very small (e.g. by using the -i option to bootstrap prior output of wwwstat). ========================================================================== Frequently Asked Questions 1. Why is all that legalese necessary? Isn't wwwstat free? The above legalese exists because others have abused the priviledge of using free software. Because this software was developed by an employee of the University of California, we must protect ourselves from lawsuits by those who would abuse our legal system for personal gain, regardless of any actual damages. To our knowledge, no damage has ever been caused by this program. Wwwstat is distributed free of charge and will remain so as long as it is legally possible. If you are not distributing the program to others, there is no need for you to include mention of the University of California in its output. However, I would prefer that you leave in the reference to wwwstat's distribution site (at the bottom of the output) so that others can know where to get the original program. Wwwstat is in use around the world. If you have translated the output to another language (i.e. German, French, Maori, etc.), I encourage you to share those translations with others or mail them to me (Roy Fielding) so that I can provide special patch files for each language. 2. Will you be developing a version for other httpd's, i.e. CERN, Plexus, ... Obviously, versions of this program would also be nice for the Plexus and CERN servers. However, I found that much of the logic for finding file names was just too specific to the NCSA server to justify all the other work of making this general. Although this should now be easier given the common logfile format, I don't have the time to install all those servers just to see how to do it. Feel free to do so yourself. 3. Why use a separate program (oldwwwstat) for prior log formats? Why not just use a command-line option or examine the log content? Because prior versions of wwwstat required a great deal of special file- handling capability to find file size information. Since that is no longer needed, it would be a waste to leave it in. Eventually, all systems will migrate to the new format (or something like it) and having to maintain the old code without being able to test it is just plain silly. 4. How do I setup a crontab script to run wwwstat nightly? Well, that depends on how your system's crontab works, but on mine (a Sun 4 running SunOS 4.1.2) I can edit the crontab with the command % crontab -e I have the following entry for my nightly script: 0 0 * * * /dc/ud/www/etc/update-stats and the following is my update stats script (thanks to Hal Varian) ---------------------------------------- #!/bin/sh /dc/ud/www/bin/wwwstat > /tmp/wwwstats.html mv -f /tmp/wwwstats.html /dc/ud/www/documentroot/Admin/wwwstats.html exit ---------------------------------------- Here is another script submitted by LMD/T/AD Roar Smith: (NOTE I have not tried this myself, but it looks good). ---BEGIN wwwstat.cron------------------- #!/bin/sh -fh # # wwwstat.cron # # Created: 1994-03-11 by LMD/T/AD Roar Smith (lmdrsm@lmd.ericsson.se) # Modified: 1994-03-22 by LMD/T/AD Roar Smith (lmdrsm@lmd.ericsson.se) # Wrote comments. # Modified: 1994-04-05 by LMD/T/AD Roar Smith (lmdrsm@lmd.ericsson.se) # Bug fix for first day of month. # # Copyright: This program is in the Public Domain. # # # Run this script just after midnight on every day of the month. # # Example crontab entries: # -------------------------------------------------- # 1 0 * * * /library/WWW/wwwstat/wwwstat.cron # -------------------------------------------------- # { program="/library/WWW/wwwstat/wwwstat-0.3/wwwstat" httpd="/usr/local/etc/httpd/httpd" statdir="/library/WWW/stats" statfile="wwwstats.html" tmpfile="/tmp/wwwstats.$$" accessfile="/var/adm/httpd_access_log" errorfile="/var/adm/httpd_error_log" pidfile="/var/adm/httpd.pid" umask 022 day="`/bin/date +'%d'`" month="`/bin/date +'%m'`" set -- Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov shift $month lmonth="$1" if [ "$day" -eq 1 ]; then # # First kill HTTP daemon to avoid interference # httpdpid=`/bin/cat "$pidfile"` if [ -z "$httpdpid" ]; then /bin/kill -TERM "$httpdpid" fi # # Copy Access and Error logfiles # cp -p "$accessfile" "$accessfile.$lmonth" cp -p "$errorfile" "$errorfile.$lmonth" /usr/etc/chown root.daemon "$accessfile.$lmonth" /usr/etc/chown root.daemon "$errorfile.$lmonth" # # Empty Access and Error logfiles # echo -n >"$accessfile" echo -n >"$errorfile" # # Restart HTTP daemon # (cd / ; "$httpd") # # Run stats program # $program -d "$lmonth" "$accessfile.$lmonth" >"$tmpfile" && /bin/mv "$tmpfile" "$statdir/$lmonth.$statfile" && /usr/etc/chown root.daemon "$statdir/$lmonth.$statfile" # # Copy this as current stats file /bin/cp -p "$statdir/$lmonth.$statfile" "$statdir/$statfile" else # # Run stats program # $program >"$tmpfile" && /bin/mv "$tmpfile" "$statdir/$statfile" && /usr/etc/chown root.daemon "$statdir/$statfile" fi } 2>&1 | mail webmaster 2>&1 1>/dev/null exit ---END wwwstat.cron--------------------- 6. What is the general procedure for monthly resetting of the access_log? Again, that depends a great deal on how your site is set up and how frequent the accesses are to your server. My site gets about 15000 requests a month, so I just do the following at the beginning of each month (the example is for April): % cp httpd/logs/access_log oldlogs/Mar_access_log % vi oldlogs/Mar_access_log -- then delete all entries that are not from March or that are obviously corrupted. % wwwstat -e oldlogs/Mar_access_log > /tmp/Mar.wwwstats.html -- this creates the full monthly summary for March and at the same time (the -e option) lists out any other corrupted entries that I may want to delete from the log. % mv /tmp/Mar.wwwstats.html documentroot/Admin/Mar.wwwstats.html -- to publish the summary on my web. % gzip -9 oldlogs/Mar_access_log -- use compress if you don't have gzip. % cd httpd/logs % mv access_log access_log.tmp -- if using a standalone type server, send a kill -1 to the httpd process so that it creates a new access_log. This is not necessary for inetd servers. % vi access_log.tmp -- then delete all entries from March (should now be left with only April entries, since this is repeated monthly). % cat access_log >> access_log.tmp % mv -f access_log.tmp access_log -- the above two commands should be done in quick succession to avoid missing a new entry, and then followed by a kill -1 to the httpd process if running in standalone. 7. My server load is HUGE and wwwstat runs out of memory, what can I do? The only solution I can recommend is to use the -i option and bootstrap wwwstat's output every day -- setup a process which purges the logfile every night and creates a wwwstat output file which can be included the next day, and so on. The process would do something like: % mv -f httpd/logs/access_log /tmp/access_log if server is standalone, restart it with kill -1 `cat httpd/logs/httpd.pid` % wwwstat -i docroot/stats/current.html /tmp/access_log > /tmp/wwwout % mv -f docroot/stats/current.html docroot/stats/previous.html % mv -f /tmp/wwwout docroot/stats/current.html % cat /tmp/access_log >> archived_log % rm -f /tmp/access_log ========================================================================== If you have any suggestions, bug reports, fixes, or enhancements, send them to the author Roy Fielding at . See the file Changes for known problems and complete version information. This work has been sponsored in part by the Advanced Research Projects Agency under Grant Number MDA972-91-J-1010. This software does not necessarily reflect the position or policy of the U.S. Government and no official endorsement should be inferred. Their support is appreciated.