README for analog1.2.6

Contents


Introduction

This README describes analog1.2.6. For the latest version of analog, see the analog home page. You may want to try out the latest beta test for version 2 of analog instead of this version. (The beta versions don't seem to contain many bugs, and do contain a lot of extra features).

This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see the

For examples of the output see This program may be freely distributed and modified provided full credit is given to Stephen Turner (sret1@cam.ac.uk), and that this condition is retained. However, please only distribute it intact, including the domains file and this README. No warranty of any sort is given or implied for this program.

What's new?

This section describes the main changes in each version of analog. If you are using analog for the first time, you can skip this section.
1.2.6
Minor bug fix; will only affect those with corrupt logfiles.
1.2.5
Minor bug fix for weekly report.
1.2.4
Patch for Spyglass server logfile format.
1.2.3
A couple of bug fixes (wild subdomains sometimes caused crashes).
-v option now gives the version number.
1.2.2
Patch for proxy servers: http:// not translated to http:/
1.2
Can configure columns in reports to give percentage requests and number of bytes.
Wild subdomains (e.g., *.com).
Nameless subdomains.
Subdomains now listed in alphabetical order.
Proper support for numerical hostnames in HOSTIGNORE, HOSTONLY, SUBDOMAIN and alphabetical sorting.
New BASEURL command allowing statistics to be displayed on other servers.
Output always says how things are sorted.
"Last 7 days" now behaves sensibly with TO.
Filenames containing /../, /./ and // translated.
Header and footer options removed from form (for security reasons).
1.1
Form interface introduced.
ASCII output now possible as well as HTML.
Output file can now be specified in the configuration file.
FROM and TO commands more powerful.
DEBUG and BACKGROUND introduced.
One bug fix: alphabetical sorting doesn't now swap some hostnames.
List of primes included in distribution.
1.0
Only minor changes since 0.94beta.
0.94beta
New configuration variables SEPCHAR and REPORTORDER.
New configuration commands WITHARGS and WITHOUTARGS.
New commandline options +-A and +-x. (Config.: ALL and GENERAL).
Logfile entries with - as the return code are now regarded as successes, not corrupt entries.
Fixed bugs in host report when aliases or numerical hosts are present.
Documentation rewritten.
0.93beta
Approximate hostname counting now possible in fixed memory.
New configuration commands ISPAGE and ISNOTPAGE.
New commandline option -v.
New configuration command WEEKBEGINSON.
Proper error message when memory exceeded.
Program split into several files.
0.92beta
New reports introduced: hostname, full daily, and weekly.
FROM and TO commands introduced.
Header and footer files introduced.
More helpful warning messages.
Ability to read configuration instructions from stdin.
Subdomain commands moved from domains file to configuration file.
Makefile provided.
0.91beta
Configuration file introduced, enabling many new options.
Some bug fixes and speed improvements.
Ability to print "top n" reports (rather than "everything higher than n").
Request report can print only pages.
Ability to try and resolve numerical addresses.
Now less fussy about the format of the domains file.
Logo added.
README converted to HTML.
0.9beta
More speed improvements, and some bug fixes.
Introduced -u option.
Introduced subdomain analysis.
Included "not modified" replies as successes, not redirects.
First public release at 0.9beta3.
0.89beta
Commandline arguments.
Efficiency improvements.
Host count and "last 7 day" statistics.
0.8beta
Initial program, just default options.

Compiling and running the program

If you want to get on with trying out the program straight away, you can leave most of this README until another time. The one thing you need to do is to look at the file analhead.h. These are all user-settable options, but most of them you can leave alone for the moment. You will need to check the values of DOMAINSFILE and LOGFILE, and you will want to change HOSTNAME and HOSTURL.

When you have done that, compile the program by typing

make
(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again.

Then just type

analog
to run the program. To send the output to a particular file instead of to the screen, type, e.g.,
analog > outfile.html

Customising analog

Pretty soon you will want to customise the output of analog to your personal preferences. How to do that is explained in this section. There are lots of options, so this section is rather long. (However, you can bypass this section to some extent if you set up a form interface to allow you to choose the main options from a Web page).

There are three ways in which customising can be done. First, the file analhead.h contains various settable parameters. These can be changed before compiling the program. They are explained in that file, so they will not be documented again here.

Secondly, there are commandline options, given after the commandname in the usual way. So, for example, the command

analog +d
uses the +d option to tell analog to include a daily summary in its ouput. All the commandline options are explained below.

Thirdly, you can tell analog to use a configuration file to read in extra options. This is specified by means of the commandline option +g. For example,

analog +gextra.conf
tells analog to read configuration commands from the file extra.conf. (Note that there is no space between +g and the filename; this is true of all commandline options). If +G is used instead of +g, the default configuration file as specified in analhead.h is read first, then the one specified after +G. You can specify standard input as the configuration file by the options +g- and +G-.

The configuration file can contain several commands on separate lines; any text after a hash (#) on a line is ignored as a comment. So the following is an example of a configuration file.

DAILY      OFF   # We don't want a daily summary
FULLDAILY  ON    # We want a full daily report instead 
An argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space. The various commands which can occur in the configuration file are explained below.

Why three separate methods to specify options? Although some options can be set in two or even three ways, the three methods have different functions. The file analhead.h contains default values for the variables, which you want always to apply when you don't set anything else. The configuration file is appropriate for options you often use. For example, I run three jobs every night to calculate different sets of statistics from our server; each of these different formats is controlled by a configuration file. Commandline options, on the other hand, are the quickest thing to use if you want to run the program on line, or if you want to override one of the options set in a configuration file.

In order to use the three separate methods together, you have to know which takes precedence over which. The default values in analhead.h have the lowest priority. They are overriden by the values in the configuration file (if two configuration files are read, the one specified by +G takes precedence over the default one). And they in turn are overridden by the commandline arguments. If two contradictory options are specified in one configuration file or on the commandline, the later one is obeyed.

If this is all a bit confusing, just run

analog -v [other options]
That will tell you what the values of all the variables will be based on analhead.h, the configuration options and the commandline options.

Now we shall look at options which affect one of the reports; after that we shall see options which affect several or all of the reports. We shall look at the options under the following headings.


General Summary

Program started at Mon-26-Jun-1995 17:09 local time.
Analysed requests from Thu-28-Jul-1994 20:31 to Mon-26-Jun-1995 17:09 (332.8 days).
Total completed requests: 368 063 (12 872)
Total failed requests: 4 089 (139)
Total redirected requests: 35 277 (1 838)
Average requests per day: 1 219 (2 121)
Number of distinct files requested: 966 (336)
Number of distinct hosts served: 28 589 (1 589)
Number of new hosts served in last 7 days: 1 037
Corrupt logfile entries: 869
Total bytes transferred: 1 852 029 300 (85 752 881)
Average bytes transferred per day: 5 544 997 (12 250 411)
(Figures in parentheses refer to the last 7 days).

The general summary can be turned off by the commandline option -x or the configuration command

GENERAL OFF
or on by +x or GENERAL ON. If the general summary is off, all the `Go To' links in the output are also omitted.

The figures in parentheses refer to the last 7 days. They can be turned on and off with +7 and -7 or the configuration command

LASTSEVEN ON    # or OFF

Counting hosts is something which can take a lot of memory (we have to remember the name of every host that has accessed our server). If memory is a problem, you can turn the host counting off with the commandline option -s or the configuration command

COUNTHOSTS OFF
Alternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using +ss or
COUNTHOSTS APPROX
and you can specify the amount of memory to be used by
APPROXHOSTSIZE 100000  # or whatever number, in bytes
About 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used.

Monthly report

Each + represents 1000 requests, or part thereof.

   month: #reqs
--------  -----
Nov 1994: 24784: +++++++++++++++++++++++++
Dec 1994: 32767: +++++++++++++++++++++++++++++++++
Jan 1995: 37656: ++++++++++++++++++++++++++++++++++++++
Feb 1995: 41666: ++++++++++++++++++++++++++++++++++++++++++
Mar 1995: 45113: ++++++++++++++++++++++++++++++++++++++++++++++
...
The monthly report can be turned on and off with +m and -m or
MONTHLY ON  # or OFF

The value of + can be specified by a number after the +m option; e.g., +m1000 for the above display. If you specify +m0 (or if 0 is the default setting from analhead.h) the program will choose something sensible automatically. The equivalent configuration command is

MONTHLYUNIT 1000   # or 0, or whatever

Weekly report

week beg.: #reqs
---------  -----
24/Jul/94:   187: +
31/Jul/94:  3909: +++++++++++++++++
 7/Aug/94:  3550: ++++++++++++++++
14/Aug/94:  3920: +++++++++++++++++
21/Aug/94:  5220: +++++++++++++++++++++
...
This is configured in exactly the same way as the previous report, but with +W and -W in place of +m and -m, and configuration commands WEEKLY and WEEKLYUNIT.

Daily summary

day: #reqs
---  -----
Sun: 29488: ++++++++++++++++++++
Mon: 55680: ++++++++++++++++++++++++++++++++++++++
Tue: 58162: +++++++++++++++++++++++++++++++++++++++
Wed: 59157: ++++++++++++++++++++++++++++++++++++++++
Thu: 61907: ++++++++++++++++++++++++++++++++++++++++++
Fri: 60827: +++++++++++++++++++++++++++++++++++++++++
Sat: 32573: ++++++++++++++++++++++
Again as before, with +d and -d, and DAILY and DAILYUNIT.

Daily report

     date: #reqs
---------  -----
28/Jul/94:    11: +
29/Jul/94:   174: ++++
30/Jul/94:     2: +
31/Jul/94:     0: 
 1/Aug/94:   104: +++
 2/Aug/94:   517: +++++++++++
...
This report has one request for each day from the first to the last request, so it can be very large. The appropriate commands are +D, -D, FULLDAILY and FULLDAILYUNIT.

Hourly summary

hr: #reqs
--  -----
 0: 12245: ++++++++++++++++++++++++++++++++++++++++++
 1: 10163: ++++++++++++++++++++++++++++++++++
 2:  9137: ++++++++++++++++++++++++++++++++
 3:  8899: ++++++++++++++++++++++++++++++
 4:  8070: ++++++++++++++++++++++++++++
 5:  7713: ++++++++++++++++++++++++++
...
+h, -h, HOURLY and HOURLYUNIT are the appropriate commands for this report.

Domain report

  #reqs :  %bytes : domain
--------  --------  ------
 103125 :  46.58% : .uk (United Kingdom)
( 64982):( 35.45%):     cam.ac.uk (University of Cambridge)
( 47138):( 20.55%):       statslab.cam.ac.uk
  49290 :  12.49% : .edu (USA Educational)
  54879 :   9.35% : .com (USA Commercial)
  39812 :   6.97% : (Numerical domains)
  15186 :   2.84% : .de (Germany)
...
This report is turned on and off with the commandline options +o and -o, or the configuration command
DOMAIN ON  # or OFF

The report can be sorted by number of requests, percentage of bytes, or alphabetically. This is achieved on the commandline by adding a letter after the +o option; +or, +ob or +oa respectively. In the configuration file, the command

DOMSORTBY BYREQUESTS  # or BYBYTES, or ALPHABETICAL
can be given.

You can control which columns which appear in the domain report by means of the configuration command DOMCOLS. There are four possible columns of data: number of requests due to each domain (R), percentage of requests (r), number of bytes (B) and percentage of bytes (b). A command like

DOMCOLS Rrb
would instruct the program to display columns for number of requests, percentage of requests, and number of bytes in that order for each domain. (This can only be done in the configuration file, not on the commandline).

The report can be listed to any required depth by putting a number after the +o, +or, +ob or +oa option. If sorting is by requests or alphabetical, the number is interpreted as the minimum number of requests required to get onto the report. If sorting is by bytes, it is hundredths of a percent of bytes. For example, +oa15 will list all domains with at least 15 requests, sorted alphabetically, whereas +ob15 will list all domains with at least 0.15% of the traffic, sorted by bytes. If a negative number is given, a `top n' report is calculated; so, for example, +or-20 will list the 20 domains with the highest numbers of requests. The number can also be supplied by means of the configuration command

DOMFLOOR 15  # or -20, or whatever

Subdomains can be specified for each domain. This can only be done in the configuration file. The syntax of the command is

SUBDOMAIN subdomain subdomain_name
If the subdomain name has spaces in, it must be enclosed in quotes. The subdomain name can be omitted, indicating a nameless subdomain. For example, to produce the above output, I would include the following lines in the configuration file
SUBDOMAIN cam.ac.uk 'University of Cambridge'
SUBDOMAIN statslab.cam.ac.uk

Numerical subdomains (which have most significant part on the left) can also occur. They will look like

131   The Ever-Popular 131 domain
131.111   # Nameless

Also subdomains with wildcards in can occur. The following are examples:

SUBDOMAIN *.edu       # mit.edu, umn.edu  etc.
SUBDOMAIN 131.111.*   # 131.111.1, 131.111.2 etc.
SUBDOMAIN %           # all top-level numerical domains, from 1 to 255
If you specify wild subdomains, you will probably want to set quite a high SUBDOMFLOOR (see below).

There is a command NOTSUBDOMAIN to erase a previously requested subdomain. For example, you can write

NOTSUBDOMAIN *.edu
NOTSUBDOMAIN cam.ac.uk
However, if you request, for example, *.edu, then NOTSUBDOMAIN mit.edu will be unable to override it.

There is a configuration command SUBDOMFLOOR to specify how much traffic or how many requests a subdomain needs to be included in the output. It works the same way as the DOMFLOOR command above, except it can't be negative.

Within a domain, subdomains will be output in alphabetical order.

The domain report relies on having a domains file available, listing which geographical locations correspond to which domains. Which file is to be used as the domains file can be specified by the commandline option -ffilename or the configuration command

DOMAINSFILE domainsfile
The correct format of the domains file is explained in a separate section.

Host report

#reqs: %bytes: host
-----  ------  ----
   10:  0.03%: zlsm03.arcs.ac.at
   11:  0.04%: iki10.boku.ac.at
    1:       : oeh1.boku.ac.at
    2:  0.01%: dopefish.esi.ac.at
    1:       : piassun1.joanneum.ac.at
...
This is much the same as the domain report, with commandline options +S and -S, and configuration commands FULLHOSTS, HOSTSORTBY, HOSTFLOOR and HOSTCOLS. Note that in this report, alphabetical sorting is by domain as most significant part. This report can be very long and slow to sort, and should be used with a high floor if at all.

Directory report

 #reqs: %bytes: directory
------  ------  ---------
237985: 35.40%: /~sret1/
 18596: 17.60%: /~rrw1/
  3574: 11.89%: /~richard/
  2376:  7.92%: /~steve/
 13518:  7.42%: /Dept/
...
Again, this is much the same as the domain report, with commandline options +i and -i, and configuration commands DIRECTORY, DIRSORTBY, DIRFLOOR and DIRCOLS. There is one further variable for this report, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like
 #reqs: %bytes: directory
------  ------  ---------
 43772: 72.06%: /~sret1/backgammon/
173426: 19.93%: /~sret1/backgammon/bitmaps/
 11298:  4.14%: /~sret1/
  5322:  1.71%: /~sret1/backgammon/books/
  2773:  1.22%: /~sret1/images/
   728:  0.66%: /~sret1/backgammon/clubs/
...
This can be specified by the commandline option +l3 or the configuration command
DIRLEVEL 3

Request report

#reqs: %bytes: filename
-----  ------  --------
33980: 23.66%: /~sret1/backgammon/main.html
21162:  2.69%: /~sret1/backgammon/bitmaps/board.xbm
20422:  0.49%: /~sret1/backgammon/bitmaps/dice1.xbm
20187:  0.49%: /~sret1/backgammon/bitmaps/dice2b.xbm
12690:  0.86%: /
 8457:  1.09%: /header.gif
 7198:  0.81%: /~sret1/coldlist.html
 5461:  0.48%: /home.xbm
 3550:  0.32%: /~sret1/home.html
 3370:  0.23%: /~mcmc/html/
...
Commandline options +r and -r, and configuration commands REQUEST, REQSORTBY, REQFLOOR and REQCOLS work analogously to the last three reports. There are also various options to control which files are printed and which are given links.

In fact, if the commandline option +r is used, only pages will be displayed in the report. If you want to list all files, including, for example, graphics, then you should use +R instead; alternatively, if neither +r nor +R is specified on the commanline, the configuration command

REQTYPE PAGES  # or ALL
will control whether pages or all files are listed.

There are three possible modes of linking in the request report; you can link to none of the files, or pages only, or all files. The commandline options for these are -k, +k and +kk respectively; or you can use the configuration command

PAGELINKS OFF   # or ON, or ALL
There is also a related command BASEURL to specify a URL to prepend to the links. For example, if
BASEURL http://www.statslab.cam.ac.uk
were specified, then /~sret1/analog/ would be linked to http://www.statslab.cam.ac.uk/~sret1/analog/. This is useful if you want to display the statistics on a different server than the one they belong to.

You can also specify in the configuration file what should be counted as a page in the requests report (thus giving you complete control over what goes in the report, or what is linked to). At the beginning, the following are `pages': *.html, *.htm, *.shtml, *.shtm, *.html3, *.ht3 and directories (*/). The command

ISPAGE filename
will specify that some other file is a `page'. Filenames can begin with an asterisk (*) as a wild card; so, for example,
ISPAGE *.ps
ISPAGE *.ps.gz
would mean that Postscript files and compressed Postscript files are to be regarded as pages. You can also use
ISNOTPAGE filename
to specify that something which would otherwise be a page is not to be regarded as a page.

Analysing parts of the logfile

The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command

analog logfile.log
will use that logfile for its report. You can also write
analog -
to use standard input as the logfile. This is useful in constructing pipes; for example, if you want to analyse an old compressed logfile, you could type
gzcat logfile.old.gz | analog -
(gzcat might be called zcat on some systems). You can also specify which logfile to use in the configuration file by means of a command like
LOGFILE logfile.log   # or ...
LOGFILE stdin         # for standard input

There are various commands which instruct the program only to analyse part of the logfile. These are all configuration commands only; they have no commandline or analhead.h equivalents.

First, you can instruct the program only to tak into account certain files. This is done by means of the FILEONLY and FILEIGNORE commands. Asterisks can appear at the end of the filenames specified, as wildcards. For example, the configuration

FILEONLY /~sret1/*
FILEIGNORE /~sret1/backgammon/*
FILEIGNORE /~sret1/home.html
would instruct the program to examine only my files, and excluding my backgammon files and home page. (This should not be confused with excluding them from the request report, which still includes them in other reports; this excludes them altogether from the whole analysis).

There are similar commands HOSTONLY and HOSTIGNORE to analyse only the requests from certain sites. Here an asterisk can occur at the start of a hostname. For example,

HOSTIGNORE emu.pmms.cam.ac.uk
HOSTIGNORE *.statslab.cam.ac.uk
would ignore accesses from emu and from my site (including statslab.cam.ac.uk itself). For numerical domains, the asterisk occurs at the end, not the beginning: for example
HOSTIGNORE 131.111.20.*   # ignore unresolved addresses from statslab

Finally, there are commands to analyse only a subset of the dates in the logfile. The simplest usage is FROM yymmdd and TO yymmdd. So, for example, to analyse only requests in July 1995 I would use the configuration

FROM 950701
TO   950731
Also each of the pairs of digits can be preceded by - and the month and date can by preceded by + to represent time relative to the current date. This allows constructions like
FROM -01-00+01   # from tomorrow last year
TO -00-0131  # to the end of last month (OK even if last month
             # didn't have 31 days)
FROM -00-00-56
TO   -00-00-01  #statistics for the last 8 weeks

If a TO command is given, the figures for the last 7 days refer to the time until then.


Aliases etc.

There are commands to give aliases for filenames and hostnames. The configuration line
FILEALIAS file1 file2
says that whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately. It also understands that .. means `parent directory,' . means `this directory' and // is the same as /, and translates those filenames to their canonical forms.

If * is placed at the end of the first entry, then all filenames starting with file1 will be changed to start with file2. So, for example, after the command

FILEALIAS /~sret1/statprog/* /~sret1/analog/
a filename looking like /~sret1/statprog/statprog/stat.c will be understood as /~sret1/analog/statprog/stat.c. (Note that the conversion is done only once for each filename; you don't get /~sret1/analog/analog/stat.c).

A pair of related commands is WITHARGS and WITHOUTARGS. Normally any arguments given as part of a URL (after a question mark) are ignored. However, if a configuration command like

WITHARGS /cgi-bin/prog.cgi
is given, then the arguments to that file will form part of the filename. So /cgi-bin/prog.cgi?a and /cgi-bin/prog.cgi?b will be regarded as separate files, whereas without that command they would both have been translated to /cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks can again occur at the end of the filename, for example in commands like
WITHARGS /cgi-bin/*
There is also a parallel command WITHOUTARGS; for example,
WITHARGS /cgi-bin/*
WITHOUTARGS /cgi-bin/spam.cgi
would expand read the arguments for all files in /cgi-bin/ except spam.cgi.

There is a command HOSTALIAS, similar to FILEALIAS, which is useful if your server records local hostnames in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use

HOSTALIAS lion lion.statslab.cam.ac.uk
HOSTALIAS www lion.statslab.cam.ac.uk
HOSTALIAS www.statslab.cam.ac.uk lion.statslab.cam.ac.uk
Again, only one conversion is done per host, which is why I need both the second and the third line. There is no wildcard conversion for this command.

One more related command is the command to tell analog to try and look up the names of hosts that appear only as numerical addresses in the logfile; so, for example, 131.111.20.59 will be translated to lion.statslab.cam.ac.uk. Note, however, that not all hosts have names, or we may not be able to discover them. The commandline option to try and translate numerical addresses is +1 (or -1 to turn it off); the equivalent configuration command is

NUMLOOKUP ON  # or OFF
Looking up hostnames is a slow business. If this option is used, be prepared for analog to take a very long time to compile its report.

Layout

The final group of options is those which affect the layout of the output. First, you can choose whether you want ASCII (plain text) or HTML output, using the commandline option +a or -a, or the configuration commands
ASCII ON   # or OFF
HTML OFF   # or ON  (equivalent to previous line)
If you choose ASCII output, some of the other options are ignored, but it should be obvious which ones they will be.

You can select the file for the output to be sent to in the configuration file or on the commandline. So instead of

analog > outfile.html
you can use the commandline option +Ooutfile.html or the configuration command
OUTFILE outfile.html

There is a configuration command REPORTORDER which specifies which order the reports should occur in. The usage is a line like

REPORTORDER hDdWmoSir
This says that the reports should occur in the order hourly summary (h), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i) and request report (r). It is important to include all the above nine letters exactly once each.

There is a commandline option +A to include all reports (particular ones can then be omitted with -d or whatever); likewise -A omits all reports (and particular ones can then be included). The equivalent configuration commands are

ALL ON  # or OFF
Note that order is important; for example, +i -A +r will include the request report but not the directory report.

The title line of the output page contains three adjustable variables. First, the logo in the top left hand corner can be turned on or off, or any other logo substituted (for example, your organisation's logo). This is accomplished by the commandline arguments -p (no logo: mnemonic, p for picture), +p (use the default logo) and +pURL use the logo at the given URL. The equivalent configuration commands are

LOGO     ON    # or OFF
LOGOURL  url   # where it is
The organisation name on the title line can be specified by means of the option -nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option -uURL; use -u- if you don't want any link. The equivalent configuration options are
HOSTNAME  name  # must be in quotes if it contains spaces
HOSTURL   URL
HOSTURL   -     # for no link

A header file and footer file can be inserted near the top and bottom of your output. These should be written in HTML, and can contain anything you want. Possible uses include providing information about your organisation or about the way the statistics were calculated, linking to related pages, and no doubt many other things. The commandline options to achieve this are +Hfilename and +Ffilename, or -H and -F to turn them off, and the configuration commands are

HEADERFILE filename
FOOTERFILE none      # if you don't want one

There is also a configuration command to use a certain image as the background to the output page. If you insist on using one it should be small, otherwise people with slow lines won't be able to load your page, and it should not stop people with low resolution monochrome screens being able to read your page. The command is

BACKGROUND none   # preferably!
BACKGROUND URL    # to use that URL

A command like

WEEKBEGINSON SUNDAY
says which day should be regarded as the first day of the week. This is used in the daily report, daily summary and weekly report.

There is a command SEPCHAR to say which character should separate each group of three digits in long numbers. For example,

SEPCHAR ,
will give 123,456,789, whereas
SEPCHAR ' '
will give 123 456 789.

The character which is used in the barcharts in some of the reports can be changed to, for example, a hash by -c# or

MARKCHAR '#'  # put in quotes so that it isn't a comment

Those graphical reports also need to know how many characters wide the output page is. Although a normal page is 80 characters wide, for Web pages about -w65 or

PAGEWIDTH 65
seems to be about right.

Finally, there is a debugging command, for printing (to stderr) problems with your logfile. There are currently three levels of debugging: 0 for no debugging, 1 for printing corrupt logfile lines (prepended by "C:"), and 2 which also prints hosts for which the domain is unknown (prepended by "U:"). The commandline option for level 1 debugging is +V1 (V for verbose) and the configuration command is

DEBUG 1
You can also use commandline options +V for level 1 and -V for level 0.

The domains file

The file domains.tab, to translate internet country codes to locations, should have come with the program. If you haven't got one, you can download one from http://www.statslab.cam.ac.uk/~sret1/analog/analog/domains.tab. It should be in the following format:
ad   Andorra
ae   United Arab Emirates
[...]
There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.

Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,

uk  United Kingdom  # God save the Queen!

The form interface

Another way to run analog is via the form interface; this allows users to select which options they want via a Web page.

To set up the form interface, go to the directory where the analog source code lives, and follow these steps.

  1. In analhead.h, make sure that the FORMPROG is set to be the URL of the form processing program, which will be wherever cgi-bin programs live on your server; normally in the cgi-bin directory.
  2. Edit the top of analform.c to indicate where the analog program lives (the program name within your computer's whole filespace, not a URL).
  3. Type make form.
  4. Move the program analform.cgi to the place you specified as the FORMPROG. Make sure it is executable by the server.
  5. Make sure analog itself is executable by the server too, and that domains.tab is readable.
  6. The file analogform.html is the actual form interface; move it to wherever you want people to get at it. Make sure it is world readable.

If the third step above fails to generate a form, you can generate one yourself by means of the command analog -form +Oanalogform.html. You might also want to run this command yourself if you want to supply different default options from normal for the form user: if you run the command with extra commandline or configuration file options, they will be respected in the construction of the form.

If the form doesn't seem to work, check the following:

  1. Look in the server's error_log for clues.
  2. Do other cgi-bin programs work on your server?
  3. Are all the files in the right places, with the right access permissions, as specified above?
  4. Try the following. setenv QUERY_STRING "xq=1" (C Shell) or export QUERY_STRING="xq=1" (other shells), then run analform from the shell.
  5. If the local time doesn't seem to be correct in the output, you may have to set the timezone yourself in the form. Four lines from the bottom, there is a line like <input type=hidden name="TZ" value=""> For the value you should insert your timezone, in standard format. Usually this looks like your winter timezone name, followed by hours west of Greenwich, followed by your summer timezone name. So the East Coast of the USA should have value="EST5EDT", and Germany value="MEZ-1MESZ".

It is better, although not essential, if when you change the default options for your analog, you remake the form.

Note that you probably want to restrict access to the form and form program to certain users; if it is world readable there could be considerable load on your server as well as potential confidentiality problems. Exactly how to do this depends on which server you are running. You might also want to specify a default configuration file in analhead.h (which the form user cannot override except where options are provided on the form) or remove some options from the form.


Frequently asked questions

When I try and compile analog, it gives me an error.
Look in the Makefile to see if you need to include any extra libraries.
Also, make sure you are using an ANSI C compiler (like gcc) or have included the right CFLAGS in the Makefile to turn on the ANSI option in a compiler like cc.
Why don't I get such-and-such a report in the output even though I asked for it? (or why don't I get the subdomains I requested in the domain report?)
Maybe the floor for the report is set too high. For example, if you ask for a request report for all pages with at least 50 accesses and no page has that many, no report will be produced. See also the next question.
Why doesn't such-and-such a file appear in the request report?
You've asked for only pages, and this file is not a page. The remedy is to use REQTYPE ALL to list all files in the request report, or ISPAGE to say that this file is a `page.'
Why don't the total requests in the request report add up to the grand total?
See the previous two questions.
Why are directories listed in the request report?
They are not directories, they are pages with the same name as the directory. (They arise because if you ask the server for a directory, it typically returns the page in that directory called index.html).
Why don't I get the pretty graphs like on the analog home page?
The graphs are only in the new beta test version.
Why are no data on bytes included in my output?
You have some old-style logfile lines that do not include that information, so the analysis cannot be done.
Can I ignore all gifs in the analysis?
No, but it's probably not what you want to do anyway (that would make the total bytes transferred go wrong, for example). If you just want them not to appear in the request report, read about the configuration commands REQTYPE and ISPAGE above.
Can I change the ink colour of the output page?
No, and you can't make the top request blink either. For such a widespread program, it's only appropriate to use true HTML, not things which one company has added on to HTML on its products.
Can you extrapolate from the current month's partial data to produce a prediction for the whole month, based on the rate so far?
No. There are too many problems in trying to produce anything sensible, especially near the beginning of the month. Different days of the week and different times of day cause lots of problems. I would prefer to produce raw accurate data than suspect derived data.
I ran out of memory when trying to run analog. What can I do?
Try using approximate (instead of exact) hostname counting with the +ss option, or turning hostname counting off altogether with -s.
I have some old compressed logfiles and a current logfile. How can I analyse them both together?
The command gzcat or zcat has an option -f to uncompress compressed files and leave other files alone, and then stick them all together. So
gzcat -f log1.gz log2.gz log3 | analog -
is the required command.
My logfile is getting too big. Can analog record some of the data in a convenient format so that it can read it without having to process the whole logfile again, and I can throw the old logfile?
No, at least not yet. Trying compressing your old logfile entries instead using gzip -9 (see the previous question). Because of the high number of repeated strings in the logfile, compression is very efficient.

Warnings

Lines with filenames longer than a certain limit (which can be specified in analhead.h) are regarded as corrupt lines and discarded.

If we are doing a `top n' report and two entries tie for nth place, only one will be printed.

The reported `running time' is elapsed real time, not CPU time.

If you specify +oa-10 you really do get the top ten domains alphabetically. This is almost certainly useless!

You can sort a report by requests even when you have turned off the request columns. This may confuse your readers.

The behaviour of FILEALIAS a b; FILEALIAS b c is undefined.


Known bugs

The choice of floor for the reports is not done correctly: in particular, if you change the method of sorting from the default, you should also then give the desired floor explicitly. (For the domain report, that includes the SUBDOMFLOOR).

The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate.

Do not alias a file to itself (e.g., FILEALIAS /home.html /home.html) or a host to itself, or it will get lost.


Wishlist

I always welcome mail on analog (my e-mail address is sret1@cam.ac.uk); whether it works on your system (yes, even if it does!), any bug reports or requests for new features. If you send me mail, I shall keep you informed about future releases.

I am happy to help people who have trouble with analog, but please read the FAQ and list of known bugs first. Also, you might be able to diagnose the problem yourself if you run

analog -v [your usual options]
which lists the value of all variables. But if you still can't get it to work, ask me. It helps me find bugs, and to know where the documentation is unclear. When submitting bug reports, please include the version number (which you can find out by the command analog -v).

Acknowledgements

Thanks are due to the author of getstats, Kevin Hughes. We (and other people) have found that getstats gets buggy and very slow when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent.

Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.


Stephen Turner
University of Cambridge Statistical Laboratory
E-mail: sret1@cam.ac.uk

Page last modified: 08-Jun-96