Get Firefox!

Official pgLOGd Home

Download the Current Version: 2.3 Release

Latest News:

April 29, 2006 pgLOGd End of Life

I have decided that further development of pgLOGd is not something I have time for, mostly because any time I would spend on pgLOGd I'm devoting to its successor, dbWebLog.

March 17, 2006 pgLOGd does not work with Postgres 8.x

I have received reports that pgLOGd does not work properly against Postgres 8.x. This is not really any big surprise to me since it has been over 2 years since I have actively worked on it. Sorry, sometimes that's the way it goes with a one-man Open Source projects, mostly because needing to eat (i.e. make real money) takes priority, as does having two children in the last 3 years. I'm sure the problems are due to the Postgres C-API changing between major releases, but I have not had time to confirm this or update pgLOGd.

Description

pgLOGd, simply put, is a program that takes web server (Apache) log entries and sends them to a database. It is called pgLOGd because of the database it was designed to function with, PostgreSQL. PostgreSQL is sometimes abbreviated as pg, this program LOGs entries, and it runs as a daemon (hence the d).

Who should use pgLOGd?

Almost (see requirements) anyone who runs a web server can use pgLOGd, however, sites that need 24x7 up-time on their web servers, or who run many virtual hosts, will benefit the most from pgLOGd.

What does it cost?

Nothing, it's Open Source (basically free.) The code is released under the BSD Open Source License

Requirements

In this version (2.3), the requirements are as follows:

  1. A PostgreSQL database installation.

  2. A Web Server capable of writing its log entries to a file and which has a customizable log entry format. pgLOGd was developed with Apache in mind and that is the recommend web server.

  3. A C compiler. After all, you get the source! The GNU gcc compiler is recommended (comes standard with most U*IX OSes.)

  4. A multi-tasking OS, FreeBSD is my preference and the development platform.

Why PostgreSQL?

Three primary reasons:

  1. The reason that most of us use the things we do, because we like them! I like PostgreSQL, so I use it. If you don't like PostgreSQL, I would dare to wager that you have never used it...

  2. PostgreSQL is fast. Yes, I said fast! Like Ford™® says: Have you driven PostgreSQL lately? If not, I highly recommend you check it out at PostgreSQL's homepage.

  3. It provides the asynchronous connection and query processing needed to make pgLOGd robust and fast. See the Features and README for details.

Where do I get pgLOGd?

Download the Source.

Features

Here I will list the prominent features of pgLOGd and briefly describe each one. For more detail about these features and how they came to be, see the pgLOGd README.

Installation and Configuration

Setting up and operating pgLOGd is very straight forward, however there are some areas that may leave you asking why? Please see the README for complete details on the reasoning behind pgLOGd.

Assumptions

Never assume! ;-) But, I will assume:

Precompile

Edit the Makefile if your PostgreSQL installation is not in /usr/local/pgsql.

PostgreSQL Backend

Create an entry in the /path/to/postgres/data/pg_hba.conf file to allow connections from what ever machine your web server is running on. If your postmaster and web server are on the same box, this step can be skipped.

Make sure your postmaster is running with the -i option, only needed if your web server is not running on the same box as the postmaster.

Create a database, default name pglogd, but call it what you want, and create the required tables within the database. An SQL script is included with the source pglogd_tables.sql to do this:

# su - postgres
$ createdb pglogd
$ psql pglogd < pglogd_tables.sql
$ exit

Compiling and Running pgLOGd

To build the pglogd and pglogctl binaries:

# make

Solaris users:

# make -DSOLARIS

OpenBSD users:

# make -DOPENBSD

Copy them where ever you wish:

# cp pglogd /usr/local/sbin/
# cp pglogctl /usr/local/bin/

Edit the pglogd.conf configuration file and copy it where ever you wish:

pgLOGd default options are as follows:

The name of the configuration file is not important, however, make sure you pass the same name to pglogd with the -c parameter.

# cp pglogd.conf /usr/local/etc/pglogd.conf

Start pglogd:

# /path/to/binary/pglogd -c /path/to/config/file/pglogd.conf

If there are any problems you will get an error message. If the daemon has already been created then there will be no terminal and errors will be written to the error log, typically in /var/log/pglogd.log unless you changed the location.

Make sure that pgLOGd is always running before Apache and make sure it is shutdown after Apache!

Apache Configuration

Add these lines to your httpd.conf file:

LogFormat "%t %T %>s %b %m %v %h \"%U\" \"%{Referer}i\" \"%{User-agent}i\" \"%r\" %l %u" pglogd

Add this entry for each site you want pgLOGd to record log entries for:

CustomLog "/path/to/apache/logs/pglogd_fifo" pglogd

If you changed the location and name of the FIFO file, make the adjustment here as well.

Restart Apache, usually done with:

# /path/to/apache_binary/apachectl graceful

Contributed Notes

Linux RedHat-7.1 notes provided by Calvin Dodge

NOTE: The addition of the -s flag in 2.1beta should solve the startup problem described below.

*** advice for Red Hat users (assuming they installed PostgreSQL from RPMs) ***

Be sure to have postgresql-devel installed.

In the Makefile, change the following lines:

"PGDIR=/usr/local/pgsql" to "PGDIR=/usr"

"CFLAGS = -I${PGDIR}/include" to "CFLAGS = -I/usr/include/pgsql"

Pglogd needs to be running before Apache starts, but after PostgreSQL starts.

Unfortunately, the RedHat scripts start up Apache before PostgreSQL (they have the same start numbers (85), and "httpd" is less than "postgresql").

My solution was to:

  1. Edit /etc/init.d/postgresql - changing "chkconfig 85 15" to "chkconfig 80 20" (so it will start before and shutdown after Apache)

  2. chkconfig --del postgresql

  3. chkconfig --add postgresql

  4. chkconfig --level 345 postgresql on

  5. chkconfig --level 0126 postgresql off

  6. Create a pglogd script for the /etc/init.d directory. I copied the apache script, and edited it to remove irrelevant information, and to provide for "start", "stop", and "restart". Its chkconfig line has the numbers "83 17", so it will start after PostgreSQL and before Apache.

  7. chkconfig --add pglogd

  8. chkconfig --level 345 pglogd on

  9. chkconfig --level 0126 pglogd off

Using pgLOGd

Syntax:

pgLOGd [-s] [-c <config file>]

OptionDescription
-sStart pgLOGd without attempting to make a database connection. This can be usefull when you need to start pgLOGd at start-up, before the database is online.
-c <config file>Specify the full path to the configuration file.

To start:

# /path/to/binary/pglogd [-s] [-c <configuration file>]

To stop:

# kill -TERM [pid]

NOTE: The overflow file only contains log entries that have passed parsing and are not checked again prior to processing. Adding entries to the overflow file manually is not recommended! If you decide you just have to do this, the format requires one pgLOGd style entry per line, each line not more than 16,768 bytes in length, and each line terminated with a single new-line character (character 10.) A better way to get log entries into pgLOGd manually would be to simply dump the file to the FIFO:

# cat [some_file_in_pgLOGd_format] > /path/to/apache/logs/pglogd_fifo

You can do this any time pgLOGd is running, however, I would not recommend doing this with really large files during peak traffic times.

pgLOGd will write errors and other information to its log file, and it should not be too noisy as far as logging goes.

Using pglogctl

pglogctl can be used to generate log files from the pgLOGd database. The output format is the standard Combined Log Format.

Description:

Moves records from the log entries table into the temp entries table which has indexes. Also facilitates creation of Combined Log Format log files from the temp table.

Syntax:

pglogctl [-o | -m | -d | -p] domain startdate days

OptionDescription
-oOutput Combined Log Format from the temp entries table
-mMove records from log entries table to the temp entries table
-dDelete records from the temp entries table
-pPrint values to be used based on command line options
domainDomain to use, required
startdateFormat: mm/dd/yyyy or mm.dd.yyyy. Start time will be 00:00:00 and end time will be 23:59:59
daysNumber of days, including the start date, to process. Value can be negative.

Performance

The initial tests here were done prior to the implementation of overflow logging and asynchronous non-blocking operation. Needless to say pgLOGd was capable of keeping up then, it should not have any problems now. Note, this test is very old.

Test Results

Tested on a Dual P2-333, 128MB RAM, 9GB SCSI (Seagate Barracuda)

A quick insert test indicated that PostgreSQL can do about 11 to 28 inserts per second, depending on record size and number of indexes. Even at 11 per second, that is still over a million hits per day on a web server, so there should not be to much trouble with pglogd keeping up even on a heavily loaded server. Also, since the entries table does not have any indexes, the high end of the performance curve is realized, which means pglogd can easily keep up with a very heavy traffic site.

If 28 inserts per second is not enough, then start your postmaster with these two options:

-o -F

That will shut off fsync and the inserts per second jump to about 800 per second!! What you lose is the ACID reliability of PostgreSQL by turning off fsync. But unless you are getting over 2 million hits in a 24 hour period, you should not have to do that.

README

Here I will attempt to explain why pgLOGd exists and the reasoning and theories behind its madness. If you agree, disagree, or have insight to share, by all means please don't hesitate to Email me.

Contents

  1. The Problem
  2. Configuration and Rotation
  3. Solutions
  4. Review
  5. The Design of Something Better
  6. A New Log Format
  7. Theory of Operation

The Problem

pgLOGd was, like many things, developed to resolve a problem for which there did not seem to be a complete solution. The problem is with the routine maintenance and configuration of the web server logs, particularly:

There are also several smaller problems to which pgLOGd currently addresses or to which it will address in the near future:

Each of these points undoubtedly has a solution available, in one form or another, but it usually requires several command line utilities, CRON jobs, and administrator time to accomplish the tasks. Not to mention the frequency (daily, weekly, monthly) that the tasks require.

Configuration and Rotation

Configuration is not too bad if a system has already been devised, put in place, and is consistent. For example, names and locations of log files has already been decided, policy for rotation based on allowed disk space and bandwidth has been determined, and access to log files established. This can easily become quite an administrative chore (or nightmare) as the site count increases.

Log file rotation was the primary motivator for the creation of pgLOGd. Sites will undoubtedly have more or less traffic than other sites, so when do you do rotation? On a busy site you might have to rotate the logs hourly, every 12 hours, or daily. On a smaller site, maybe once a week or once a month is sufficient. Also, when do your customers expect the logs to be available? Up to the minute (not possible with standard logs), twice a day, daily, weekly? So, now there has to be a policy and schedule set up for each site based on the site's traffic and customer expectation.

Solutions

The Apache Group does not provide any built-in solution, but they do provide all kinds of hooks and options (modules, external programs, excellent server configuration, etc.) One solution to the problem is to use the Apache option that lets you pipe the log entries to an external program, like cronolog. Cronolog is a nice little program that will automatically write the log entry to a file based on a date scheme that you can configure. This was the savior, cronolog seemed to solve all the problems:

But, Cronolog has some drawbacks:

Review

A quick review of where we are and how we got here:

The primary motivation for writing pgLOGd was a need to rotate Apache log files without stopping the server. Several other solutions were looked at and decided against:

Aside from the two additional processes per Apache child, there is the time to start up the additional processes, albeit a rather small time, but on a busy web server every clock tick counts. Nothing seemed acceptable.

The Design of Something Better

There had to be a better way to rotate log files. A way that was similar to writing directly to files, just as fast as writing to files, but that didn't add a bunch of system processes and overhead into the mix. Well, I couldn't find one, so I wrote one. Enter pgLOGd.

I needed something with the following principles:

pgLOGd was designed and written to adhere to each of these principles, which basically becomes its feature list. With the implementation of the fall-back logging, pgLOGd can handle log entries as fast the web server can send them, just like it was logging to a file, even if the database connection cannot keep up or goes completely down!

A New Log Format

One of the first things you will undoubtedly notice when setting up pgLOGd is the requirement of a custom log format instead of the generally accepted Common Log Format. Why is this? Well, first take a look at the Common Log Format:

Now take a look at all the parameters available for customizing a log entry (these tokens are for the Apache web server, other web server formating will most likely be different):

%...a: Remote IP-address
%...A: Local IP-address
%...B: Bytes sent, excluding HTTP headers.
%...b: Bytes sent, excluding HTTP headers. In CLF format, i.e. a '-' rather than a 0 when no bytes are sent.
%...c: Connection status when response is completed. 'X' = connection aborted before the response completed. '+' = connection may be kept alive after the response is sent. '-' = connection will be closed after the response is sent.
%...{FOOBAR}e: The contents of the environment variable FOOBAR
%...f: Filename
%...h: Remote host
%...H: The request protocol
%...{Foobar}i: The contents of Foobar: header line(s) in the request sent to the server.
%...l: Remote logname (from identd, if supplied)
%...m: The request method
%...{Foobar}n: The contents of note "Foobar" from another module.
%...{Foobar}o: The contents of Foobar: header line(s) in the reply.
%...p: The canonical Port of the server serving the request
%...P: The process ID of the child that serviced the request.
%...q: The query string (prepended with a ? if a query string exists, otherwise an empty string)
%...r: First line of request
%...s: Status. For requests that got internally redirected, this is the status of the *original* request --- %...>s for the last.
%...t: Time, in common log format time format (standard English format)
%...{format}t: The time, in the form given by format, which should be in strftime(3) format. (potentially localized)
%...T: The time taken to serve the request, in seconds.
%...u: Remote user (from auth; may be bogus if return status (%s) is 401)
%...U: The URL path requested, not including any query string.
%...v: The canonical ServerName of the server serving the request.
%...V: The server name according to the UseCanonicalName setting.

There is quite a bit more useful information available that is not included in the Common Log Format. The first question that comes to mind is why so little information in the Common Log Format? One can only speculate, but it was most likely designed way back when log files did not grow very fast or very big, and when they were probably read by humans. The current state of the Internet makes reading raw log files almost unheard of, albeit unnecessary due to the availability of many free and commercial log analyzer programs.

So why a new format?

So why deviate from the normal, the Common Log Format? The primary reason has to do with parsing the log entry, and other reasons include the need to record some of the other useful information that is not part of the Common Log Format. First the parsing problem, so take another look at the Common Log Format followed by a typical line from a log file:

"%h %l %u %t \"%r\" %>s %b"

10.0.0.1 - - [04/Sep/2001:19:34:59 -0400] "GET /index.html?userdata=badstuff HTTP/1.1" 200 9206 "http://10.0.0.1/index.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

The first parameter is the remote host. Not too bad for parsing, especially if HostnameLookups is off (which it should be for any production or high volume web server) and IP address is very easy to parse.

The second parameter is the remote logname. This is practically useless since visitors to your web site will probably not even be running an ident server. But worse than just being practically useless is the fact that if a person so chooses to set-up their own ident server they can, and that means they can configure the data that would be logged. This is bad. What if I configured my ident server to supply an ident name of:

- - [04/Sep/200

Then the log entry would look something like this:

10.0.0.1 - - [04/Sep/200 - [04/Sep/2001:19:34:59 -0400] "GET /index.html?userdata=badstuff HTTP/1.1" 200 9206 "http://10.0.0.1/index.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

Typically an ident entry is limited to about 15 characters, but that is plenty for me to do damage. Unless a log parser is very intelligent it will get choked up right here and most likely skip the line as invalid. This is because Common Log Format uses space characters as delimiters between the remote host, remote logname, and remote user; but space characters could be part of the ident (and remote user) data. What harm does this do? Well, if I don't want a company tracking me on their web site I could configure my ident server to report a logname similar to the one above and their log analyzer will most likely /dev/null the log entries generated by my activity on their web server. Unless someone digs into the logs by hand, I could bang away at their web site without being noticed, and maybe I'm a cracker trying to gain illegal access...

The third entry, remote user, is just as bad. Again, user collected data right up front in the log entry. Remember, this data is not checked anywhere, it is passed straight from the remote client so it could contain control characters as well.

I'm sure there may be a very excellent log analyzer out there that could overcome these problems, but the pay-off for the protection is speed. Such an analyzer, if possible, would be very slow and that is not an option with huge log file sizes on high traffic web servers. There is a better way...

The pgLOGDd Log Format Solution

Here is the format required by pgLOGd:

%t %T %>s %b %m %v %h \"%U\" \"%{Referer}i\" \"%{User-agent}i\" \"%r\" %l %u

Notice where the remote logname and remote user are placed. The format is designed to be parsed quickly by a computer and log information goes from the most trusted data to the least trusted (data from the remote client.) The format also includes useful data that cannot be attained by using the Common Log Format, such as the time it took the server to service the request. This can be very useful to the system admin so they know when it is time to invest in faster CPUs, more memory, or faster disk arrays.

Don't worry though, pgLOGd's log format is a superset of the Common Log Format, so a Common Log Format log file can be produced to feed into your favorite log analyzer.

Theory of Operation

For those who are interested in how pgLOGd is intended to work, read on.

The problem seemed impossible, and the solution only came after some digging around through man pages and with some consulting of "Advanced Programming in the UNIX Environment" by W. Richard Stevens (RIP). It seems that U*IX, since the beginning of time (well at least BSD-4.2), has had these nice little things called FIFOs or Named Pipes that allow communication between non-related processes.

A FIFO, once created, looks and smells just like a file and is handled as if it were a file (they use the standard open(), read(), write(), and close() file operators.) The only caveats to the FIFO are:

So it seemed pretty straight forward: make a daemon that listens on a known FIFO and have Apache log all entries to that FIFO. The FIFO looks and acts like a file so it is fast and Apache treats it just like it was a regular log file. Almost there! The only other problem was that all entries were being written to the same log file (the FIFO), so for web servers hosting many virtual sites there needed to be a way to determine which entries were for which sites. That solution came with Apache's ability to make custom log file formats, and particularly with the %v parameter. That about solved all the problems and it was off to code the daemon.

Initially error log entries were to be logged as well, but since error logs are not configurable and since they can be written by any number of different mods, there is no format. We will have to wait and see if The Apache Group changes this in the near future. Until then error logs will have to be dealt with in the usual ways, but they should not grow to much unless a site it having serious problems.

pgLOGd Operation

This is a verbal description of what goes on inside pgLOGd. For implementation details, consult the source code.

pgLOGd begins by making sure it can access or create each of the fundamental components it requires:

Currently pgLOGd logs messages and errors directly to a file (instead of to a logging facility like syslog.) No error checking of any kind is done on this log file. Eventually pgLOGd will include options to take advantage of system logging such as syslog.

With these checks complete, pgLOGd next calls fork() to begin the transition to a daemon process. The parent exits and the child process sets itself as the session leader. Next, all inherited file descriptors are closed and three signals are captured:

These signals currently all do the same thing, cause pgLOGd to shut down. In the future, SIGHUP will cause pgLOGd to reread its configuration file, and SIGINT may perform some other tasks such as re-establishing the database connection, write its current state to the log file, etc.

At this point pgLOGd enters a select() and waits for any of several things to happen:

Once an action is detected by select(), pgLOGd enters the state logic. A state machine, as it is typically known, is basically a set of logical states that a program can be in at any given time. Not all states are wired to all other states, so depending on certain conditions, only certain actions are possible. For pgLOGd, one of those states might be "database connection down", and from that state pgLOGd cannot get to the "write entry to database" state.

pgLOGd's states are pretty straight forward, and the primary ones are described here:

If at any time an unrecoverable error is encountered, pgLOGd will write its current state to the log file and exit. Examples of unrecoverable errors are system function call failures such as: malloc(), read(), write(), and select(). Encountering an error caused by any of the aforementioned functions failing is currently not something pgLOGd can recover from.

pgLOGd stays in the select() loop until one of several events happen. Any of these events will cause pgLOGd to shut down:

A signal will cause pgLOGd to perform a "graceful" shut down, meaning it will close all its connections and shut down properly. An unrecoverable error may or may not allow pgLOGd to perform a "graceful" shut down, but it will certainly try.