Mittwoch, 9. September 2009
Multipart tutorial: a centralized mail log parser
Who attended the 4. Mailserver-Konferenz 2009 in Berlin and listened to my speech probably knows what I'm speaking about. The others could get an idea having a look at the slides from my presentation called The Big Picture: Der OSS-Mailcluster von Raiffeisen Online.
- create an astonishing fast central realtime log search for our customer support team
- provide the same tool with a prettier presentation and strong security measures and filters to our VIP customers for realtime mail log access in their web backend
- and finally, just to prove how amazing it is: realtime display of all rejected, quarantined and delivered mail in our customers webmail backend - for each of our 30-40,000 mailboxes
One of the most important components of this system is it's central log parser. My strategy was as follows:
- each Postfix and Amavis instance writes it's syslog messages to a central syslog server
- this central syslog server pipes all log lines (Postfix and Amavis instances mixed) to a pipe read by a log parsing daemon
- traditional log files are still being directly written to disk and securely stored as required by your government (especially here in Italy those laws are subject to regular changes)
- aggregation happens in "real time", the daemon is building one object for each mail in memory, adding additional information it has learned line by line, and storing it to database once he is sure that he got all related log lines for this single message
- please note that there could be hundreds of thousands of lines between first and last line related to a mail, events that occured later could appear earlier as there are multiple cores on multiple hosts working with one single mail
- the most complicated part of this challenge was writing a daemon with a "let's wait to see if the missing line will arrive"-logic
- this daemon should be able to catch most errors and also have a bulletproof garbage collection
And, please don't laugh: my current implementation has been written in PHP. Nonetheless it is working quite good, and I have also been able to find a workaround for a nasty memory leak (should be fixed in PHP 5.3, didn't try yet). On our live system the daemon is watched by a monitoring tool, if it's memory footprint exceeds a certain limit it gets a kill signal, dumps current data to disk and restarts itself. Scary, but works.
As this daemon is far from being perfect I decided long time ago to rewrite it from scratch as an OSS project. My employer gave me the permission to do so, as everyone here agrees that doing so would be the best strategy to get out the most of this valuable component.
Choosing a programming language
It needs to be developed with a scripting/programming language supporting threads (garbage collection and other regular tasks), current candidates are Perl, Python and C#. My personal favorite is Python, as I stopped writing Perl long time ago (still using it from time to time), and being a C#-evangelist seems to be a little bit risky these days
I did some first tests with Threads in Python a while ago, those where my very first steps with Python. Even if I have been able to realize what I wanted to, it didn't fully convince me. I've always heard that Python is sooo fantastic, especially for OOP - my first impression was a different one. Ever tried to implement patterns as Singleton & Co with Python? Ever tried to use protected / private variables and functions in a way someone extending your class could not violate the rules? To say the truth, I was deluded. Really deluded. I whish I could have the everything-is-an-object-philosophy from Python coupled with how OOP with PHP 5.3 looks like (finally provides late static bindings!!). Shall I really give C# a chance?
Design principles
It's nearly 02:00 AM and too late to write down everything that is flowing around in my mind, that's why I decided to start this as a multipart tutorial. I'll start asking for feedback, plan the daemon and code it step by step. A similar attempt has been started on this blog for an RFC-conform mail autoresponder (in German language).
Some first thoughts, and then I'll go to bed:
- main parser should be able to decide which of the available parsers to engage with the current line
- right now I have written such parsers for Postfix, Amavisd-new and Perdition
- I'd like to implement the observer pattern for data storage, doing so multiple data backends could react on the "mail information is completed" event
- who wants to could for example write a dedicated backend feeding a custom reputation database
- I'd like to provide at least to example data backends, one of them suitable for very large distributed setups and the other one for small setups on single hosts
- web backends and other tools are not planned right now, all this daemon will be useful for is realtime mail log correlation
- ...
Folks, that's enough for today - I really need to get some sleep! Feedback is more than welcome, especially I'd like to ask you:
- are you aware of other similar projects (I'd not like to re-invent the wheel)?
- do you have suggestions regarding the hot what's-the-best-programming-language topic (no flame war please!)?
- what would you like this parser to be able to do (please have a look at the slides mentioned at the beginning to get an idea of what it is able to do right now)?
All other related questions / concerns / suggestions are obviously more than welcome!
Tom