I smell bacon!

For years I have been interested in parsing logs.  Did a fair amount of it in a past life.  I think pig changes pretty well everything.  Back in the day, I wrote some parsing in perl, tried to figure out how to group things, colate, += these logs into a database, etc.  Well, for me, pig makes the parsing pretty slick.

My pig scripts generally look like

REGISTER coolJarOne.jar

REGISTER secondJar.jar

Those jars can contain user defined functions (UDFs), for doing custom log parsing.

Then, let’s slurp in a log

raw = LOAD ‘myLog.txt’ USING org.apache.pig.piggybank.storage.apachelog.CombinedLogLoader() AS (remoteAddr, remoteLogname, user, dayTime, method, uri, proto, status, bytes, referer, userAgent);

So that one line will parse one of apache’s combinedLogs.  And yeah, I wrote it.  It isn’t commited in yet, but I hope to submit a patch tonight-ish.

Then, this little bit of pig latin pulls out all the refers from my log

refererRaw1 = FOREACH raw GENERATE MyDateExtractor(dayTime), host, remoteAddr, referer;
refererRaw2 = FILTER refererRaw1 BY com.loghelper.MyLength($3) > 1;
refererRaw3 = GROUP refererRaw2 BY ($0, $1, $2, $3);
refererRaw4 = FOREACH refererRaw3 GENERATE FLATTEN($0), COUNT($1);
STORE refererRaw4 INTO ‘log-referers.txt’ USING PigStorage();

so log-referers.txt is a tab delimited file, which is rather easy to parse and += into mysql.  The beauty is that you can supposedly do it all in parallel.  Haven’t done that yet, but hopefully soon.

A few exciting things have been happening for me.

  1. My pig patches (which I call bacon) have been committed in.  Check it out.
  2. I tied in my pig stuff to parse logs from good old holaservers.
  3. I tied into the Google Visualization API and am now generating some pretty killer graphs.
  4. Tonight I took a friend’s combined log, ran it through my system and folded it on in to the above graphs.

Course, all the pig and hadoop (the parallel processing stuff) work will hopefully be the basis of good old loghelper.

Tags: , , ,

Leave a Reply