PDA

View Full Version : Optimizing the program generating access log reports


nkrgupta
11-02-2005, 06:42 AM
Hi,
I'll try my best to make this as understandable as possible. Please do ask if any clarification is required.
I had made this program to create a report from the apache access log files, of top entry pages and top exit pages of our site. What the program does is take a date as argument form the shell and search for log files of that date and parse them accordingly and produce the output. At that time we had 20 servers, (each server has a separate access log file), but the no. of servers has shot up to almost 150! (the traffic has increased about 5-7 times). Quite understandably, my program is not being able to handle the huge amount of data in the desired way due to which, after processing log files of about 120 odd servers, the program is terminating for want of memory on the server.

So, if anyone could go through it and suggest a better way of doing what is being done (maybe use of some module, or just a simpler and less memory intensive way, or any suggestion which would improve the efficiency of the program). I know the program is not bug-free, as the report generated may not be 100% correct, though most of it is reliable. But pointers as to increasing the processing capacity of the program are what i am looking for, more than those questioning the logic of the program (All are welcome though).

I am including the code as-is with comments, wherever needed (pardon me for not using strict!).

The point to be noted is that the program terminates in the first while loop only, as the initial hash is not built. So optimizations need to be done there.

#!/usr/bin/perl


use Date::Calc ( ":all" );
my ($yy1,$mm1,$dd1, $hh,$min,$sec, $doy,$dow,$dst) = Localtime();
if (!$ARGV[0])
{
my ($yy,$mm,$dd) = Add_Delta_Days($yy1,$mm1,$dd1,-1);
$mm = "0$mm" if($mm < 10);
$dd = "0$dd" if($dd < 10);
}
else
{
my ($mm,$dd,$yy) = split(/\//,$ARGV[0]);
}
my $file_date=$yy.$mm.$dd;
my $path="/reports/cgi-bin/ServerLogs"; ##path to all the log files

my $txt_format="e_$mm$dd$yy";
my $first_file="temp.txt"; ##temp file
my $final_file="e_$mm$dd$yy.txt"; ##final text file to be generated

open (DB, ">>$first_file") or print("Could not create txt File<br>\n\n");


undef %hash;

&servers;

foreach $server(@servers)
{
my $logfile="$path/www$server.access.$file_date.log"; ##format of log files names

print "www$server..\n";
open (F, $logfile) or print ("\tCould not open $logfile\n");

while (<F>)
{
chomp($line);

##parsing each line of log files
my ($host, $ident_user, $auth_user, $date, $time, $time_zone, $method, $url, $protocol, $status, $bytes, $referer, $agent) = /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.+?) (\S+)\" (\S+) (\S+) \"([^\"]+)\" \"([^\"]+)\"$/;


##this if block ignores all files prgrams which need to be ignored
if ($referer =~ /http:\/\/cards*./i || $url =~ /(\/view\/|\/internal\/auto\/|gotoad\.pl|TrackRecord\.pl|HpTrack\.pl|bottrack\.pl|mantle\.pl|uptime\.pl|customtrack\.pl|popup\.pl |\.dll\?|TrackThumb.pl|\.asp\?|nl\.pl|ah-ha\.pl|nc\.pl|\/cgi-bin\/cards\/help\/|searchmain\.pl|sprinks\.pl|popup4\.pl|ReturnCardData\.pl|\/cgi-bin\/cards\/|\/newcards\/|\/card\/\d\d\/\d\d\/\d\d\/\d\d\/Z|assocstats\.pl|putsp\.pl|newmember\.pl|putmail\.pl|\/interface\/|GetCount\.pl|RetDS\.pl|thankyou\.pl|prevcus\.pl|cust\.pl|lostpass\.pl|Return|putcus\.pl|chgencat\.p l|catcont\.pl|putgen\.pl|choosevent\.pl|mailtag\.pl)/i || $url !~ /(\.html|\/|\/\?|\.pl|\.pl\.new|\d{12,}|rd\/\?\/.+|rd1\/\?\/.+)$/i) {
$NbOfLinesDropped++;
next;
}


if ($url =~/^http:\/\/w+\.our-website\.com\//)
{
$url=~ s/http:\/\/w+\.our-website\.com\//\//; ##substituting sitename with '/'
}
elsif ($url =~/^http:\/\/w+\d{1,2}\.our-website\.com\//)
{
$url=~ s/http:\/\/w+\d{1,2}\.our-website\.com\//\//; ##substituting sitename with '/'
}
if ($url=~/\/card\//)
{
$url=~/(\/card\/\d{2}\/\d{2})(\/\d{2}\/\d{2}\/\w{2})(\d{1})(.*)/;
$url=$1."/".$3; ##doing another editing
}

$hash{$host}{$time}=$url; ##BUILDING THE HASH



}
close(F);


}

##THE following loop creates a file which has the hash created above, in a format which can be sorted according to time
foreach $ho(keys(%hash))
{
foreach $nt (keys %{$hash{$ho}})
{
my ($h,$m,$s)=split(':',$nt);
my $time=$h.$m.$s;
my $str=$ho."|^~".$time."|^~".$hash{$ho}{$nt};
print DB "$str\n";
}
}
undef %hash;
close(DB);

`sort $first_file -o $first_file`; ##sorting of the file according to time
open (R, "$first_file") or print("Could not create txt File<br>\n\n");
open (W, ">$final_file") or print("Could not create txt File<br>\n\n");
undef %final;

my $entry="Entry";
my $exit="Exit";

##The following while loop creates the final hash of all the entry and exit pages.
while (my $line=<R>)
{
chomp($line);
my ($ip,$time,$url)=split('\|\^\~',$line);
if (!exists $final{$ip}{$entry})
{
$final{$ip}{$entry}=$url;
}
$final{$ip}{$exit}=$url;
}

##the following loop writes all the entry and exit pages to a file
foreach $k (keys(%final))
{
if ($final{$k}{$entry} =~ /\.pl/)
{
$final{$k}{$entry}=~/(.*\/)(\w+\.pl)(.*)/;
$final{$k}{$entry}=$2;
}
if ($final{$k}{$exit}=~/\.pl/)
{
$final{$k}{$exit}=~/(.*\/)(\w+\.pl)(.*)/;
$final{$k}{$exit}=$2;
}
$string="$final{$k}{$entry}|^~$final{$k}{$exit}";
print W "$string\n";

}
undef %final;
close(R);
close(W);
open(RT,"$final_file") or print ("Error");
undef %entry;
undef %exit;

##Creating the indivisual hashes for antry and exit pages with the no. of times each page is called as the value of the hash
while (my $line=<RT>)
{
chomp($line);
my ($ent,$ext)=split('\|\^\~',$line);
$entry{$ent}++;
$exit{$ext}++;
}
close(RT);
my $ent_txt="ent_$mm$dd$yy";
my $ext_txt="ext_$mm$dd$yy";

open(ENT,">$ent_txt") or print ("Error");
open(EXT,">$ext_txt") or print ("Error");

##Writing entyr and exit pages with no. of times called in separate text files
foreach $k (keys %entry)
{
print ENT $k."|^~".$entry{$k}."\n";
}
undef %entry;
foreach $k (keys %exit)
{
print EXT $k."|^~".$exit{$k}."\n";
}
undef %exit;
close(ENT);
close(EXT);

print "Archiving $final_file\n";
my $tarname="/reports/accesslog_db/".$txt_format.".tgz";
my $execstring="tar -zcf ".$tarname." ".$ent_txt." ".$ext_txt;
print($execstring);
system($execstring);
print("\n");
unlink $first_file;
unlink $final_file;
unlink $ent_txt;
unlink $ext_txt;
print "$tarname created \n";


sub servers ## @servers for online server
{
undef @servers;
open READ,"servers.dat" or die "No Serverlist: $!";
my $line=<READ>;
close(READ);
@servers=split(/,/,$line);
}


Thank You Very much
Naveen (http://naveenhere.blogspot.com)

FishMonger
11-02-2005, 07:44 AM
I haven't yet taken the time to read and analize your script, but have you searched cpan for a module that parses apache log files? The first one that comes to mind is Apache::ParseLog.

http://search.cpan.org/~akira/Apache-ParseLog-1.02/ParseLog.pm

It's a little old, written in 1998, but it looks promising and I'm sure there are other modules on cpan that can help. There should also be several options available from http://sourceforge.net

nkrgupta
11-02-2005, 08:02 AM
Hi FishMonger,
There is no module in CPAN (including the one which you said) which i have left out of consideration. But none of them works as i want them to, as in giving relevant output. I'v been working on this project (producing various access log reports) for the past few months and have found that the way i'm parsing is probably the only way to get customised output, any deviations from which would not help us.

Also, don't you think that even if any module to parse the log files is used , the memory will saturate at some point , which needs to be tackled in a different way?

Still i'm reviewing the modules once again, with the hope of getting a breakthrough.

Thanks
Naveen

FishMonger
11-03-2005, 03:32 PM
Ok, I'm just starting to take a look at this, and if the script is terminating during the first while loop, it's most likely due to the regex's. In paticular, the ones used in checking the $referer.

That second regex is way way too long. Due to all of its alternations, it will have a lot of backtracking to do which will tie up the memory and most likely increase cpu usage. Since most of what those regexs are doing is ruling out a number of perl scripts, it would be more efficient to put those script names into a lookup table i.e., a hash and do a check against it i.e., if (exists $hash{'scriptname'}).

Here are some modules that will help.
Break up the url into seperate parts
http://search.cpan.org/~gaas/URI-1.35/URI.pm
http://search.cpan.org/~rosch/URI-Find-0.16/lib/URI/Find.pm
http://search.cpan.org/~rjp/URI-Find-Rule-0.8/lib/URI/Find/Rule.pm

Debugging tool: find the memory usage of Perl variables
http://search.cpan.org/~dsugal/Devel-Size-0.63/Size.pm

You might want to look at using a tied hash instead of keeping the hash in memory.
http://search.cpan.org/~nwclark/perl-5.8.7/lib/Tie/Hash.pm
http://search.cpan.org/~mhx/Tie-Hash-Indexed-0.03/lib/Tie/Hash/Indexed.pm

nkrgupta
11-04-2005, 04:50 AM
Hi Fishmonger,
I sincerely appreciate the time and effort you've put in for my script. Thanks for the valuable suggestions and i'm sure they're gonna work. So i'm playing with all that u've mentioned and will come back to you in either scenario.

Thanks Again,
Naveen (http://naveenhere.blogspot.com)

nkrgupta
11-04-2005, 05:47 AM
Hi Fishmonger
Issues Galore... Well, sort of -

As with your suggestion to create a hash of all the perl scripts which need to be checked for ommission or inclusion, how do i implement the whole process? For example, now i'm doing -

$url="http://www.website.com/cgi-bin/TrackRecord.pl";
if ($url=~/TrackRecord\.pl/) ##this is actually the WAY WAY TOO LONG regex
{
next;
}

So do i have to put the complete $url (i.e. not the name of the perl script only) in the hash to do if (exists $hash{'$url'}) (if yes, then what about the several matches like /view/ or /auto/ etc. which will fall in between $url)

OR, have you suggested the URI.pm module for the same purpose of breaking up $url into different parts ( viz. <scheme>://<authority><path>?<query>#<fragment> ) and will it suffice for all conditions?

I came across this regex (http://search.cpan.org/~gaas/URI-1.35/URI.pm#PARSING_URIs_WITH_REGEXP) which is the official alternative to URI.pm, and saw that it also did the same with elan. Which one would be better - the module or regex?

Also, if you could give a simple example to do the stuff using URI.pm.
And for what specific purpose did you mention the modules URI Find and URI Find Rule? where do i require to find url's from the text?

Thank You,
Naveen

FishMonger
11-04-2005, 08:32 AM
Assuming you've created a hash containing the (lowercase) names of the scripts to include/exclude as keys and either 0 or 1 as their values, you can do something like this:
# put these use statements at the begining of the script
use URI;
use File::Basename;

while (<F>)
{
##parsing each line of log files
my ($host, $ident_user, $auth_user, $date, $time, $time_zone, $method, $url, $protocol, $status, $bytes, $referer, $agent) = /^(\S+) (\S+) (\S+) \[([^:]+):(\d+:\d+:\d+) ([^\]]+)\] \"(\S+) (.+?) (\S+)\" (\S+) (\S+) \"([^\"]+)\" \"([^\"]+)\"$/;

my $uri = URI->new($referer);
my $script_name = lc(basename($uri->path));

if ( exists $scripts{$script_name} && $scripts{$script_name} == 1) {
$NbOfLinesDropped++;
next;
}

Of course, you'd need to expand that to include your other requirements, but the combination of those 2 modules and a couple simple regexs will make it much easier and more effcient than your current regex.

It's 12:30am here, so I'll check back latter after I get some sleep.

FishMonger
11-07-2005, 12:50 AM
Hi Naveen,

Checking in to see if those suggestions I made helped.

I noticed that as you parse each line, you capture a number of variables that are never used, at least not in the code you posted. You can reduce the amount of memory and cpu usage by only capturing the items you need and nothing more.

Since I don't have access to your logs, I can't do any real testing/debugging, but see if these these modifications help. There are additional alterations that should be made, but this should get you started.
4019

nkrgupta
11-07-2005, 05:03 AM
Hi Fishmonger,
I can't thank you enough for the work which you put for us guys out here... Sorry for not responding to your earlier suggestions.. But i couldn't actually test them coz since the day i posted the query, the log files had shrunk a bit n so the stuff was being done again normally. So maybe i'll have to wait till a few days before christmas to see how things workout with the modified script. Till then, my previous program seems to be okay. And i'll also take a look at the script u'v posted n try to make the best possible combo from all the suggestions u've made.

Its 10 am her n just entered office n getting over the monday morning blues... :D

Thanks a lo once more..Will surely get back
Naveen

nkrgupta
11-14-2005, 07:18 AM
Hi FishMonger,
With reference to the modified script you had posted above, i have used your logic to generate another report (not the one i discussed earlier), which shows the number of clicks from one page of the site to another. I had to modfiy certain portions of your scripts for want of statements like "use URI", "my" in a few places, etc. But the major problem i'm facing is that the script names mentioned in %scripts are not being ignored in the while loop, due to which the report is totally awry! I tried the way you had written (writing hash keys in lower case and using the function 'lc' to convert and then check) and also by writing the hash keys in the same case as the script names are. But neither of the formats seem to work. Also, i see that none of the validations on line no. 91 of the attached scirpt are working. My guess is that there's something fishy between line nos. 85-88 due to which its happening. For debugging purpose i had outputted $script_name and $URL (on line no. 89) to a different file , the output of which i'm attaching also.

Can you figure out where's the bug?

Thank You
Naveen

FishMonger
11-14-2005, 02:38 PM
I'll take a look at it after I get into work, but I may need the portion of your log file that was used to generate the example you provided, so I can run some tests.

nkrgupta
11-15-2005, 04:32 AM
Hi,
I've pm'ed to you the details of the sample log file.

Thanks
Naveen