View Full Version : Php code for traffic analysis
gorilla1
10-10-2002, 06:11 PM
I amusing a host where they have the traffic analysis status reports shown here:
http://ftp.ics.uci.edu/pub/websoft/wwwstat/
However, this software does nothave the basic thing I want, which is stats on how many visits by day, by hour, how many unique visits, etc. Is there php code around to analyze access_log and get this kind of report? Or am I better off writing a hit counter php routine in the index.php or index.html file on my site?
G
firepages
10-10-2002, 06:34 PM
I have an apache log-parser on my 'todo' list but I think its the thought of all that regex which keeps knocking it back up the list :)
There are a load of ready-made logging utilities in PHP (hotscripts.com) ,but It grinds at me to duplicate the data which is already available via the apache logs & all the best log-parsers are in PERL.
so if anyone wants to supply the regex... :)
mordred
10-10-2002, 07:32 PM
I would like to, I find regular expressions an amusing and often quite challenging task. Only that you don't recall what you were trying to do if you let them lie around for some months... line noise madness... :) Just post your specific question.
gorilla1, writing a hit counter with PHP is very easy. But if you want to track your visitor's clickpath, you may be better helped by phpOpenTracker (http://www.phpopentracker.de/?lang=en). I did not use it myself, but from what I've read in the PHP community, it looks like a very good and cleanly designed application.
gorilla1
10-10-2002, 08:08 PM
Mordred, Interesting, thanks. I lean the same direction as Firepages, though - if the data is already being gathered, seems a shame to collect it all over again... As for regex, would be willing to help, though I have no experience with it.
G
firepages
10-11-2002, 05:48 PM
Hey Mordred, if you are game for it... :) the default apache custom
custom log format is space seperated with some fields quote
delimited...
eg
%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"
202.59.70.55 - - [07/Oct/2002:00:49:53 -0600] "GET / HTTP/1.0" 200 1327 "http://www.php.net/manual/en/tutorial.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
which breaks down to
%h > 202.59.70.55 (always valid ip format)
%l > logfile name...? never actually seen this set , defaults to '-' but assuming ?? '/home/user/logfile.file' or similar
%u > REMOTE_USER if logged in via htaccess etc, eg 'username' ,defaults to '-'
%t > '[07/Oct/2002:00:49:53 -0600]'
"%r" > "GET / HTTP/1.0" (always enclosed in quotes)
%>s > request status , integer i.e. '200' , '404' etc
%b >size_in_bytes, integer '1327'
"%{Referer}i" > HTTP_REFERER , "http://www.php.net/manual/en/tutorial.php" (always enclosed in quotes)
"%{User-Agent}i" > HTTP_USER_AGENT "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" (always enclosed in quotes )
default for an unset value is '-' (without the single quotes)
if it were not for the possible spaces in the quote
delimited fields one could just explode the line
parseing the time and user_agent etc is straightforward
enough so splitting the original line into a workable array
is the main issue.
After that its a straight race for speed between PERL and PHP
to see if its worth writing a custom interface in PHP for log parsing :)
Gorilla1 - sorry for hi-jacking your thread :)
mordred
10-11-2002, 09:10 PM
Ok, here's a first quick hack. I decided to design the regexp just to match text content of the log details, not validating the contents of each detail since that would cut down on performance even more.
$logStr = '202.59.70.55 - - [07/Oct/2002:00:49:53 -0600] "GET / HTTP/1.0" 200 1327 "http://www.php.net/manual/en/tutorial.php" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"';
function splitLogString($line) {
$logArray = array();
$regexp = '/^(.+?) (.+?) (.+?) \\[(.+?)\\] "(.+?)" (\\d+) (\-|\\d+)( "(.+?)")?( "(.+?)")?\\s*$/i';
if (preg_match($regexp, $line, $matches) && count($matches) > 0) {
$logArray["ip"] = $matches[1];
$logArray["name"] = $matches[2];
$logArray["remote_user"] = $matches[3];
$logArray["date_time"] = $matches[4];
$logArray["http"] = $matches[5];
$logArray["status"] = $matches[6];
$logArray["length"] = $matches[7];
if (isset($matches[9])) {
$logArray["referer"] = $matches[9];
}
if (isset($matches[11])) {
$logArray["user_agent"] = $matches[11];
}
}
return $logArray;
}
Using this function in conjunction with a small snippet that parses a greater log file (about 1.8 MB) ...
function searchLog() {
$lines = file("access.log");
$completeLog = array();
for ($i = 0; $i < count($lines); $i++) {
$completeLog[] = splitLogString($lines[$i]);
}
return $completeLog;
}
var_dump(searchLog());
...it hits the maximum execution time after working 30 seconds and retrieving 18,320 lines of log information... which is kinda slow... but it's a proof-of-concept, optimization of the performance should be the last step.
firepages
10-13-2002, 07:50 AM
Prety cool Mordred, works a treat,
so the next issue is that when trying to parse a logfile of 45MB I actually kill apache ... thats on windows , I am sure on *NIX it would not die but its a red flag at least...
so I am about to try and fopen and read the logfile 1 line at a time which should be no problem, I am not sure if file() actually loads the whole array into memory but I assume it tries to, so a line at a time is good as we can reuse the variable.... but then we still have all that parsed data in memory which is probably the killer, so how would you suggest going about storing the data?
we could update a DB as we parse say in blocks of 100 or 1000 to avoid $x000 queries :) - or do the same to a flatfile?
in fact how would you suggest storing the data .. flatfiles are good... who wants a $xMB database lying around, but of course then parsing those flatfiles for data is again a pain in the ...
or would you just store statistics ?
pick your brains some more please ?
why ?? well to help answers Gorrilla1's question of course & nothing at all to do with the fact that I have for a while wanted to make a client-side GUI that connects to an online server and grabs (via XML-RPC/whatever?) parsed web-server statistics , nothing to do with that at all ok.
PS I just got for the first time ever a full list of referers (all the PERL stats systems just give the last 30/40 or whatever) for a given month which was fun and enlightening so cheers !
gorilla1
10-13-2002, 01:45 PM
If you don't mind my interrupting, for the purpose I agee with you that it seems like a nuisance to have to create a database. Flat file is good.
G
mordred
10-13-2002, 04:29 PM
Ok, I tormented my brain for a while with the question how to store the parsed results of the log files.
Initially I had the same idea as you both had, storing the relevant info in flat files and avoid talking to a database which should save a lot of work, but... I think differently now.
After thinking it through, I came to a couple of different approaches.
1. Store the info in flat files.
That sounds good, but does not accomplish much. As you said firepages, parsing the data of these flat files is a PITA indeed. In fact, we just copy the log files to a different format. And yet we need to have all this information in memory to work upon it (say, retrieve all Netscape users in april that stayed for than one minute on page xy.html).
2. Serialize a multidimensional array
Same as in 1.), only that the work of parsing the flat file is done by PHPs unserialize() function. Both 1 and 2 suffer from two practical limits: The size of an array in memory and the limit of the execution time.
3. Insert parsed info at regular intervals into a database
As a result of 1 and 2, I now favor this approach. I envision a table that has fields corresponding to the log format we parsed. A multi-insert SQL statement inserts the info at a given interval, perhaps after reading 1000 lines.
The main advantages from this will come true when working later on that data: We can use SQL to construct our analyzing statements, and since MySQL is damn fast in selecting rows, we should have both better functionality and speed as opposed to 1 and 2. In fact, we would have to write a multitude of fancy array-handling functions to accomplish the analyzing tasks when dealing with flat files.
So my vote definitely goes for using a MySQL database for storing the logged information. Every thing else I thought of suffers from performance and redundancy issues. :)
What's your opinion on my ramblings?
firepages
10-16-2002, 11:04 AM
sorry not posted back earlier - had do so some work !yuk!
my thinking on this is stuck between a rock and a hard place....
agreed that when working on the data MySQL is going to pay dividends....
but then if the logs are 10,20,...etc MB, thats going to become quite a hefty database, although I am sure they will take up far less space in the DB than as flatfiles. + of course archiving historical data is going to be a pain as well (if in DB)
so I was thinking OK, parse the log for relevant data and just store that (a summary) in the DB, not every individual byte of data, but that brings us back to the original problem of memory use whilst parsing the data, and only having summary information in the DB kindof limits the flexibility of manipulation of that data.
then :) & this is my current thinking, use a temp table to add all the data to the DB - adding in 1000 line blocks etc as you mentioned, now the data should be available for crunching without causing any major heartache for the server or DB...
then parse that data via PHP/SQL (which I assume will be quite quick) and perhaps add summary data to another table (browser stats, referers etc etc) and put all the sticky stuff (hourly analysis etc) into preformatted flatfiles or similar?
I mean I do personally look at hourly/daily statistics but only occaisionally, normally I am most interested in referer's , search terms ,errors, hack attacks etc so I would like this data to be available quickly and with the ability to format it i.e. via SQL , but thats me and I am sure others use thier stats in different ways.
or do we keep it all in the DB ? <head hurts />
anyway just to check my suspicions I will add the 45 MB logfile I have to the DB as discussed above and see how much space is saved
So do you think a hybrid ! system is a possibiity or does this defeat the purpose ??
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.