...

View Full Version : Limitations of a WAMP stack, or is my code just hosed?



fuzzy1
03-25-2009, 10:08 PM
Hey All,
Having a time with a little Craigslist search experiment, on a WAMP stack

XP Pro, 2.8GHz Xeon Dual Core, 2GB RAM and minimal additional apps.
Apache/2.2.6 (Win32), MySQL 5.0.45-community-nt and PHP 5.2.5, with Loaded Extensions: bcmath, calendar, com_dotnet, ctype, session, filter, ftp, hash, iconv, json, odbc, pcre, Reflection, date, libxml, standard, tokenizer, zlib, SimpleXML, dom, SPL, wddx, xml, xmlreader, xmlwriter, apache2handler, mbstring, mysql, mysqli, PDO, pdo_sqlite, SQLite I've set up a 3 TABLE test database (users, user_searches, and search_hits)
In users, I've added 1000 dummy users,
In user_searches I've added 10 CL searches for each user (classic muscle cars) across all known CL US Cities.

With the following script, I can scrape for 1000 user searches (New CL posts with Today's Date ) in about 5 minutes, but would like to improve on that considerably. It occurred to me to attempt concurrent instances of the script, but I find that -- try as I might -- I can only manage two concurrent instances of the script. Beyond that, additional instances lock up until such time as one of the preceding instances of the script conclude iteration. All the while, CPU usage hovers at around 8-12 % and memory at around 500MB, (plenty to spare?) and it doesn't appear to be a failing of bandwidth either. It seems that this is a failing in my httpd.conf instead, or more specifically a failing / inherent limitation of a WAMP stack in general, owing to the limitations of MaxRequestsPerChild / ThreadsPerChild parameters on a windows configuration?

Could this be improved upon running a LAMP stack instead employing prefork or worker MPM, or is there just something inherently wrong with my code / approach?
<?php
session_start();
session_regenerate_id();//different session id for each instance
$session = session_id();
echo $session;
require_once('includes/DbConnector.php');
$connector = new DbConnector();
mysql_select_db('cl_search_assist') or die ("Connection Failed ". mysql_error()); // Run the query.

$lock= "UPDATE user_searches SET locked ='$session'
WHERE active_status='1'
AND locked='0'
AND date_modified < DATE_SUB(NOW(),INTERVAL 10 MINUTE)
ORDER BY user_id DESC, date_entered DESC
LIMIT 1000";
$lock_result = mysql_query($lock);
if (!$lock_result) {
die($lock.'lock failed:' . mysql_error());
exit();
}

$page='1';
if($_GET[page]){
$page=$_GET[page]+1;
}
$remaining = "SELECT COUNT(id) FROM user_searches WHERE active_status='1' AND locked='0'";
$result = mysql_query($remaining) or die(mysql_error());
$row = mysql_fetch_array($result);
$remains = $row['COUNT(id)'];// "un-locked active searches"
if($remains >1){
echo "<script>
lz='this.php?page=$page';
//window.open(lz,'$page','width=400,height=200,toolbar=yes,location=no,directories=no,status=yes,menub ar=yes,scrollbars=yes,copyhistory=no,resizable=yes');
</script>";
}

$start_limit =0;
$ses_duration=0;
$query_limit ="0,100";
$itts ='0';
$errors = array();
Get_CL_Posts($session, $query_limit,$start_limit,$ses_duration);

function Get_CL_Posts($session, $query_limit,$start_limit,$ses_duration){
// $frame_id = $_GET[frame_id];
$now = strtotime("now");
$time_start = microtime(true);
global $errors, $itts;
########################
# GET ACTIVE SEARCHES #
########################
$SelectUserSearches ="SELECT * FROM user_searches
WHERE active='1'
AND locked='$session'
ORDER BY user_id DESC, date_entered DESC
LIMIT $query_limit";
$result = mysql_query($SelectUserSearches) or die ("SelectUserSearches query failed " .mysql_error()); // Run the query.
$num_rows=mysql_num_rows($result);//or die ("user searches query failed " .mysql_error()); // Run the query. ;
if($num_rows!='0'){
echo "<h1>Conducting ".$num_rows." user searches</h1>";
set_time_limit( 60 * 5 );
$insert="INSERT INTO search_hits (user_name, user_id, search_title, post_title, post_id, search_id, hit_link, date_entered) VALUES";
while ($row = mysql_fetch_assoc($result)){
$search_id = $row["id"];
$active_status = $row["active_status"];//active searches by this user
$search_title=$row['search_title'];
$user_id= $row['user_id'];
$user_name= $row['user_name'];
$cl_query=$row['cl_query'];
$city_url=$row['city_url'];
$catAbbreviation=$row['catAbbreviation'];
$addOne=$row['addOne'];
$addTwo=$row['addTwo'];
$addThree=$row['addThree'];
$addFour=$row['addFour'];
$addFive=$row['addFive'];
$srchType=$row['srchType'];
$alert_count=$row['alert_count'];
$locked = $row["locked"];//

$user_search = "?query=".str_replace(" ", "+", $cl_query); //%20 ?
$search_url= $city_url."search/".$catAbbreviation.$user_search;
$mods=array('addOne'=>$addOne,'addTwo'=>$addTwo,'addThree'=>$addThree,'addFour'=>$addFour,'addFive'=>$addFive,'srchType'=>$srchType);
foreach($mods as $key=>$value){
if($value !=''){$search_url.="&".$key."=".$value;}
}

$ctx = stream_context_create(array( 'http' => array( 'timeout' => 1)));
$data = @file_get_contents("$search_url", 0, $ctx); //SET TIMELIMIT
$itts = $itts+1;
// echo $data;
if ($data =='') {
echo "<h2>There was a problem with the http request!</h2>";//remove for production !!!
array_push($errors,$search_url);
echo "<font color='red'>".$search_url."</font><br>";//update in db and continue instead!!!!
continue;
}

$today = "/".date('M j')." - <a href=";
$search = $today."([^`]*?)<\/a><\/i>/";
preg_match_all ($search, $data, $matches);
$i=0;
foreach($matches[0] as $key =>$value){
####################################################
//hit_link = full CL list text with date and links
####################################################
$replace = '<a href="/';
$alt = '<a target="_blank" href="'.$city_url;
$hit_link = str_replace($replace, $alt, $value);
//echo $hit_link."\n<br>";
##########################################
//post_title = CL post list TITLE only
##########################################
$strip1 = explode('.html">', $value);
$strip2 = explode('-</a>', $strip1[1]);
$post_title = $strip2[0];
################################
//post_id = CL post id number
################################
$pre =date('M j')." - <a href=\"/";
$prePost = str_replace($pre, "",$strip1[0]);
$post= explode('/', $prePost);
$post_id = $post[1];
//echo $post_id;

$check_dupes ="SELECT * FROM search_hits WHERE user_id='$user_id' AND post_id='$post_id'";
/* un-comment for production
$dupes_result = mysql_query($check_dupes) or die ("Check Dupes failed " .mysql_error()); // Run the query.
$num_rows=mysql_num_rows($dupes_result);
if($num_rows!='0'){
continue;
}
else{
$date_entered = date('Y-m-d H:i:s', $now);
$insert.="('$user_name', '$user_id', '$search_title', '$post_title', '$post_id', '$search_id', '$hit_link', '$date_entered' ),";
//echo $hit_link."<br />";
}*/
echo $user_name." --> ".$hit_link."<br />";//remove for production
}//end foreach $matches[0]*/
}//end mysql_fetch_assoc($result)
}//end if numrows
//else echo "<script>alert('numrows =0')</script>";
$time_end = microtime(true);
$time = $time_end - $time_start;
$ses_duration+=$time;
echo "<br><br><font color ='red'> Ellapsed time:".$time." seconds. -- Session Total: ".$ses_duration."</font><br><br>";
if($start_limit <=900){
$start_limit=$start_limit+100;
$query_limit="$start_limit,100";
Get_CL_Posts($session, $query_limit,$start_limit,$ses_duration);
}
}//end Get_CL_Posts
echo "There were ".count($errors)." errors in ".$itts." itterations.";
$insert="INSERT INTO search_hits (user_name, user_id, search_title, post_title, post_id, search_id, hit_link, date_entered)
VALUES('$user_name', '$user_id', '$search_title', '$post_title', '$post_id', '$search_id', '$hit_link', '$date_entered' )";
$insert = rtrim($insert, ",");
/*
$result = mysql_query($insert);
if (!$result) {
die('Invalid query: ' . mysql_error());
}else {echo "success";}
*/

?>

Inigoesdr
03-26-2009, 06:49 AM
Yes, the problem with your approach is that scraping is generally very inefficient using a web scripting language. Making this many requests is going to be slow no matter what you do. Yes, using a LAMP stack should be faster, but you're still limited by bandwidth(not just throughput, but the size and quantity of the requests), and the HTTP protocol. Using a dedicated server with a decent size bandwidth limit and network pipe will improve the time as well. You aren't being affected by MaxRequestsPerChild or ThreadsPerChild unless you are spawning the processes from your browser, and they are set to 1. You probably aren't gaining anything by running MPM anyway. If you're going to launch multiple processes it would be more efficient to remove apache from the process altogether and just run the script through the PHP CLI.

Good luck!

fuzzy1
03-26-2009, 07:28 PM
it would be more efficient to remove apache from the process altogether and just run the script through the PHP CLI. Fantastic! Thank You Inigoesdr! Finally a little light on the subject.
Of course, as is so often the case, fresh answers do tend to breed fresh and burning questions like -- PHP CLI? :eek::eek::eek:
I didn't know there was any such thing!
While I look further into it under my own steam, I would be eternally grateful if you could offer just a bit more of your interpretation of the concept meantime?
Thanks Again!

Inigoesdr
03-27-2009, 04:21 AM
Of course, as is so often the case, fresh answers do tend to breed fresh and burning questions like -- PHP CLI? :eek::eek::eek:
I didn't know there was any such thing!
Yes, indeed. The PHP manual (http://php.net/features.commandline) has a pretty in-depth explanation of what the CLI is, and how to use it. Basically, it's an executable that runs the PHP interpreter engine without the need of a server(such as Apache) to invoke it. This allows you to run commands and execute files from the command line, which uses less resources and potentially allows you to interact with PHP from other programs.

fuzzy1
03-27-2009, 05:41 AM
Quite!
IMHO the PHP Manuals are THE benchmark for ALL manuals everywhere. Simply Unparalleled. I have already pared down my script a bit and have successfully run it from DOS in 4 or 5 concurrent instances, and from a batch file as well. Still, I'm having a bit of a time effecting a seamless and automagic initiation of consecutive instances -- as need be -- from within the PHP script.
It was my thought to replace the following bit form the script above
$remaining = "SELECT COUNT(id) FROM user_searches WHERE active_status='1' AND locked='0'";
$result = mysql_query($remaining) or die(mysql_error());
$row = mysql_fetch_array($result);
$remains = $row['COUNT(id)'];// "un-locked active searches"
if($remains >1){
echo "<script>
lz='this.php?page=$page';
window.open(lz,'$page','width=400,height=200,toolbar=yes,location=no,directories=no,status=yes,menub ar=yes,scrollbars=yes,copyhistory=no,resizable=yes');
</script>";
}with a variation something along the lines as shown below...
$remaining = "SELECT COUNT(id) FROM user_searches WHERE active_status='1' AND locked='0'";
$result = mysql_query($remaining) or die(mysql_error());
$row = mysql_fetch_array($result);
$remains = $row['COUNT(id)'];// "un-locked active searches"
if($remains >1){
$bat ="C:\\wamp\\www\\cl_search_assist\\init.bat";
echo exec($bat);
} I've run at this in several different configurations, but have yet to get it quite right. It seems that exec() behaves somewhat differently in the cli, but I haven't been able to pin it down. Any thoughts? Thanks Again! I do think this just may fit the bill.

fuzzy1
03-27-2009, 06:46 PM
the problem with your approach is that scraping is generally very inefficient using a web scripting language. Also, I would be remiss if I didn't inquire... are you aware of a viable alternative to scraping???
Thanks!



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum