Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 6 of 6
  1. #1
    Regular Coder
    Join Date
    Apr 2006
    Posts
    311
    Thanks
    17
    Thanked 0 Times in 0 Posts

    Limitations of a WAMP stack, or is my code just hosed?

    Hey All,
    Having a time with a little Craigslist search experiment, on a WAMP stack
    XP Pro, 2.8GHz Xeon Dual Core, 2GB RAM and minimal additional apps.
    Apache/2.2.6 (Win32), MySQL 5.0.45-community-nt and PHP 5.2.5, with Loaded Extensions: bcmath, calendar, com_dotnet, ctype, session, filter, ftp, hash, iconv, json, odbc, pcre, Reflection, date, libxml, standard, tokenizer, zlib, SimpleXML, dom, SPL, wddx, xml, xmlreader, xmlwriter, apache2handler, mbstring, mysql, mysqli, PDO, pdo_sqlite, SQLite
    I've set up a 3 TABLE test database (users, user_searches, and search_hits)
    In users, I've added 1000 dummy users,
    In user_searches I've added 10 CL searches for each user (classic muscle cars) across all known CL US Cities.

    With the following script, I can scrape for 1000 user searches (New CL posts with Today's Date ) in about 5 minutes, but would like to improve on that considerably. It occurred to me to attempt concurrent instances of the script, but I find that -- try as I might -- I can only manage two concurrent instances of the script. Beyond that, additional instances lock up until such time as one of the preceding instances of the script conclude iteration. All the while, CPU usage hovers at around 8-12 % and memory at around 500MB, (plenty to spare?) and it doesn't appear to be a failing of bandwidth either. It seems that this is a failing in my httpd.conf instead, or more specifically a failing / inherent limitation of a WAMP stack in general, owing to the limitations of MaxRequestsPerChild / ThreadsPerChild parameters on a windows configuration?

    Could this be improved upon running a LAMP stack instead employing prefork or worker MPM, or is there just something inherently wrong with my code / approach?
    PHP Code:
    <?php
    session_start
    ();
    session_regenerate_id();//different session id for each instance
    $session session_id();
    echo 
    $session;
    require_once(
    'includes/DbConnector.php');
    $connector = new DbConnector();
    mysql_select_db('cl_search_assist') or die ("Connection Failed "mysql_error()); // Run the query.

    $lock"UPDATE user_searches SET locked ='$session' 
        WHERE active_status='1' 
            AND locked='0'
            AND date_modified < DATE_SUB(NOW(),INTERVAL 10 MINUTE)  
            ORDER BY user_id DESC, date_entered DESC 
            LIMIT 1000"
    ;
        
    $lock_result mysql_query($lock);
        if (!
    $lock_result) {
            die(
    $lock.'lock failed:' mysql_error());
            exit();
        }
        
    $page='1';
    if(
    $_GET[page]){
    $page=$_GET[page]+1;
    }
    $remaining "SELECT COUNT(id) FROM user_searches WHERE active_status='1' AND locked='0'";
    $result mysql_query($remaining) or die(mysql_error());
    $row mysql_fetch_array($result);
    $remains $row['COUNT(id)'];// "un-locked active searches"
    if($remains >1){
        echo 
    "<script>
                lz='this.php?page=$page';
                //window.open(lz,'$page','width=400,height=200,toolbar=yes,location=no,directories=no,status=yes,menubar=yes,scrollbars=yes,copyhistory=no,resizable=yes');
            </script>"
    ;
    }

    $start_limit =0;
    $ses_duration=0;
    $query_limit ="0,100";
    $itts ='0';
    $errors = array();
    Get_CL_Posts($session$query_limit,$start_limit,$ses_duration);

    function 
    Get_CL_Posts($session$query_limit,$start_limit,$ses_duration){
    //    $frame_id = $_GET[frame_id];    
        
    $now strtotime("now");
        
    $time_start microtime(true);
        global 
    $errors$itts;
        
    ########################
        # GET ACTIVE SEARCHES #
        ########################
        
    $SelectUserSearches ="SELECT * FROM user_searches 
                WHERE active='1' 
                AND locked='$session' 
                ORDER BY user_id DESC, date_entered DESC  
                LIMIT $query_limit"
    ;
        
    $result mysql_query($SelectUserSearches) or die ("SelectUserSearches query failed " .mysql_error()); // Run the query.
        
    $num_rows=mysql_num_rows($result);//or die ("user searches query failed " .mysql_error()); // Run the query. ;
        
    if($num_rows!='0'){
            echo 
    "<h1>Conducting ".$num_rows." user searches</h1>";
            
    set_time_limit60 );
            
    $insert="INSERT INTO search_hits (user_name, user_id, search_title, post_title, post_id, search_id, hit_link, date_entered) VALUES";
            while (
    $row mysql_fetch_assoc($result)){
                
    $search_id $row["id"];
                
    $active_status $row["active_status"];//active searches by this user
                
    $search_title=$row['search_title'];
                
    $user_id$row['user_id'];
                
    $user_name$row['user_name'];
                
    $cl_query=$row['cl_query'];
                
    $city_url=$row['city_url'];
                
    $catAbbreviation=$row['catAbbreviation'];
                
    $addOne=$row['addOne'];
                
    $addTwo=$row['addTwo'];
                
    $addThree=$row['addThree'];
                
    $addFour=$row['addFour'];
                
    $addFive=$row['addFive'];
                
    $srchType=$row['srchType'];
                
    $alert_count=$row['alert_count'];
                
    $locked $row["locked"];//
                
                
    $user_search "?query=".str_replace(" ""+"$cl_query); //%20 ?
                
    $search_url$city_url."search/".$catAbbreviation.$user_search;
                
    $mods=array('addOne'=>$addOne,'addTwo'=>$addTwo,'addThree'=>$addThree,'addFour'=>$addFour,'addFive'=>$addFive,'srchType'=>$srchType);
                foreach(
    $mods as $key=>$value){
                    if(
    $value !=''){$search_url.="&".$key."=".$value;}
                }
                
                
    $ctx stream_context_create(array( 'http' => array( 'timeout' => 1))); 
                
    $data = @file_get_contents("$search_url"0$ctx); //SET TIMELIMIT
                
    $itts $itts+1;
        
    //        echo $data;
                
    if ($data =='') {
                    echo 
    "<h2>There was a problem with the http request!</h2>";//remove for production !!!
                    
    array_push($errors,$search_url);
                    echo 
    "<font color='red'>".$search_url."</font><br>";//update in db and continue instead!!!!
                    
    continue;
                }

                
    $today "/".date('M j')." - <a href=";
                
    $search $today."([^`]*?)<\/a><\/i>/";
                
    preg_match_all ($search$data$matches);
                
    $i=0;
                foreach(
    $matches[0] as $key =>$value){
               
    ####################################################        
                //hit_link = full CL list text with date and links 
               ####################################################                
                    
    $replace '<a href="/';
                    
    $alt '<a target="_blank" href="'.$city_url;
                    
    $hit_link str_replace($replace$alt$value);
                    
    //echo $hit_link."\n<br>";
                  ##########################################        
                    //post_title = CL post list TITLE only
                  ##########################################
                    
    $strip1 explode('.html">'$value);
                    
    $strip2 explode('-</a>'$strip1[1]);
                    
    $post_title $strip2[0];
                   
    ################################
                    //post_id = CL post id number
                   ################################                
                    
    $pre =date('M j')." - <a href=\"/";
                    
    $prePost str_replace($pre"",$strip1[0]);
                    
    $postexplode('/'$prePost);
                    
    $post_id $post[1];
                    
    //echo $post_id;
            
                    
    $check_dupes ="SELECT * FROM search_hits WHERE user_id='$user_id' AND post_id='$post_id'";
                    
    /* un-comment for production
                    $dupes_result = mysql_query($check_dupes) or die ("Check Dupes failed " .mysql_error()); // Run the query.
                    $num_rows=mysql_num_rows($dupes_result);
                    if($num_rows!='0'){
                        continue;
                    }
                    else{
                        $date_entered = date('Y-m-d H:i:s', $now);
                        $insert.="('$user_name', '$user_id', '$search_title', '$post_title', '$post_id', '$search_id', '$hit_link', '$date_entered' ),";
                        //echo $hit_link."<br />";
                    }*/
                    
    echo $user_name." --> ".$hit_link."<br />";//remove for production
                
    }//end foreach $matches[0]*/
            
    }//end mysql_fetch_assoc($result)
        
    }//end if numrows
        //else echo "<script>alert('numrows =0')</script>";
        
    $time_end microtime(true);
        
    $time $time_end $time_start;
        
    $ses_duration+=$time;
        echo 
    "<br><br><font color ='red'> Ellapsed time:".$time." seconds. -- Session Total: ".$ses_duration."</font><br><br>";
        if(
    $start_limit <=900){ 
            
    $start_limit=$start_limit+100;
            
    $query_limit="$start_limit,100";
            
    Get_CL_Posts($session$query_limit,$start_limit,$ses_duration);
        }
    }
    //end Get_CL_Posts
    echo "There were ".count($errors)." errors in ".$itts." itterations.";
    $insert="INSERT INTO search_hits (user_name, user_id, search_title, post_title, post_id, search_id, hit_link, date_entered)
     VALUES('$user_name', '$user_id', '$search_title', '$post_title', '$post_id', '$search_id', '$hit_link', '$date_entered' )"
    ;
    $insert rtrim($insert",");
    /*
    $result = mysql_query($insert);
    if (!$result) {
        die('Invalid query: ' . mysql_error());
    }else {echo "success";}
        */

    ?>

  • #2
    Super Moderator Inigoesdr's Avatar
    Join Date
    Mar 2007
    Location
    Florida, USA
    Posts
    3,647
    Thanks
    2
    Thanked 406 Times in 398 Posts
    Yes, the problem with your approach is that scraping is generally very inefficient using a web scripting language. Making this many requests is going to be slow no matter what you do. Yes, using a LAMP stack should be faster, but you're still limited by bandwidth(not just throughput, but the size and quantity of the requests), and the HTTP protocol. Using a dedicated server with a decent size bandwidth limit and network pipe will improve the time as well. You aren't being affected by MaxRequestsPerChild or ThreadsPerChild unless you are spawning the processes from your browser, and they are set to 1. You probably aren't gaining anything by running MPM anyway. If you're going to launch multiple processes it would be more efficient to remove apache from the process altogether and just run the script through the PHP CLI.

    Good luck!

  • #3
    Regular Coder
    Join Date
    Apr 2006
    Posts
    311
    Thanks
    17
    Thanked 0 Times in 0 Posts
    it would be more efficient to remove apache from the process altogether and just run the script through the PHP CLI.
    Fantastic! Thank You Inigoesdr! Finally a little light on the subject.
    Of course, as is so often the case, fresh answers do tend to breed fresh and burning questions like -- PHP CLI?
    I didn't know there was any such thing!
    While I look further into it under my own steam, I would be eternally grateful if you could offer just a bit more of your interpretation of the concept meantime?
    Thanks Again!
    Last edited by fuzzy1; 03-27-2009 at 04:44 AM.

  • #4
    Super Moderator Inigoesdr's Avatar
    Join Date
    Mar 2007
    Location
    Florida, USA
    Posts
    3,647
    Thanks
    2
    Thanked 406 Times in 398 Posts
    Quote Originally Posted by fuzzy1 View Post
    Of course, as is so often the case, fresh answers do tend to breed fresh and burning questions like -- PHP CLI?
    I didn't know there was any such thing!
    Yes, indeed. The PHP manual has a pretty in-depth explanation of what the CLI is, and how to use it. Basically, it's an executable that runs the PHP interpreter engine without the need of a server(such as Apache) to invoke it. This allows you to run commands and execute files from the command line, which uses less resources and potentially allows you to interact with PHP from other programs.

  • #5
    Regular Coder
    Join Date
    Apr 2006
    Posts
    311
    Thanks
    17
    Thanked 0 Times in 0 Posts
    Quite!
    IMHO the PHP Manuals are THE benchmark for ALL manuals everywhere. Simply Unparalleled. I have already pared down my script a bit and have successfully run it from DOS in 4 or 5 concurrent instances, and from a batch file as well. Still, I'm having a bit of a time effecting a seamless and automagic initiation of consecutive instances -- as need be -- from within the PHP script.
    It was my thought to replace the following bit form the script above
    PHP Code:
    $remaining "SELECT COUNT(id) FROM user_searches WHERE active_status='1' AND locked='0'";
    $result mysql_query($remaining) or die(mysql_error());
    $row mysql_fetch_array($result);
    $remains $row['COUNT(id)'];// "un-locked active searches"
    if($remains >1){
    echo 
    "<script>
    lz='this.php?page=$page';
    window.open(lz,'$page','width=400,height=200,toolbar=yes,location=no,directories=no,status=yes,menubar=yes,scrollbars=yes,copyhistory=no,resizable=yes');
    </script>"
    ;

    with a variation something along the lines as shown below...
    PHP Code:
    $remaining "SELECT COUNT(id) FROM user_searches WHERE active_status='1' AND locked='0'";
    $result mysql_query($remaining) or die(mysql_error());
    $row mysql_fetch_array($result);
    $remains $row['COUNT(id)'];// "un-locked active searches"
    if($remains >1){
    $bat ="C:\\wamp\\www\\cl_search_assist\\init.bat";
    echo 
    exec($bat);

    I've run at this in several different configurations, but have yet to get it quite right. It seems that exec() behaves somewhat differently in the cli, but I haven't been able to pin it down. Any thoughts? Thanks Again! I do think this just may fit the bill.
    Last edited by fuzzy1; 03-27-2009 at 04:46 AM.

  • #6
    Regular Coder
    Join Date
    Apr 2006
    Posts
    311
    Thanks
    17
    Thanked 0 Times in 0 Posts
    the problem with your approach is that scraping is generally very inefficient using a web scripting language.
    Also, I would be remiss if I didn't inquire... are you aware of a viable alternative to scraping???
    Thanks!


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •