PDA

View Full Version : Efficient Sorting.


NancyJ
08-12-2006, 05:08 PM
Hey guys, need a bit of brainstorming help. We're having a problem with an application - its basically a hotel availability listing that gets data from 2 external sources, then compiles the results into a single listing, sorted by cheapest price.
I've taken a few chunks out of the code - just to keep it to the point and not overload you with info that you dont need.
The problem is that searches are using way too much CPU, like up to 40% for a single search - now I'm assuming the problem is the sorting by price and compositing the two arrays but I could be wrong - what worries me is that I've seen 1 search that took 13s, 12.1% CPU, 8% Memory and only returned a single result. Now how can that be?
Anyway - this is the code

//soap call to get data from first source goes here

//check if we got a response
if($this->response)
{

foreach($this->response as $offer)
{
///the data we get back is quite complex its structure is basically a Hotel, then a listing of all available rooms and prices
$arrHotels[(string)$offer->Hotel->Id] = array("HotelName"=>$offer->Hotel->Name, "Resort"=>$offer->Hotel->Region->Name, "Stars"=> $offer->Hotel->Stars, "Thumbnail"=>$offer->Hotel->Photo->ThumbnailUrl, "Accom" =>$offer->Hotel->Type, "Type"=> "OHG", "Rooms"=>array());
//$offer->Result is the rooms
foreach($offer->Result as $result)
{
$arrHotels[(string)$offer->Hotel->Id]["Rooms"][]= array("RoomDesc" => str_ireplace(array(" pax", " ssv", " sv", " mv"), array(" passengers", " sea side view", " sea view", " mountain view") ,$result->Room->RoomType->Text). " ".$result->Room->MealType->Text, "Price"=>$result->Room->SellingPrice->Amount);
}
}

//sort OHG data into cheapest

$startTime = microtime(true);

foreach($arrHotels as $key => $val)
{
$numOfRooms = count($val['Rooms']);
$rooms +=$numOfRooms;
$arrPrices = array();
for($i=0; $i< $numOfRooms; $i++)
{
$arrPrices[] = $val['Rooms'][$i]['Price'];
}
asort($arrPrices);
$tempPrices= array();
foreach($arrPrices as $key2=>$val2)
{
$tempPrices[] = $val['Rooms'][$key2];
}
$arrHotels[$key]['Rooms'] = $tempPrices;

$arrCheapestPrices[$key] = $arrHotels[$key]['Rooms'][0]['Price'];
$arrOHGHotels[$key] = $arrHotels[$key]['HotelName'];
}

asort($arrCheapestPrices);


}

//dump OHG hotel data into array (easier than renaming code later)
$arrHotels1 = $arrHotels;
//clear arrHotels array.
$arrHotels = array();
$start = time();

//Med Hotels search
//dont search 1star - not worth it
$min = $_GET['r']>1?$_GET['r']:2;
//loop through each star rating or until 15 seconds passed
for($s=5; $s>=$min; $s--)
{
if(time()-$start < 15)
{
//CURL request to get data from second source goes here

$xml = @simplexml_load_string($rawData);

$hotels = $xml->Reservation;

if($hotels->Hotel)
{
foreach($hotels->Hotel as $hotel)
{

if(is_array($arrHotels[(string)$hotel['ID']]))
{
if((float)$hotel->Star_Category >= $s)
{
if(is_numeric(substr($hotel->Room_Title, 0,1)))
{
//if the first character is space we need to strip off everything from the first . or space.
$pos = strpos($hotel->Room_Title, ".");
if($pos === false)
{
$pos = strpos($hotel->Room_Title, " ");
}
$hotel->Room_Title = substr($hotel->Room_Title, $pos+1);
}
$arrHotels[(string)$hotel['ID']]["Rooms"][]= array("RoomDesc"=>$hotel->Room_Title, "Price"=>$hotel->Price);
}
}
else
{
if(is_numeric(substr($hotel->Room_Title, 0,1)))
{
//if the first character is space we need to strip off everything from the first . or space.
$pos = strpos($hotel->Room_Title, ".");
if($pos === false)
{
$pos = strpos($hotel->Room_Title, " ");
}
$hotel->Room_Title = substr($hotel->Room_Title, $pos+1);
}

$arrHotels[(string)$hotel['ID']]= array("HotelName"=>str_replace(" (".ucwords(strtolower($hotel->Area)).")", "",$hotel->Hotel_Name), "Resort"=>ucwords(strtolower($hotel->City)), "Stars"=>$hotel->Star_Category, "Thumbnail"=>$hotel->Thumbnail, "Accom" => "Hotel", "Type" => "MedHotels", "Rooms" => array(array("RoomDesc"=>$hotel->Room_Title, "Price"=>$hotel->Price)));


}
}
}
unset($xml);
}
}

//sort by price again
foreach($arrHotels as $key => $val)
{
$numOfRooms =count($val['Rooms']);
$rooms +=$numOfRooms;
$arrPrices = array();
for($i=0; $i< $numOfRooms; $i++)
{
$arrPrices[]=$val['Rooms'][$i]['Price'];
}
asort($arrPrices);
$tempPrices= array();
foreach($arrPrices as $key2=>$val2)
{
$tempPrices[]= $val['Rooms'][$key2];
}
$arrHotels[$key]['Rooms']=$tempPrices;

$arrCheapestPrices[$key]= $arrHotels[$key]['Rooms'][0]['Price'];
}
asort($arrCheapestPrices);


//ok we have 2 arrays now - we need to remove duplicates.
//medhotels list should be smallest because of star rating issue so we'll loop through that one
foreach($arrHotels as $key2 => $hotel)
{

$key = array_search($hotel['HotelName'], $arrOHGHotels);
if($key!= false)
{
//we have duplicate
//check for cheapest
if($arrCheapestPrices[$key]>$hotel['Rooms'][0]['Price'])
{
//we need to replace that data
$arrCheapestPrices[$key]= $hotel['Rooms'][0]['Price'];
//we also need to change the type to OHG because of the key being used to indentify it
$hotel['Type']= "OHG";
$arrHotels1[$key]= $hotel;


}
}
else
{
//no duplicate so lets just plop it on the end
$arrHotels1[$key2]= $hotel;
}
}

asort($arrCheapestPrices);

$this->hotels = $arrHotels1;
$this->prices = $arrCheapestPrices;
unset($arrHotels1);
unset($arrHotels);
unset($arrCheapestPrices);


So can anyone help with my CPU load issues or making this sort less intensive without losing functionality?

marek_mar
08-12-2006, 07:31 PM
I don't really know what this is doing. I mean it even sends a request.

NancyJ
08-12-2006, 09:01 PM
I don't really know what this is doing. I mean it even sends a request.
Huh? Sends a request? Not sure what you mean.

What it does is:

Loops through the hotels and puts the important data into an array
For each Hotel it loops through the rooms and adds that info to the array
Loops through the new hotels array,
For each hotel, loops through the rooms and sorts them by price
Stores the cheapest price in an array with the entries in the prices and hotels arrays having matching keys
Then does the same for the second feed
Then Loop through the smallest array and check if it is the same hotel as one in the other array, if it is then compare the cheapest prices, if the second hotel is cheaper than the first then replace the entry with the cheaper one and update the prices array.
Otherwise add the entry onto the end of the array.
Then asort the prices array to get the display order.

marek_mar
08-12-2006, 09:44 PM
Well you have comments in that code that suggest that a request is made.

Wouldn't it be easier to get all the data first and then sort it?

NancyJ
08-12-2006, 09:57 PM
Well you have comments in that code that suggest that a request is made.

Wouldn't it be easier to get all the data first and then sort it?
Yes we make a SOAP request to the 1st data source then a CURL request to a second data supplier. I took those code chunks out because they're not important to the problem and they contain sensitive data.

What difference would it make to do it in a different order?

marek_mar
08-12-2006, 10:02 PM
Yes well you're timing the connections so the sorting itself would be a lot faster without them. :) I'm pretty sure that if you'd sort one bigger array it would be faster that sorting smaller ones and then joining them. Ultimately you could use a database or request the data pre-ordered.

NancyJ
08-12-2006, 10:13 PM
Even if the data came preordered - it comes from 2 sources - they have to be composited together and still be ordered by price.
I dont see the value in inserting the data into a database just to request it out again sorted.
Yes well you're timing the connections so the sorting itself would be a lot faster without them.
Yes ofcourse it would be faster, there would be no data to sort at all. And strangely enough, sitting waiting for the results to be returned isnt that stressful on the CPU. The timer is because its better to return a limited result set within 30 seconds than to return the full set within 2 minutes. Its just to make sure that if the data supplier is running slow, just to take what we've got and output that, rather than try to get the rest.

marek_mar
08-12-2006, 10:35 PM
Yes ofcourse it would be faster, there would be no data to sort at all. And strangely enough, sitting waiting for the results to be returned isnt that stressful on the CPU. The timer is because its better to return a limited result set within 30 seconds than to return the full set within 2 minutes. Its just to make sure that if the data supplier is running slow, just to take what we've got and output that, rather than try to get the rest.
Sitting there and doing nothing isn't stressfull on the CPU (becouse nothing is being done) but the memory that willstay at the level before it started waiting.

By using a database I ment to actually store it in there not just to sort rows. Though when you have 2 data providers this may not be an option.

Still you could combine the arrays with array_merge() and array_diff() though your arrays are rather complex so that might not work out too well. array_multisort() could be helpful becouse it would sort the price array the way you want it and re-order the other arrays (hotel name, room number and such) to match the price array.

NancyJ
08-12-2006, 10:53 PM
For the moment I'm only concerned with the CPU usage - not the memory. The memory levels are mostly fine. The biggest result set I've seen so far is 716 rooms in 189 Hotels.
No a database is not an option - this is live availability data. Storing it in a db would be impossible, regardless of the number of data suppliers since the data is live - it would have to be updated for every combination every second. Thats just not feasable.
I would imagine array_multisort would be a much more complex and CPU intensive operation that sorting a single dimension array... but it might be worth a try. array_diff() might make array_merge possible - I will need to run some benchmarking on that.

NancyJ
08-12-2006, 11:00 PM
It is difficult to benchmark though. You can do the same search twice and get very different results. Eg. the 189 Hotels 716 Rooms earlier today took 42% CPU and 11.3% memory, returned result within 16 seconds
Now I've run the same search again and got the same number of results but in 23 seconds with a cpu usage of only 12% - server load currently 1.8
I just dont understand it.

marek_mar
08-12-2006, 11:34 PM
Well I don't know how you get the statistics but you should make sure if it is php (or Apache) that is cousing the high CPU usage. Your arrays are actually quite big and this mightalso be a bit problematic.

Phill
08-12-2006, 11:51 PM
Is using an in-memory database like SQLite not an option?

It would at least externalise the sorting from your PHP process, that might help.

marek_mar
08-13-2006, 12:56 AM
You could also cache the data you get and store it somewhere so at least you wom't have to get them every request. If you'd choose a database for it you would be able to sort faster aswell. You would update the data every hour or so.

zealotgi
08-13-2006, 01:50 AM
I suggest using either a selection sort or an insertion sort

firepages
08-13-2006, 07:09 AM
I suspect the time lag is in the fetching of the external data rather than processing it, what are the benchmarks if you simply fetch and display the results from the 2 external sources ?

or run your code using cached data sources rather than fresh requests.. what does that do to processing time?

+ for larger result sets using the DB may indeed be an option, creating temprary tables in MySQL is faster than you may think...

NancyJ
08-13-2006, 08:18 PM
This is proving impossible to benchmark -
the results are screwy.
I'm getting more results using less memory and CPU - now the only reason you would get more results on one search than another is if it gave up waiting on one of the data sources.... so maybe it is the waiting that using the cpu and memory not the sorting - but I cant see why.
Interestingly though, using unset doesnt free up any memory... why is that? I took out all the unsets (just to see what happened) and there was no reduction in memory usage.

...gonna run some more tests

Phill
08-13-2006, 11:46 PM
Interestingly though, using unset doesnt free up any memory... why is that? I took out all the unsets (just to see what happened) and there was no reduction in memory usage.

I'm not an expert on any of this, but I think with memory, sometimes a piece of memory can be flagged as being 'available' even if it isn't freed as such. So, it will be unloaded as and when another application needs it, rather than you seeing an immediate drop in usage.

Not sure about that though, so don't quote me on it!