...

View Full Version : load data from huge file into an array as fast as possible



Rooseboom
02-25-2007, 08:23 AM
Hi,

I've got certain data which I can't store in a db (it is too much data, don't ask) so I'll store it in an optimized file structure. Now I want to read the data from that file and get it into a specific array format:

$file_array[$id_a][$id_b] = $value; with data being like:

$file_array[1023][50123435] = 10023;
$file_array[1023][50035768] = 00234;
$file_array[1023][50003452] = 00037;
$file_array[1023][50002345] = 00002;
$file_array[566978][50000343] = 023493;
$file_array[566978][50123435] = 004543;
$file_array[566978][50003452] = 000039;

this number of items in this array can get up to 2 million items!! It takes about 4 seconds to currently load the data into the array with 1.6M items from a file which is stored like:

1023 => array ( '50123435' => '10023', '50035768' => '00234', '50003452' => '00037', '50002345' => '00002')
566978 => array ( '50000343' => '023493', '50123435' => '004543', '50003452' => '000039')

Is there a better way to store the data in the file from which I can fill the array more quickly (<1sec.)?

Thanks!

kenetix
02-27-2007, 04:04 PM
Databases are supposed to load MUCH faster than flat files as they're designed to retrieve data directly from the disk sector unlike flat files where it has to go through certain OS restrictions. I'd suggest using databases, change the field variable types to allow you to input large amounts of data.

Also you might want to try splitting the data into separate tables, for example Forum 'subject' and forum 'message' are broken into 2 separate tables to decrease the table size within forum tables.

Indexing the database tables might also work, most current database systems have indexing features, which allow faster retrieval and more efficient sorting of data.

MySQL is capable of handling very huge database sizes. PHPBB's forums are an example. They have over 10 million threads in their forums, and the site isn't loading slow at all.

ralph l mayo
02-27-2007, 04:23 PM
Denormalized file structures can be much faster than a database when you don't care abou things like transactions and atomicity, the overhead of which dwarfs that of whatever operating systems costs the database doesn't also have to pay.

IIRC phpBB archives threads to separate tables so only a small number of however many millions total are ever a factor in most queries.

Per the OP, please post how you're reading the data now and what the file structure is. Ideally you'd load the data into memory once and serve it many times, meaning load time wouldn't be so much a concer as random access time, which is probably acceptable. Maybe someone can clarify how you can share memory in PHP like you can in the mod_* extensions.

If this really needs to be fast you could write an Apache module in C to read the data and share it with PHP clients.

CFMaBiSmAd
02-27-2007, 04:52 PM
Since you don't indicate what this data is, how it is used, or how often it gets updated..., we can only guess, but I'll guess that you only display, access, or process a small select portion of it on any web page that you output to a browser? If so, you can save yourself a lot of loading time by using a database and only access the data you need on any particular page.

marek_mar
02-28-2007, 12:48 AM
Using a databse with a properly structured and indexed table should do the trick.

GJay
02-28-2007, 12:55 AM
you can put things in memory with something like memcached or with the apc caching extension, with each:


$memcache = new Memcache;
$memcache->connect('localhost', 11211) or die ("Could not connect");


if(false===($data = $memcache->get('big_data'))) {
$data = get_lots_of_data(); //this takes a long time
$memcache->set('big_data',$data);
}

or with apc:


if(false===($data = apc_fetch('big_data'))) {
$data = get_lots_of_data(); //this takes a long time
apc_store('big_data',$data);
}

http://php.net/memcache
http://php.net/apc

marek_mar
02-28-2007, 01:05 AM
.. or with streams:
http://www.php.net/manual/en/wrappers.php.php

Rooseboom
04-19-2007, 05:22 PM
sorry for my late response, been away and the subject recently became a priority again.

in total many gigs (>250GB) of these numbers are stored and every time (several every second) a different part of the total is requested.

In a single request up to 2M items after each => array represented like this:

1023 => array ( '50123435' => '10023', '50035768' => '00234', '50003452' => '00037', '50002345' => '00002')
566978 => array ( '50000343' => '023493', '50123435' => '004543', '50003452' => '000039')

This is just a very small sample.
I now turn them into an array to use for calculations:

$file_array[1023][50123435] = 10023;
$file_array[1023][50035768] = 00234;
$file_array[1023][50003452] = 00037;
$file_array[1023][50002345] = 00002;
$file_array[566978][50000343] = 023493;
$file_array[566978][50123435] = 004543;
$file_array[566978][50003452] = 000039;

When finally loaded into the $file_array some calculations are done (which have the same items and than add values:

both 1023 and 566978 have item 50123435 with total value 10023+004543
both 1023 and 566978 habe item 3452 with total value 00037+000039

every item is evaluated like this and only items in both (can be three, four , five etc.) lists are finally send back.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum