PDA

View Full Version : Full text searching


quadrant6
04-06-2004, 09:32 AM
I have a database table with 30 columns which will hold around 70,000 records.

Up to 12 of the table columns will need to be searchable from a web page (PHP). There will be a drop down menu next to each search input field (title,description etc) with 2 options for searching "exact phrase" and "all words"

Searching for the "exact phrase" can be done with:

WHERE column LIKE %$user_input%;

And for "all words", I've added a fulltext index on those columns and use:

WHERE (description,title etc) MATCH AGAINST ('$user_input')

There are downsides to this fulltext searching though so I'm wondering if there is another way enable this sort of search ("all words") ?

raf
04-06-2004, 10:16 AM
In such case, i just build the where clause dynamically.
Like (supposing it's a space delimited list)


$sql="select ... FROM table";
$clause = ' WHERE 1=2';
$collection=explode(' ',$user_input);
foreach($collection as $keyword){
$clause .= " OR title LIKE '%" . $keyword . "%'";
}
$sql .= $clause

or if you have multiple columns:

$sql="select ... FROM table";
$columns=array('title', 'description', 'blabla');
$clause = ' WHERE 1=2';
$collection=explode(' ',$user_input);
foreach($collection as $keyword){
foreach($columns as $var){
$clause .= " OR " . $var . " LIKE '%" . $keyword . "%'";
}
}
$sql .= $clause


Note : before treating the userinput like this, make sure you check the user-input for wildcards, for the number of keyword, that you trim it, that you remove common words like 'the, or, and, a, an, ...'

bcarl314
04-06-2004, 01:51 PM
I have a database table with 30 columns which will hold around 70,000 records.



This sounds like it may not be normalized. You may want to look into normalization of your database to improve it's effeciency.

Just a thought

raf
04-06-2004, 08:53 PM
This sounds like it may not be normalized. You may want to look into normalization of your database to improve it's effeciency.

Just a thought
:confused: You can tell that from the fact that the table has 30 columns?
And why should it be normalised? If the db needs to be optimized for fast searches, then a big flat table will probably be better suited.

quadrant6
04-14-2004, 11:45 AM
Thanks raf :thumbsup:

bcarl314
04-15-2004, 07:39 PM
:confused: You can tell that from the fact that the table has 30 columns?
And why should it be normalised? If the db needs to be optimized for fast searches, then a big flat table will probably be better suited.

Just a guess. Very rarely have I seen databases in 3rd normal form or higher with tables that have more than 20 fields. Granted I'm not saying that there's not a need for a 30 column table, its just been my experience that all too often people view database tables as spreadsheets and do not look at the underlying data structure.

I was simply pointing out the fact that they may want to double check it.

Oh yeah, and my magic 8-ball said "Hard to say" when I asked it if his tables were normalized. :D

mat
05-19-2004, 02:27 AM
Raf,

I was wondering if you could help with a problem I'm having using the search (building the SQL dynamically and using LIKE %..)

At the moment If I search for 'amb', I get 'gamble'. This is, I take it because of the 'LIKE % '".input.''%'.

Is there a way around this? so that is only matches words? i.e: If i enter into the keywords box: 'amb, utu' it only brings back results with both of those words and not gamble, future etc?

:confused:

mat
05-19-2004, 04:46 AM
After a bit of search the only solution seems to be via regular expressions:

(^|[^a-zA-Z])search_string([^a-zA-Z]|$)

SELECT * WHERE name REGEXP '(^|[^a-zA-Z])search_string([^a-zA-Z]|$)'";

firepages
05-19-2004, 06:43 AM
I wouldn't be searching the table at all , with that much data you will be better off with a couple of keyword and lookup tables . ( and supporting indexing scripts;) )

mat
05-19-2004, 07:32 AM
Could you give more info on how that is done please?

You are right, The search is very slow now.

firepages
05-19-2004, 08:45 AM
ok , well several ways :D

This is the way I do it, which was basically snarfed from the logic of PHPdig , I looked at several other existing systems but this made more sense for me....

3 tables
#
# Table structure for table `sh_engine`
#

CREATE TABLE `sh_engine` (
`spider_id` int(9) NOT NULL default '0',
`key_id` int(9) NOT NULL default '0',
`weight` int(9) NOT NULL default '0',
KEY `spider_id` (`spider_id`),
KEY `key_id` (`key_id`)
) TYPE=MyISAM;

# --------------------------------------------------------

#
# Table structure for table `sh_keywords`
#

CREATE TABLE `sh_keywords` (
`key_id` int(9) NOT NULL auto_increment,
`keyword` varchar(50) NOT NULL default '',
PRIMARY KEY (`key_id`),
UNIQUE KEY `keyword` (`keyword`)
) TYPE=MyISAM;

# --------------------------------------------------------

#
# Table structure for table `sh_spider`
#

CREATE TABLE `sh_spider` (
`spider_id` int(9) NOT NULL auto_increment,
`title` varchar(150) NOT NULL default '',
`path` varchar(50) NOT NULL default '0',
`uid` int(9) NOT NULL default '0',
`created` timestamp(14) NOT NULL,
PRIMARY KEY (`spider_id`),
KEY `path` (`path`)
) TYPE=MyISAM;


2 scripting issues , 1 how to fill the tables , 2 how to search the tables

1 is the easier if more complex as you have to write a spider , though its still easier as we at least have a fixed goal ... counting words !

[note that some systems spider as they go along , e.g. each time a page/product/data is added the search DB is updated , others spider the whole site at once (I prefer the all at once thing myself), others do both !!]

So you need to write a script that crawls through your database/and or site which grabs the unique wordlist for each given record or page.


during this `spidering` process ...

...every time you find a new word that does not already exist in the sh_keywords database add it.
you may want to filter out common or useless words ,'the' ,'a', ' .NET' etc ;)


...every time you find what constitutes a new page , record it in sh_spider , title and path are used for display/link purposes (uid and created are optional here) , so sh_spider ends up as a site map of your site , each record being a unique page (how you handle dynamic pages (prod_id=1 etc) is up to you)


in a given page (sh_spider.spider_id ) you want to add to the sh_engine the keywords you found (their id in the keyword table not the word itself !) + the spider_id of the current page + the relevance of the word in that page (e.g. how many times it was found).. e.g.
word 'PHP' (sh_keywords.key_id = 10)was found 50 times in the page (sh_spider.spider_id =5 ) you would
INSERT INTO sh_engine VALUES( 5 , 10 , 50 );

how you do the weighting (here I just put the word count in) is up to you and debatable , you may wish to apply further rules to that.

...........................

ok so now you have some scary looking tables (keywords and engine will be quite big but the main table sh_engine is lean to search)

how you actually query those tables depends on many things :D , but a simplistic (yet effective) approach would be...

First off get the ( filtered and trim()ed ) search keywords from your user input ,
Grab those keyword id's from the sh_keyword table.
Next you search for those keyword_id's in the sh_engine table JOINing the sh_spider as you go (oredering by relevance) and there you have a basic set of search results.

so you end up with 2 or 3 queries to get the search results (as you may want to grab some data from each of the pages found for display purposes (that path is in sh_spider)

but they are quite efficient queries, you can play with how you seach the results to come up with ways of recognising `natural phases` etc though thats when things start to slow down , though its amazing how accurate even simple keyword search results can appear even using the basic approach above.


Thats the best way I can explain it off the cuff , any questions please ask + note that this is only one way of acheiving stuff , I can't say its the best way , only that it seemed the best to me and that it has translated into results quite well for me.
The major disadvantage to this system is the size of the search tables, that is however only of concern from a disk space perspective , not the actual efficiency.

raf
05-19-2004, 11:36 AM
Mat,

If you wanna do better then a keywordsearch (which is the same as saying you wanna do better then almost all other searchfeatures on almost all other sites), then you'll need some specials. Like making an optimised searchtable like firepages outlined.

My initial thought would also be regex, but maybe you can do it just like

select var from table where var2 LIKE '% keyword1%keyword2%keyword3 %'

so this means you'll first make a % delimited list of your 'keywords', and then add a space in front and at the end of it, and use that as a value.

but maybe you're just making it hard on yourself or maybe i just don't understand your situation.