...

View Full Version : PHP "Artificial Intelligence" - Site Search that learns??



Pennimus
06-04-2007, 10:27 AM
Over the weekend I've been working on a theoretical model for a PHP/MySQL driven site search consisting of multiple scripts and databases. As I developed the idea it became apparent that what I was imagining is a system that could "learn" about misspellings of the entries in my database and so, in theory, become more and more efficient as time goes on.

My reason for working on this is because the users of one site I'm working on often have a poor grasp of English. In addition, spellings for many of the entries in the database for the site aren't standardised so there isn't really one version that is absolutely correct.

While doing some research on advanced matching I became aware of the similar_text() function. It seems to offer something quite powerful but comes with stark warnings about how server intensive it is. Because my anticipated use of the search feature on the site is *huge* I wanted to minimise server load from each request as much as possible. One route for doing this seemed to be to make the search system learn about misspellings on the fly, so that as time goes on the need for similar_text() becomes less and less.

That's enough waffle for now. I've mocked up a flow chart (http://www.adambunn.co.uk/Search%20Flow%20Chart.gif) that shows the processes that the various scripts run through when a search is made. There's also a key describing the necessary databases. The reason I am posting this on CodingForums is because I know there are some great minds here and no doubt there will be things I've missed, things that can be made more efficient, maybe it's already been done, maybe some people want to collaborate in actually developing the scripts etc etc...

http://www.adambunn.co.uk/Search%20Flow%20Chart.gif

Note 1: The flow chart is set up with my specific needs in mind, where queries are made on one field which is just the title of the entry - essentially a dictionary like function. However I'm sure it could be adapted for a wider site search.

Note 2: While creating this flow chart I was convinced that similar_text() was actually called similar_string() - so wherever I mention similar_string() assume I mean similar_text().

marek_mar
06-04-2007, 01:47 PM
You should use soundex()/metaphone() instead. They are lots faster and more useful as they return a value you can compare to. It won't be a "learning" searchengine though.

printf
06-04-2007, 02:22 PM
normally what I do, is I take the search keywords and then first run them through a spell check routine. If i get a match for all keywords then I just stem the words then perform the database search. If some words don't match, I get the closet (3, 5) suggestions and for the words that didn't match or return the single suggestion when the spell checker returns a single match (like litttle == little, or poeple = people) and then add them to the search request, but using relevance so the suggestions that don't have exact single suggested words follow the suggestions that have a exact single word match, then stem the search words and perform the search! After, I add each miss spelled word to the auto change word list, so next time the spell check is done, the auto change words list replaces any miss spelled word that has already be flagged in the auto change list, while spell checking is being done. This way the search system learns as more searches are done. For closed in applications like forum search system, I maintain a unique list of words for each user, this way the search engine learns how each user likes to search for things. There is a lot more included in a refined closed in search system, because you can add all kinds of neat stuff that makes the search system work for each user the way they want.

Pennimus
06-04-2007, 02:23 PM
That looks interesting... thanks.

The way I'm reading it, it looks like I could replace the similar_text() part of the script with something built around metaphone(), but still carry on with the "learning" part of things as planned.

Or would you anticipate that simply using metaphone() for all searches would actually be quicker and less server intensive than building a database with matched misspellings?

My thinking is that by doing that, it means that PHP will only usually have to consult 1 or 2, or occassionally 3 fields before finding a match, without having to resort to using metaphone() at all.

What are your thoughts?


EDIT: In reply to printf.

I need to take a little time to swallow what you just said... but from my first reading it looks like you more or less have the same system as the one I'm theorising, perhaps with the different parts executed in a slightly different order. Do you use similar_text or soundex/metaphone for the spellchecking?

printf
06-04-2007, 03:39 PM
No I don't use any of the PHP linguistic functions because they don't exactly work the way most miss-spelled words are written. What happens is that most of the functions use typographical error reasoning, when phonetically similar reasoning is most times included in the mix.

Look at this example, it's pretty bad, but it's found 2016 times in 214,324 searches.


avhe

The (3) nearest matches are (ave, ache, av he), but if you include phonetically similar reasoning you get the exact match (have), another similar one...

found 862 times in 103,873 searches.


aveh

The (3) nearest matches are (ave, aves, aver), but if you include phonetically similar reasoning you get the exact match (have)

These are just simple examples, but they can screw a search result up big time, because they will match what the user really doesn't want, or not match what the user does want. Undoubtedly you wouldn't be searching for (have), it's just an example, but less common words also cause these problems which ruins searches, that the search engine can fix, if good reasoning is being learned from each search. I track all search results (just miss spelled words), so I get idea of what miss spelled words appear and their frequency of appearance, so I know when they need to be added to auto change list so the searches return better matching relevance.

When go back to the office on Wednesday I'll get you a copy of my search / spell check class they both are learning engines that work really well! I don't have an example search engine up at the moment so I can't show how it works, but you try the spell checker auto change feature. In spell checker environment, you use the auto change feature so the user doesn't have to waste time changing common mistakes that the spell checker can fix automatically because it has learned the mistake from seeing it many times before!

// place this in the box...


audeince
audiance
availalbe
aveh
avhe
awya
aywa
bakc
balence
ballance
baout
barin
bcak
beacuse
becasue
becaus
lerans
levle
libary
lible
librery
lief
lieing
liekd
liesure
lieutenent
liev
likly
lisense
litature
literture
littel
litttle
liuke
tjhe
tje

http://ws.ya-right.com/spell.php

Pennimus
06-04-2007, 04:23 PM
Yes this is the gist of what I wanted to achieve, except without having to add the misspellings to the autochange field (or whatever you want to call it) manually - in my proposed system it would be handled automatically (or maybe you also do it automatically and it wasn't clear to me), incorporating user feedback from when you serve a "did you mean xyz?" result. This is important because it means the whole system scales very well.

But it does mean I need to decide on a method for deciphering the misspellings in the first place - similar_text(), or metaphone(). Ultimately I suppose only testing can determine which one will be more accurate for my particular database.

Anyway, I look forward to seeing your script.

matak
06-04-2007, 08:28 PM
If you need help with the idea of making a search engine there are few thoughts i had in mind, and i would like to join whatever community you got, or help you start a new one when it comes to search engines.

Most people here who read about future of web probably realize that it's all gona be in databases. So primary thing when building a search engine is having a massive database with information.

2nd and the most obvious thing is building a smart search engine that will correct misspelled words. Maybe it would also be good, to run similar engine when storing information into database, beacouse people who misspell are not only the people who search, but also people who write websites.

3rd thing, and probably the only thing that would differ this "new" search engine from egzisting ones is the user approach, and website reviews and ratings. When you compare hugest search engines now, you can see that their logic is fast, but not that much correct.

Most of the engines rate sites based on the number of visitors, and that can be manipulated with massive advertising and other ways of promoting websites who don't have the needed quality. New search engine should be smart, fast, and user friendly. It could never support that much searches as best engines today, but user side would give engine some benefites, couse users would personaly rate sites, and would work for the engine. Like one big happy family :).

aedrin
06-04-2007, 08:32 PM
Most of the engines rate sites based on the number of visitors

I don't know of any big search engine that does this.

What will be the next evolution in search engines has already been discovered. So there is little value in discussing what it is. The problem is getting there.

matak
06-04-2007, 08:48 PM
What will be the next evolution in search engines has already been discovered

can you share that information, i tend to miss the obvius things :rolleyes: ?

do you mean it's going to be something like that opencyc thing?

aedrin
06-04-2007, 09:30 PM
I can't think of the exact buzzword they use at the moment, but the basic thought is this.

Currently, HTML has little meaning. Something can be marked as <p>, or <h1>, but you can assume nothing from this. It could be a paragraph, or a header, but it could also be used as many different things. It's the concept behind 'markup' versus 'meaning'. <b> denotes something is bold. <strong> denotes that something has a stronger meaning.

If search engines could read this information correctly, search results would be a lot more valid. And this is what the next evolution would be. A search engine that understands the content of websites.

Granted, search engines today try to implement this today, but until HTML gets updated it is still guess work. And even then, all existing websites would need to be updated.

matak
06-04-2007, 09:39 PM
The idea comes to my mind riiight, oh here it is. Search engine that reads CSS, and denotes css that has excessive design code. Going way off here.. :)

Why did people make it so complex, you have h1 for main heading and so forth till h6, p for paragraph, b(strong) etc.

Hm, maybe it would be a really good idea to look into that CSS/TAG denote thingy :D

eg. Once i searched something about aDSL - what's it about, and among searches found quite simple website one Finish professor, in which she wrote WOOW article that explains aDSL technology. If i found that maybe 10 years ago, i would be the main aDSL provider in my country today :no_appropriate_smiley:

Design killed the web! :D

aedrin
06-04-2007, 10:11 PM
Correct.

While all these graphics can entertain you, the real purpose of the internet was to share information and make it available to everyone.

Few people use it for this purpose nowadays, and won't even consider a clean website that shows information in its most efficient state. A state without any design elements (basic markup is required of course.)

Fumigator
06-04-2007, 10:22 PM
The term is "semantic HTML" I believe... (google it)

matak
06-04-2007, 10:42 PM
Do you guyz think that with XML, and option to hide the source, and HTML, whole web is going to be fully commercial?
I belive that internet has gone in bad way, from exchanging information, to selling information. :mad:

aedrin
06-04-2007, 10:47 PM
The term is "semantic HTML" I believe... (google it)

Ah, yeah.

I'm not good with buzzwords. ;)


Do you guyz think that with XML, and option to hide the source, and HTML, whole web is going to be fully commercial?

Such is the fate of most successful free services. Either the owner is unable to operate because of costs, or they are tempted by money. Either way, at one point or another the service will sell out and become part of a large corporation. It is slowed by the fact that the internet is technically not owned by anyone. However, it is 'controlled' by all ISPs.

matak
06-04-2007, 11:10 PM
@Fumigator

i googled that semantic markup.. And woaha.. i didn't know about half of the thingies there, cant wait to use them on my next template.

2nd result (http://brainstormsandraves.com/articles/semantics/structure/)

'bout future

internet is really complex technology. it's sad that it's controlled by our ISP's, which are then controled again depending which backbone they use. again backbones are ISP's, just on a greater level.
But when you look at it, we are all connected with wires, and i think that maybe not in near future, but sometime far away, it is possible that someone finds a way to use that wires to connect computers with new protocol that can't be controlled. this thinking is giving me symptoms which i don't like, so i'll give up right now :rolleyes:.

yes, search engine, that's what we were talking about.. lol

marek_mar
06-05-2007, 01:30 AM
That looks interesting... thanks.

The way I'm reading it, it looks like I could replace the similar_text() part of the script with something built around metaphone(), but still carry on with the "learning" part of things as planned.

Or would you anticipate that simply using metaphone() for all searches would actually be quicker and less server intensive than building a database with matched misspellings?

My thinking is that by doing that, it means that PHP will only usually have to consult 1 or 2, or occassionally 3 fields before finding a match, without having to resort to using metaphone() at all.

What are your thoughts?
You should try pspell (http://www.php.net/pspell) for spell checking. It should work lots faster than any PHP implementation. A database with misspellings would be vast (as each word can have many misspellings). there is a difference between finding words that are spelled similarly and pronounced similarly (metaphone checks the latter).
Again the learning bit won't be there unless by "learning" you mean that a user add a word here or there. That is not the computer learning. Making a computer learn is a lot more difficult than that. :p

Pennimus
06-05-2007, 09:48 AM
Aw man, my threads been 'jacked :rolleyes:

I'm not interested in building a web search engine, so I'll take my leave of this thread.

marek_mar: Yes, clearly it doesn't actually "learn", hence the preponderence of quotes whenever you or I have used the word "learn" in this thread. :thumbsup: But thanks for all your suggestions... I'm now better equipped to go away and work on this.

aedrin
06-05-2007, 04:25 PM
I'm not interested in building a web search engine, so I'll take my leave of this thread.

I thought that is what this whole thread was about? Ideas about building a better search engine?

Don't just call it hijacked because you weren't part of the discussion. It only applies when the topic has been changed.

Pennimus
06-05-2007, 04:47 PM
Well, I was specifically talking about site search - not web search, which is an entirely different thing. When you start getting into semantic HTML, web crawling and so on it really goes far beyond site search into a whole new realm.

I started the thread specifically to discuss the searching and "learning" functions I was thinking about and had laid down in the flow chart I linked to. Now the topic has evidently "moved on" as it were, I'm not bitter about that at all :thumbsup: It's just that as I said I'm not interested in web search...

aedrin
06-05-2007, 05:10 PM
I was specifically talking about site search - not web search

My bad.

People tend to use site, web and website interchangeably, so I did not pick up on the scope.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum