PDA

View Full Version : string similarities percentage


Beck
04-23-2003, 12:26 AM
I found a function a few months back that takes two strings and returns their percent similarity. I guess that means something like

similar(washington,wachington); (note the differences in spelling)

might return about .98 or so. Does anyone know either the algorithm taking place or where I can get that function? I really wish I had kept it and saved it somewhere.

Thanks in advance!

Vladdy
04-23-2003, 01:04 PM
If you say that X is the character number in a string and Y is its ASCII (or unicode) code you can program a correlation function... :thumbsup:

Ökii
04-23-2003, 02:55 PM
quick guess

function testem(a,b)
{
c = Math.max(a.length,b.length);
d = 0;
for(e=0; e<c; e++)
{
d += (a.charAt(e) == b.charAt(e)) ? 1 : 0;
}
alert(a +' is ' +Math.round((d/c)*100)+ '% similar to ' +b);
}
testem('washington','wachington');

Vladdy
04-23-2003, 03:35 PM
The problem with this approach is that it only can pick a misspelled letter, however it totally fails in cases like:

"washington" " washington"
"washington" "waschington"

liorean
04-23-2003, 05:19 PM
One problem here is how to handle pronounciation similarities and how to quantify positional distance, not only character code. ti and si (comp. lotion - explosion) are more similar than se, ze or ce (comp. offense - defence; colonize - colonise) when it comes to ascii codes but of no difference when it comes to pronounciation. How about distance of the keys on the "qwerty" keyboard layout?

Beck
04-23-2003, 10:43 PM
yeah, I'm not real sure on the standard, which makes this kind of a strange algorithm request (considering if you don't know the rules then how do you apply them, right?). What exactly makes two strings "similar"?

The spelling idea was a good one, I did think to question letter matches in the strings, like Ökii tried, but like Vladdy points out, when there is an insertion into the string, the characters no longer correspond, and when a word is off only by 1 character, it recieves a nearly 100% NONsimilarity, even though it's in fact very similar.

I wouldn't know how to judge key distances on the qwerty keyboard (gosh, I've been looking for an excuse to type the word qwerty for a long time :p ) using javascript except manually, but that sounds offhand kinda ridiculous and tedious.

I'm also pretty clueless about handling ascii or unicode manipulation. Hope someone else has that wired or knows of a very straight foreward tutorial that can post a link to.

THanks for the ideas, though, guys! I knew this was the forum to bring that one to.