View Full Version : automatically convert keywords into links

09-16-2003, 11:10 PM
i'm writing on a rather large application that will include a helpfunction with definitons etc.

i'm planning on setting up a table with in the rows
keywords and there synonyms, abreviations, plurals, ... + short description + a definition.
(or maybe 2 tables where 1 tables has a primary key, short description, definition. And the other table has one row for each keyword/synonym/plural/... and the PK of table1 as a foreign key)

now, my problem is that i want to write a function that will automatically convert these keywords in all my pages into a link to the definition-page with the primary key value from that table in the querystring (so they apear as a link and if the user clicks it, he gets the definition, if he hoovers over it, he will get the short description (like a title))
this should be done at runtime, and only on specific parts of the page (not the title etc)

I have absolutely no idea what would be the best way to write a function that does this.

I could select the complete table and dump it inside 2 arrays (1 array containing all keywords, synonyms ...) and one array containing the corresponding urls, like (<a href=\"lexicon.php?pk=10\" title=\"De eerste dag van de maand\">Begindag</a>")

and then run a str_replace on all strings in the page where i want this conversion, but i think there must be a more effcicient way to do this.

Anyone has any ideas ?

09-16-2003, 11:15 PM
Wow! I believe this is the first time I've ever seen raf ask for help! Couldn't you use the preg_replace() method?


09-16-2003, 11:27 PM
So basically anywhere in a page where a designated keyword exists you want to automatically associate and hyperlink that word to another page as well as put in a short title description?

Kind of like CNN does with Company/Business/Corporation names?


Not sure if they are doing this automatically but whenever they put a big business name in their articles it gets hyperlinked to their stock market info.

09-16-2003, 11:37 PM
Well you would have to parse each page. Define the starting point to be after the <body> tag and stop at the </body> tag. Probably will want to ignore keywords that exist as part of other links, img sources, etc.

There is really no other more efficient way other than using str_replace or preg_replace. Use str_replace unless you need to get fancy with it.

As for making it more efficient..hmmmm just thinking out loud here but maybe you could try and create this so that it learns each time it parses the page. Each time it parses the page you could physically change each keyword into the link permanently and then store any keywords found in the page in your database and then store the last modified date of the page and then next time if the page has not been modified then there is no need to parse the page.....or something along those lines.

09-17-2003, 12:25 AM
Hi raf,

you might want to take a look at this (http://www.codingforums.com/showthread.php?s=&threadid=22416) thread. Mordred came up with a solution to highlight a certain keyword in a page, without messing up de html that containes the same keyword, in for example the class-attribute.

It is not exactly what you are looking for, but I think it is related to your problem and should help you to make a start!

Cheers, Michiel

09-17-2003, 01:03 AM
Regular expressions are definitely tools suited for this purpose. However, they have their limitations. They are not particularly aware of their context and can't subsitute a real parser. Some problems you need to watch out for:

1.) Because regexp replacements are done sequentially, it could be that you end up overwriting changes you've done with a previous regexp! Exampel: First you replace "apple wine" with "apple [alcoholic liquid]", and then you search for "al" to replace it with "[al bundee]" and you get... "apple [[al bundee]coholic liquid]". Looks perhaps a bit obscure, but it's a real problem, and if you insert markup, it can screw up your whole page.

2.) It's quite hard to distinguish between "real" text and text in HTML attributes. The script Michiel linked to tries to only replace those words that stand alone, and are not part of of HTML tags. However, the application for which this script was written was fairly simple and contained not much markup - things will be different in your case.
I later encountered a problem with URLs that were replaced. You have to employ this pattern instead, which narrows the chances for replacement accidents:

$find[] = '/(\s|\b)(?<![\<\.])' . preg_quote($t[$i]) . '(\s|\b)(?!\>)/i';

If you can, do those replacements only on raw text. That'll make things easier. And Spooksters thoughts about caching the output are very important, if we're talking about efficiency here.

09-17-2003, 01:41 AM
Originally posted by mordred
And Spooksters thoughts about caching the output are very important, if we're talking about efficiency here.

lol I use and design lookup caches all the time at work and it didn't even occur to me that that is what I had just described.. :D

09-17-2003, 09:50 AM
Thanks guys. All great ideas.

I don't now about the caching because this woun't be a fixed list of keywords, so even if i would do some sort of caching i'd need to run it through my function each time the page is called, to replace the new keywords. So i would need to check the caching datetime and then build an array with more recent keywords + i would need to use some regex to not replace keywords inside the <a> tags inside my strings (with my current setup, i'm planning on keeping links out of text-paragraphs, but if i would store the processed pages, there would be links in the textparagrahs). On top of that, almost all content will be pulled from the db and i'm planning on working with userprofiles to only show the info they need (you know, for experts, more detailed in dept info and for general users, just a more general, shorter explanation), so there's a good chance that caching these pages wount pay off.

I may have to change my opinion later on, but for now, i don't think the advantaes will be big enough to build such a construction.

I'm gonna look into mordreds thread and then just gonna give it a try to see how efficient it all is.
I'm gonna create and store the arrays in a seperate file (SSI) each time the keywordtable is updated + inside my contenttables, i'm gonna add a "replace" column which indicates if that records values need to be ran through the function. I think that if the arrays don't need to be created each time, that this would cut out most of the processing-time.

Anyone, thanks for the input, and if you get new ideas (or if you have some ideal code just laying around), please let me know.