View Full Version : Matching similar items and best practice

11-21-2011, 10:30 AM
Morning all

I've been thinking of ways to match different database records (let's call them XXX), and give them a "match" score. Whether this is a percentage, or simply "yes" or "no", I haven't decided, but I don't think it'll make much of a difference. One important thing to note is that I would expect a table of approximately 10000 records. Since a lot of these matches will be calculated via tags, we could then potentially have a hell of a lot of queries/joins.

So far I have come up with three solutions:

1) Write a function to do so, and when viewing a XXX the server will calculate the scores live. This will be the most up-to-date method, although potentially very slow if the site does actually grow to levels I'd be happy with.

2) Use a CRON job during quieter times to calculate matches from one XXX to another. We could then store these in a new table, say 'matches', and have a float/integer with the two XXXIDs. Obviously this will mean that the matches are delayed somewhat, and also the table could be 10000*10000*0.5 = huge. Unless of course you only keep matches above a certain threshold.

3) Use a CRON job as above, but store the data in another field in the XXXs table. This field would contain a string of related XXXs (limited to however many , possibly such as "356,89;6456,88;14,78;" in which the XXX with ID 6456 has a match of 88%. Again, this will eliminate low matches.

Or... secret option!

4) Stick an additional field 'lastUpdated' or something in the XXX table. If it was updated in the last 24/48/168 hours, then we run option 1) and generate a new set of matches. If not, we conserve our power and simply read from the existing record.

Thanks in advance for any input. Which would you choose and why? Maybe I'm missing something. I appreciate this is not specific to PHP but it's what I'll be using. What do you call XXX in this case? Entity? Widget? Thing? This is more of a theoretical debate rather than a request for code.

11-21-2011, 10:31 AM
P.s. on my phone so apologies for spelling errors etc!

11-21-2011, 06:16 PM
This sounds to me more of a normalization issue. 10,000 is not very many records, and a SQL query will be more than sufficient to do so. But if the data isn't indexed properly, then you will have slow lookups.
You refer to the use of tagging. This isn't really a problem, write a flattening table between a record and its tags to implement a many to many relationship. Then you can retrieve all records that have been tagged with a certain tag or tags, and use whatever aggregate functions in SQL you need.
If you have data that looks like this: '356,89;6456,88;14,78;', then yes you have a major normalization issue, and you must follow one of the approaches you've listed above, preferably during quiet time. Un-normalized data cannot be indexed properly (as it won't make any sense for what you want it to do), so you'd need to deal with every record available to develop the proper results. Indexed data on the other hand does not suffer from this