...

View Full Version : Perform complex calculations entirely within MySQL



PikaD
01-23-2012, 03:04 PM
I currently have this working by passing bucket-loads of data to PHP and then doing all the calculations there. I want (and most likely need) to perform all the calculations using MySQL only. I know that it is possible to write functions and procedures in MySQL but I've never done it before and every time I sit down to get this written I confuse myself into a crying mess of self-disappointment. So....

Explanation of database:
Let's start with a database diagram:
http://i.imgur.com/Th3NL.jpg

You can immediately disregard the users table. We'll not bother with that in this question.

So this database hold information drawn from web pages. You can see the links table holds the URL and title, as well as the maximum term frequency for any term in that document.

The object table is just to allow both users and links to be foreign primary keys in the occurences table. BTW, I am aware of the constant mis-spelling of occurrences as occurences! =)

The occurences table holds the majority of data. The table has a primary foreign key of linkoruser (though just ignore user for this). It has word_id which refers to the dictionary table, tag_id which refers to the tags table and finally score which is the number of occurrences of that word, of that tag type, in that document.

The tags table has a tag id, the name of a tag (eg. title, or H1) and its weight which is normally set between 1 and 10.

The dictionary table has a word id, the actual word, and the inverse document frequency of that word. If you care what an IDF is then it's the number of web pages, divided by the number of web pages containing that specific word somewhere.

Explanation of the system's function:
Given a link ID (we'll call it BaseID), all other links are compared for similarity (cosine similarity) and the IDs, url and titles of all links are shown to the user, ordered by most to least similar.

Scores for each word are:
((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
ie the word's number of occurrences in the document multiplied by the tag weighting for this particular word's tag, divided my the maximum term frequency for that document. Then that multiplied by the word's inverse document frequency to give you the final score for that word, in that context.

The similarity is done via cosine similarity, best show in the image below.
http://i.imgur.com/n3Bp8.jpg

So each page has it's similarity to the BaseID, and the system shows the user all the page IDs, URLs and Titles, ordered by similarity from most to least similar.


Explanation of the problem:
Currently massive amounts of data is dragged out of MySQL and handed over to PHP which then does all the processing.

This is because I only know my way around basic MySQL. I know it is possible to use functions and procedures, but I am getting mighty tangled.

What I would like to have is for me to pass a BaseID to MySQL and it return the page IDs, URLs and Titles, ordered by similarity from most to least similar.

I know this is a massive question, and nobody's getting paid to sit here and churn out solutions. So I really appreciate that you've even read this far down!

Thanks in advance!

P.S. If you want a download of the database so you can have a fiddle, it's available here:
http://dl.dropbox.com/u/22289145/linksitev2.sql

Old Pedant
01-23-2012, 08:42 PM
Ummm...occurences table must also have a document id? Oh...I see...that's the linkoruser field, right?

This could probably be done without using temp tables, but it's likely easier to code by creating at least one and probably two temp tables as part of the stored procedure.

Hmmm...or maybe we should just add another field to the occurences table? No, wait. You already have the score in there.

Hmmm...not sure this isn't easier than you think.

Let me cogitate a while.

Old Pedant
01-23-2012, 08:50 PM
Okay, explain this:


The occurences table holds ... score which is the number of occurrences of that word, of that tag type, in that document.
...
Scores for each word are:
((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf

You use "scores" to mean two different things. Wouldn't "frequency" have been a better name for the occurences.scores field?

Why can't you pre-compute the final score for each occurrence?? I don't see anything in your formula there which would be time-variant. Am I missing something?

Hmmm...or does word_idf change over time, as new documents are added. Looks like it. Still, you could pre-compute all but that last multiply, no?

The more tables you can leave out of the massive ugly JOIN that this will need, the better. So being able to leave out tags and links would be a big help.

Old Pedant
01-24-2012, 03:45 AM
One more question: So the summations used must take into account any word founc in *EITHER* of the two documents, right.

But no, that's not really true. If a word has a "final score" of zero in either table (e.g., it's not there) then multiplying 0 by non-zero is still zero. So for the numerator we only care about words that are indeed in both documents. And for each of the sums in the denominator, we won't even *see* a zero value.

Hmmm...I can see this think is going to take forever, but it doesn't seem very complex, actually.

So far, I have done these two operations on your occurences table:

alter table occurences add pscore float;

UPDATE occurences, tags, links
SET occurences.pscore = 1.0 * occurences.score * tags.weight / links.max_tf
WHERE occurences.tag_id = tags.tag_id
AND occurences.linkoruser = links.idlinks;

Adding that "pscore" field and then calculating it took several minutes for each operation, so that is several minutes that won't have to be repeated in each computation.

It looks to me like it's going to be worth creating at least one TEMP table: The first operation will be to go out and get the ||A|| value for each document and store that in the TEMP table. Though it's tempting to add it to the LINKS table.

You certainly aren't going to have two people running this query at the same time are you? And/or you aren't going to add a new document while the results are being generated? (Doing so would alter the dictionary.word_idf value midway through the operation, thus invalidating all work done to that point.)

Old Pedant
01-24-2012, 04:08 AM
Okay, I've now done these steps:


CREATE TABLE squares (
idlinks INT PRIMARY KEY,
sumofsquares FLOAT,
similarity FLOAT );

INSERT INTO squares ( idlinks, sumofsquares )
SELECT links.idlinks, SUM( POWER( (occurences.pscore * dictionary.word_idf), 2 ) ) AS ss
FROM links, occurences, dictionary
WHERE links.idlinks = occurences.linkoruser
AND occurences.word_id = dictionary.word_id;

When I convert this into a stored procedure, either the squares table will be a temp table or I will truncate it before doing the INSERT.

Does that math make sense to you so far?

pscore values was a tremendous time saver.]

Old Pedant
01-24-2012, 08:31 AM
Okay, I completed it. But something seems to be wrong.

The similarity for link 726 *TO ITSELF* is not even close to the highest similarity score.

I've been over the numbers three times now, and they all seem right. But why doesn't the similarity score then make sense.

FWIW, here are the top 10 (highest) similarity scores to 726:


mysql> select * from similarities order by similarity desc limit 10;
+---------+-----------------+
| idlinks | similarity |
+---------+-----------------+
| 1933 | 0.000000118274 |
| 2869 | 0.000000109086 |
| 720 | 0.0000000935016 |
| 4516 | 0.0000000910589 |
| 2504 | 0.0000000862625 |
| 5677 | 0.0000000843 |
| 1132 | 0.0000000838255 |
| 2792 | 0.0000000828706 |
| 3799 | 0.0000000826505 |
| 5202 | 0.0000000824886 |
+---------+-----------------+
and for comparison:

mysql> select * from similarities where idlinks = 726;
+---------+-----------------+
| idlinks | similarity |
+---------+-----------------+
| 726 | 0.0000000527725 |
+---------+-----------------+


I'm going to try rerunning the numbers again, later, but think it's time for you to chime in.

Old Pedant
01-24-2012, 08:43 AM
Here's a sample dump of some of the data that goes into creating the numbers.

pscore is that precalulated value of occurences.score * tags.weight / links.max_tf

What do you think?


mysql> select d.word, o.pscore, d.word_idf, o.pscore * d.word_idf, o.score
-> from dictionary as d, occurences as o
-> where d.word_id = o.word_id
-> and o.linkoruser = 725
-> limit 20;
+----------+----------+----------+-----------------------+-------+
| word | pscore | word_idf | o.pscore * d.word_idf | score |
+----------+----------+----------+-----------------------+-------+
| new | 0.333333 | 4376 | 1458.666710138321 | 1 |
| what | 0.333333 | 1619 | 539.6666827499866 | 1 |
| usernam | 0.333333 | 156 | 52.000001549720764 | 1 |
| password | 0.666667 | 436 | 290.6666753292084 | 2 |
| member | 0.333333 | 1287 | 429.0000127851963 | 1 |
| happen | 0.333333 | 766 | 255.33334094285965 | 1 |
| updat | 0.333333 | 2353 | 784.3333567082882 | 1 |
| world | 0.333333 | 2447 | 815.6666909754276 | 1 |
| expert | 0.333333 | 268 | 89.33333599567413 | 1 |
| favorit | 0.333333 | 618 | 206.0000061392784 | 1 |
| friend | 0.333333 | 2087 | 695.6666873991489 | 1 |
| twitter | 0.666667 | 2242 | 1494.6667112112045 | 1 |
| twitter | 0.666667 | 2242 | 1494.6667112112045 | 1 |
| twitter | 1 | 2242 | 2242 | 3 |
| email | 0.333333 | 2264 | 754.666689157486 | 1 |
| celebr | 0.333333 | 1308 | 436.00001299381256 | 1 |
| sign | 0.333333 | 2908 | 969.3333622217178 | 1 |
| sign | 0.666667 | 2908 | 1938.6667244434357 | 1 |
| sign | 1 | 2908 | 2908 | 3 |
| forgot | 0.333333 | 166 | 55.33333498239517 | 1 |
+----------+----------+----------+-----------------------+-------+

PikaD
01-24-2012, 12:25 PM
Here's a sample dump of some of the data that goes into creating the numbers.

pscore is that precalulated value of occurences.score * tags.weight / links.max_tf

What do you think?


mysql> select d.word, o.pscore, d.word_idf, o.pscore * d.word_idf, o.score
-> from dictionary as d, occurences as o
-> where d.word_id = o.word_id
-> and o.linkoruser = 725
-> limit 20;
+----------+----------+----------+-----------------------+-------+
| word | pscore | word_idf | o.pscore * d.word_idf | score |
+----------+----------+----------+-----------------------+-------+
| new | 0.333333 | 4376 | 1458.666710138321 | 1 |
| what | 0.333333 | 1619 | 539.6666827499866 | 1 |
| usernam | 0.333333 | 156 | 52.000001549720764 | 1 |
| password | 0.666667 | 436 | 290.6666753292084 | 2 |
| member | 0.333333 | 1287 | 429.0000127851963 | 1 |
| happen | 0.333333 | 766 | 255.33334094285965 | 1 |
| updat | 0.333333 | 2353 | 784.3333567082882 | 1 |
| world | 0.333333 | 2447 | 815.6666909754276 | 1 |
| expert | 0.333333 | 268 | 89.33333599567413 | 1 |
| favorit | 0.333333 | 618 | 206.0000061392784 | 1 |
| friend | 0.333333 | 2087 | 695.6666873991489 | 1 |
| twitter | 0.666667 | 2242 | 1494.6667112112045 | 1 |
| twitter | 0.666667 | 2242 | 1494.6667112112045 | 1 |
| twitter | 1 | 2242 | 2242 | 3 |
| email | 0.333333 | 2264 | 754.666689157486 | 1 |
| celebr | 0.333333 | 1308 | 436.00001299381256 | 1 |
| sign | 0.333333 | 2908 | 969.3333622217178 | 1 |
| sign | 0.666667 | 2908 | 1938.6667244434357 | 1 |
| sign | 1 | 2908 | 2908 | 3 |
| forgot | 0.333333 | 166 | 55.33333498239517 | 1 |
+----------+----------+----------+-----------------------+-------+



I just checked this forum now. Three things I really need to say:
1) I haven't read all this yet.
2) I am amazed someone is helping! All the other forums were mean to me ;)
3) Old Pedant is truly an outstanding username.

I will read through all this ASAP. Thanks for everything so far O.P.

PikaD
01-24-2012, 01:31 PM
Okay, explain this:

You use "scores" to mean two different things. Wouldn't "frequency" have been a better name for the occurences.scores field?

Why can't you pre-compute the final score for each occurrence?? I don't see anything in your formula there which would be time-variant. Am I missing something?

Hmmm...or does word_idf change over time, as new documents are added. Looks like it. Still, you could pre-compute all but that last multiply, no?

The more tables you can leave out of the massive ugly JOIN that this will need, the better. So being able to leave out tags and links would be a big help.



Yes, frequency would indeed be better. I'm working with code previously written by someone else. I also think it would help if they'd spelled 'occurrences' correctly, but such is life.

Anyhoo, word_idf can indeed change as more pages are added to the database. Similarly, tag weight can be changed by the administrator at any time, which negates your pre-calculation of pscore I'm afraid. I'm just sorry I wasn't at a PC to catch this message when you asked.



One more question: So the summations used must take into account any word founc in *EITHER* of the two documents, right.

But no, that's not really true. If a word has a "final score" of zero in either table (e.g., it's not there) then multiplying 0 by non-zero is still zero. So for the numerator we only care about words that are indeed in both documents. And for each of the sums in the denominator, we won't even *see* a zero value.

Hmmm...I can see this think is going to take forever, but it doesn't seem very complex, actually.

So far, I have done these two operations on your occurences table:

alter table occurences add pscore float;

UPDATE occurences, tags, links
SET occurences.pscore = 1.0 * occurences.score * tags.weight / links.max_tf
WHERE occurences.tag_id = tags.tag_id
AND occurences.linkoruser = links.idlinks;

Adding that "pscore" field and then calculating it took several minutes for each operation, so that is several minutes that won't have to be repeated in each computation.

It looks to me like it's going to be worth creating at least one TEMP table: The first operation will be to go out and get the ||A|| value for each document and store that in the TEMP table. Though it's tempting to add it to the LINKS table.

You certainly aren't going to have two people running this query at the same time are you? And/or you aren't going to add a new document while the results are being generated? (Doing so would alter the dictionary.word_idf value midway through the operation, thus invalidating all work done to that point.)

Erm, in theory the system might be used by more than one user, though I think MySQL queues users, right? And yes, if a new doc is added the results are innacturate, but the larger the database the smaller the inaccuracy.



Okay, I completed it. But something seems to be wrong.

The similarity for link 726 *TO ITSELF* is not even close to the highest similarity score.

I've been over the numbers three times now, and they all seem right. But why doesn't the similarity score then make sense.

FWIW, here are the top 10 (highest) similarity scores to 726:


mysql> select * from similarities order by similarity desc limit 10;
+---------+-----------------+
| idlinks | similarity |
+---------+-----------------+
| 1933 | 0.000000118274 |
| 2869 | 0.000000109086 |
| 720 | 0.0000000935016 |
| 4516 | 0.0000000910589 |
| 2504 | 0.0000000862625 |
| 5677 | 0.0000000843 |
| 1132 | 0.0000000838255 |
| 2792 | 0.0000000828706 |
| 3799 | 0.0000000826505 |
| 5202 | 0.0000000824886 |
+---------+-----------------+
and for comparison:

mysql> select * from similarities where idlinks = 726;
+---------+-----------------+
| idlinks | similarity |
+---------+-----------------+
| 726 | 0.0000000527725 |
+---------+-----------------+


I'm going to try rerunning the numbers again, later, but think it's time for you to chime in.

Chiming. Those similarities are very low. If it's any consolation the current system has a bug where occasionally relevancy is higher than 1.0! I'm sure this bug can be hunted.


Here's a sample dump of some of the data that goes into creating the numbers.

pscore is that precalulated value of occurences.score * tags.weight / links.max_tf

What do you think?


mysql> select d.word, o.pscore, d.word_idf, o.pscore * d.word_idf, o.score
-> from dictionary as d, occurences as o
-> where d.word_id = o.word_id
-> and o.linkoruser = 725
-> limit 20;
+----------+----------+----------+-----------------------+-------+
| word | pscore | word_idf | o.pscore * d.word_idf | score |
+----------+----------+----------+-----------------------+-------+
| new | 0.333333 | 4376 | 1458.666710138321 | 1 |
| what | 0.333333 | 1619 | 539.6666827499866 | 1 |
| usernam | 0.333333 | 156 | 52.000001549720764 | 1 |
| password | 0.666667 | 436 | 290.6666753292084 | 2 |
| member | 0.333333 | 1287 | 429.0000127851963 | 1 |
| happen | 0.333333 | 766 | 255.33334094285965 | 1 |
| updat | 0.333333 | 2353 | 784.3333567082882 | 1 |
| world | 0.333333 | 2447 | 815.6666909754276 | 1 |
| expert | 0.333333 | 268 | 89.33333599567413 | 1 |
| favorit | 0.333333 | 618 | 206.0000061392784 | 1 |
| friend | 0.333333 | 2087 | 695.6666873991489 | 1 |
| twitter | 0.666667 | 2242 | 1494.6667112112045 | 1 |
| twitter | 0.666667 | 2242 | 1494.6667112112045 | 1 |
| twitter | 1 | 2242 | 2242 | 3 |
| email | 0.333333 | 2264 | 754.666689157486 | 1 |
| celebr | 0.333333 | 1308 | 436.00001299381256 | 1 |
| sign | 0.333333 | 2908 | 969.3333622217178 | 1 |
| sign | 0.666667 | 2908 | 1938.6667244434357 | 1 |
| sign | 1 | 2908 | 2908 | 3 |
| forgot | 0.333333 | 166 | 55.33333498239517 | 1 |
+----------+----------+----------+-----------------------+-------+



The PSCORE seems to be a third, two thirds or one. Perhaps there's something there?


Also I think I'm missing out on some of your calculations.
Your squares table says:

SELECT links.idlinks, SUM( POWER( (occurences.pscore * dictionary.word_idf), 2 ) ) AS ss
FROM links, occurences, dictionary
WHERE links.idlinks = occurences.linkoruser
AND occurences.word_id = dictionary.word_id;

Which confuses me (maybe because I'm bad at maths!) as it doesn't seem to calculate similarity. Maybe I'm not seeing all of your calculations, or maybe I'm just being dumb.
To calculate similarity you create a list of all words that appear in either the base document or the document you are comparing it with. We will call these DOC1 and DOC2 respectively.

For each DOC/word pairing you take its true score, ie doing ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf
If the word does not appear in one of the documents then it's score is zero.

Let's assume the documents have a total of 6 words appearing in both. Obviously we'll be seeing a score of 1.0, but this is more about explaining the sum than the result.

We'll represent these doc/word pairing scores as D1W1, D2W1, D1W2, D2W2, D1W3, D2W3, D1W4, D2W4, D1W5, D2W5, D1W6, D2W6

Now the top part (T) of the similarity calculation is:
T = (D1W1 * D2W1) + (D1W2 * D2W2) + (D1W3 * D2W3) + (D1W4 * D2W4) + (D1W5 * D2W5) + (D1W6 * D2W6)

The bottom left (BL) is:
BL = SquareRoot((D1W1*D1W1) + (D1W2*D1W2) + (D1W3*D1W3) + (D1W4*D1W4) + (D1W5*D1W5) + (D1W6*D1W6))

The bottom right (BR) is:
BR = SquareRoot((D2W1*D2W1) + (D2W2*D2W2) + (D2W3*D2W3) + (D2W4*D2W4) + (D2W5*D2W5) + (D2W6*D2W6))

Making the bottom (B)
B = BL * BR

And the full similarity:
T over B

As seen in http://i.imgur.com/n3Bp8.jpg

PikaD
01-24-2012, 06:50 PM
Ooh, I just noticed a mistake I made in my description which explains a lot.
For each word's Inverse Document Frequency it requires a calculation:

SELECT COUNT(idlinks) FROM links
_____________________________ = Inverse Document Frequency
dictionary.word_idf

I can see this would greatly improve things.

Old Pedant
01-24-2012, 09:16 PM
AHA! That last message indeed helps explain a lot, I think.

Regarding tag weight being changeable: Unless this happens freuently, it's still better to have the pscore precalculated. Just rerun the query that creates the pscore whenever the tag weights are changed.

The squares table is *ONLY* finding the values of ||A|| (and ||B||) as shown in your post. That is, *ONLY* the sum-of-the-squares for one document.

Those sums-of-squares are also "fixed" for any given table until such time as a tag weight changes or word_idf changes.

Whether they get recalculated each time or you leave them in place until a change in tag weight or addition of another document occurs is up to you.

It is only the product (the dividend of the final forumula) that depends on WHICH idlinks is chose as the base. So you can do all the other calculations once and then you only have to calculate the product once for each base idlinks. Again, supposing that tag weights and word_idf don't change.

Here's the sad part: We had a power outage just after I did all of the above last night and I hadn't saved by final work. So I need to reconstruct it.

Oh...and one major goof on my part: I forgot to do the square roots of the sums! *THAT* would make a *HUGE* difference.

So...more later, but I have to get some work done right now.

Old Pedant
01-24-2012, 09:18 PM
So this formula is actually wrong:
((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf

Yes?

It should, instead, be
( ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf ) / COUNT(idlinks)

Right?

PikaD
01-24-2012, 09:40 PM
So this formula is actually wrong:
((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf

Yes?

It should, instead, be
( ((occurences.score * tags.weight) / links.max_tf) * dictionary.word_idf ) / COUNT(idlinks)

Right?

Not quite, score is:

((occurences.score * tags.weight) / links.max_tf) * (COUNT(idlinks) / dictionary.word_idf)


If you precalculate the score, and do a mass-recalculation upon tag weight change, it could also be an option to have the IDF precalculated, and have an option for the administrator to 'rebuild' the scores (and obviously do an automatic rebuild whenever tag weights are changed).

There has to be a fair pile of calculation done at the time of request though, because there will be a considerable amount of different similarity comparisons. If comparing everything to 1 item requires a table of all things, then comparing everything to everything requires a table the size of links SQUARED. Clearly impractical.

Old Pedant
01-24-2012, 09:50 PM
There has to be a fair pile of calculation done at the time of request though, because there will be a considerable amount of different similarity comparisons. If comparing everything to 1 item requires a table of all things, then comparing everything to everything requires a table the size of links SQUARED. Clearly impractical.

Yes, but as I said, it is only the PRODUCTS part of the equations (the dividend, the product A.B in your diagram) that changes depending on which base idlinks is used.

And yesterday I was able to calculate all 5106 of those products in around 30 seconds (I didn't time it, but it certainly wasn't a minute). And that's on my 4 year old machine. (Well, it is dual processor, but one of the early cheap ones...only paid $400 for it 4 years ago.)

Old Pedant
01-24-2012, 09:50 PM
The calculation of the final similarity scores then becomes trivial, taking way less than a second.

Old Pedant
01-24-2012, 09:51 PM
Even rebuilding all my "temp" tables, I doubt that the whole process would take more than 90 seconds. Maybe only 60 to 70 seconds. With the DB as you have it. How does that compare to the PHP timing?

PikaD
01-24-2012, 09:58 PM
The calculation of the final similarity scores then becomes trivial, taking way less than a second.

I think you're my new hero.

PikaD
01-24-2012, 10:03 PM
PHP timing to show top and bottom 10 similar links is between 60 and 120 seconds using an i5 @ 4.4Ghz. So yeah, that's better!

I know the sum of the squares calculation is what makes up the bottom left and right parts of the calculation, so in your version you're suggesting that only the top part of the equation needs calcuating each time, and the score numbers are already precalculated, yes?

Old Pedant
01-24-2012, 10:30 PM
Well, I must have made a big mistake yesterdsy!

Because today I'm seeing it take hundreds of seconds to create the "squares" table.

It's working, but it's slow.

Old Pedant
01-24-2012, 10:33 PM
I know the sum of the squares calculation is what makes up the bottom left and right parts of the calculation, so in your version you're suggesting that only the top part of the equation needs calcuating each time, and the score numbers are already precalculated, yes?

Mostly yes. I was only precalculating the pscore as I showed yesterday. So as part of each sum I still have to multiply by linkCount and divide by word_idf and then square each of those.

Hmmm...now I'm wondering if it might not be worth it to do all of that ahead of time.

Old Pedant
01-24-2012, 10:35 PM
Ouch. If you are really getting times that fast using PHP, then I think maybe this experiment in doing it in SQL is doomed to failure.

I'm not sure how I can make it much more efficient.

Old Pedant
01-24-2012, 10:38 PM
Okay, I've now done these steps:


CREATE TABLE squares (
idlinks INT PRIMARY KEY,
sumofsquares FLOAT,
similarity FLOAT );

INSERT INTO squares ( idlinks, sumofsquares )
SELECT links.idlinks, SUM( POWER( (occurences.pscore * dictionary.word_idf), 2 ) ) AS ss
FROM links, occurences, dictionary
WHERE links.idlinks = occurences.linkoruser
AND occurences.word_id = dictionary.word_id;

pscore values was a tremendous time saver.]

OW OW OW OW!

No wonder it was so fast!

I left out the GROUP BY! That SELECT should end with

GROUP BY links.idlinks

NO WONDER the results were so bogus!

Helps to do the work a second time, so you notice mistakes like that.

*sigh*.

Old Pedant
01-24-2012, 10:39 PM
Calculating the sumsofsquares the *RIGHT* way now takes a tad over 7 minutes!!!

PikaD
01-24-2012, 10:49 PM
Calculating the sumsofsquares the *RIGHT* way now takes a tad over 7 minutes!!!

But that's to build all the tables required for calculations, right?

Once that's built, is the actual comparison query faster?

Old Pedant
01-24-2012, 11:38 PM
Well, I need to re-run the whole thing from scratch, to get tiimings, but at least now I think the results are right.

726 was my baseid.



mysql> select * from similarities order by similarity limit 10;
+---------+---------------------------+
| idlinks | similarity |
+---------+---------------------------+
| 844 | 0.00000005012831145140039 |
| 1783 | 0.00000005012831145140039 |
| 5172 | 0.00000009280917927588555 |
| 2654 | 0.00000009280920811111671 |
| 2800 | 0.00000009280920811111671 |
| 4100 | 0.00000009280920811111671 |
| 5225 | 0.00000009280920811111671 |
| 5459 | 0.00000009280920811111671 |
| 754 | 0.00000011761214611482076 |
| 3982 | 0.00000015829979618323346 |
+---------+---------------------------+
10 rows in set (0.01 sec)

mysql> select * from similarities order by similarity desc limit 10;
+---------+----------------------+
| idlinks | similarity |
+---------+----------------------+
| 726 | 1.000002046777577 |
| 732 | 0.16055284080459034 |
| 736 | 0.1299248544985623 |
| 4047 | 0.02652796074826159 |
| 703 | 0.02420583129816516 |
| 771 | 0.019268659509927132 |
| 2494 | 0.016417334600793036 |
| 715 | 0.01631614545748908 |
| 974 | 0.015567806984892331 |
| 5713 | 0.011873962711117889 |
+---------+----------------------+
10 rows in set (0.01 sec)

And notice that 726 is, not surprisingly, most similar to itself. I guess the actual value should be exactly 1, but not surprising that some inaccuracy slipped in with all those calculations.

By the by, I did all this with DOUBLE PRECISION. I don't think it would be significantly faster to use FLOAT, but if you don't need high accuracy it might be worth trying that.

Old Pedant
01-24-2012, 11:47 PM
WOW!

So after getting the numbers for a baseid of 726, I then turned around and asked for a baseid of 2800.

Look:


mysql> call getSimilaritiesWithPrecalulations(2800);
+---------+-----------------------+
| idlinks | similarity |
+---------+-----------------------+
| 2800 | 1.0000000000000002 |
| 2644 | 0.030299368732139086 |
| 2958 | 0.030299368732139086 |
| 3196 | 0.030135317570824507 |
| 2926 | 0.029907818703917916 |
| 5168 | 0.0018337482684913616 |
| 2654 | 0.0009181360618612229 |
| 4100 | 0.0009181360618612229 |
| 5225 | 0.0009181360618612229 |
| 5459 | 0.0009181360618612229 |
+---------+-----------------------+
10 rows in set (3.94 sec)

UNDER 4 SECONDS!

So...

I takes around 8 minutes to build everything up to the squares tables. And you would need to do that anytime the tags were changed or you added a document.

But once you do that, you can get the numbers for any baseid of your choice in 4 seconds.

I'm looking to see if I can improve on the 8 minutes. But that's going to have to wait until tomorrow.

Old Pedant
01-24-2012, 11:48 PM
Here's the code as it now exists. Again, I think some tweaks are possible.



delimiter //

DROP PROCEDURE IF EXISTS precalculateForSimilarities;
//
CREATE PROCEDURE precalculateForSimilarities( )
BEGIN

DECLARE linkCount INT;
SELECT COUNT(idlinks) INTO linkCount
FROM links;

DROP TABLE IF EXISTS squares;
CREATE TABLE squares (
idlinks INT PRIMARY KEY,
sumofsquares DOUBLE PRECISION );

DROP TABLE IF EXISTS products;
CREATE TABLE products (
idlinks INT PRIMARY KEY,
product DOUBLE PRECISION );

DROP TABLE IF EXISTS similarities;
CREATE TABLE similarities (
idlinks INT PRIMARY KEY,
similarity DOUBLE PRECISION );

UPDATE occurences, tags, links
SET occurences.pscore = 1.0 * occurences.score * tags.weight / links.max_tf
WHERE occurences.tag_id = tags.tag_id
AND occurences.linkoruser = links.idlinks;

INSERT INTO squares ( idlinks, sumofsquares )
SELECT links.idlinks, SUM( POWER( (occurences.pscore * linkCount / dictionary.word_idf), 2 ) ) AS ss
FROM links, occurences, dictionary
WHERE links.idlinks = occurences.linkoruser
AND occurences.word_id = dictionary.word_id
GROUP BY links.idlinks;

END;
//

DROP PROCEDURE IF EXISTS getSimilaritiesWithPrecalulations;
//
CREATE PROCEDURE getSimilaritiesWithPrecalulations( baseid INT )
BEGIN

DECLARE linkCount INT;
SELECT COUNT(idlinks) INTO linkCount
FROM links;

TRUNCATE TABLE products;

INSERT INTO products( idlinks, product )
SELECT L2.idlinks, SUM( O1.pscore * O2.pscore * POW( linkCount, 2 ) / POW( D.word_idf, 2 ) ) AS pp
FROM links AS L1, occurences AS O1,
links AS L2, occurences AS O2, dictionary AS D
WHERE L1.idlinks = O1.linkoruser
AND L2.idlinks = O2.linkoruser
AND D.word_id = O1.word_id
AND D.word_id = O2.word_id
AND L1.idlinks = baseid
GROUP BY L2.idlinks;

TRUNCATE TABLE similarities;

INSERT INTO similarities( idlinks, similarity )
SELECT P.idlinks, P.product / ( SQRT( S1.sumofsquares ) * SQRT( S2.sumofsquares ) )
FROM products AS p, squares AS S1, squares AS S2
WHERE p.idlinks = S2.idlinks
AND S1.idlinks = baseid
GROUP BY P.idlinks;

SELECT * FROM similarities
ORDER BY similarity DESC LIMIT 10;

END;
//

DROP PROCEDURE IF EXISTS getSimilarities;
//
CREATE PROCEDURE getSimilarities( baseid INT )
BEGIN
CALL precalculateForSimilarities;
CALL getSimilaritiesWithPrecalculations( baseid );
END;
//

delimiter ;


The code in red is only there for convenience. Once you have run the SP(s), you can make inquires of the similarities table to your hearts content. Remembering that it's only valid for one base id.

Old Pedant
01-24-2012, 11:52 PM
Just out of curiosity, how often *do* you change the tags values and/or add another document to the links table?

Old Pedant
01-25-2012, 03:06 AM
Well, it's a good thing nobody has accused me of having any brains.

Look here:


mysql> call getSimilaritiesWithPrecalulations(2216);
+---------+-----------------------+
| idlinks | similarity |
+---------+-----------------------+
| 2216 | 1.0000003244760411 |
| 1410 | 0.008372131422705589 |
| 1597 | 0.0074875093271385374 |
| 1203 | 0.006851973297245124 |
| 1611 | 0.0059732967858094664 |
| 2805 | 0.0059732967858094664 |
| 3096 | 0.0059732967858094664 |
| 1245 | 0.00562390980707817 |
| 1532 | 0.0036334177294031927 |
| 929 | 0.0035453084018750055 |
+---------+-----------------------+
10 rows in set (5 min 8.04 sec)

Query OK, 0 rows affected (5 min 8.06 sec)


Double DOH on me. The length of time it takes to get the products (the A.B dividend) is dependent on how many word (occurences) there are in the baseid!

The ones that were taking 4 seconds (and one that took 0.37 of a second) all had only a handful of words.

Above is the time for the WORST baseid, 2216, as it has the highest count of occurences.

So...

If you really can get times of 1 to 2 minutes using PHP code, I think you are going to want to stick with it. But do me a favor: Find out how long baseid 2216 takes.

p.s.: I did improve the code slightly from what I posted before, but clearly not enough to matter much.

PikaD
01-25-2012, 03:18 AM
Just out of curiosity, how often *do* you change the tags values and/or add another document to the links table?

First off, Old Pedant - you are my hero.

Secondly, I will be testing that code out tomorrow and seeing what magic can be conjured.

Documents get added almost by the minute, however an update every few hours or so is entirely acceptable. Tag weighting is done by the admin at will, sometimes a few times an hour, sometimes nothing for weeks. It all depends on how results are coming out in cognitive relevance (ie, how relevant a human reckons things are) in comparison to the automated results set.

I have PMed you something as a thank you for all your help so far ;)

Old Pedant
01-25-2012, 05:18 AM
Well, I'm really disappointed with my latest efforts to improve the performance.

I tried doing this:


UPDATE occurences, tags, links, dictionary
SET occurences.pscore = ( 1.0 * occurences.score * tags.weight / links.max_tf ) * linkCount / dictionary.word_idf
WHERE occurences.tag_id = tags.tag_id
AND occurences.linkoruser = links.idlinks
AND occurences.word_id = dictionary.word_id;

Because once you do that, then the calculation of sumofsquares is dirt simple:


TRUNCATE TABLE squares;

INSERT INTO squares ( idlinks, sumofsquares )
SELECT linkoruser, SUM( pscore * pscore ) AS ss
FROM occurences
GROUP BY linkoruser;

And indeed only takes 15 seconds or so.

Thinking that would be something you would do each time the tags changed or a document was added, but that UPDATE query takes on the order of 30 minutes to complete!

Hmmm...I wonder what EXPLAIN will tell me about it... Ugh. Nothing useful. What I would hope for. Each intermediate step is as efficient as it can be, according to EXPLAIN.

Still thinking...

PikaD
01-26-2012, 01:34 AM
Well, I'm really disappointed with my latest efforts to improve the performance.

I tried doing this:


UPDATE occurences, tags, links, dictionary
SET occurences.pscore = ( 1.0 * occurences.score * tags.weight / links.max_tf ) * linkCount / dictionary.word_idf
WHERE occurences.tag_id = tags.tag_id
AND occurences.linkoruser = links.idlinks
AND occurences.word_id = dictionary.word_id;

Because once you do that, then the calculation of sumofsquares is dirt simple:


TRUNCATE TABLE squares;

INSERT INTO squares ( idlinks, sumofsquares )
SELECT linkoruser, SUM( pscore * pscore ) AS ss
FROM occurences
GROUP BY linkoruser;

And indeed only takes 15 seconds or so.

Thinking that would be something you would do each time the tags changed or a document was added, but that UPDATE query takes on the order of 30 minutes to complete!

Hmmm...I wonder what EXPLAIN will tell me about it... Ugh. Nothing useful. What I would hope for. Each intermediate step is as efficient as it can be, according to EXPLAIN.

Still thinking...

I'm having some trouble mirroring what I have with what you have as a database. Could you do me an SQL export of your version so I can replace my DB with your version and investigate it further?

Thanks!

Old Pedant
01-26-2012, 02:15 AM
I didn't touch the *DATA* in that file you showed in your original post. I downloaded and imported it untouched. So if it's still there, you can use it.

I'll clean up the stored procedures, though, and shoot them to you again, clean.

I've got them in a state of flux right now as I was trying different things. Give me a while.

PikaD
01-26-2012, 10:49 PM
Well I've got it working and I thought you'd like to see the timings.


mysql> CALL getSimilarities(2800);
+---------+-----------------------+
| idlinks | similarity |
+---------+-----------------------+
| 2800 | 0.9999999999999999 |
| 2644 | 0.03029936909391529 |
| 2958 | 0.03029936909391529 |
| 3196 | 0.03013531794568269 |
| 2926 | 0.029907819092343718 |
| 5168 | 0.0018337483122279452 |
| 2654 | 0.0009181360837897326 |
| 4100 | 0.0009181360837897326 |
| 5225 | 0.0009181360837897326 |
| 5459 | 0.0009181360837897326 |
+---------+-----------------------+
10 rows in set (1 min 34.04 sec)

Query OK, 0 rows affected (1 min 34.05 sec)

mysql> CALL getSimilarities(2801);
+---------+---------------------+
| idlinks | similarity |
+---------+---------------------+
| 2767 | 1.1458806903092202 |
| 2801 | 1.1458806903092202 |
| 2165 | 0.2777225335938566 |
| 2392 | 0.2777225335938566 |
| 3019 | 0.2590013454635094 |
| 3072 | 0.2590013454635094 |
| 2492 | 0.24944945486161393 |
| 2985 | 0.24600440930985706 |
| 2795 | 0.22618299390350902 |
| 2685 | 0.22493720872271245 |
+---------+---------------------+
10 rows in set (1 min 53.39 sec)

Query OK, 0 rows affected (1 min 53.41 sec)

mysql> CALL getSimilaritiesWithPrecalculations(2800);
+---------+-----------------------+
| idlinks | similarity |
+---------+-----------------------+
| 2800 | 0.9999999999999999 |
| 2644 | 0.03029936909391529 |
| 2958 | 0.03029936909391529 |
| 3196 | 0.03013531794568269 |
| 2926 | 0.029907819092343718 |
| 5168 | 0.0018337483122279452 |
| 2654 | 0.0009181360837897326 |
| 4100 | 0.0009181360837897326 |
| 5225 | 0.0009181360837897326 |
| 5459 | 0.0009181360837897326 |
+---------+-----------------------+
10 rows in set (0.81 sec)

Query OK, 0 rows affected (0.83 sec)

mysql> CALL getSimilaritiesWithPrecalculations(2801);
+---------+---------------------+
| idlinks | similarity |
+---------+---------------------+
| 2767 | 1.1458806903092202 |
| 2801 | 1.1458806903092202 |
| 2165 | 0.2777225335938566 |
| 2392 | 0.2777225335938566 |
| 3019 | 0.2590013454635094 |
| 3072 | 0.2590013454635094 |
| 2492 | 0.24944945486161393 |
| 2985 | 0.24600440930985706 |
| 2795 | 0.22618299390350902 |
| 2685 | 0.22493720872271245 |
+---------+---------------------+
10 rows in set (29.29 sec)

Query OK, 0 rows affected (29.31 sec)

mysql> CALL getSimilaritiesWithPrecalculations(2803);
+---------+---------------------+
| idlinks | similarity |
+---------+---------------------+
| 2304 | 1.2304943390561895 |
| 2803 | 1.2304943390561895 |
| 3343 | 0.25026730659837726 |
| 2266 | 0.1907925413812479 |
| 2611 | 0.1907925413812479 |
| 4698 | 0.16645138635228038 |
| 4695 | 0.16358905087732806 |
| 5739 | 0.1626278979891907 |
| 5303 | 0.161396669689247 |
| 2057 | 0.11465883609516067 |
+---------+---------------------+
10 rows in set (16.58 sec)

Query OK, 0 rows affected (16.60 sec)

This is my functions code, looking like so:


delimiter //

DROP PROCEDURE IF EXISTS precalculateForSimilarities;
//
CREATE PROCEDURE precalculateForSimilarities( )
BEGIN

DECLARE linkCount INT;
SELECT COUNT(idlinks) INTO linkCount
FROM links;

DROP TABLE IF EXISTS squares;
CREATE TABLE squares (
idlinks INT PRIMARY KEY,
sumofsquares DOUBLE PRECISION );

UPDATE occurences, tags, links, dictionary
SET occurences.pscore = ( 1.0 * occurences.score * tags.weight / links.max_tf ) * (linkCount / dictionary.word_idf)
WHERE occurences.tag_id = tags.tag_id
AND occurences.linkoruser = links.idlinks
AND occurences.word_id = dictionary.word_id;

TRUNCATE TABLE squares;

INSERT INTO squares ( idlinks, sumofsquares )
SELECT linkoruser, SUM( pscore * pscore ) AS ss
FROM occurences
GROUP BY linkoruser;

END;
//

DROP PROCEDURE IF EXISTS getSimilaritiesWithPrecalculations;
//
CREATE PROCEDURE getSimilaritiesWithPrecalculations( baseid INT )
BEGIN

DECLARE linkCount INT;
SELECT COUNT(idlinks) INTO linkCount
FROM links;

DROP TABLE IF EXISTS products;
CREATE TEMPORARY TABLE products (
idlinks INT PRIMARY KEY,
product DOUBLE PRECISION );

DROP TABLE IF EXISTS similarities;
CREATE TEMPORARY TABLE similarities (
idlinks INT PRIMARY KEY,
similarity DOUBLE PRECISION );

TRUNCATE TABLE products;

INSERT INTO products( idlinks, product )
SELECT L2.idlinks, SUM( O1.pscore * O2.pscore) AS pp
FROM links AS L1, occurences AS O1,
links AS L2, occurences AS O2, dictionary AS D
WHERE L1.idlinks = O1.linkoruser
AND L2.idlinks = O2.linkoruser
AND D.word_id = O1.word_id
AND D.word_id = O2.word_id
AND L1.idlinks = baseid
GROUP BY L2.idlinks;

TRUNCATE TABLE similarities;

INSERT INTO similarities( idlinks, similarity )
SELECT P.idlinks, P.product / ( SQRT( S1.sumofsquares ) * SQRT( S2.sumofsquares ) )
FROM products AS p, squares AS S1, squares AS S2
WHERE p.idlinks = S2.idlinks
AND S1.idlinks = baseid
GROUP BY P.idlinks;

SELECT * FROM similarities
ORDER BY similarity DESC LIMIT 10;

END;
//

DROP PROCEDURE IF EXISTS getSimilarities;
//
CREATE PROCEDURE getSimilarities( baseid INT )
BEGIN
CALL precalculateForSimilarities;
CALL getSimilaritiesWithPrecalculations( baseid );
END;
//

delimiter ;

Times are better than PHP but still pretty beastly.

my ibdata1 file is over 6gb (sooo huge) - is that likely to have much impact?
I tried to mysqldump from SQL to database but it kept crashing out with an error. If the ibdata1 file size is unimportant to performance I will file it under 'will get around to it eventually'.

Thoughts? Ideas? Improvements?

Old Pedant
01-27-2012, 09:29 AM
6GB is big but not outrageous for MySQL. Depends a lot on how much RAM you have on the machine, I would think.

One thing that would help would be to specify all those newly created tables as "ENGINE MyISAM". I forgot to do that.

One thing I did notice. Latest version of the SP's are such that you *could* use TEMPORARY tables. Which would mean different people could be making queries at the same time with different baseid's. Dunno of that is important to you.

PikaD
01-27-2012, 10:33 AM
6GB is big but not outrageous for MySQL. Depends a lot on how much RAM you have on the machine, I would think.

One thing that would help would be to specify all those newly created tables as "ENGINE MyISAM". I forgot to do that.

One thing I did notice. Latest version of the SP's are such that you *could* use TEMPORARY tables. Which would mean different people could be making queries at the same time with different baseid's. Dunno of that is important to you.

I'll make the change to MyISAM and see if it improves.

I did use temporary tables for products and similarities in my version above. I moved their CREATEs out of the first function and put them in as CREATE TEMPORARY in the second one.

Old Pedant
01-27-2012, 10:59 PM
I did use temporary tables for products and similarities in my version above. I moved their CREATEs out of the first function and put them in as CREATE TEMPORARY in the second one.

Haven't had time to mess with it today or late yesterday. But that's exactly what I was going to do, too. Great minds run in the same gutter.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum