Help! How to find the similarity between users in Twitter ?
I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance
Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?
For example:
user A's hashtag = (cat, bull, cow, chicken, duck)
user B's hashtag =(cat, chicken, cloth)
user C's hashtag = (lenovo, Hp, Sony)
clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?
Since, presumably, the tweets are recorded in a database somewhere, it seems obvious to me that you should first approach this looking for a database solution. I see no reason, in any case, that JavaScript (which is what this forum is for) would ever be involved.
As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
__________________
An optimist sees the glass as half full.
A pessimist sees the glass as half empty.
A realist drinks it no matter how much there is.
js can handle this just fine, alone in node.js, or as a map reduce function in a nosql system.
it seems to me that you simply need to transpose your data from groups of hashes per user into groups of users per hash.
you can then solve this problem system-wide by simply concatenating the users list of each hash mentioned by user A into one big repetitive array of users. probably IDs or something...
you can sort() that list by count, and remove duplicates.
slot #0 should obviously be the user A, having 100% participation in the selected pool of hashes, but slot #1 would be the 2nd closest, slot #2 the third closest, etc until you reach the users with only one hash in common with user A.
it would use double the RAM of the orig, but should provide near-instant execution to cover that up-front RAM cost.
you could just run that function for every user to find out who is common to who system-wide (if needed).
if your two tables are in globals the whole time it shouldn't take very long at all in V8 to calculate the whole list of lists.
side note:
i once had a complex comparison that required an sql subquery, and took about 8-9 mins to run. it was looking for missing two-way relationships. when i instead exported the DB as JSON and ran the thing in javascript after building an on-demand LUT as described above, i could calculate the same thing in 200ms instead of almost ten mins... this let the workers of the project i was working on, an inventory of about 80,000 media objects, run this report on-demand instead of having me do it twice a day, greatly speeding progress. I know DBs are the go to, but they aren't panaceas, and it CAN pay to explore other options.
__________________ my site (updated 5/13) STATS (2013/5) HTML5:90.2% MOB:15.2% IE7:0.5% IE8:8.4% IE9:8.5% IE10:8.5%
Since, presumably, the tweets are recorded in a database somewhere, it seems obvious to me that you should first approach this looking for a database solution. I see no reason, in any case, that JavaScript (which is what this forum is for) would ever be involved.
As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
Dear Old Pedant
Yes, the project professor ask me to use C language to handle it. I post it in JavaScript section, because this section is very active, I have more chance to meet a programming pro, who may give me a very good and feasible way to do it. Actually ,this is a data mining problem, How to deal with the data and data analysis, this problem's essential part is the idea, if the idea is clear how to do it, the code implementation should no a problem. I hope some smart guy can give me some hints. thank you.
js can handle this just fine, alone in node.js, or as a map reduce function in a nosql system.
it seems to me that you simply need to transpose your data from groups of hashes per user into groups of users per hash.
you can then solve this problem system-wide by simply concatenating the users list of each hash mentioned by user A into one big repetitive array of users. probably IDs or something...
you can sort() that list by count, and remove duplicates.
slot #0 should obviously be the user A, having 100% participation in the selected pool of hashes, but slot #1 would be the 2nd closest, slot #2 the third closest, etc until you reach the users with only one hash in common with user A.
it would use double the RAM of the orig, but should provide near-instant execution to cover that up-front RAM cost.
you could just run that function for every user to find out who is common to who system-wide (if needed).
if your two tables are in globals the whole time it shouldn't take very long at all in V8 to calculate the whole list of lists.
side note:
i once had a complex comparison that required an sql subquery, and took about 8-9 mins to run. it was looking for missing two-way relationships. when i instead exported the DB as JSON and ran the thing in javascript after building an on-demand LUT as described above, i could calculate the same thing in 200ms instead of almost ten mins... this let the workers of the project i was working on, an inventory of about 80,000 media objects, run this report on-demand instead of having me do it twice a day, greatly speeding progress. I know DBs are the go to, but they aren't panaceas, and it CAN pay to explore other options.
Dear rnd me
you suggestion sounds good and useful. I am a University student, my professor ask me to use C language to handle it. I may not clearly state my problem.
The hashtag is the words highlighted by users, which can reflect the users interests in some extent,if two users have some common hashtags, it means they have some hidden similarity/interest, which also means they have some hidden similarity. However, in Twitter, 90% of the users are not active, most of them haven't any hashtag, instead of directly calculate the similarity between all users, I would like to filter out all those un-related users(for a particular user, those users haven't common hashtags/no hashtags).To do this, I use an array to store all users, then for each user,use a linkedlist to chain all the related users group(for example, all those users who have some common hashtags with user A, we say they are the related group to A). Then second step is to calculate the similarity amount the related group for user A,and get the relative similarity value between A and all other users of the group. repeat this step for each users in twitter, so that we could get relative similarity value between any user in Twitter. next step is to find the related similarity value between all users, then according to the similarity value to build a graph(similarity =0 means no link between the two users), finally using this graph go through the fast unfold community detection algorithm to find the clusters of the social website.
In fact, the hashtag is just only one of the measurements of the similarity between users, I categorise the measurement into content similarity, link similarity and meta-data similarity.
Content similarity: hashtags and tweets
Link similarity: follower-following relationship between two users and the number of times they have retweeted, mentioned or replied to each other
meta-data similarity: location, gender and age
content similarity = hashtags similarity + tweets similaity, and use the same way to get link similarity and meta-data similarity
the total similarity = content similarity + Link similarity + meta-data similarity
the total similarity is probabaly a digit value, which is something like the weigted link I Want to get for building the similarity graph of the twitter
Yes, the project professor ask me to use C language to handle it. I post it in JavaScript section, because this section is very active, I have more chance to meet a programming pro, who may give me a very good and feasible way to do it.
heh. i don't think too many professional programmers hang out in javascript forums; a lot of them think JS sucks. Does the simple transposed look-up-table routine i described not solve your problem?
__________________ my site (updated 5/13) STATS (2013/5) HTML5:90.2% MOB:15.2% IE7:0.5% IE8:8.4% IE9:8.5% IE10:8.5%
RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.
A lot different than RdnMe's 80,000 media objects.
I am not saying that a SQL approach is the only way. But if you want to solve the problem quickly--and I would assume that is true since this is for a class (you did say "professor"?)--then SQL seems to me to be the answer. It feels like a one-day (or less) project if done using SQL.
But you never did describe what the RESULTS are supposed to be. Recall that I wrote
Quote:
As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
So...just what is it you are supposed to be able to report?
Find the 100 people who have the most hashtags in common? Or what?
Anyway, as a SQL problem ir really is simple:
Code:
CREATE TABLE users (
userid INT AUTO_INCREMENT PRIMARY KEY,
username VARCHAR(50),
... other user info ...
);
CREATE TABLE hashtags (
tagid INT AUTO_INCREMENT PRIMARY KEY,
tag VARCHAR(50)
);
CREATE TABLE userTags (
userid INT,
tagid INT,
PRIMARY KEY (userid, tagid),
CONSTRAINT FOREIGN KEY usertags_user (userid) REFERENCES users(userid),
CONSTRAINT FOREIGN KEY usertags_tag (tagid) REFERENCES hashtags(tagid)
);
And to find, say, the 20 people with the most hash tag matches *TO A GIVEN USER*:
SELECT u2.username, COUNT(*) AS matches
FROM users AS u1, users AS u2, usertags AS T1, usertags AS T2
WHERE u1.username = 'Fred Flintstone' /* Fred being the given user in this demo */
AND u1.userid = T1.userid
AND T1.tagid = T2.tagid
AND T1.userid <> T2.userid
GROUP BY u2.username
ORDER BY matches DESC
LIMIT 20;
__________________
An optimist sees the glass as half full.
A pessimist sees the glass as half empty.
A realist drinks it no matter how much there is.
RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.
my problem had fewer objects, but it needed a full table scan per object to calculate the field we needed.
if he is using C, i don't think he would use SQL to do the analysis.
doing some back of the envelope math, i think you can handle 6 million users with my routine.
given each unique hash tag as a long integer (4 bytes) and an average of 19 or 20 tags per user (using one slot as a userID):
6000000 * 4 * 20 = ~480MB
for the LUT you need an array of all the hashes. lets say there are 100K unique hashes at 20 bytes each:
100000 * 20 = 2MB
you then need the transposed data (users by hash). you can refer to each user as a long integer. let's pretend that each hash has 50 users on average:
100000 * 50 * 4 = ~20MB
that's only ~500 megabytes, no problem for an off-the-shelf laptop...
__________________ my site (updated 5/13) STATS (2013/5) HTML5:90.2% MOB:15.2% IE7:0.5% IE8:8.4% IE9:8.5% IE10:8.5%
But those will be essentially the same numbers that the MySQL DB and query will see in the usertags table.
So you will likely get the results faster each time...*AFTER* you write the C code...but I will get the first results faster by not *HAVING* to write the C code.
It all depends on what his goal is, which we so far don't know.
Oh...and invoking MySQL from C code is essentially the same as invoking it from PHP. The primary interface to MySQL is, indeed, C.
__________________
An optimist sees the glass as half full.
A pessimist sees the glass as half empty.
A realist drinks it no matter how much there is.
I get the feeling that you might be seriously over-estimating this customer.
LOL! Yeah, we probably both are.
And maybe he really does have to write it all in C as a proof he can code in that language. (Though why anybody would be taking a course in C in this day and age is beyond me. *AT LEAST* he should be taking C++, if not Java or C#. Some language that is just a *tad* more modern!)
__________________
An optimist sees the glass as half full.
A pessimist sees the glass as half empty.
A realist drinks it no matter how much there is.
RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.
A lot different than RdnMe's 80,000 media objects.
I am not saying that a SQL approach is the only way. But if you want to solve the problem quickly--and I would assume that is true since this is for a class (you did say "professor"?)--then SQL seems to me to be the answer. It feels like a one-day (or less) project if done using SQL.
But you never did describe what the RESULTS are supposed to be. Recall that I wrote
So...just what is it you are supposed to be able to report?
Find the 100 people who have the most hashtags in common? Or what?
Anyway, as a SQL problem ir really is simple:
Code:
CREATE TABLE users (
userid INT AUTO_INCREMENT PRIMARY KEY,
username VARCHAR(50),
... other user info ...
);
CREATE TABLE hashtags (
tagid INT AUTO_INCREMENT PRIMARY KEY,
tag VARCHAR(50)
);
CREATE TABLE userTags (
userid INT,
tagid INT,
PRIMARY KEY (userid, tagid),
CONSTRAINT FOREIGN KEY usertags_user (userid) REFERENCES users(userid),
CONSTRAINT FOREIGN KEY usertags_tag (tagid) REFERENCES hashtags(tagid)
);
And to find, say, the 20 people with the most hash tag matches *TO A GIVEN USER*:
SELECT u2.username, COUNT(*) AS matches
FROM users AS u1, users AS u2, usertags AS T1, usertags AS T2
WHERE u1.username = 'Fred Flintstone' /* Fred being the given user in this demo */
AND u1.userid = T1.userid
AND T1.tagid = T2.tagid
AND T1.userid <> T2.userid
GROUP BY u2.username
ORDER BY matches DESC
LIMIT 20;
Dear Old Pedant,
Yes, I am a University student, However, the project is not a class project, it's a research project. and My professor ask me to implement it with C code. The final goal here is to find the similarity relative value between users, the hashtag is just only one of the measurements of the similarity between users, I categorise the measurement into content similarity, link similarity and meta-data similarity.
Content similarity: hashtags and tweets
Link similarity: follower-following relationship between two users and the number of times they have retweeted, mentioned or replied to each other
meta-data similarity: location, gender and age
content similarity = hashtags similarity + tweets similaity, and use the same way to get link similarity and meta-data similarity
the total similarity = content similarity + Link similarity + meta-data similarity
the total similarity is probabaly a digit value, which is something like the weigted link I Want to get for building the similarity graph of the twitter
After I get this weighted graph, I will use it to go through my fast unfold community detection algorithm to find those potential Clusters in twetter. The focus here is to design a good program, which could fast compute the similarity (total similarity) value between users, and filter out those no related users, and finally get something like similarity graph text file. for example:
usera_id userb_id similarity vale
12 15 0.6
. . .
. . .
. . .
each line contains a pair of integers a b
it means there is an edge from a to b (means some similarity)
After I get something like above text file, I will pass the text file to the fast unfold algorithm to find the potential clusters in the website.
contains, 36 trillion entries, I hope you aren't asked to print it out
To add to the above discussion, I think a bit of pre-processing is in order here. Do you know how many unique hashtags there are in the dataset? Do you know how many hashtags are referenced only once? Do you know how many users have zero hashtags in common with anyone else? Those numbers will be simple to find using rnd_me's suggestion of collecting the information via hashtag rather than user, and can be used to drastically (I suspect) reduce the size of this 6M x 6M "similarity matrix" you are looking to create.