Flash Website Builder- Trendy Site Builder is a Flash Site Building tool that helps users build stunning websites. Check Out Custom Custom Logo Design by LogoBee. Website Design and Free Logo Templates available.
 CodingForums.com Help! How to find the similarity between users in Twitter ?

Before you post, read our: Rules & Posting Guidelines

Enjoy an ad free experience by logging in. Not a member yet? Register.
 12-27-2012, 08:19 PM PM User | #1 ldaneil New to the CF scene   Join Date: Dec 2012 Location: Singapore Posts: 6 Thanks: 1 Thanked 0 Times in 0 Posts Help! How to find the similarity between users in Twitter ? I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users? For example: user A's hashtag = (cat, bull, cow, chicken, duck) user B's hashtag =(cat, chicken, cloth) user C's hashtag = (lenovo, Hp, Sony) clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?
 12-27-2012, 09:05 PM PM User | #2 Old Pedant Supreme Master coder!     Join Date: Feb 2009 Posts: 24,955 Thanks: 75 Thanked 4,308 Times in 4,275 Posts Since, presumably, the tweets are recorded in a database somewhere, it seems obvious to me that you should first approach this looking for a database solution. I see no reason, in any case, that JavaScript (which is what this forum is for) would ever be involved. As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users. __________________ An optimist sees the glass as half full. A pessimist sees the glass as half empty. A realist drinks it no matter how much there is.
 12-28-2012, 08:52 AM PM User | #3 rnd me Senior Coder     Join Date: Jun 2007 Location: Urbana Posts: 3,973 Thanks: 10 Thanked 535 Times in 517 Posts js can handle this just fine, alone in node.js, or as a map reduce function in a nosql system. it seems to me that you simply need to transpose your data from groups of hashes per user into groups of users per hash. you can then solve this problem system-wide by simply concatenating the users list of each hash mentioned by user A into one big repetitive array of users. probably IDs or something... you can sort() that list by count, and remove duplicates. slot #0 should obviously be the user A, having 100% participation in the selected pool of hashes, but slot #1 would be the 2nd closest, slot #2 the third closest, etc until you reach the users with only one hash in common with user A. it would use double the RAM of the orig, but should provide near-instant execution to cover that up-front RAM cost. you could just run that function for every user to find out who is common to who system-wide (if needed). if your two tables are in globals the whole time it shouldn't take very long at all in V8 to calculate the whole list of lists. side note: i once had a complex comparison that required an sql subquery, and took about 8-9 mins to run. it was looking for missing two-way relationships. when i instead exported the DB as JSON and ran the thing in javascript after building an on-demand LUT as described above, i could calculate the same thing in 200ms instead of almost ten mins... this let the workers of the project i was working on, an inventory of about 80,000 media objects, run this report on-demand instead of having me do it twice a day, greatly speeding progress. I know DBs are the go to, but they aren't panaceas, and it CAN pay to explore other options. __________________ my site (updated 13/9/26) BROWSER STATS [% share] (2013/10/31) IE7:0.5, IE8:8.6, IE9:5.3, IE10:12.3, FF:17.7, CH:41.8, SF:8.1, MOBILE:20.4 Last edited by rnd me; 12-28-2012 at 09:07 AM..
12-28-2012, 08:15 PM   PM User | #4
ldaneil
New to the CF scene

Join Date: Dec 2012
Location: Singapore
Posts: 6
Thanks: 1
Thanked 0 Times in 0 Posts

Quote:
 Originally Posted by Old Pedant Since, presumably, the tweets are recorded in a database somewhere, it seems obvious to me that you should first approach this looking for a database solution. I see no reason, in any case, that JavaScript (which is what this forum is for) would ever be involved. As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
Dear Old Pedant

Yes, the project professor ask me to use C language to handle it. I post it in JavaScript section, because this section is very active, I have more chance to meet a programming pro, who may give me a very good and feasible way to do it. Actually ,this is a data mining problem, How to deal with the data and data analysis, this problem's essential part is the idea, if the idea is clear how to do it, the code implementation should no a problem. I hope some smart guy can give me some hints. thank you.

12-28-2012, 08:25 PM   PM User | #5
ldaneil
New to the CF scene

Join Date: Dec 2012
Location: Singapore
Posts: 6
Thanks: 1
Thanked 0 Times in 0 Posts

Quote:
 Originally Posted by rnd me js can handle this just fine, alone in node.js, or as a map reduce function in a nosql system. it seems to me that you simply need to transpose your data from groups of hashes per user into groups of users per hash. you can then solve this problem system-wide by simply concatenating the users list of each hash mentioned by user A into one big repetitive array of users. probably IDs or something... you can sort() that list by count, and remove duplicates. slot #0 should obviously be the user A, having 100% participation in the selected pool of hashes, but slot #1 would be the 2nd closest, slot #2 the third closest, etc until you reach the users with only one hash in common with user A. it would use double the RAM of the orig, but should provide near-instant execution to cover that up-front RAM cost. you could just run that function for every user to find out who is common to who system-wide (if needed). if your two tables are in globals the whole time it shouldn't take very long at all in V8 to calculate the whole list of lists. side note: i once had a complex comparison that required an sql subquery, and took about 8-9 mins to run. it was looking for missing two-way relationships. when i instead exported the DB as JSON and ran the thing in javascript after building an on-demand LUT as described above, i could calculate the same thing in 200ms instead of almost ten mins... this let the workers of the project i was working on, an inventory of about 80,000 media objects, run this report on-demand instead of having me do it twice a day, greatly speeding progress. I know DBs are the go to, but they aren't panaceas, and it CAN pay to explore other options.
Dear rnd me

you suggestion sounds good and useful. I am a University student, my professor ask me to use C language to handle it. I may not clearly state my problem.

The hashtag is the words highlighted by users, which can reflect the users interests in some extent,if two users have some common hashtags, it means they have some hidden similarity/interest, which also means they have some hidden similarity. However, in Twitter, 90% of the users are not active, most of them haven't any hashtag, instead of directly calculate the similarity between all users, I would like to filter out all those un-related users(for a particular user, those users haven't common hashtags/no hashtags).To do this, I use an array to store all users, then for each user,use a linkedlist to chain all the related users group(for example, all those users who have some common hashtags with user A, we say they are the related group to A). Then second step is to calculate the similarity amount the related group for user A,and get the relative similarity value between A and all other users of the group. repeat this step for each users in twitter, so that we could get relative similarity value between any user in Twitter. next step is to find the related similarity value between all users, then according to the similarity value to build a graph(similarity =0 means no link between the two users), finally using this graph go through the fast unfold community detection algorithm to find the clusters of the social website.

In fact, the hashtag is just only one of the measurements of the similarity between users, I categorise the measurement into content similarity, link similarity and meta-data similarity.

Content similarity: hashtags and tweets

Link similarity: follower-following relationship between two users and the number of times they have retweeted, mentioned or replied to each other

meta-data similarity: location, gender and age

content similarity = hashtags similarity + tweets similaity, and use the same way to get link similarity and meta-data similarity

the total similarity = content similarity + Link similarity + meta-data similarity

the total similarity is probabaly a digit value, which is something like the weigted link I Want to get for building the similarity graph of the twitter

website. something look like this:
http://www.xatlantis.ch/examples/gra...h1_example.png

12-28-2012, 08:45 PM   PM User | #6
rnd me
Senior Coder

Join Date: Jun 2007
Location: Urbana
Posts: 3,973
Thanks: 10
Thanked 535 Times in 517 Posts
Quote:
 Originally Posted by ldaneil Yes, the project professor ask me to use C language to handle it. I post it in JavaScript section, because this section is very active, I have more chance to meet a programming pro, who may give me a very good and feasible way to do it.
heh. i don't think too many professional programmers hang out in javascript forums; a lot of them think JS sucks. Does the simple transposed look-up-table routine i described not solve your problem?
__________________
my site (updated 13/9/26)
BROWSER STATS [% share] (2013/10/31) IE7:0.5, IE8:8.6, IE9:5.3, IE10:12.3, FF:17.7, CH:41.8, SF:8.1, MOBILE:20.4

12-28-2012, 08:46 PM   PM User | #7
Old Pedant
Supreme Master coder!

Join Date: Feb 2009
Posts: 24,955
Thanks: 75
Thanked 4,308 Times in 4,275 Posts
RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.

A lot different than RdnMe's 80,000 media objects.

I am not saying that a SQL approach is the only way. But if you want to solve the problem quickly--and I would assume that is true since this is for a class (you did say "professor"?)--then SQL seems to me to be the answer. It feels like a one-day (or less) project if done using SQL.

But you never did describe what the RESULTS are supposed to be. Recall that I wrote
Quote:
 As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
So...just what is it you are supposed to be able to report?

Find the 100 people who have the most hashtags in common? Or what?

Anyway, as a SQL problem ir really is simple:
Code:
```CREATE TABLE users (
userid INT AUTO_INCREMENT PRIMARY KEY,
... other user info ...
);

CREATE TABLE hashtags (
tagid INT AUTO_INCREMENT PRIMARY KEY,
tag VARCHAR(50)
);

CREATE TABLE userTags (
userid INT,
tagid INT,
PRIMARY KEY (userid, tagid),
CONSTRAINT FOREIGN KEY usertags_user (userid) REFERENCES users(userid),
CONSTRAINT FOREIGN KEY usertags_tag  (tagid)  REFERENCES hashtags(tagid)
);

And to find, say, the 20 people with the most hash tag matches *TO A GIVEN USER*:

FROM users AS u1, users AS u2, usertags AS T1, usertags AS T2
WHERE u1.username = 'Fred Flintstone' /* Fred being the given user in this demo */
AND u1.userid = T1.userid
AND T1.tagid = T2.tagid
AND T1.userid <> T2.userid
ORDER BY matches DESC
LIMIT 20;```
__________________
An optimist sees the glass as half full.
A pessimist sees the glass as half empty.
A realist drinks it no matter how much there is.

 Users who have thanked Old Pedant for this post: ldaneil (12-30-2012)
12-28-2012, 08:59 PM   PM User | #8
rnd me
Senior Coder

Join Date: Jun 2007
Location: Urbana
Posts: 3,973
Thanks: 10
Thanked 535 Times in 517 Posts
Quote:
 Originally Posted by Old Pedant RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.
my problem had fewer objects, but it needed a full table scan per object to calculate the field we needed.

if he is using C, i don't think he would use SQL to do the analysis.

doing some back of the envelope math, i think you can handle 6 million users with my routine.

given each unique hash tag as a long integer (4 bytes) and an average of 19 or 20 tags per user (using one slot as a userID):
6000000 * 4 * 20 = ~480MB

for the LUT you need an array of all the hashes. lets say there are 100K unique hashes at 20 bytes each:
100000 * 20 = 2MB

you then need the transposed data (users by hash). you can refer to each user as a long integer. let's pretend that each hash has 50 users on average:
100000 * 50 * 4 = ~20MB

that's only ~500 megabytes, no problem for an off-the-shelf laptop...
__________________
my site (updated 13/9/26)
BROWSER STATS [% share] (2013/10/31) IE7:0.5, IE8:8.6, IE9:5.3, IE10:12.3, FF:17.7, CH:41.8, SF:8.1, MOBILE:20.4

Last edited by rnd me; 12-28-2012 at 09:01 PM..

 12-28-2012, 09:14 PM PM User | #9 Old Pedant Supreme Master coder!     Join Date: Feb 2009 Posts: 24,955 Thanks: 75 Thanked 4,308 Times in 4,275 Posts Your numbers seem reasonable. But those will be essentially the same numbers that the MySQL DB and query will see in the usertags table. So you will likely get the results faster each time...*AFTER* you write the C code...but I will get the first results faster by not *HAVING* to write the C code. It all depends on what his goal is, which we so far don't know. Oh...and invoking MySQL from C code is essentially the same as invoking it from PHP. The primary interface to MySQL is, indeed, C. __________________ An optimist sees the glass as half full. A pessimist sees the glass as half empty. A realist drinks it no matter how much there is.
12-28-2012, 09:55 PM   PM User | #10
Logic Ali
Senior Coder

Join Date: Sep 2010
Location: London
Posts: 1,027
Thanks: 0
Thanked 206 Times in 201 Posts
Quote:
 Originally Posted by rnd me Does the simple transposed look-up-table routine i described not solve your problem?
I get the feeling that you might be seriously over-estimating this customer.

12-28-2012, 10:31 PM   PM User | #11
Old Pedant
Supreme Master coder!

Join Date: Feb 2009
Posts: 24,955
Thanks: 75
Thanked 4,308 Times in 4,275 Posts
Quote:
 Originally Posted by Logic Ali I get the feeling that you might be seriously over-estimating this customer.
LOL! Yeah, we probably both are.

And maybe he really does have to write it all in C as a proof he can code in that language. (Though why anybody would be taking a course in C in this day and age is beyond me. *AT LEAST* he should be taking C++, if not Java or C#. Some language that is just a *tad* more modern!)
__________________
An optimist sees the glass as half full.
A pessimist sees the glass as half empty.
A realist drinks it no matter how much there is.

12-30-2012, 01:13 PM   PM User | #12
ldaneil
New to the CF scene

Join Date: Dec 2012
Location: Singapore
Posts: 6
Thanks: 1
Thanked 0 Times in 0 Posts

Quote:
 Originally Posted by Old Pedant RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags. A lot different than RdnMe's 80,000 media objects. I am not saying that a SQL approach is the only way. But if you want to solve the problem quickly--and I would assume that is true since this is for a class (you did say "professor"?)--then SQL seems to me to be the answer. It feels like a one-day (or less) project if done using SQL. But you never did describe what the RESULTS are supposed to be. Recall that I wrote So...just what is it you are supposed to be able to report? Find the 100 people who have the most hashtags in common? Or what? Anyway, as a SQL problem ir really is simple: Code: ```CREATE TABLE users ( userid INT AUTO_INCREMENT PRIMARY KEY, username VARCHAR(50), ... other user info ... ); CREATE TABLE hashtags ( tagid INT AUTO_INCREMENT PRIMARY KEY, tag VARCHAR(50) ); CREATE TABLE userTags ( userid INT, tagid INT, PRIMARY KEY (userid, tagid), CONSTRAINT FOREIGN KEY usertags_user (userid) REFERENCES users(userid), CONSTRAINT FOREIGN KEY usertags_tag (tagid) REFERENCES hashtags(tagid) ); And to find, say, the 20 people with the most hash tag matches *TO A GIVEN USER*: SELECT u2.username, COUNT(*) AS matches FROM users AS u1, users AS u2, usertags AS T1, usertags AS T2 WHERE u1.username = 'Fred Flintstone' /* Fred being the given user in this demo */ AND u1.userid = T1.userid AND T1.tagid = T2.tagid AND T1.userid <> T2.userid GROUP BY u2.username ORDER BY matches DESC LIMIT 20;```
Dear Old Pedant,

Yes, I am a University student, However, the project is not a class project, it's a research project. and My professor ask me to implement it with C code. The final goal here is to find the similarity relative value between users, the hashtag is just only one of the measurements of the similarity between users, I categorise the measurement into content similarity, link similarity and meta-data similarity.

Content similarity: hashtags and tweets

Link similarity: follower-following relationship between two users and the number of times they have retweeted, mentioned or replied to each other

meta-data similarity: location, gender and age

content similarity = hashtags similarity + tweets similaity, and use the same way to get link similarity and meta-data similarity

the total similarity = content similarity + Link similarity + meta-data similarity

the total similarity is probabaly a digit value, which is something like the weigted link I Want to get for building the similarity graph of the twitter

website. something look like this:
http://www.xatlantis.ch/examples/gra...h1_example.png http://www.xatlantis.ch/examples/gra...h1_example.png

After I get this weighted graph, I will use it to go through my fast unfold community detection algorithm to find those potential Clusters in twetter.
The focus here is to design a good program, which could fast compute the similarity (total similarity) value between users, and filter out those no related users, and finally get something like similarity graph text file. for example:

usera_id userb_id similarity vale

12 15 0.6
. . .
. . .
. . .

each line contains a pair of integers a b
it means there is an edge from a to b (means some similarity)
After I get something like above text file, I will pass the text file to the fast unfold algorithm to find the potential clusters in the website.

 12-30-2012, 05:21 PM PM User | #13 tracknut Senior Coder   Join Date: Aug 2006 Posts: 1,041 Thanks: 6 Thanked 244 Times in 243 Posts I think this will be a fun program to write, and congratulate your professor on creating the challenge! By the way, list list: Code: ```usera_id userb_id similarity vale 12 15 0.6 . . . . . . . . .``` contains, 36 trillion entries, I hope you aren't asked to print it out To add to the above discussion, I think a bit of pre-processing is in order here. Do you know how many unique hashtags there are in the dataset? Do you know how many hashtags are referenced only once? Do you know how many users have zero hashtags in common with anyone else? Those numbers will be simple to find using rnd_me's suggestion of collecting the information via hashtag rather than user, and can be used to drastically (I suspect) reduce the size of this 6M x 6M "similarity matrix" you are looking to create. Dave

 Bookmarks

 Thread Tools Rate This Thread Rate This Thread: 5 : Excellent 4 : Good 3 : Average 2 : Bad 1 : Terrible

 Posting Rules You may not post new threads You may not post replies You may not post attachments You may not edit your posts BB code is On Smilies are On [IMG] code is On HTML code is Off Forum Rules
 Forum Jump User Control Panel Private Messages Subscriptions Who's Online Search Forums Forums Home :: Client side development     JavaScript programming         DOM and JSON scripting         Ajax and Design         JavaScript frameworks         Post a JavaScript     HTML & CSS     XML     Flash & ActionScript         Adobe Flex     Graphics and Multimedia discussions     General web building         Site reviews         Building for mobile devices :: Server side development     Apache configuration     Perl/ CGI     PHP         Post a PHP snippet     MySQL         Other Databases     Ruby & Ruby On Rails     ASP     ASP.NET     Java and JSP     Other server side languages/ issues         ColdFusion         Python :: Computing & Sciences     Computer Programming     Computer/PC discussions     Geek News and Humour Web Projects and Services Marketplace     Web Projects         Small projects (quick fixes and changes)         Medium projects (new script, new features, etc)         Large Projects (new web application, complex features etc)         Unknown sized projects (request quote)         Vacant job positions         Looking for work/ for hire         Project collaboration/ partnership         Paid work offers and requests (Now CLOSED)     Career, job, and business ideas or advice     Domains, Sites, and Designs for sale         Domains for sale         Websites for sale         Design templates and graphics for sale :: Other forums     Member Offers     Forum feedback and announcements

All times are GMT +1. The time now is 10:15 PM.