Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 13 of 13
  1. #1
    New to the CF scene
    Join Date
    Dec 2012
    Location
    Singapore
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Smile Help! How to find the similarity between users in Twitter ?

    I am working on a project about data mining. my company has given me 6 million dummy customer info of twitter. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

    Problem : I use the tweets & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & tweets of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

    For example:
    user A's hashtag = (cat, bull, cow, chicken, duck)
    user B's hashtag =(cat, chicken, cloth)
    user C's hashtag = (lenovo, Hp, Sony)

    clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

  • #2
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,436
    Thanks
    75
    Thanked 4,372 Times in 4,337 Posts
    Since, presumably, the tweets are recorded in a database somewhere, it seems obvious to me that you should first approach this looking for a database solution. I see no reason, in any case, that JavaScript (which is what this forum is for) would ever be involved.

    As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #3
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,333
    Thanks
    11
    Thanked 587 Times in 568 Posts
    js can handle this just fine, alone in node.js, or as a map reduce function in a nosql system.

    it seems to me that you simply need to transpose your data from groups of hashes per user into groups of users per hash.

    you can then solve this problem system-wide by simply concatenating the users list of each hash mentioned by user A into one big repetitive array of users. probably IDs or something...
    you can sort() that list by count, and remove duplicates.

    slot #0 should obviously be the user A, having 100% participation in the selected pool of hashes, but slot #1 would be the 2nd closest, slot #2 the third closest, etc until you reach the users with only one hash in common with user A.

    it would use double the RAM of the orig, but should provide near-instant execution to cover that up-front RAM cost.


    you could just run that function for every user to find out who is common to who system-wide (if needed).
    if your two tables are in globals the whole time it shouldn't take very long at all in V8 to calculate the whole list of lists.


    side note:
    i once had a complex comparison that required an sql subquery, and took about 8-9 mins to run. it was looking for missing two-way relationships. when i instead exported the DB as JSON and ran the thing in javascript after building an on-demand LUT as described above, i could calculate the same thing in 200ms instead of almost ten mins... this let the workers of the project i was working on, an inventory of about 80,000 media objects, run this report on-demand instead of having me do it twice a day, greatly speeding progress. I know DBs are the go to, but they aren't panaceas, and it CAN pay to explore other options.
    Last edited by rnd me; 12-28-2012 at 08:07 AM.
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%

  • #4
    New to the CF scene
    Join Date
    Dec 2012
    Location
    Singapore
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Smile

    Quote Originally Posted by Old Pedant View Post
    Since, presumably, the tweets are recorded in a database somewhere, it seems obvious to me that you should first approach this looking for a database solution. I see no reason, in any case, that JavaScript (which is what this forum is for) would ever be involved.

    As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
    Dear Old Pedant

    Yes, the project professor ask me to use C language to handle it. I post it in JavaScript section, because this section is very active, I have more chance to meet a programming pro, who may give me a very good and feasible way to do it. Actually ,this is a data mining problem, How to deal with the data and data analysis, this problem's essential part is the idea, if the idea is clear how to do it, the code implementation should no a problem. I hope some smart guy can give me some hints. thank you.

  • #5
    New to the CF scene
    Join Date
    Dec 2012
    Location
    Singapore
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Smile

    Quote Originally Posted by rnd me View Post
    js can handle this just fine, alone in node.js, or as a map reduce function in a nosql system.

    it seems to me that you simply need to transpose your data from groups of hashes per user into groups of users per hash.

    you can then solve this problem system-wide by simply concatenating the users list of each hash mentioned by user A into one big repetitive array of users. probably IDs or something...
    you can sort() that list by count, and remove duplicates.

    slot #0 should obviously be the user A, having 100% participation in the selected pool of hashes, but slot #1 would be the 2nd closest, slot #2 the third closest, etc until you reach the users with only one hash in common with user A.

    it would use double the RAM of the orig, but should provide near-instant execution to cover that up-front RAM cost.


    you could just run that function for every user to find out who is common to who system-wide (if needed).
    if your two tables are in globals the whole time it shouldn't take very long at all in V8 to calculate the whole list of lists.


    side note:
    i once had a complex comparison that required an sql subquery, and took about 8-9 mins to run. it was looking for missing two-way relationships. when i instead exported the DB as JSON and ran the thing in javascript after building an on-demand LUT as described above, i could calculate the same thing in 200ms instead of almost ten mins... this let the workers of the project i was working on, an inventory of about 80,000 media objects, run this report on-demand instead of having me do it twice a day, greatly speeding progress. I know DBs are the go to, but they aren't panaceas, and it CAN pay to explore other options.
    Dear rnd me

    you suggestion sounds good and useful. I am a University student, my professor ask me to use C language to handle it. I may not clearly state my problem.

    The hashtag is the words highlighted by users, which can reflect the users interests in some extent,if two users have some common hashtags, it means they have some hidden similarity/interest, which also means they have some hidden similarity. However, in Twitter, 90% of the users are not active, most of them haven't any hashtag, instead of directly calculate the similarity between all users, I would like to filter out all those un-related users(for a particular user, those users haven't common hashtags/no hashtags).To do this, I use an array to store all users, then for each user,use a linkedlist to chain all the related users group(for example, all those users who have some common hashtags with user A, we say they are the related group to A). Then second step is to calculate the similarity amount the related group for user A,and get the relative similarity value between A and all other users of the group. repeat this step for each users in twitter, so that we could get relative similarity value between any user in Twitter. next step is to find the related similarity value between all users, then according to the similarity value to build a graph(similarity =0 means no link between the two users), finally using this graph go through the fast unfold community detection algorithm to find the clusters of the social website.

    In fact, the hashtag is just only one of the measurements of the similarity between users, I categorise the measurement into content similarity, link similarity and meta-data similarity.

    Content similarity: hashtags and tweets

    Link similarity: follower-following relationship between two users and the number of times they have retweeted, mentioned or replied to each other

    meta-data similarity: location, gender and age

    content similarity = hashtags similarity + tweets similaity, and use the same way to get link similarity and meta-data similarity

    the total similarity = content similarity + Link similarity + meta-data similarity

    the total similarity is probabaly a digit value, which is something like the weigted link I Want to get for building the similarity graph of the twitter

    website. something look like this:
    http://www.xatlantis.ch/examples/gra...h1_example.png

  • #6
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,333
    Thanks
    11
    Thanked 587 Times in 568 Posts
    Quote Originally Posted by ldaneil View Post
    Yes, the project professor ask me to use C language to handle it. I post it in JavaScript section, because this section is very active, I have more chance to meet a programming pro, who may give me a very good and feasible way to do it.
    heh. i don't think too many professional programmers hang out in javascript forums; a lot of them think JS sucks. Does the simple transposed look-up-table routine i described not solve your problem?
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%

  • #7
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,436
    Thanks
    75
    Thanked 4,372 Times in 4,337 Posts
    RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.

    A lot different than RdnMe's 80,000 media objects.

    I am not saying that a SQL approach is the only way. But if you want to solve the problem quickly--and I would assume that is true since this is for a class (you did say "professor"?)--then SQL seems to me to be the answer. It feels like a one-day (or less) project if done using SQL.

    But you never did describe what the RESULTS are supposed to be. Recall that I wrote
    As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
    So...just what is it you are supposed to be able to report?

    Find the 100 people who have the most hashtags in common? Or what?

    Anyway, as a SQL problem ir really is simple:
    Code:
    CREATE TABLE users (
        userid INT AUTO_INCREMENT PRIMARY KEY,
        username VARCHAR(50),
        ... other user info ...
    );
    
    CREATE TABLE hashtags (
        tagid INT AUTO_INCREMENT PRIMARY KEY,
        tag VARCHAR(50)
    );
    
    CREATE TABLE userTags (
        userid INT,
        tagid INT,
        PRIMARY KEY (userid, tagid),
        CONSTRAINT FOREIGN KEY usertags_user (userid) REFERENCES users(userid),
        CONSTRAINT FOREIGN KEY usertags_tag  (tagid)  REFERENCES hashtags(tagid)
    );
    
    And to find, say, the 20 people with the most hash tag matches *TO A GIVEN USER*:
    
    SELECT u2.username, COUNT(*) AS matches
    FROM users AS u1, users AS u2, usertags AS T1, usertags AS T2
    WHERE u1.username = 'Fred Flintstone' /* Fred being the given user in this demo */
    AND u1.userid = T1.userid
    AND T1.tagid = T2.tagid
    AND T1.userid <> T2.userid
    GROUP BY u2.username
    ORDER BY matches DESC
    LIMIT 20;
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • Users who have thanked Old Pedant for this post:

    ldaneil (12-30-2012)

  • #8
    Senior Coder rnd me's Avatar
    Join Date
    Jun 2007
    Location
    Urbana
    Posts
    4,333
    Thanks
    11
    Thanked 587 Times in 568 Posts
    Quote Originally Posted by Old Pedant View Post
    RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.
    my problem had fewer objects, but it needed a full table scan per object to calculate the field we needed.

    if he is using C, i don't think he would use SQL to do the analysis.

    doing some back of the envelope math, i think you can handle 6 million users with my routine.

    given each unique hash tag as a long integer (4 bytes) and an average of 19 or 20 tags per user (using one slot as a userID):
    6000000 * 4 * 20 = ~480MB

    for the LUT you need an array of all the hashes. lets say there are 100K unique hashes at 20 bytes each:
    100000 * 20 = 2MB

    you then need the transposed data (users by hash). you can refer to each user as a long integer. let's pretend that each hash has 50 users on average:
    100000 * 50 * 4 = ~20MB


    that's only ~500 megabytes, no problem for an off-the-shelf laptop...
    Last edited by rnd me; 12-28-2012 at 08:01 PM.
    my site (updated 13/9/26)
    BROWSER STATS [% share] (2014/5/28) IE7:0.1, IE8:5.3, IE11:8.4, IE9:3.2, IE10:3.2, FF:18.2, CH:46, SF:7.9, NON-MOUSE:32%

  • #9
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,436
    Thanks
    75
    Thanked 4,372 Times in 4,337 Posts
    Your numbers seem reasonable.

    But those will be essentially the same numbers that the MySQL DB and query will see in the usertags table.

    So you will likely get the results faster each time...*AFTER* you write the C code...but I will get the first results faster by not *HAVING* to write the C code.

    It all depends on what his goal is, which we so far don't know.

    Oh...and invoking MySQL from C code is essentially the same as invoking it from PHP. The primary interface to MySQL is, indeed, C.
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #10
    Senior Coder Logic Ali's Avatar
    Join Date
    Sep 2010
    Location
    London
    Posts
    1,028
    Thanks
    0
    Thanked 207 Times in 202 Posts
    Quote Originally Posted by rnd me View Post
    Does the simple transposed look-up-table routine i described not solve your problem?
    I get the feeling that you might be seriously over-estimating this customer.

  • #11
    Supreme Master coder! Old Pedant's Avatar
    Join Date
    Feb 2009
    Posts
    25,436
    Thanks
    75
    Thanked 4,372 Times in 4,337 Posts
    Quote Originally Posted by Logic Ali View Post
    I get the feeling that you might be seriously over-estimating this customer.
    LOL! Yeah, we probably both are.

    And maybe he really does have to write it all in C as a proof he can code in that language. (Though why anybody would be taking a course in C in this day and age is beyond me. *AT LEAST* he should be taking C++, if not Java or C#. Some language that is just a *tad* more modern!)
    An optimist sees the glass as half full.
    A pessimist sees the glass as half empty.
    A realist drinks it no matter how much there is.

  • #12
    New to the CF scene
    Join Date
    Dec 2012
    Location
    Singapore
    Posts
    6
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Smile

    Quote Originally Posted by Old Pedant View Post
    RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.

    A lot different than RdnMe's 80,000 media objects.

    I am not saying that a SQL approach is the only way. But if you want to solve the problem quickly--and I would assume that is true since this is for a class (you did say "professor"?)--then SQL seems to me to be the answer. It feels like a one-day (or less) project if done using SQL.

    But you never did describe what the RESULTS are supposed to be. Recall that I wrote
    So...just what is it you are supposed to be able to report?

    Find the 100 people who have the most hashtags in common? Or what?

    Anyway, as a SQL problem ir really is simple:
    Code:
    CREATE TABLE users (
        userid INT AUTO_INCREMENT PRIMARY KEY,
        username VARCHAR(50),
        ... other user info ...
    );
    
    CREATE TABLE hashtags (
        tagid INT AUTO_INCREMENT PRIMARY KEY,
        tag VARCHAR(50)
    );
    
    CREATE TABLE userTags (
        userid INT,
        tagid INT,
        PRIMARY KEY (userid, tagid),
        CONSTRAINT FOREIGN KEY usertags_user (userid) REFERENCES users(userid),
        CONSTRAINT FOREIGN KEY usertags_tag  (tagid)  REFERENCES hashtags(tagid)
    );
    
    And to find, say, the 20 people with the most hash tag matches *TO A GIVEN USER*:
    
    SELECT u2.username, COUNT(*) AS matches
    FROM users AS u1, users AS u2, usertags AS T1, usertags AS T2
    WHERE u1.username = 'Fred Flintstone' /* Fred being the given user in this demo */
    AND u1.userid = T1.userid
    AND T1.tagid = T2.tagid
    AND T1.userid <> T2.userid
    GROUP BY u2.username
    ORDER BY matches DESC
    LIMIT 20;
    Dear Old Pedant,

    Yes, I am a University student, However, the project is not a class project, it's a research project. and My professor ask me to implement it with C code. The final goal here is to find the similarity relative value between users, the hashtag is just only one of the measurements of the similarity between users, I categorise the measurement into content similarity, link similarity and meta-data similarity.

    Content similarity: hashtags and tweets

    Link similarity: follower-following relationship between two users and the number of times they have retweeted, mentioned or replied to each other

    meta-data similarity: location, gender and age

    content similarity = hashtags similarity + tweets similaity, and use the same way to get link similarity and meta-data similarity

    the total similarity = content similarity + Link similarity + meta-data similarity

    the total similarity is probabaly a digit value, which is something like the weigted link I Want to get for building the similarity graph of the twitter

    website. something look like this:
    http://www.xatlantis.ch/examples/gra...h1_example.png http://www.xatlantis.ch/examples/gra...h1_example.png

    After I get this weighted graph, I will use it to go through my fast unfold community detection algorithm to find those potential Clusters in twetter.
    The focus here is to design a good program, which could fast compute the similarity (total similarity) value between users, and filter out those no related users, and finally get something like similarity graph text file. for example:

    usera_id userb_id similarity vale

    12 15 0.6
    . . .
    . . .
    . . .


    each line contains a pair of integers a b
    it means there is an edge from a to b (means some similarity)
    After I get something like above text file, I will pass the text file to the fast unfold algorithm to find the potential clusters in the website.

  • #13
    Senior Coder
    Join Date
    Aug 2006
    Posts
    1,259
    Thanks
    10
    Thanked 276 Times in 275 Posts
    I think this will be a fun program to write, and congratulate your professor on creating the challenge!

    By the way, list list:
    Code:
    usera_id userb_id similarity vale
    
    12 15 0.6
    . . .
    . . .
    . . .
    contains, 36 trillion entries, I hope you aren't asked to print it out

    To add to the above discussion, I think a bit of pre-processing is in order here. Do you know how many unique hashtags there are in the dataset? Do you know how many hashtags are referenced only once? Do you know how many users have zero hashtags in common with anyone else? Those numbers will be simple to find using rnd_me's suggestion of collecting the information via hashtag rather than user, and can be used to drastically (I suspect) reduce the size of this 6M x 6M "similarity matrix" you are looking to create.

    Dave


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •