RndMe's comments are interesting and correct, but I question their relevance, given the *amount* of data involved here. SIX MILLION customers. And who knows how many messages and/or hash tags that will amount to. Even if each customer has as few as four hash tags, that's 24 million hash tags.
A lot different than RdnMe's 80,000 media objects.
I am not saying that a SQL approach is the only way. But if you want to solve the problem quickly--and I would assume that is true since this is for a class (you did say "professor"?)--then SQL seems to me to be the answer. It feels like a one-day (or less) project if done using SQL.
But you never did describe what the RESULTS are supposed to be. Recall that I wrote
So...just what is it you are supposed to be able to report?
As a database problem, it's nearly trivial to find how "close" any ONE GIVEN PERSON is to all others in the DB. It's a tougher (well...not harder coding, just will take a long time) job to find the highest similarities, say, among all users.
Find the 100 people who have the most hashtags in common? Or what?
Anyway, as a SQL problem ir really is simple:
CREATE TABLE users (
userid INT AUTO_INCREMENT PRIMARY KEY,
... other user info ...
CREATE TABLE hashtags (
tagid INT AUTO_INCREMENT PRIMARY KEY,
CREATE TABLE userTags (
PRIMARY KEY (userid, tagid),
CONSTRAINT FOREIGN KEY usertags_user (userid) REFERENCES users(userid),
CONSTRAINT FOREIGN KEY usertags_tag (tagid) REFERENCES hashtags(tagid)
And to find, say, the 20 people with the most hash tag matches *TO A GIVEN USER*:
SELECT u2.username, COUNT(*) AS matches
FROM users AS u1, users AS u2, usertags AS T1, usertags AS T2
WHERE u1.username = 'Fred Flintstone' /* Fred being the given user in this demo */
AND u1.userid = T1.userid
AND T1.tagid = T2.tagid
AND T1.userid <> T2.userid
GROUP BY u2.username
ORDER BY matches DESC