Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 5 of 5
  1. #1
    Regular Coder
    Join Date
    Sep 2010
    Posts
    331
    Thanks
    9
    Thanked 6 Times in 6 Posts

    Count occurrences of multiple varying substrings within table field?

    How can I count the occurrences of multiple varying substrings within each record of a table field? What I'm trying to do is output the top 5 (or so) words beginning with a hashtag/pound sign (#). I'm not sure how to do this. Would preg_match_all() using its matches parameter suffice, or is substr_count() what I need? Or a combination of both?

    What needs to happen: It should find any and all words beginning with a hashtag/pound sign (#) within existing records in a field called `post`.

    Each occurrence of each word beginning with the # should be counted - that is to say, every occurrence within every record in the `post` field of, for example, #this should be counted as well as every other word starting with #.

    The top 5 words with the most occurrences should be returned.
    (*I don't have any code already*)
    Coding is a challenge, get used to it
    Always remember to debug
    Try the guess & check method
    Break it down into simple steps

  • #2
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,978
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    I'm not sure if there's even an easy way to do this in SQL itself.
    If you separate the hashtags out *before* inserting the record into the database, then you could use a many to many relationship with hashtags to post (oh but if you do NEVER use a REPLACE syntax; always use an INSERT IGNORE syntax; otherwise you'll either get an error on duplicate or delete associated entries, or set null which would result in broken relationships). Then it would be trivial select on a group by and then querying that.

    PHP wise you could use preg_match_all. Since the only necessary part to take note of is the hashtag, simply find anything that starts with it. You can then count them up from the array.
    PHP Code:
    $sPattern '~#([^\s]+)~';
    $aMatches = array();

    $s 'This is a string with #one or #two hashtags in it.  #one will show twice.';
    if (
    preg_match_all($sPattern$s$aMatches))
    {
        
    print_r(array_count_values($aMatches[1]));

    The array_count_values will index the 'value' with the number it finds. It is case sensitive.

    Now to make use of it though, you will need to select everything out of the database. You may want to look at writing a db function to do this instead, although I'm not sure what you'd need to do that.
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #3
    Regular Coder
    Join Date
    Sep 2010
    Posts
    331
    Thanks
    9
    Thanked 6 Times in 6 Posts
    Quote Originally Posted by Fou-Lu View Post
    I'm not sure if there's even an easy way to do this in SQL itself.
    If you separate the hashtags out *before* inserting the record into the database, then you could use a many to many relationship with hashtags to post (oh but if you do NEVER use a REPLACE syntax; always use an INSERT IGNORE syntax; otherwise you'll either get an error on duplicate or delete associated entries, or set null which would result in broken relationships). Then it would be trivial select on a group by and then querying that.

    PHP wise you could use preg_match_all. Since the only necessary part to take note of is the hashtag, simply find anything that starts with it. You can then count them up from the array.
    PHP Code:
    $sPattern '~#([^\s]+)~';
    $aMatches = array();

    $s 'This is a string with #one or #two hashtags in it.  #one will show twice.';
    if (
    preg_match_all($sPattern$s$aMatches))
    {
        
    print_r(array_count_values($aMatches[1]));

    The array_count_values will index the 'value' with the number it finds. It is case sensitive.

    Now to make use of it though, you will need to select everything out of the database. You may want to look at writing a db function to do this instead, although I'm not sure what you'd need to do that.
    Excuse me if I'm wrong but, doesn't the third parameter of preg_match_all() put all arguments into an array in the first place? Why declare one?

    And this is what I have gotten on my own so far:
    PHP Code:
    $getPosts "SELECT `post` FROM `posts`";
                        
    $gpQ $cnct->query($getPosts);
                        foreach(
    $gpQ->fetchAll(PDO::FETCH_ASSOC) as $fetchPost) {
                            
    $htMatches preg_match_all('/[#]+[A-Za-z0-9-_]+/'$fetchPost['post'], $match);
                            if (
    $htMatches >= 1) {
                                for (
    $i 0$i <= $htMatches$i++) {
                                    echo 
    '<p>'.$match[$i] . ' occurs ' .substr_count($match[$i][$i], $fetchPost['post']). ' time(s)</p>';
                                }
                            }
                        
    /*echo $fetchPosts['post'];
                        echo '<hr />';*/
                        

    Treating $match like a multi-d array gets rid of an initial error, but now in a sense the same thing happens but without the warning error:
    Array occurs 0 time(s)
    occurs 0 time(s)
    Array occurs 0 time(s)
    Edit: As of now, I can get the script to find all hashtags (the values in the array) and output them but the script still won't correctly count their occurrences. Where it should output the correct number of times (many of them only being 1), it instead returns 0 (i.e #whatever occurs 0 times)
    PHP Code:
    foreach($gpQ->fetchAll(PDO::FETCH_ASSOC) as $fetchPost) { 
                            
    $htMatches preg_match_all('/[#]+[A-Za-z0-9-_]+/'$fetchPost['post'], $match); 
                            if (
    $htMatches >= 1) { 
                                for (
    $i 0$i <= $htMatches$i++) { 
                                    echo 
    '<p>'.$match[0][$i] . ' occurs ' .substr_count($match[0][$i], $fetchPost['post']). ' time(s)</p><hr />'
                                } 
                            } 
                        } 
    Last edited by elitis; 06-26-2013 at 02:09 AM.
    Coding is a challenge, get used to it
    Always remember to debug
    Try the guess & check method
    Break it down into simple steps

  • #4
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,978
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    It does, but that doesn't mean its not wise to predeclare. PHP creates far too many bad habits as is, but predeclaration is one of the worst.

    You need to capture from the pcre if you plan on using the values after. Use brackets to capture groups; group 0 will always exist though, so if you want to keep the # on it, that's fine to not group anything else.
    Don't use substr_count since that requires a new iteration of a string for a count of every matched hashtag. That's why you use array_count_values instead. You can iterate that and break it into key=>value pairs for the hashtag and the count.
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #5
    Regular Coder
    Join Date
    Sep 2010
    Posts
    331
    Thanks
    9
    Thanked 6 Times in 6 Posts
    ultimately went with something a little less complex but in the hashtags table, (likely due to the dual loops) each hashtag's `times_used` field is either updated by 2 or initially set to 2. The script doesn't seem to work without both loops, but having 2 is causing problems.
    PHP Code:
    if (preg_match_all("/[#]+[A-Za-z0-9-_]+/i"$post$hashtags)) {
            for (
    $h 0$h <= $hashtags$h++) {
                
    $htMatched $hashtags[$h];
                foreach(
    $htMatched as $key=>$htVal) {
                    
    $htCheck "SELECT COUNT(hashtag) FROM `hashtags` WHERE `hashtag` = '$htVal'";
                    
    $htQ $con->query($htCheck);
                    
    $htExistence $htQ->fetchColumn();
                    if (
    $htExistence >= 1) {
                        
    $addTU "UPDATE `hashtags` SET `times_used` = times_used+1 WHERE `hashtag` = '$htVal'";
                        
    $updateHT $con->exec($addTU);
                    } else {
                        
    $htMSQL "INSERT INTO `hashtags` (`hashtag`) VALUES (:hashtag)";
                        
    $htMQ $con->prepare($htMSQL);
                        
    $htMQ->bindParam(':hashtag'$htValPDO::PARAM_STR);
                        
    $htMQ->execute();
                    }
                }
            }
        } 
    Coding is a challenge, get used to it
    Always remember to debug
    Try the guess & check method
    Break it down into simple steps


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •