![]() |
Distinct Count of values, but need a wildcard
If I use this query:
Code:
SELECT name,COUNT(*) as count FROM tablename GROUP BY name ORDER BY count DESC;The background is, I have a field in my table that contains URL's. There are plenty that are in the same domain...ie www.codingforums.com, however, they all contain different points of reference after the .com. More specifically, my question is I want to search for number of occurrences for the domains. They all should begin either http:// or https:// but not all would have the www. and such. Not sure it is possible but figured I would ask since Google searching didn't really give me anything useful to go off of. |
Quote:
It will count *ALL* records *PER NAME*. If you wanted a count of DISTINCT VALUES then you would need to use: Code:
SELECT name, COUNT( DISTINCT fieldname ) AS countCode:
SELECT name, COUNT( fieldname ) AS count*********** Having cleared that up... Quote:
There's no simple way build into MySQL. It's true that MySQL has regular expressions, which would seem the ideal way, but unfortunately they are limited to use with the RLIKE operator. So you could easily search for all occurrences of codingforums.com for example, but FINDING all the domains is much trickier. A lot of it will depend on how "regular" the contents of that field are. Does the field *ONLY* contain URLs? Or does it contain URLs embedded within text? e.g., Quote:
|
The field only contains URL's, no additional text...
The URL's are formed more like http://www.codingforums.com/newreply...te=1&p=1303255 rather than just www.codingforums.com |
*IF* all the fields contain *ONLY* URLs of the form
Code:
http(s)://(www.)xxxxx.yyy/anythinghttp://download.microsoft.com? That is, a subdomain?Are you worried about them? I think it would be tough to create a simple expression to look for those, as well. Ignoring subdomains, this SEEMS to work: Code:
SELECT |
How it works:
First, we concat a '/' to the end of the URL, just in case there is none there already. Then we replace 'http://' and 'https://' and 'www.' with nothing. That gets us down to something like codingforums.com/gobbledygook/ So then we use LOCATE to find the first '/' in that (which MIGHT be the one we added at the first step). And then we use SUBSTR to get all the characters up to but not including that first '/'. Make sense? |
If it's not obvious, that will collect subdomains separately from their domains.
That is, you might get something like this: Code:
microsoft.com 119 |
Quote:
|
Quote:
|
I *can* get rid of subdomains. It's just much more complex.
The first thought is to just get rid of everything in front of the first period. That simplifies things tremendously: No more need to distinguish between http: and https: BUT... What about domains with *NO* prefix? Example: What if you have just http://ebay.com/xxx?If I strip every thing before (and including) the first period we would end up with just com OOPS? |
Quote:
|
Quote:
|
| All times are GMT +1. The time now is 01:38 AM. |
Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.