CodingForums.com

CodingForums.com (http://www.codingforums.com/index.php)
-   MySQL (http://www.codingforums.com/forumdisplay.php?f=7)
-   -   Distinct Count of values, but need a wildcard (http://www.codingforums.com/showthread.php?t=285048)

mharrison 12-31-2012 01:32 AM

Distinct Count of values, but need a wildcard
 
If I use this query:
Code:

SELECT name,COUNT(*) as count FROM tablename GROUP BY name ORDER BY count DESC;
It will give me the count of each occurrence of a distinct value of a field. However, I am wondering if there would be a way to modify this to have a wildcard.

The background is, I have a field in my table that contains URL's. There are plenty that are in the same domain...ie www.codingforums.com, however, they all contain different points of reference after the .com. More specifically, my question is I want to search for number of occurrences for the domains. They all should begin either http:// or https:// but not all would have the www. and such.

Not sure it is possible but figured I would ask since Google searching didn't really give me anything useful to go off of.

Old Pedant 12-31-2012 06:54 PM

Quote:

Originally Posted by mharrison (Post 1303153)
If I use this query:
Code:

SELECT name,COUNT(*) as count FROM tablename GROUP BY name ORDER BY count DESC;
It will give me the count of each occurrence of a distinct value of a field.

*** NOT TRUE! ***

It will count *ALL* records *PER NAME*.

If you wanted a count of DISTINCT VALUES then you would need to use:
Code:

SELECT name, COUNT( DISTINCT fieldname ) AS count
FROM tablename GROUP BY name ORDER BY count DESC

And if you wanted a count of NON-NULL values of a given field you would do
Code:

SELECT name, COUNT( fieldname ) AS count
FROM tablename GROUP BY name ORDER BY count DESC

But using COUNT(*) simply counts *ALL* records. Period.

***********

Having cleared that up...
Quote:

The background is, I have a field in my table that contains URL's. There are plenty that are in the same domain...ie www.codingforums.com, however, they all contain different points of reference after the .com. More specifically, my question is I want to search for number of occurrences for the domains. They all should begin either http:// or https:// but not all would have the www. and such.
So, more simply stated, you want to group by domain name alone. Yes?

There's no simple way build into MySQL. It's true that MySQL has regular expressions, which would seem the ideal way, but unfortunately they are limited to use with the RLIKE operator.

So you could easily search for all occurrences of codingforums.com for example, but FINDING all the domains is much trickier.

A lot of it will depend on how "regular" the contents of that field are. Does the field *ONLY* contain URLs? Or does it contain URLs embedded within text? e.g.,
Quote:

Be sure to visit http://ebay.com for every day bargains
??

mharrison 12-31-2012 07:07 PM

The field only contains URL's, no additional text...

The URL's are formed more like
http://www.codingforums.com/newreply...te=1&p=1303255
rather than just www.codingforums.com

Old Pedant 12-31-2012 07:27 PM

*IF* all the fields contain *ONLY* URLs of the form
Code:

    http(s)://(www.)xxxxx.yyy/anything
Then I think we can do this. But what about a url such as http://download.microsoft.com? That is, a subdomain?

Are you worried about them? I think it would be tough to create a simple expression to look for those, as well.

Ignoring subdomains, this SEEMS to work:
Code:

SELECT
  SUBSTR(
      REPLACE(REPLACE(REPLACE(CONCAT(url,'/'),'http://',''),'https://',''),'www.',''),
      1,
      LOCATE('/',REPLACE(REPLACE(REPLACE(CONCAT(url,'/'),'http://',''),'https://',''),'www.',''))-1
  ) AS domain,
  COUNT(*) AS howmany
FROM urls
GROUP BY domain
ORDER BY howmany DESC

But *ONLY* if your URL field contains urls alone, no other text.

Old Pedant 12-31-2012 07:31 PM

How it works:

First, we concat a '/' to the end of the URL, just in case there is none there already.

Then we replace 'http://' and 'https://' and 'www.' with nothing.

That gets us down to something like codingforums.com/gobbledygook/

So then we use LOCATE to find the first '/' in that (which MIGHT be the one we added at the first step).

And then we use SUBSTR to get all the characters up to but not including that first '/'.

Make sense?

Old Pedant 12-31-2012 07:34 PM

If it's not obvious, that will collect subdomains separately from their domains.

That is, you might get something like this:
Code:

microsoft.com          119
download.microsoft.com  17


mharrison 12-31-2012 07:37 PM

Quote:

Originally Posted by Old Pedant (Post 1303260)
How it works:

First, we concat a '/' to the end of the URL, just in case there is none there already.

Then we replace 'http://' and 'https://' and 'www.' with nothing.

That gets us down to something like codingforums.com/gobbledygook/

So then we use LOCATE to find the first '/' in that (which MIGHT be the one we added at the first step).

And then we use SUBSTR to get all the characters up to but not including that first '/'.

Make sense?

It does make sense, however I double checked my table and it does contain subdomains as well. Perhaps if it were possible to also ignore the first part of the subdomain as we did with the www., then as long as I could display the entire field at the end, I think I would get good results. I don't see any duplicate domains that have different sudomains....it seems if I have download.microsoft.com, that is all I have, I don't have ie.microsoft.com as far as I can tell.

mharrison 12-31-2012 07:38 PM

Quote:

Originally Posted by Old Pedant (Post 1303261)
If it's not obvious, that will collect subdomains separately from their domains.

That is, you might get something like this:
Code:

microsoft.com          119
download.microsoft.com  17


Ok, got notified of this reply as I was replying. If the query you posted returns results like you have above, then that would be exactly what I would want to see.

Old Pedant 12-31-2012 08:07 PM

I *can* get rid of subdomains. It's just much more complex.

The first thought is to just get rid of everything in front of the first period.

That simplifies things tremendously: No more need to distinguish between http: and https:

BUT... What about domains with *NO* prefix?

Example: What if you have just http://ebay.com/xxx?

If I strip every thing before (and including) the first period we would end up with just com

OOPS?

Old Pedant 12-31-2012 08:07 PM

Quote:

Originally Posted by mharrison (Post 1303263)
Ok, got notified of this reply as I was replying. If the query you posted returns results like you have above, then that would be exactly what I would want to see.

?? You are saying it is okay to get those subdomains separately? In that case we are done.

mharrison 01-01-2013 06:30 AM

Quote:

Originally Posted by Old Pedant (Post 1303270)
?? You are saying it is okay to get those subdomains separately? In that case we are done.

Yes, getting the subdomains separately is fine. Thanks again for your help!


All times are GMT +1. The time now is 01:38 AM.

Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.