PDA

View Full Version : Normalising slows query


takin
05-08-2008, 09:02 AM
I've got a report that I've been loading into mysql to process. I've normalised the database (essential for another query), but that's made a key query extend from .09s (pre-normalisation) to 4min50s (post-normalisation). I'm sure there's a better way of writing the query or indexing tables, but explain means nothing to me.

The complexity comes from the fact that each line (call) of the original report links 2 sites (source and destination) from the same pool. I've pulled the sites out into another table and created a third (connections) to link the two properly, but I still need to know the type of systems the connection is between. I don't care too much about the direction so if one line of the result shows connections from type 1 to type 3 plus connections from type 3 to type 1, that's fine.

Here's the table structure:
Single table: id, date, mins, s_ip, s_type, d_ip, d_type

broken into
Calls: call_id, date, mins
Sites: ip, type
Connections: id, call_id, ip, direction

The query on a single table is
SELECT
YEAR(date), MONTH(date), s_type, d_type, COUNT(mins), SUM(mins)
FROM
pnraw
GROUP BY
YEAR(date), MONTH(date), s_type, d_type
Explain shows
+----+-------------+-------+------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | pnraw | ALL | NULL | NULL | NULL | NULL | 43864 | Using temporary; Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+-------+---------------------------------+

On the normalised database, this becomes
SELECT
YEAR(date), MONTH(date), s_type, d_type, COUNT(mins), SUM(mins)
FROM
calls
LEFT JOIN
(SELECT
call_id, type AS s_type
FROM
connections
LEFT JOIN
sites
ON
system=ip
WHERE
direction=1)
AS s
ON
calls.id=s.call_id
LEFT JOIN
(SELECT
call_id, type AS d_type
FROM
connections
LEFT JOIN
sites
ON
system=ip
WHERE
direction=2)
AS d
ON
calls.id=d.call_id
GROUP BY
YEAR(date), MONTH(date), s_type, d_type;

Explain says
+----+-------------+-------------+--------+---------------+---------+---------+-----------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+---------------+---------+---------+-----------------------------+-------+---------------------------------+
| 1 | PRIMARY | calls | ALL | NULL | NULL | NULL | NULL | 42347 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 42347 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 42347 | |
| 3 | DERIVED | connections | index | NULL | call_id | 19 | NULL | 84694 | Using where; Using index |
| 3 | DERIVED | sites | eq_ref | PRIMARY | PRIMARY | 15 | pnreport.connections.system | 1 | |
| 2 | DERIVED | connections | index | NULL | call_id | 19 | NULL | 84694 | Using where; Using index |
| 2 | DERIVED | sites | eq_ref | PRIMARY | PRIMARY | 15 | pnreport.connections.system | 1 | |
+----+-------------+-------------+--------+---------------+---------+---------+-----------------------------+-------+---------------------------------+

Is there any way I can reduce the time this query takes in the normalised structure?

Stooshie
05-08-2008, 11:51 AM
Try this.

SELECT
YEAR(ca.date), MONTH(ca.date),
s.type, co.type
FROM
calls ca
LEFT OUTER JOIN
connections co
ON
co.call_id = ca.call_id
LEFT OUTER JOIN
sites s
ON
s.ip = co.ip
GROUP BY
YEAR(ca.date), MONTH(ca.date),
s.type, d.type

Sorry, I haven't had a chance to create a DB test it yet.

The explain is basically showing you the number of searches the DB is having to do and a bit more info on speeds, indexing etc.

As you can see, 7 different searches to get 4 pieces of data from what is basically 1 many-many relationship is quite a lot.

Another important column in the EXPLAIN table is the rows. The less the better. Most of the searches on your EXPLAIN are searching between 40000 and 80000 records (I assume most of the records in the DB) Indexing will help to reduce that number.

takin
05-09-2008, 02:15 AM
Stooshie,
Thanks for looking at this. It's not as simple as you suggest. There are at least 2 many-many(?) relationships. ie. Where does d.type come from in your version? Here's how it looks to me.


calls: id date mins
|
conns: |-- call_id ip dir=src
| |
sites: | |-- ip s.type
|
conns: |-- call_id ip dir=dest
|
sites: |-- ip d.type


There are 2 lines in connections for each line (call) in calls. One is the source, the other is the destination. Each has a one to one relationship with a site for which I need the type. Any site can both make and receive calls.

I hope this clears up the structure a bit more.

oesxyl
05-09-2008, 03:16 AM
I guess that this come from the fact that you normalise the original database having in mind the entities not their instances. I mean both source site and destination site are the same entity, a site, but as instance there are distinct, source site and destination site, and have a relation between them and that make the structure to become complex. Also I don't think that is a better way to normalise then yours, at least I don't see one, :)
Maybe a solution is to create specific view or views. This way only creating view will be slower or maybe not.

PS: is just a opinion, :)

regards

Stooshie
05-09-2008, 11:55 AM
My mistake about the "d.type", I meant "co.type" in the query.

It does look more complicated than I first thought. I'll have another look.

Stooshie
05-09-2008, 12:13 PM
Ah, I see what you are trying to do now! :-)

You are probably better to normalise the database with the connections table having a source ip and a destination ip, then the query should be easier. something like:

SELECT
YEAR(ca.date), MONTH(ca.date),
src.type AS s_type, dest.type AS d_type,
COUNT(mins), SUM(mins)
FROM
calls ca
LEFT OUTER JOIN
connections con
ON
con.call_id = ca.call_id
LEFT OUTER JOIN
sites src
ON
src.ip = con.source_ip
LEFT OUTER JOIN
sites dest
ON
dest.ip = con.dest_ip
GROUP BY
YEAR(ca.date), MONTH(ca.date),
src.type, dest.type

You will need the two columns source_ip and dest_ip in the connections table.

takin
05-10-2008, 01:01 AM
Stooshie,
Yes. That would probably work. Then I'll have a 1-1 relationship between calls and connections, so I might as well put them in the one table. I'll give it a go on Monday. Thanks.

oesxyl,
What do you mean by 'specific view'? Is that something like a separate table created from a query that is always kept up to date as other data changes? Is that possible?

oesxyl
05-10-2008, 02:04 AM
Stooshie,
Yes. That would probably work. Then I'll have a 1-1 relationship between calls and connections, so I might as well put them in the one table. I'll give it a go on Monday. Thanks.

oesxyl,
What do you mean by 'specific view'? Is that something like a separate table created from a query that is always kept up to date as other data changes? Is that possible?
I was thinking to associate one or more query with a view which make the query optimal but yes is possible but are some limitation you must read:

http://dev.mysql.com/doc/refman/5.0/en/create-view.html

related stuff you find and make some test.
You can, in some condition, to modify/add/delete row/fields from tables and this is reflected in view.
Also you can operate with the view and the changes are reflected into tables, but how I said are some limitation.
To be honest I don't have much experience with views, I used few times only.
In your case I was thinking that because the original table was faster but not normalised, creating a view solve the problem, :)


a better link then previous:
http://dev.mysql.com/doc/refman/5.0/en/views.html

regards

Stooshie
05-10-2008, 09:20 PM
good idea oesxyl.

If he is worried about speed though, isn't a view really just another query with a name attached that behaves like it's a table so that every time you run a query on it, it is also running another query in the background?

oesxyl
05-10-2008, 09:53 PM
good idea oesxyl.

If he is worried about speed though, isn't a view really just another query with a name attached that behaves like it's a table so that every time you run a query on it, it is also running another query in the background?
I really don't know how it work but I presume that is optimised somehow else I don't see the difference between a temporary table and a view except that second is in two steps, query to create the table and query over temporary table.
I presume that the view is create once from the 'as' query and further update when the tables changes, something what could be simulated using triggers for example.
Anyway, some tests and banchmarks could make clear if is or not what we need in a given context.

regards

GJay
05-11-2008, 10:39 AM
you probab;y want indexes on 'direction' and 'call_id' and 'system' and 'ip'

takin
05-12-2008, 10:41 AM
Stooshie, your idea puts me back to square 1. The reason I started trying to normalise was that I want to run a query showing total usage for each site and having it link to 2 fields was a nightmare to code. I never once got accurate data until I normalised.

GJay, well spotted. I seem to be missing indexes on direction and system. I'll give that a go.

oesxyl, views sound like a last resort, but I'll do some reading.

Thanks for all your help. I'm learning heaps.

Stooshie
05-12-2008, 01:43 PM
I have downloaded this document (http://dev.mysql.com/tech-resources/articles/mysql-views.pdf) from mysql.com that explains views.

I am currently making my way through it when I have the odd minutes here and there.

It seems that views are basically a query with a name attached. That shouldn't affect your query structure but it may affect the speed.

When you say my idea puts you back to square 1 do you mean my idea about views or my idea for the query further up?

takin
05-13-2008, 01:52 AM
Stooshie,
I meant your idea of using a s_ip and d_ip field in the one table. It makes queries that are independent of specific systems easier, but makes queries that summarise by system a nightmare.

GJay,
Adding indexes to those columns only shaved 6s off the total time to make it 4m44s.