View Full Version : Indexing advice needed

11-16-2011, 04:34 AM
I'm handling a large amount of traffic and need to record details of each visit. I record things such as the referring URL and the user agent. I've been recording these things as Text columns in each row, and consequently the table is becoming very large. I would like to make two new tables, one for referrer and one for user agent, and then record just ID numbers in the main table.

I'm worried about the time needed to query these two tables and fetch the ID numbers. A useragent can be 80-200 characters long, and a referring URL can be 20-200+ characters. I don't know how many rows would be in each of these two tables but I'm recording *lots* of duplicates -- in one day we might 100,000 hits from the same referrer, so it makes sense to normalize the info, in theory. I don't want there to be any lag caused by this though. If it took 1/4 second to do a query that'd be an issue. What kind of performance can I expect on indexed text fields (or VarChar i suppose) with millions of rows of this kind of data?

Bonus question: once I launch the new method of recording data I'm going to need to change the existing table. I imagine using a cron to process all the rows, starting from the oldest, and substituting in the ID numbers in places of the actual text. But keeping the columns as Text fields until all rows are processed and I can change the column types to Int. When I change the column types it's going to suck basically. I expect the table to lock for at least 2 hours for each of the two column changes? It's 53 Million rows (MyISAM btw).

Old Pedant
11-16-2011, 09:27 PM
MySQL is reasonably efficient at indexing.

But I dunno how to tell you what the performance will be without trying it.

Couldn't you copy the existing table to another machine, do all your changes, try them out, and then if it works just zap the existing db and move the fixed up one back to the original server? MyISAM tables allow you to do that by just copying the myi and myd files. I know, I used to do that all the time (with tables that had millions of records).

As for the "bonus question": An alternative approach maybe? ADD the two int fields to the existing table, as you process it, put the PK values into those int fields. When done, drop the text fields. Test that on a sample vs. your approach to see which would end up faster.

As for lookup time: Since you will be referring to the outer tables via their (assumed auto_increment?) primary keys, it shouldn't take anything approaching a quarter second.

Again, though, all that you are suggesting here could easily be done in a day, so why not just copy the data to another machine and try it? (And remember: you don't have to dump/restore to do this. If you set it up carefully you can do it by just copying the actual files.)

11-16-2011, 09:41 PM

Actually it wouldn't be referring to the new tables by their ID (primary key). Because what it'd be doing is finding the ID by finding a row with the same referring url (and again for useragent). It'd check the referrers table for 'http://google.com' for example, then use the ID number of that row for storing in the main table. If there's no match then it'd insert a new row in the referrers table.

On another forum I'm being told to just keep all of the data together in one table as I'm doing now.

Testing it out is just a bit difficult cause I'd have to download the 16GB table. That's not too terrible I suppose.

Old Pedant
11-16-2011, 10:17 PM
If doing all this would take longer than a couple of days, I might say "keep it as is" as well. But I really do think it's worth your time to give it a shot on another machine.

> Actually it wouldn't be referring to the new tables by their ID (primary key).

Why not??

I understand you that when a new hit is recorded you'd have to look up the referrer first and add it if not found (though you might be able to use ON DUPLICATE KEY...??) but why would you not then store only the ID (PK) of the found or added record into the main table?

FWIW, we were running a MyISAM based system where we were adding 1GB every 20 minutes *AND* doing some moderately hairy queries against the data also in real time. Sometimes, I had to write some gnarly Stored Procs to get decent performance in the real-time queries (created temp tables, etc., even), but it did work. And it did keep up.

Anyway, yes, I did occasionally copy a table from one machine to the other by just copying the .myi and .myd files (and the partition files, as we were using partitioned tables, but I don't see that applying here) and often did 10GB that way. Slow, but so much faster than a dump/restore.

The "trick" is to clone the table structure in the copied-to machine, so that MySQL creates the .myd and .myi files there and "knows" about them. And then you just copy the populated files on top of those and restart mysql and it happily accepts them.

Try it with a couple of small tables to get the idea of it, and then have at it!

At the worst, you'll waste a day or two. At best, you have a great answer.

11-16-2011, 10:37 PM
Oh, we would store just the ID number in the main table. But when I was talking about being concerned with lookup time I mean the time to look up google.com or w/e in the referrers table to find out what ID it is. That's actually my only concern in the whole thing I think :) Having to look up URLs and useragents to get their IDs.

I'm currently looking into shadow copies and merge tables - both recommended to me on the other forum. I've never used either, but it sounds interesting. If I could use a shadow copy while changing indexes and column types and stuff, to avoid downtime, that'd be great. And if using merge or partitioned tables would speed things up, that'd be good too. Cause we rarely use data more than 6 months old. I have a lot of reading to do lol

Old Pedant
11-17-2011, 01:40 AM
LOL! Sure a lot different than our situation.

Collecting 1GB every 20 minutes, we put 100MB into a partition and then kept only the last 100, 200, or 300 partitions (depending on whether we wanted to retain 10GB, 20GB, or 30GB). So were adding and dropping partitions every 20 minutes, too, besides importing that much data.

If 30GB represents 6 *MONTHS* of data for you, I don't think you will have any performance issues that matter.

But it will be "fun" finding out.