Working with large DBs
I'm looking to get advice on how best to manage large databases.
I have a database I'm running with two tables over 1.5 million entries. Its made queries slow (I think mostly because its a shared server), but I wanna learn what I can/should do to make large tables more efficient. What are the most important rules? Should I be using MyISAM or InnoDB (MyISAM, right?)? I've created proper keys. Both tables only have 3 columns, but still...
So, what advice?
negligible difference in speed because of the table types I'd think.
1.5 million rows isn't really that much.
what indexes do you have on the tables and what queries do you generally run against them?
what speed do you get out of the queries now?
This is a nice benchmark that has been broken down by scaling (albeit being an older article): http://www.mysqlperformanceblog.com/...hmarks-part-1/
It should help you choose which type to use. INNODB wins overall here with very few drops in the scaling below the myisam. If you were to increase it and follow this existing data graph, it would likely show an improvement of myisam for large recordset's with fulltext searches over the innodb. But otherwise, it would indicate that innodb will scale better.
If your keys are correct, make sure you check your datatypes. If a query doesn't match the datatype and the mysql isn't set on strict datatype mode, then it needs to cast the datatype each time its used. If it expects a number and you give it a string, it will cast each field into a string to compare it to your string. Doing this becomes noticeably slower as the recordsets increase in size.
The issue I'm facing is namely on inserts. What I've got is basically an online voting system; I have a registration table, matching a user to a ballot. I run the student government elections for a number of colleges, one of which has over 30k students on each ballot. So over the last few years, with a few ballots a year, I've gotten to over 1.5mil rows. I had actually assumed this isn't much, specially because I only have 3 columns: ballotID (mediint(4)), voterID (medint(7)), and voted (tinyint(1)). Obviously not much being stored.
When I went to add to the voters list this year, I noticed that while in the past I could usually put in sets of 5k voters at once (more then that and the system timed out, a fault of the shared host server I'm on), and usually inserted an average of 300 voters a second, this year, I could barely process 20 voters per second. I checked with the host and the server load was no greater than in the past. When I decided to delete all registrants before 2010, the insert speed shot up to about 250 voters per second. But as the list got higher again, it went back to a crawl.
So I firstly know that one big limit for me is my host. I should get a VPS server at the least, I think, but I'm trying to see what I can do until I can afford to do that.
I also make sure to properly cast my data, and as for index on the registrants table, I used a combination of ballotID and voterID for the key (after all, a user can only be registered for a ballot once).
I have a second table for voter tags which is basically just ballotID (mediint(4)), voterID (medint(7)), tag (varchar(30)). The key on that table is all 3 columns. This table is at about 2.2 mil records and is as slow as the registration table. Both have data fed into them during the above mentioned process, and are included in the insert speeds mentioned above.
how are you doing the inserts? You say you are doing 5K at once. Are your votes saved in a file or something? In that case use LOAD DATA INFILE and load the entire thing at once.
The method I'm using is pretty inefficient I think...
I have a CSV with usernames and tags. It loops through, checks to see if the user already exists, if not, it inserts. It when loops through the tags and inserts those.
I couldn't really think of any other way to do it, as everything is associated to the voterID rather then the username.
Like I mentioned, I didn't really notice an issue with it until I got to ~1mil, and it sped up significantly when I removed ~250k registrants (only to slow back down).
Looping through and comparing is a tremendous overhead for this. Depending on the size of the file as well. To me, it sounds like you may have an improper comparison type likely during the selection process, so you'd need to check that the datatype comparison matches the field datatype.
For a batch insert like this, I'd suggest abandoning the check completely. Use an INSERT IGNORE or an ON DUPLICATE KEY syntax if the new record is potentially different requiring an update instead. No need to see if it already exists, if it does it will fail to insert and keep going. If you are using a language like PHP, there will be an affected rows method or value that can be used to see how many records were actually inserted (if this is of any value to you). The client will tell you the same, but will tell you the number of records total and how many duplicates it encountered.
I checked if it existed because I would need the ID returned. When I insert, I use the last insert ID to insert the registration row, and when the user already exists, I extract the voterID it already has. Is there some other way to do it?
And I definitely get that the loop adds a lot to the script; but given I have to get each users already existent ID or their newly created one, how can I achieve this without a loop? Or a more barebones loop?
In this case I don't know if there is an easier way around it. If you are using surrogate keys, then you need to check with the DB first either with manual iteration and checking, or using an INSERT SELECT statement.
Something still reeks though, given that the > the record count has an incremental slowdown like you are seeing, it still smells of either a datatype issue or an incorrect key setup (such as composite keys searching in the wrong order).