PDA

View Full Version : find duplicate lines in multiple files


dsylebee
07-20-2010, 08:35 PM
Hi, I'm trying to figure out how this could work I have:

FILE A, B, C, D
they all have around 1000000+ lines.
I need a code that will find duplicate lines in either files, and write to a file which file has a duplicated line in this form:

FOUND DUPLICATE LINE >> filename1:line - filename2:line
it can also be:
FOUND DUPLICATE LINE >> filename1:line - filename1:line

thanks in advance, I've tried methods but they are very slow.
we are aiming a 40 minutes max procedure.

C, VB, C++, C#, VB.NET ( please any of these ) ( VB would be the best since I could use the same application rather than have two application. )

Spookster
07-20-2010, 11:44 PM
Ok have to ask. If you are going to be working with something that produces 1 million plus lines of data in a file then why would you not store that data in a database instead of flat files?

dsylebee
07-21-2010, 01:50 PM
Ok have to ask. If you are going to be working with something that produces 1 million plus lines of data in a file then why would you not store that data in a database instead of flat files?

the reason is simple the files are generated by a cash register so it's not an option really, I do some code that inserts into a access database which made it overflow and stopped.

I think cause access db's can only store 2gb.

I would install SQL or MySql but the company refuses :o

Spookster
07-21-2010, 06:52 PM
It wouldn't matter what the files were generated by. It's not that difficult to write a basic utility to parse the file and insert the data into a database. Sounds like you have already started with something like that. If one file is going to cause the database to be larger than 2Gb's then splice the file into multiple smaller files and store in multiple databases. Once you have the database(s) it's easy to use SQL to query one or more databases for duplicate records or any other kind of data querying they would need. MySQL would be better or if cost was not an issue a more suitable db system like Oracle to handle that kind of data.

The route you are going now seems like you are trying to put a bandaid on gushing wound.