PDA

View Full Version : Better method to return number of lines in file?


sage45
03-09-2005, 11:37 PM
In C++ the method that I currently use to count the number of lines in a file is:

infile.open("whatever.txt")
char fileChar;
lineCount = 0;

infile.get(fileChar);
while (infile)
{
while(fileChar != '\n')
{
infile.get(fileChar);
}
lineCount++;
infile.get(fileChar);
}

What I would like to know is if there is a better method?

I am using the lineCount to setup a dynamic array and do not want to base the file size off of a constant as file sizes change and as everyone knows why use extra memory when you don't need to. The flipside of this is, what if I specify a constant for the array that is too small and the file information goes out of the range of the array.

The only problem I can really see with the code as is, is that the bigger your files are/get the longer it takes to parse through them to find the number of lines to read in.

-sage-

Unit
03-10-2005, 01:59 AM
What are you trying to do with the file?

If you are parsing it line by line, why not read one line at a time?

If you are loading the entire file into memory, you can seek to the end of file to figure out the size of the file reliably without scanning the entire file char by char.

If you are loading the entire file into memory to access line by line, you can read one line at a time with sufficient buffer and allocate memory for it after reading it. The buffer can then be reused for the next line.

If you are loading a few lines at a time, reuse the same buffer as a circular buffer.


infile.get(fileChar);
while (infile)
{
// no need for another while since all we are doing is counting '\n's
// This will give wrong answer if the last line does not have a '\n'
if (fileChar == '\n')
lineCount++;
infile.get(fileChar);
}

sage45
03-10-2005, 03:07 AM
As I stated. I need the number of lines in the file to setup a dynamic array.

I was just seeing if there was:

a.) A more efficient method

b.) A safer more stable method

As you pointed out, if the last line does not have a '\n' Then all is for naught.

As for the array, I am planning on setting the two array upper bounds to equal (numberOfLines - 1). After that It's just a matter of reading the string/char information into one array and the numeric values into the other. From there I can do all the calculations and such. ;)

-sage-

Unit
03-10-2005, 03:27 AM
I was asking for more context because problems like these can be worked around by doing the counting when you actually read the file for processing. Doing so avoids the need to scan the file twice.

If you are careful with how you read the file(Exact method depends on what you are doing with its contents), you can get away with only a little more memory overhead than is required to have the entire file in memory.

There is no magic that can figure out the number of '\n's in a file other than counting them all :)


infile.get(fileChar);
while (infile)
{
if (fileChar == '\n')
lineCount++;
infile.get(fileChar);
}
// add the last line to line count if the file does not end with '\n'
if(fileChar != '\n')
lineCount++;


An interesting problem :thumbsup:

aman
03-10-2005, 12:49 PM
You can try it with getline


int main()
{
int linecount= 0;
char buffer[1024];
ifstream infile("test.txt");

while( infile.good() )
{
infile.getline(buffer,sizeof(buffer));
linecount++;
}
infile.close();

return 0;
}



Also, don't use while( infile ).
Always use while( infile.good() ) because it checks the stream state. You can always test for conditions yourself (rdstate()), but this is the easiest as it checks failbit, badbit, and eofbit for you and fails if any of those bits are set.

sage45
03-10-2005, 06:19 PM
I was asking for more context because problems like these can be worked around by doing the counting when you actually read the file for processing. Doing so avoids the need to scan the file twice.

If you are careful with how you read the file(Exact method depends on what you are doing with its contents), you can get away with only a little more memory overhead than is required to have the entire file in memory.

There is no magic that can figure out the number of '\n's in a file other than counting them all :)

An interesting problem :thumbsup:

And counting while reading through the file the first time would be a prefereable method. As you stated it takes away the need for multiple passes through the file.

I was toying around with reading a line into the array and adding a line count that way. The problem with that is that I must copy the array to a temporary array, destroy the array, create a new instance of the array from the current line count, copy the data from the temporary array and then read in the new data. This would ofcourse loop while infile is true. To me though that method seems as if it has a higher margin for error or data loss/corruption. :D

Thanks for that suggestion aman. ;) getline does save on parse time.

-sage-

Unit
03-10-2005, 08:22 PM
Doesn't getline do the same thing behind the screen? I doubt that it has better performance than scanning one letter at a time.

Also, what happens when you have a line longer than 1024 bytes? (ex: some of the texts from project gutenburg)

aman
03-11-2005, 12:43 AM
Doesn't getline do the same thing behind the screen? I doubt that it has better performance than scanning one letter at a time.

Also, what happens when you have a line longer than 1024 bytes? (ex: some of the texts from project gutenburg)

I didn't suggest he should use it, just that he could try it. He wouldn't have to worry about any ending newline characters that way.

Of course there would be a problem if you used [1024] on a file that had lines longer than that, but you could use [1024*1024] if you wanted, that was simply an example.

sage45
03-11-2005, 01:44 AM
Well the getline method does work... It is a little more efficient (fewer steps to accomplish)... Both methods have their issues that can be tested for...

For example, using the character by character method, theoretically you will get a more exact count of the lines of valid information because in most txt files people will leave trailing empty lines... Just in case there is information on the next line (line break or page break perhaps) you can use the peek method to look at the next line until you find data, otherwise if you hit the eof, you can simply leave the file stream.

With the get line method, you will get a line count that is exact to the number of lines but not all line will have valid information. In this case again you could use a peek method to test for validity and add to the count if valid...

Now that I have initialized my dynamic arrays its just a matter of writting the data from the files to the arrays...

I am testing different methods but havn't found one that works quite yet... Another thing I found quirky was that when I attempted to read from the file a second time, I could not read from it without specifying a second instance (under a different variable)... I even tried reopening the first instance to the ios::beg of the file and it wouldnt read... Any thoughts on this???

-sage-

aman
03-11-2005, 02:46 AM
You must be doing something wrong if you cant read a second time.. you should be able to read the file as many times as you want. If you're trying to re-open the file, have you closed it first? And if you're just trying to read it again you dont need to re-open it again at all, simple reset the stream to the beginning of the file and your set. Perhaps if you post the problem code we can find the problem.

It seems to me that you could do what you need to do the first time through.. use realloc as needed on your array, or even an array of basic_string objects to store the data, or a vector of strings which takes care of all the dynamic allocation for you...

Dunna
03-11-2005, 01:46 PM
You could fread() in 1024 bytes at a time ( = 1 kb) until you encounter a feof() error. There is a good example in the MSDN I believe.

sage45
03-11-2005, 04:56 PM
Thanks Dunna, I'll check it out...

Here is a copy of the code aman... I would have sent it out last night but I was at work and didn't have my laptop...

void fileType::readIn()
{

// Declare variables
int stringColumns = 255, floatColumns = 10;
char buffer[1024], fileChar;
ifstream inFile, inFile2;

inFile.open(inputFileName.c_str());

if(!inFile)
{
cout << "Unable to open the file - " << inputFileName << ".\n"
<< "Quitting Program . . . ." << endl;
errorValue = FILE_NOT_FOUND;
}
else
{
arraySize = 0;
//inFile.get(fileChar); -- Old method
while (inFile.good())
{
inFile.getline(buffer,sizeof(buffer));
/* Part of old method
//while (fileChar !='\n')
//{
// inFile.get(fileChar);
//}
*/
arraySize++;
//inFile.get(fileChar); -- Old method
}
inFile.close();
nameArray = (string * *) malloc(sizeof(string *) * arraySize);
if (nameArray == NULL)
{
cout << "Not enough memory for row - " << arraySize << "\n" << endl;
}
else
{
for (int idx = 0; idx < arraySize; idx++)
{
nameArray[arraySize] = (string *) malloc(sizeof(string) * stringColumns);
if(nameArray[arraySize] == NULL)
{
cout << "Not enough memory for column - " << stringColumns << "\n"
<< "in row - " << arraySize << endl;
free(nameArray[arraySize]);
}
}
}

knownDataArray = (float * *) malloc(sizeof(float *) * arraySize);
if (scoreArray == NULL)
{
cout << "Not enough memory for row - " << arraySize << "\n" << endl;
}
else
{
for (int idx = 0; idx < arraySize; idx++)
{
knownDataArray[arraySize] = (float *) malloc(sizeof(float) * floatColumns);
if(knownDataArray[arraySize] == NULL)
{
cout << "Not enough memory for column - " << stringColumns << "\n"
<< "in row - " << arraySize << endl;
free(knownDataArray[arraySize]);
}
}
}

transArray = (char *) malloc(sizeof(char) * arraySize);
if(transArray == NULL)
{
cout << "Not enough memory for row - " << arraySize << "\n" << endl;
}
cout << "\n" << arraySize << "\n" << endl;
}

inFile2.open(inputFileName.c_str());
//inFile.seekg(0,ios::beg); -- Tried seeking the beginning of the original call for inFile.
// While not end of file inputFileName
while (inFile2.good())
{
for(int rowIdx = 0; rowIdx < arraySize; rowIdx++)
{
inFile2 >> lineNumber >> firstName >> lastName;
inFile2 >> dataEntry1 >> dataEntry2 >> dataEntry3;
inFile2 >> dataEntry4 >> dataEntry5 >> dataEntry6;
//nameArray[rowIdx][0] = firstName;
//nameArray[rowIdx][1] = lastName;
scoreArray[rowIdx][0] = dataEntry1;
scoreArray[rowIdx][1] = dataEntry2;
scoreArray[rowIdx][2] = dataEntry3;
scoreArray[rowIdx][3] = dataEntry4;
scoreArray[rowIdx][4] = dataEntry5;
scoreArray[rowIdx][5] = dataEntry6;
}
}
}

When I attempt to read into the array I get exception errors... I was able to accomplish a read in with something like this:

inFile2 >> lineNumber >> firstName >> lastName;
inFile2 >> dataEntry1 >> dataEntry2 >> dataEntry3;
inFile2 >> dataEntry4 >> dataEntry5 >> dataEntry6;
while (inFile2.good())
{
for(int rowIdx = 0; rowIdx < arraySize; rowIdx++)
{
//nameArray[rowIdx][0] = firstName;
//nameArray[rowIdx][1] = lastName;
scoreArray[rowIdx][0] = dataEntry1;
scoreArray[rowIdx][1] = dataEntry2;
scoreArray[rowIdx][2] = dataEntry3;
scoreArray[rowIdx][3] = dataEntry4;
scoreArray[rowIdx][4] = dataEntry5;
scoreArray[rowIdx][5] = dataEntry6;
}
}
}

But then I get the same thing written over and over again without changing to the next line of data in from the file... I also commented nameArray because it is a string type and I cannot get a string type to read into the array... So now I am looking into changing it to a char type and trying it that way... ;) Oh if you think that variables aren't being declared, they are. This is part of a class and most of my variables are in that class under the private subheading.

class fileType
{
public:
void readIn();
// PRE: None
// POST: Reads data from inputFileName

void readOut();
// PRE: None
// POST: Writes output file to outputFileName

private:
// Declare variables
int count, arraySize;
int formatOption, errorValue;
string inputFileName, outputFileName;
string firstName, lastName;
float lineNumber, dataEntry1;
float dataEntry2, dataEntry3;
float dataEntry4, dataEntry5;
float dataEntry6;
char typeOfTrans;

// Declare dynamic arrays
string **nameArray;
float **knownDataArray;
char *transArray;

};

-sage-

Unit
03-11-2005, 09:28 PM
What are the exception errors that you get? The second way seems wrong in the obvious way that you are not reading from the file in the loop at all.

sage45
03-11-2005, 10:58 PM
I'll recreate the exception and post it later...

As for the file read not being included in the loop, I was just ensuring that I could read in the data to the array, but as I stated when I include the read into the loop I start getting exceptions... :(

-sage-

aman
03-11-2005, 11:59 PM
I haven't had time to look through it carefully yet, but a few quick observations.. don't use if(infile) anywhere, use if(infile.good()) or if(infile.is_open()) when you want to test if the file was open. Also, after the first read through, eofbit is set on infile, so clear the error condition before you try and read from it again (even if you close and re-open the file).. use infile.clear() to reset the control state or you may get exceptions trying to read from the stream with a failed state set.

I'll look it over in a bit when I have some time. Do you want other suggestions for improving efficiency or general design implementation as well?

sage45
03-12-2005, 12:34 AM
I'll take anything I can get... Thanks for the help aman and unit...

-sage-

aman
03-12-2005, 12:49 AM
BTW, where are you allocating memory for scoreArray?


And do you need everything in different arrays? Can you use vectors?

Can you use one array of all data types? Like an array of a struct which contains all your different variable types?

Oh yea, is this homework?

sage45
03-12-2005, 02:03 AM
BTW, where are you allocating memory for scoreArray?

scoreArray is actually supposed to be knownDataArray, sorry forgot to change that... But even after making that change I still get the exceptions...

And do you need everything in different arrays? Can you use vectors?

In school we havn't covered vectors yet so I'm not trying to implement one of those (as of yet ;) )... I have been reading about them but I don't understand enough about them to say whether they would be beneficial or not... The reason I was thinking arrays is because I am familiar with those, at least from a vb perspective but not so much from a C++ perspective, I am learning though. The need for multiple arrays arrise from the complexity of the data... Some of it will be multi-dimensional whereas others won't be... I don't want to waste un-needed memory...

Can you use one array of all data types? Like an array of a struct which contains all your different variable types?

I never considered using a struct... I guess it would work, you mean something like this:

class fileType
{
public:
void readIn();
// PRE: None
// POST: Reads data from inputFileName

void readOut();
// PRE: None
// POST: Writes output file to outputFileName

private:
// Declare variables
int count, arraySize;
int formatOption, errorValue;
string inputFileName, outputFileName;
string fileFirstName, fileLastName;
float fileLineNumber, fileDataEntry1;
float fileDataEntry2, fileDataEntry3;
float fileDataEntry4, fileDataEntry5;
float fileDataEntry6;
char typeOfTrans;

// Declare dynamic arrays
struct **informationArray {
int lineNumber;
string firstName, lastName;
float lineNumber, dataEntry1;
float dataEntry2, dataEntry3;
float dataEntry4, dataEntry5;
float dataEntry6;
};
//string **nameArray; -- Old method
//float **knownDataArray; -- Old method
char *transArray;
// Initialize struct
informationArray fileInformation;

};


void fileType::readIn()
{

// Declare variables
int stringColumns = 255, floatColumns = 10;
char buffer[1024], fileChar;
ifstream inFile;

inFile.open(inputFileName.c_str());

if(inFile.is_open())
{
cout << "Unable to open the file - " << inputFileName << ".\n"
<< "Quitting Program . . . ." << endl;
errorValue = FILE_NOT_FOUND;
}
else
{
arraySize = 0;
//inFile.get(fileChar); -- Old method
while (inFile.good())
{
inFile.getline(buffer,sizeof(buffer));
/* Part of old method
//while (fileChar !='\n')
//{
// inFile.get(fileChar);
//}
*/
arraySize++;
//inFile.get(fileChar); -- Old method
}
inFile.clear();
/*************************
Old method
nameArray = (string * *) malloc(sizeof(string *) * arraySize);
if (nameArray == NULL)
{
cout << "Not enough memory for row - " << arraySize << "\n" << endl;
}
else
{
for (int idx = 0; idx < arraySize; idx++)
{
nameArray[arraySize] = (string *) malloc(sizeof(string) * stringColumns);
if(nameArray[arraySize] == NULL)
{
cout << "Not enough memory for column - " << stringColumns << "\n"
<< "in row - " << arraySize << endl;
free(nameArray[arraySize]);
}
}
}
**********************/

fileInformation = (float * *) malloc(sizeof(float *) * arraySize);
if (scoreArray == NULL)
{
cout << "Not enough memory for row - " << arraySize << "\n" << endl;
}
else
{
for (int idx = 0; idx < arraySize; idx++)
{
fileInformation[arraySize] = (float *) malloc(sizeof(float) * floatColumns);
if(knownDataArray[arraySize] == NULL)
{
cout << "Not enough memory for column - " << stringColumns << "\n"
<< "in row - " << arraySize << endl;
free(knownDataArray[arraySize]);
}
}
}

transArray = (char *) malloc(sizeof(char) * arraySize);
if(transArray == NULL)
{
cout << "Not enough memory for row - " << arraySize << "\n" << endl;
}
cout << "\n" << arraySize << "\n" << endl;
}

// While not end of file inputFileName
while (inFile.good())
{
for(int rowIdx = 0; rowIdx < arraySize; rowIdx++)
{
inFile2 >> fileLineNumber >> fileFirstName >> fileLastName;
inFile2 >> fileDataEntry1 >> fileDataEntry2 >> fileDataEntry3;
inFile2 >> fileDataEntry4 >> fileDataEntry5 >> fileDataEntry6;
fileInformation.firstName[rowIdx][0] = firstName;
fileInformation.lastName[rowIdx][1] = lastName;
fileInformation.dataEntry1[rowIdx][0] = dataEntry1;
fileInformation.dataEntry1[rowIdx][1] = dataEntry2;
fileInformation.dataEntry1[rowIdx][2] = dataEntry3;
fileInformation.dataEntry1[rowIdx][3] = dataEntry4;
fileInformation.dataEntry1[rowIdx][4] = dataEntry5;
fileInformation.dataEntry1[rowIdx][5] = dataEntry6;
}
}
}

Oh yea, is this homework?

No it's not homework... We havn't learned about arrays and classes yet... I am trying to teach myself because I have a project at work for setting up a utility to BIOS and DMI motherboards... The way that I am thinking about approaching it is to have a file that has information of the following format

firstEntry,secondEntry,thirdEntry,fourthEntry,fifthEntry,sixthEntry,seventhEntry,biosfilename,dmifil ename

Where:
firstEntry = Model by part number
secondEntry = Default serial number
thirdEntry = Product code
fourthEntry = Service Tag number
fifthEntry = Hardware product code
sixthEntry = Hardware MCode
seventhEntry = Compiled Software MCode
biosfilename and dmifilename are self explanatory...

The file itself would either be comma-separated or space separated (I'm thinking of making it comma-separated makes it far easier to parse from the programming perspective)... Most of these entries would have to be of the string type as they are a numeric and non-numeric blend... For example the part number is usually T- or B- 99860021 Where T is desktop and B is laptop (no hyphen ofcourse... ;) )

Now I'm not asking that you write this for me or anything like that... I just noticed a couple of problems/issues when I attempted to solve the problem myself... Perhaps I am not understanding completely how to read from a file to an array... So really I'm seeking guidance... If you see that I am doing something in an inefficient manner then I'll gladly take any advice you can offer...

-sage-

aman
03-12-2005, 04:46 AM
Ok, first off try this... They way you are allocating memory to the arrays is wrong. Multidimensional arrays are tricky, so you need to be careful.

This will allocate memory for array[arraysize][columnsize]

knownDataArray = (float**)malloc(sizeof(float*) * arraySize);
for( int j=0; j<arraySize; j++)
{
knownDataArray[j] = (float*)malloc(sizeof(float) * floatColumns);
}


Since you are using C++, you should use the new operator instead

knownDataArray = new float *[arraySize];
for( int i=0; i<arraySize; i++)
{
knownDataArray[i]= new float[floatColumns];
}


That should solve your crash problems with the original program, don't forget to free/delete the memory when you are done.

sage45
03-12-2005, 10:24 PM
I actually was up till about 4 this morning and figured out the first part... After I got that licked, I tried using the new operator but I couldn't get it to initialize... :( I'll try again... :)

-sage-

aman
03-12-2005, 11:42 PM
I rewrote this a bit so you can see one way of allocating memory and saving the data in an array of structs all in one pass through the file, it eliminates a whole bunch of other stuff. Hopefully it will help.


class fileType
{
public:
void readIn();
// PRE: None
// POST: Reads data from inputFileName

void readOut();
// PRE: None
// POST: Writes output file to outputFileName

void FreeData();

private:
// Declare variables
int count, arraySize;
int formatOption, errorValue;

string inputFileName, outputFileName;
char typeOfTrans;

// Declare dynamic arrays
struct fileData {
int lineNumber;
char firstName[32], lastName[32];
float dataEntry[6];
} **fData;
};


void fileType::readIn()
{
fData=NULL;
arraySize=0;
ifstream infile("test.txt");

if( !infile.is_open() )
return;

while( infile.good() )
{
fData = (fileData**)realloc(fData, (arraySize+1)*sizeof(fileData*));
fData[arraySize] = (fileData*)malloc(sizeof(fileData));

float *entry = fData[arraySize]->dataEntry;

infile >> fData[arraySize]->lineNumber >>
fData[arraySize]->firstName >> fData[arraySize]->lastName >>
entry[0] >> entry[1] >> entry[2] >> entry[3] >> entry[4] >> entry[5];

arraySize++;
}
infile.close();
}


void fileType::FreeData()
{
for(int i=0; i< arraySize; i++)
{
free(fData[i]);
}
free(fData);

fData=NULL;
arraySize=0;
}

Dunna
03-15-2005, 03:36 AM
Sorry if this isn't what you want (I haven't the time to read such a lengthy post!), but here's some code that works that I've been using (note: it's from memory so forgive the mistakes)

FILE *fp = fopen("somefile.file","r+b");
int count = 0;
char *arrayfordata[somesize];
while ( !feof(fp) )
{
count += fread(arrayfordata, 1,1024,fp);
}


Hope that helps!