View Full Version : how best to parse large amounts of text?

08-17-2005, 11:40 PM
Hello everyone, I was wondering if anyone could give me some tips on how to parse large amounts of text. I play a strategy game through email, the turn report is sent to me, I write out my orders and send it back so on and so on. I normally just do this in notepad but I figured I could write a very basic client in javascript. There are clients already available but I would like to write my own for three reasons, the challenge, I don't really like any of the available clients and I can't install anything at work and I do most of it on my breaks while at work.

What I have in mind is a text box that I could paste my turn report into, hit a button, have it parsed and then display it in an easier to read fashion. I'm ok with creating the nicer display and everything, I'm just trying to find an easier way to parse the text. My current report is over 1500 lines long and getting longer each turn. Here is a small snippet of my report so you can see what I'm working with :

Faction Status:
Tax Regions: 4 (24)
Trade Regions: 6 (10)
Mages: 2 (2)

Errors during turn:
Dalesor Reavers (32264): MOVE: Unit has insufficient movement
points; remaining moves queued.

Events during turn:
Joss (377): Claims $100.
Mernic (1345): Claims $100.
Guards (6394): Gives 80 silver [SILV] to Fighters (6521).

That's just a small portion of the type of stuff I would be dealing with. I'm guessing it would be easiet to use indexOf() and split() but I'm a little lost as how to grab all the correct info. For example, under "Faction Status" there are only those three things, the only thing that would change is the numbers. The "Errors during turn" and "Events during turn" change constantly so how could make sure to grab all of the info each time and make sure that's all I'm grabing?

I'm sorry if this isn't making much sense, basically I just need to know the best way to parse large amounts of text. The book I have doesn't really cover it and I couldn't find anything too usefull through a google search. If anyone has any ideas I would really appreciate it.


08-18-2005, 12:16 AM
If you were to do this using javascript, a regular expression would be the best way. I would certainly suggest loading the text to be formatted into a textarea, rather than loading the file into memory using an activeX control or equivelant.

Depending on how large the files are if they are less than 100k, then use javascript without a thought. For larger files (MB and above, I would recommend Perl or C as these languages will manipulate the files more efficiently on a cellular level.

you can do a low-level speed test by running a regular expression such as can be found below on some text. Paste html page sources of varying sizes into the first textarea to find out how much text can be processed efficiently:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Untitled Document</title>
<script type="text/javascript">
function cleanHTML(objIn, objOut) {
objOut.value = objIn.value.replace(/<[^>]*>/g, function tagMatch(s) {
if(s.indexOf("<img") > -1) { // leave images alone!
return s;
} else if(s.indexOf("/") > -1) { // add a newline after closing tags
return "\n";
} else {
return ""; // clean out all opening tags

<form name="frm" onsubmit="cleanHTML(this.txtInput, this.txtOutput);return false;">
<textarea name="txtInput" cols="100" rows="10">
<p>Hello world</p>
<h3>line 2!</h3>
Line 3!
<img src="img.gif">
<textarea name="txtOutput" cols="100" rows="10"></textarea>
<input type="submit" name="submit" value="submit">

Hope this helps


08-18-2005, 12:40 AM
Thank you for the reply, I'll play around with what you suggested and see what I can come up with.