View Full Version : perl averaging help needed..
girldrifter
07-14-2004, 04:29 PM
hope no one minds me asking for help, but I've been stumped on this program for awhile and didn't know where else to ask. I'm a novice perl user, and am trying to code a program that will open numerous files in a directory, read and average each line, repeat until all lines are averaged, and then give an output file with the final average values. The files are time trajectory files of a protein position at a specific distance. So far I have been able to read the data files, split them into their respective time and distance columns, and then average *down* the columns-- not average across the distance lines of each file.
A sample output file would look like this:
1 10.987
2 11.043
3 11.578
...These files actually have 2000 frames in them, so opening them all up at once to read each line, might be too memory hogging. I was hoping there would be a way to do a foreach file loop and just open, read and grab the line, and close. To effectively/easily do this, I also need a way for it to ignore/remove the time frame column and just work with the distance numbers.
The code I've been working on looks like:
#!/usr/bin/perl
my $outputdir = "/theoryfs/rh/home/jo/edv-npy";
for(my $i=47; $i<=48; $i+=1,$i++) {
my @outputfile = `ls $outputdir/t1-run$i.dat`;
open (TRAJECTORY, "< @outputfile") || die "Can't open file: $!\n";
print `pwd`; #just a test to be sure the files opened
@matrix = <TRAJECTORY>;
print "@matrix\n";
print $#matrix;
for($i=0; $i<$#matrix; $i++) {
@tmp = split(" ",$matrix[$i]);
print "$tmp[0]\t$tmp[1]\n";
}
$distancetotal += $tmp[1];
}
print "$distancetotal\n";
$average_value = $distancetotal / $#matrix;
print "$average_value\n";
foreach $line (@matrix){
print "$line";
}
exit;
As I said, I'm a super novice at all of this, and any help is greatly appreciated. I've been playing around with it so much I don't even really know where to go with it anymore.
MattJakel
07-14-2004, 07:50 PM
So basically you want to average each line and then average all of the lines in the file together? Could you explain a little more thoroughly?
Matt
girldrifter
07-14-2004, 08:01 PM
What I want the script to do is open a series of these input files and average all of the distance files. The files will always have the same number of data points in them, and they'll always be set up the same. Since the time step is always the same, I want to ignore that when averaging. So I just want to average across multiple time steps, so that in the end I end up with an average position for each time step over all the files.
Does that make sense or should I explain more?
MattJakel
07-14-2004, 08:06 PM
I still don't really understand... Could you post part of an example input file and say which you would like to be averaged and which would be left out?
girldrifter
07-14-2004, 09:30 PM
ok I'm going to try and explain more....
As of right now this file works, however what it's doing is reading *down* the columns and averaging all of the numbers in each column per file. What I actually need it to do is read across and average each row/line in all files. what i need to end up with is an average distance per time step over all trajectories.
breaking down the present code:
The first line designates where the files I'm averaging are located. Then the for loop opens each file. I then print the contents of the file, and the number of entries per file. The second for loop splits the input files into two separate columns [tmp[0] which is the timestep, and tmp[1] which is the distance. Then I haev the total distance to divide by as the summation of tmp[1]-- obviously wrong because i need it across, not down. But the problem is figuring out how to split it that way...
here is a sample of one of the input files....it's normally 2000 frames long so I am truncating it. The left column is the time step, and the right column is the distance. It's a protein folding trajectory, btw.
t1-run47.dat:
1 30.558140
2 30.891916
3 30.764661
4 32.328254
5 33.344843
6 33.152984
7 32.676488
8 32.519974
9 31.059809
10 31.134896
11 30.546448
12 31.759029
13 32.720393
14 32.482184
15 32.878333
16 33.913783
17 31.931779
18 32.257604
19 32.139663
20 32.322234
21 32.620813
22 33.810830
23 32.859295
24 31.469885
Now every other file is identical in this, except the second column numbers vary slightly. So I want to pick up and average the numers for each time step over all the files and average them. hopefully getting one file with one set of points-- timestep and average distance.
hope this is a bit clearer...if not let me know! i appreciate the help so much!
MattJakel
07-14-2004, 10:07 PM
Ok, that makes sense now. Thanks for bearing with me. ;) Well, what you should do is loop through the files, reading them one line at a time. So something like this:
#!/usr/bin/perl
my $outputdir = "/theoryfs/rh/home/jo/edv-npy";
@outputfile[0] = 'ls $outputdir/t1-run$1.dat'; ### or however your files are organized...
@outputfile[1] = 'ls $outputdir/t1-run$2.dat'; #if there are alot of them just use a for loop to assign them to @output file
@outputfile[2] = 'ls $outputdir/t1-run$3.dat';
open (TRAA, "< @outputfile[0]") || die "Can't open file: $!\n";
open (TRAB, "< @outputfile[1]") || die "Can't open file: $!\n";
open (TRAC, "< @outputfile[2]") || die "Can't open file: $!\n";
print 'pwd';
for ($i=1;$i<=2000;$i++) {
$filea = <TRAA>;
$fileb = <TRAB>;
$filec = <TRAC>;
@matrix = ($filea,$fileb,$filec);
print "@matrix\n";
print $#matrix;
for($i=0; $i<$#matrix; $i++) {
@tmp = split(" ",$matrix[$i]);
$distancetotal += $tmp[1];
}
$average_value = $distancetotal / $#matrix;
print "$tmp[0] $average_value\n";
}
foreach $line (@matrix){
print "$line";
}
close(TRAA);
close(TRAB);
close(TRAC);
exit;
This is completely untested and I'm not 100% sure this is what you want, but it can always be edited if necessary.
Hope this helps! :thumbsup:
Matt
girldrifter
07-15-2004, 09:36 PM
Ah, I think I've got something a little different that works now. I spent last night working on it, and for some reason it seems to be just returning zeros for the average value. If I can't figure it out, then I'm going to give your script a shot. Maybe you can see something with this one though....
#!/usr/bin/perl
my $datadir = "/theoryfs/rh/home/jo/edv-npy";
for (my $i=47; $i<=48; $i++) {
my @files = `ls $datadir/t1-run$i.dat`;
chomp(@files);
my @sums = ();
my $file;
# go through each file, adding its members to the
# previously summed files
foreach $file (@files)
{
open(CURFILE, "<$file");
my $idx = 0;
while(<CURFILE>)
{
(my $garbage, my $new_value) = split(/ /, "$_");
chomp($new_value);
# i might need to play with the split delimiter a bit..
# split(/DELIMITER/, STRING);
(my $garbage, my $new_value) = split(/ /, "$_");
chomp( $new_value );
$sums[$idx++] += $new_value;
}
$idx = 0;
}
my @averages = ();
# since $#files gives us the highest index, and not the number of
# members in @files, we need to increment prior to assignment
my $number_of_values = ++$#files;
# loop through each and every index, computing the average, and storing
# it into @averages
for (my $idx = 0; $idx <= $#sums; $idx++)
{
$averages[$idx] = $sums[$idx] / $number_of_values;
}
# print out the results, modify this as needed to print to an output file
# other than stdout if needed.
my $average;
foreach $average (@averages)
{
print("$average\n");
}
}
exit;
MattJakel
07-19-2004, 04:06 AM
Well, firstly, don't you think you ought to be dividing by 3 instead of the total number of values? Because there are three values per "row" and you want to average them, right? I have the suspicion that your values are getting devided by 2000 and that's making the numbers so small that they are simply truncated and rounded to 0. It might be something else, but try that first. :thumbsup:
Matt
girldrifter
07-22-2004, 05:23 PM
Well I think that it needs to be averaged by the total number of values-- because that's what an average value is, right? It needs to be divided by the total number of files so you can get the average value over all trajectories...
or at least that's how i see it :confused:
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.