View Full Version : Processing File
naqvia
03-05-2007, 07:27 PM
Hi,
I am trying to parse elements on this file. Here is a piece of the file that I am trying to parse:
AT4G09510 10.0004564 47.0062289
AT4G34860 10.0004564 55.0090220
AT5G22510 10.0004564
AT1G65800 10.0004675 916.0016301
AT1G66150 10.0004675
AT1G70250 10.0004675 504.0004713
AT2G13790 10.0004675 588.0004672
AT2G13800 10.0004675 588.0004672
AT3G45410 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G18250 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G21380 10.0004675 916.0016301
AT5G38280 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT5G46330 10.0004675 636.0004674 NULL Receptor kinase-like protein family
AT2G04032 10.0005385 18.0008324
Now, this file has similar lines (like 10,000). I want to get the following elements AT4G18250 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G21380 10.0004675 916.0016301
AT5G38280 10.0004675 916.0016301 NULL Receptor kinase-like protein family
Pretty much if there are a series of identical family names (eg. NULL Receptor kinase-like protein family) with gap in between them, I want to get the gapped-line and one above and one below. Is there a way I can do this dynamically? Thanks!
FishMonger
03-05-2007, 08:56 PM
Here's a statring point, but in the actual script I'd probably use Tie::File instead of slurping the data into the array.
use warnings;
use strict;
my @data = <DATA>;
for (my $i=0; $i < $#data-2; $i++) {
my ($family) = $data[$i] =~ /NULL Receptor (.*?) protein family/;
if ($family && $data[$i+2] =~ /NULL Receptor $family protein family/) {
print @data[$i..$i+2];
}
}
__DATA__
AT4G09510 10.0004564 47.0062289
AT4G34860 10.0004564 55.0090220
AT5G22510 10.0004564
AT1G65800 10.0004675 916.0016301
AT1G66150 10.0004675
AT1G70250 10.0004675 504.0004713
AT2G13790 10.0004675 588.0004672
AT2G13800 10.0004675 588.0004672
AT3G45410 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G18250 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G21380 10.0004675 916.0016301
AT5G38280 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT5G46330 10.0004675 636.0004674 NULL Receptor kinase-like protein family
AT2G04032 10.0005385 18.0008324
naqvia
03-05-2007, 09:04 PM
Maybe I wasn't clear. When I meant dynamic I meant fully dynamic.. because if you see other lines that I have to parse they are much different. For example
AT2G34190 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT3G10960 13.0015646 253.0005215
AT4G38050 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT5G25420 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT5G49990 13.0015646 Organic Solute Cotransporters Permease 1-like family
I need to take care of these lines too. There are over 50 diffiferent familyu descrptions and I need to account for all of them.
FishMonger
03-05-2007, 10:26 PM
That's easily accomplished by making the proper adjustment to the regex.
use warnings;
use strict;
use Data::Dumper;
my @data = <DATA>;
for (my $i=1; $i < $#data-2; $i++) {
my ($family) = $data[$i] =~ /^\S+\s(.+?) family/;
if ($family && $data[$i+2] =~ /$family/) {
print @data[$i..$i+2];
}
}
__DATA__
AT4G09510 10.0004564 47.0062289
AT4G34860 10.0004564 55.0090220
AT5G22510 10.0004564
AT1G65800 10.0004675 916.0016301
AT1G66150 10.0004675
AT1G70250 10.0004675 504.0004713
AT2G13790 10.0004675 588.0004672
AT2G13800 10.0004675 588.0004672
AT3G45410 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G18250 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G21380 10.0004675 916.0016301
AT5G38280 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT5G46330 10.0004675 636.0004674 NULL Receptor kinase-like protein family
AT2G04032 10.0005385 18.0008324
AT2G34190 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT3G10960 13.0015646 253.0005215
AT4G38050 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT5G25420 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT5G49990 13.0015646 Organic Solute Cotransporters Permease 1-like family
Outputs:
AT4G18250 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G21380 10.0004675 916.0016301
AT5G38280 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT2G34190 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT3G10960 13.0015646 253.0005215
AT4G38050 13.0015646 Organic Solute Cotransporters Permease 1-like family
naqvia
03-06-2007, 12:48 AM
neat... this is advanced.. can u go over it step by step?
naqvia
03-06-2007, 08:55 PM
anyone?
naqvia
03-06-2007, 10:37 PM
Now I am trying to get the blank lines that have no family descriptions. I am using this regular expression
if ($data[$i+1]=~/(AT\dG\d{5}(\s\d{1,}\.\d{7}\s){1,})/){
But this only getting these: AT5G22510 10.0004564 or AT1G66150 10.0004675 and NOT AT4G21380 10.0004675 916.0016301 ... why!!?!?!?! I have the {1,} in the Regex...
FishMonger
03-06-2007, 11:17 PM
You can simplify the regex to this:
/^\w+ \S+( \d+\.\d+)?$/
I don't have time right now, but when I can I'll post an explaination of this regex and the prior solution I posted
naqvia
03-06-2007, 11:28 PM
Its not working.. nothing is printing now.. I have..
if($data[$i+1]=~/^\w+\S+(\d+.\d+)?$/){
print $data[$i+1],"\n";
}
KevinADC
03-06-2007, 11:38 PM
/^\w+ \S+(\s*\d+\.\d+){1,2}$/
FishMonger
03-06-2007, 11:41 PM
my @data = <DATA>;
for my $i (0..$#data-2) {
my ($family) = $data[$i] =~ /^\S+\s(.+?) family/;
if ($family && $data[$i+2] =~ /$family/) {
# print @data[$i..$i+2];
}
if ($data[$i] =~ /^\w+ \S+( \d+\.\d+)?$/) {
print $data[$i];
}
}
__DATA__
AT4G09510 10.0004564 47.0062289
AT4G34860 10.0004564 55.0090220
AT5G22510 10.0004564
AT1G65800 10.0004675 916.0016301
AT1G66150 10.0004675
AT1G70250 10.0004675 504.0004713
AT2G13790 10.0004675 588.0004672
AT2G13800 10.0004675 588.0004672
AT3G45410 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G18250 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT4G21380 10.0004675 916.0016301
AT5G38280 10.0004675 916.0016301 NULL Receptor kinase-like protein family
AT5G46330 10.0004675 636.0004674 NULL Receptor kinase-like protein family
AT2G04032 10.0005385 18.0008324
AT2G34190 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT3G10960 13.0015646 253.0005215
AT4G38050 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT5G25420 13.0015646 Organic Solute Cotransporters Permease 1-like family
AT5G49990 13.0015646 Organic Solute Cotransporters Permease 1-like family
Outputs:
AT4G09510 10.0004564 47.0062289
AT4G34860 10.0004564 55.0090220
AT5G22510 10.0004564
AT1G65800 10.0004675 916.0016301
AT1G66150 10.0004675
AT1G70250 10.0004675 504.0004713
AT2G13790 10.0004675 588.0004672
AT2G13800 10.0004675 588.0004672
AT4G21380 10.0004675 916.0016301
AT2G04032 10.0005385 18.0008324
AT3G10960 13.0015646 253.0005215
KevinADC
03-07-2007, 01:10 AM
ah, yes, this is correct:
/^\w+ \S+( \d+\.\d+)?$/
\S+ is catching the first occurance of the 'digits.digits' pattern
vBulletin® v3.8.2, Copyright ©2000-2012, Jelsoft Enterprises Ltd.