PDA

View Full Version : Reg Exp Problems


SamTheGonoph
07-16-2006, 10:41 PM
Hi,

I'm sorry to bring you what is probably a simple reg exp question, but I am at my wit's end, since I am a biologist primarily and not a brilliant programmer.

I am trying to extract some protein sequences from a file with this sort of syntax - I want all the ---- characters as well as the characters.

By the way, there appears to be a new line command at the end of every line...

>P1;UNKNOWN

-----------------------------------------------------AMRTPRC
GVPDLGRFQTFEGDLKWHHHNITYWIQNYSEDLPRAVIDDAFARAFALWSAVTPLTFTRV
YSRDADIVIQFGVAEHGDGYPFDGKDGLLAHAFPPGPGIQGDAHFDDDELWSLGKGVVVP
TRFGNADGAACHFPFIFEGRSYSACTTDGRSDGLPWCSTTANYDTDDRFGFC--------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------------
*
>P1;1l6jA

VLFPGDLRTNLTDRQLAEEYLYRYGYTLGPALLLLQKQLSLPETGELDSATLKAMRTPRC
GVPDLGRFQTFEGDLKWHHHNITYWIQNYSEDLPRAVIDDAFARAFALWSAVTPLTFTRV
YSRDADIVIQFGVAEHGDGYPFDGKDGLLAHAFPPGPGIQGDAHFDDDELWSLGKGVVVP
TRFGNADGAACHFPFIFEGRSYSACTTDGRSDGLPWCSTTANYDTDDRFGFCPSERLYTR
DGNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATTANYDRDKLFGFCPTRADSTVMG
GNSAGELCVFPFTFLGKEYSTCTSEGRGDGRLWCATTSNFDSDKKWGFCPDQGYSLFLVA
AHEFGHALGLDHSSVPEALMYPMYRFTEGPPLHKDDVNGIRHLYG
*


<END>

I have tried using BioPerl to extract the sequences but it misses the -------- characters which I require for part of the alignment. The code I have at the moment is not matching anything (the file name of the alignment is bestSeq.pir):

<SNIP>

open (BESTSEQ, "bestSeq.pir");

my ($sequence 1, $sequence2);

while (<BESTSEQ>)

{

if ($_=~/>\S*..(\S*?).\*>\S*..(\S*?)/s)
# Hopefully the two parentheses should pick up any non white space - i.e. the sequences

{
print "Regexp matched. Seq 1 is $1\nSeq 2 is $2\n";

$sequence1=$1;
$sequence2 =$2;
}

</SNIP>

Could anyone point out what I am doing wrong here? Thanks in advance for reading,

Ed.

KevinADC
07-17-2006, 03:26 AM
probably a typo but you have an error here:

my ($sequence 1, $sequence2);

there is a space between $sequence and 1

The problem is you are reading the file line by line but trying to match the pattern as if the text of the file is one long string. You can read the file into one long string and use a regexp to get all the matches:

open (BESTSEQ, '<bestSeq.pir');
my $text = do {local $/; <BESTSEQ>};
close(BESTSQ);
my @sequences = $text =~ />\S+\s*([^\*]+)*/g;
print "$_\n" for @sequences;

SamTheGonoph
07-17-2006, 11:57 AM
Thanks very much. It worked a treat and you have managed to do what I was trying to do for most of the weekend!


Ed.