SamTheGonoph
07-16-2006, 10:41 PM
Hi,
I'm sorry to bring you what is probably a simple reg exp question, but I am at my wit's end, since I am a biologist primarily and not a brilliant programmer.
I am trying to extract some protein sequences from a file with this sort of syntax - I want all the ---- characters as well as the characters.
By the way, there appears to be a new line command at the end of every line...
>P1;UNKNOWN
-----------------------------------------------------AMRTPRC
GVPDLGRFQTFEGDLKWHHHNITYWIQNYSEDLPRAVIDDAFARAFALWSAVTPLTFTRV
YSRDADIVIQFGVAEHGDGYPFDGKDGLLAHAFPPGPGIQGDAHFDDDELWSLGKGVVVP
TRFGNADGAACHFPFIFEGRSYSACTTDGRSDGLPWCSTTANYDTDDRFGFC--------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------------
*
>P1;1l6jA
VLFPGDLRTNLTDRQLAEEYLYRYGYTLGPALLLLQKQLSLPETGELDSATLKAMRTPRC
GVPDLGRFQTFEGDLKWHHHNITYWIQNYSEDLPRAVIDDAFARAFALWSAVTPLTFTRV
YSRDADIVIQFGVAEHGDGYPFDGKDGLLAHAFPPGPGIQGDAHFDDDELWSLGKGVVVP
TRFGNADGAACHFPFIFEGRSYSACTTDGRSDGLPWCSTTANYDTDDRFGFCPSERLYTR
DGNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATTANYDRDKLFGFCPTRADSTVMG
GNSAGELCVFPFTFLGKEYSTCTSEGRGDGRLWCATTSNFDSDKKWGFCPDQGYSLFLVA
AHEFGHALGLDHSSVPEALMYPMYRFTEGPPLHKDDVNGIRHLYG
*
<END>
I have tried using BioPerl to extract the sequences but it misses the -------- characters which I require for part of the alignment. The code I have at the moment is not matching anything (the file name of the alignment is bestSeq.pir):
<SNIP>
open (BESTSEQ, "bestSeq.pir");
my ($sequence 1, $sequence2);
while (<BESTSEQ>)
{
if ($_=~/>\S*..(\S*?).\*>\S*..(\S*?)/s)
# Hopefully the two parentheses should pick up any non white space - i.e. the sequences
{
print "Regexp matched. Seq 1 is $1\nSeq 2 is $2\n";
$sequence1=$1;
$sequence2 =$2;
}
</SNIP>
Could anyone point out what I am doing wrong here? Thanks in advance for reading,
Ed.
I'm sorry to bring you what is probably a simple reg exp question, but I am at my wit's end, since I am a biologist primarily and not a brilliant programmer.
I am trying to extract some protein sequences from a file with this sort of syntax - I want all the ---- characters as well as the characters.
By the way, there appears to be a new line command at the end of every line...
>P1;UNKNOWN
-----------------------------------------------------AMRTPRC
GVPDLGRFQTFEGDLKWHHHNITYWIQNYSEDLPRAVIDDAFARAFALWSAVTPLTFTRV
YSRDADIVIQFGVAEHGDGYPFDGKDGLLAHAFPPGPGIQGDAHFDDDELWSLGKGVVVP
TRFGNADGAACHFPFIFEGRSYSACTTDGRSDGLPWCSTTANYDTDDRFGFC--------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------------
*
>P1;1l6jA
VLFPGDLRTNLTDRQLAEEYLYRYGYTLGPALLLLQKQLSLPETGELDSATLKAMRTPRC
GVPDLGRFQTFEGDLKWHHHNITYWIQNYSEDLPRAVIDDAFARAFALWSAVTPLTFTRV
YSRDADIVIQFGVAEHGDGYPFDGKDGLLAHAFPPGPGIQGDAHFDDDELWSLGKGVVVP
TRFGNADGAACHFPFIFEGRSYSACTTDGRSDGLPWCSTTANYDTDDRFGFCPSERLYTR
DGNADGKPCQFPFIFQGQSYSACTTDGRSDGYRWCATTANYDRDKLFGFCPTRADSTVMG
GNSAGELCVFPFTFLGKEYSTCTSEGRGDGRLWCATTSNFDSDKKWGFCPDQGYSLFLVA
AHEFGHALGLDHSSVPEALMYPMYRFTEGPPLHKDDVNGIRHLYG
*
<END>
I have tried using BioPerl to extract the sequences but it misses the -------- characters which I require for part of the alignment. The code I have at the moment is not matching anything (the file name of the alignment is bestSeq.pir):
<SNIP>
open (BESTSEQ, "bestSeq.pir");
my ($sequence 1, $sequence2);
while (<BESTSEQ>)
{
if ($_=~/>\S*..(\S*?).\*>\S*..(\S*?)/s)
# Hopefully the two parentheses should pick up any non white space - i.e. the sequences
{
print "Regexp matched. Seq 1 is $1\nSeq 2 is $2\n";
$sequence1=$1;
$sequence2 =$2;
}
</SNIP>
Could anyone point out what I am doing wrong here? Thanks in advance for reading,
Ed.