...

View Full Version : Help with preg_replace



Mastiff
06-20-2005, 04:44 AM
I'm new to this regular expression stuff. I'd like to use preg_replace to eliminate a known multi-line signature from the body of incoming E-mails. Say the body text is in $body, and the sig is this


---
Sig line1
Sig line2
Sig line3

If I could just get rid of that, it would be pretty good. But I also get this kind of junk a lot, since messages are being quoted:


> ---
> Sig line1
> Sig line2
> Sig line3

or


>> ---
>> Sig line1
>> Sig line2
>> Sig line3

so I thought I'd be smart and tried:


$body = preg_replace("/---.*?Sig line1.*?Sig line2.*?Sig line3/","",$body);

but this erased the entire message somehow. So I thought I'd go back to basics and tried:


$body = preg_replace("---","",$body);
$body = preg_replace("Sig line1","",$body);
$body = preg_replace("Sig line2","",$body);
$body = preg_replace("Sig line3","",$body);

but this erased everything too.

I'm kinda stumped. Why are these erasing the entire message? And what's the actual smart way to erase this signature when it can have any amount of white space and >'s between lines?

Thanks for any help.

Fou-Lu
06-20-2005, 06:17 AM
Uh, what exactly are you attempting to remove from that?
It sounds like you wanted to get rid of it all, but based on your examples this is not the case.
Also, does it always contain the words sigline in it?

Mastiff
06-20-2005, 02:34 PM
Uh, what exactly are you attempting to remove from that?
It sounds like you wanted to get rid of it all, but based on your examples this is not the case.
Also, does it always contain the words sigline in it?

Oh, sorry, it's a signature in the body of an E-mail from a mailing list. The actual content of the mail is always different, but the signature is at the end of every mail. Since people hit reply on their E-mail programs, the signatures end up all over the place. I'm writing a thing to grab the messages and archive the actual content.

For example:



Sounds good Bob, I'll see you Friday.

Fred

>> Bob,
>>
>> Can we meet up some time next week?
>>
>> --
>> Sig1
>> Sig2
>
> Fred,
>
> Sure, Friday would work best for me.
>
> Bob
>
> ---
> Sig1
> Sig2

---
Sig1
Sig2


It doesn't need to be perfect, but if I could make that be something like:


Sounds good Bob, I'll see you Friday.

Fred

>> Bob,
>>
>> Can we meet up some time next week?
>>
>
> Fred,
>
> Sure, Friday would work best for me.
>
> Bob
>


That's the idea.

Thanks.

Fou-Lu
06-20-2005, 03:52 PM
Hmm, this won't be a simple task for preg techniques, though it can be done. I'll take a look at what I can do for you tonight.
Honestly, I would think there would be an easier way to do this. I mean, what we do know is that each reply generates another set of > symbols, and the signature is always started with at least on hyphen.
So its possible to get the data since it steps down from the > to >-- each time.
It may be easier to force something of the sorts into an xml based document and extract it from there. I'll test with a few different methods and let you know what I think would be the best way to do it.

Mastiff
06-20-2005, 04:42 PM
Thanks for the help. Don't rely on the > symbols too much though. I put those in the example, since they are the most common, but some E-mail programs use other symbols, like "|", or just indents.

If I could deal with the > symbols, that would cover most cases and I'd be pretty happy, but don't rely on a totally predictable pattern every time.

I guess the ideal solution would be where it looks for the signature lines separated by any amount of white space plus any number of reply characters that could be specified (I'd start with >, and |). Beggars can't be choosers of course. ;)

marek_mar
06-20-2005, 11:05 PM
Simple but works:


Regex: /[>]* ?---.*?Sig line1.*?Sig line2.*?Sig line3/ims

Tested with:


some text
>> ---
>> Sig line1
>> Sig line2
>> Sig line3
more text
> ---
> Sig line1
> Sig line2
> Sig line3
yet more text
---
Sig line1
Sig line2
Sig line3

Mastiff
06-20-2005, 11:21 PM
Thanks. Would you mind explaining how it works since I'm still learning? The part I need help with is the
[>]* ? at the beginning.


/[>]* ?---.*?Sig line1.*?Sig line2.*?Sig line3/ims

marek_mar
06-20-2005, 11:23 PM
Match the > sign any number of times then a space once or 0 times...

Mastiff
06-21-2005, 04:03 AM
Hey, I tried this out and it's not quite working for me. Here's what happened I think. The text was like this:



Some text

> more text
> ---- Original Message ----
> From: blah blah
> To: Blah blah
>
>> More text
>> More text
>> More text
>> More text
>> ---
>> Sig1
>> Sig2
>> Sig3
>
> ---
> Sig1
> Sig2
> Sig3


The output:

Some text

> more text
> -

I think it spots the "----" around "original message" and then wildcards everything until the last "Sig1". Any smart way around this?

marek_mar
06-21-2005, 11:08 AM
Try this regex:


/[>]* ?---[^a-z]*?Sig1.*?Sig2.*?Sig3/mis

Mastiff
06-21-2005, 02:50 PM
This one seems to be working. Does it make sense?


preg_replace("/^[> ]*---[> \n]*Sig1[> \n]*Sig2[> \n]Sig3[ \n]*$/ims","",$body);

I think it says start and end at the beginning of a line; have as many >'s or spaces at the start as you want, have as many >'s, spaces or newlines in between lines as you want, and have as many spaces and newlines as you want before stopping. This prevents any good text from getting sucked in between the Sig lines. And I ought to be able to add in |'s and anything else that comes up later.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum