...

View Full Version : Need suggestions on parsing information



macleodjb
01-20-2012, 02:09 AM
Hi guys,

I've set up a mail pipe to retrieve lottery results by email. This is the first time I've done this and I want to get some advice/suggestions as to how to proceed or the best course of action to perform. Below is a sample email that i need to parse to retrieve the "game name", the "drawing date" and the "results" of the game. My question is rather around, if all of these variables are changing all the time how do I effectively locate and parse the correct sections of this mess? Given that there are random spaces or chars in weird areas. I'm not looking for someone to write me code, just looking for some pointers to follow.



<head><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DISO=
-8859-1">

<title>Florida Lottery Winning Numbers</title>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859-=
1">
</head>
<!-- Top -->
<body bgcolor=3D"#D6E69D">
<table width=3D"500" border=3D"1" bordercolor=3D"#99CC00" align=3D"center" =
cellpadding=3D"0" cellspacing=3D"0">
<tr>
<td><table width=3D"500" border=3D"0" bgcolor=3D"#FFFFFF" align=3D"cent=
er" cellpadding=3D"0" cellspacing=3D"0">
<tr>
<td><a href=3D"http://www.flalottery.com"><img src=3D"http://www.fl=
alottery.com/exptkt/header.gif" alt=3D"Florida Lottery Winning Numbers" wid=
th=3D"500" height=3D"90" border=3D"0"></a></td>
</tr>
<!-- draw_date -->
<tr>
<td align=3D"center"><table width=3D"100%" border=3D"0" cellspacing=
=3D"0" cellpadding=3D"8">
<tr>
<td><div align=3D"center"><font face=3D"Arial" size=3D"5" col=
or=3D"#8CC43F"><strong>Thursday, January 19, 2012 Draws</strong></font></d=
iv></td>
</tr>
</table></td>
</tr>
<tr>
<td colspan=3D"2"><hr color=3D"#99CC00" size=3D"1" width=3D"9=
0%"></td>
</tr>

<!-- Play4 midday -->
<tr>
<td align=3D"center"><table width=3D"100%" border=3D"0" cellspacing=
=3D"0" cellpadding=3D"0">
<tr>
<td width=3D"40%"><div align=3D"center"><a href=3D"http://www=
.flalottery.com/inet/games-play4Main.do"><img src=3D"http://www.flalottery.=
com/exptkt/play4.gif" alt=3D"Play 4" width=3D"117" height=3D"49" vspace=3D"=
3" border=3D"0"></a></div></td>
<td width=3D"60%"><div align=3D"left"><font face=3D"Arial" si=
ze=3D"5" color=3D"#000000"><strong><font color=3D"#666666">Midday:</font> =
2 - 0 - 5 - 6<br>
</strong></font></div></td>
</tr>
</table></td>
</tr>
<tr>
<td align=3D"center"><hr color=3D"#99CC00" size=3D"1" width=3D"90%"=
></td>
</tr>

<!-- Cash3 midday-->
<tr>
<td align=3D"center"><table width=3D"100%" border=3D"0" cellspacing=
=3D"0" cellpadding=3D"0">
<tr>
<td width=3D"40%"><div align=3D"center"><a href=3D"http://www=
.flalottery.com/inet/games-cash3Main.do"><img src=3D"http://www.flalottery.=
com/exptkt/cash3.gif" alt=3D"Cash 3" width=3D"117" height=3D"46" vspace=3D"=
3" border=3D"0"></a></div></td>
<td width=3D"60%"><div align=3D"left"><font face=3D"Arial" si=
ze=3D"5" color=3D"#000000"><strong><font color=3D"#666666">Midday:</font> 6=
- 2 - 0<br>
</strong></font></div></td>
</tr>
</table></td>
</tr>
<tr>
<td align=3D"center"><hr color=3D"#99CC00" size=3D"1" width=3D"90%"=
></td>
</tr>

<!-- Bottom -->
<tr>
<td><table width=3D"100%" border=3D"0" align=3D"center" cellpadding=
=3D"5" cellspacing=3D"0">
<tr>
<td><font face=3D"Arial" size=3D"1">Please note every effort =
has been made to ensure that the enclosed information is accurate; however,=
in the event of an error, the winning numbers and prize amounts in the off=
icial record of the Florida Lottery shall be controlling.<p>
To unsubscribe from receiving Florida Lottery e-mail, please <a href=3D"htt=
p://secondchance.flalottery.com/secondchance/vip_login.do"> click here</a>,=
log in to your account and update your e-mail preferences.=20
</font></td>
</tr>
</table></td>
</tr>
<tr>
<td><a href=3D"http://www.flalottery.com"><img border=3D"0" src=3D"=
http://www.flalottery.com/exptkt/footer.gif" width=3D"500" height=3D"40" al=
t=3D"www.flalottery.com"></a></td>
</tr>
</table></td>
</tr>
</table>
</body>
</html>

Spookster
01-20-2012, 02:54 AM
For this you as long as they are consistent with how they write this out you can pick up on patterns using regular expressions.

Patterns:
Give you the date
<!-- draw_date --> followed by a bunch of junk and then <strong> date </strong>

Gives you the game names
<!-- Play4 midday -->
<!-- Cash3 midday-->

Gives you results
Midday:</font> precedes the results for each game type and each has unique pattern of results
= 2 - 0 - 5 - 6
6= - 2 - 0
not really sure what that means.

Link below will get you started on how to parse it with regular expressions. It also demonstrates use of DOM but I don't think you are going to be able to use that here. This HTML is pretty horrible.
http://www.codingforums.com/showthread.php?t=244867

mlseim
01-20-2012, 03:21 AM
If they offer results by email, do they also offer RSS feeds with the latest results?

I would use their RSS feed instead of the email. You could have a PHP script
automatically grab the RSS (XML file), save data in a database, and also SMS message
your phone, or send a nice, clean email to you with any statistical data you wish.
You would be creating the email you get using your PHP script.

Find out if they offer an RSS feed, or an API (that would also work good).

EDIT:
I just found their feed here:
http://www.flalottery.com/video/en/theWinningNumber.xml

That can easily be accessed and parsed by a PHP script automatically, using a CRON job.
You can then use the data however you want.


.

macleodjb
01-20-2012, 01:05 PM
I guess posting the florida email content was a bad example. I need to get this working for states that do not offer the RSS option. I am using the RSS in florida, however there are quite a few states that don't off it.

As for parsing with regular expressions, I'm going to have to read up on them in great detail because I've never really understood them too well. For example i can spot the patterns but its the junk in the middle that i dont need is what concerns me. I'm not sure how to get rid of it. I want to be able to feel confident that my script will pull out the data with no left over junk or simple the wrong data.

If you have any more suggestions please send them over. Thanks

mlseim
01-20-2012, 01:27 PM
How about using a service like some of these?
https://www.google.com/search?q=state+lottery+xml+results+united+states&btnG=Search

I realize they might have subscription costs, but the data is all
in one place, and easy to access. I believe the time and energy
you save would be worth the cost.

Parsing even 10 states with HTML parsing would be a nightmare, and if they
changed anything on their webpage (like a new layout design), you'd be
starting all over again.

How about this thought ... maybe you can tell us which states DON'T
offer the RSS feed results. It's possible that 40 states offer it, and 10 don't.
That might make it easier to swallow.


.

macleodjb
01-20-2012, 01:47 PM
here's my first question with regular expressions. How can I retrieve the chunk of the content between <!-- drawing date --> and the following block <!-- {whatever} -->

I tried this. but returns an empty result.

$date_pattern = "/<!-- draw_date -->(.*)<!--/";

Spookster
01-20-2012, 06:35 PM
That's easy



<?php
$subject = 'fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->';
$pattern = '/<!--(.*?)-->/';
echo htmlspecialchars($subject) . "</br>";
preg_match_all($pattern, $subject, $matches);
var_dump($matches);
?>


produces:


fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->
array
0 =>
array
0 => string '<!-- draw_date -->' (length=18)
1 =>
array
0 => string ' draw_date ' (length=11)

macleodjb
01-20-2012, 07:44 PM
That's easy



<?php
$subject = 'fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->';
$pattern = '/<!--(.*?)-->/';
echo htmlspecialchars($subject) . "</br>";
preg_match_all($pattern, $subject, $matches);
var_dump($matches);
?>


produces:


fsdjf idsfi sidh dsfh <!-- draw_date --> sdofijos.dsflsad <!- asdfsd ->
array
0 =>
array
0 => string '<!-- draw_date -->' (length=18)
1 =>
array
0 => string ' draw_date ' (length=11)




I'm not sure i follow that, but it doesn't look like what I'm after. I wanted to get the contents between those two <!--tag--> content <!-- tag -->

so in your example it would return the following.


sdofijos.dsflsad

macleodjb
01-20-2012, 08:05 PM
I just tried to use this as my regular expression to get between those points.


"/->[A-Za-z0-9-^_]+<!-/"

no luck

Spookster
01-21-2012, 12:03 AM
I'm not sure i follow that, but it doesn't look like what I'm after. I wanted to get the contents between those two <!--tag--> content <!-- tag -->

so in your example it would return the following.


sdofijos.dsflsad


No it doesn't. Did you look at the post? Did you even try it? I posted the code and the output it produces. Show me how it doesn't work.

macleodjb
01-21-2012, 12:18 AM
Here is the output from your post above.



Array
(
[0] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
)

[1] => Array
(
[0] => Top
[1] => draw_date
[2] => Play4 midday
[3] => Cash3 midday
[4] => Bottom
)

)


This returns what is between the tag start and end. What i am looking to do is return what is between the draw_date tag and the Play4 Midday tag. That will allow me to pull the date out of it easier as well as the results.

For example:


<!-- First Tag --> ie (<!-- draw_date -->)

<b>Here is the content i want to return</b>

<!-- Second Tag --> ie (<!-- play4 midday -->)

Spookster
01-21-2012, 12:31 AM
Here is the output from your post above.



Array
(
[0] => Array
(
[0] =>
[1] =>
[2] =>
[3] =>
[4] =>
)

[1] => Array
(
[0] => Top
[1] => draw_date
[2] => Play4 midday
[3] => Cash3 midday
[4] => Bottom
)

)


This returns what is between the tag start and end. What i am looking to do is return what is between the draw_date tag and the Play4 Midday tag. That will allow me to pull the date out of it easier as well as the results.

For example:


<!-- First Tag --> ie (<!-- draw_date -->)

<b>Here is the content i want to return</b>

<!-- Second Tag --> ie (<!-- play4 midday -->)



What I posted will search the string and return all matches it finds between those tags which is what you asked for



How can I retrieve the chunk of the content between <!-- drawing date --> and the following block <!-- {whatever} -->

macleodjb
01-21-2012, 03:02 AM
here's my first question with regular expressions. How can I retrieve the chunk of the content between <!-- drawing date --> and the following block <!-- {whatever} -->

I tried this. but returns an empty result.

$date_pattern = "/<!-- draw_date -->(.*)<!--/";


No, what i was looking for was the content between. See the above. The first tag is <!-- drawing date --> and the following block <!-- {whatever}--> in this example would be <!-- pick4 midday -->.

In my original posting I was attempting to use the full first tag, and the opening operand on the following tag, with anything (.*) in between. My lack of knowledge for regular expressions i guess made this hard to understand. And then in my next attempt I tried using your example just using the end operand and the start operand "-->(.*)<!--"

Sorry for the confusion.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum