Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 6 of 6
  1. #1
    New to the CF scene
    Join Date
    Mar 2013
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Generating RSS from CSV/Web-scraping match

    Hello all, first up this is my first post in this forum. Second up thanks for taking the time to read my post.

    I wish to learn how to code mainly in order to complete this specific task, the problem is I do not know which type of coding i should be using, and what methodology would be best. Let me outline the plan of what i need to do.

    I produce excel based ratings for horse racing, these are extremely accurate based on non linear algorithmic derivation, in order to demonstrate live how accurate they are, I wish to be able to be able to match the CSV file produced with a webscrape (on a time loop) in order to generate an up to date RSS output which can be linked directly to my website.

    I was wondering if there are any flaws in my idea, how easy this is to achieve and which programming language would be best to make a start on such a project?

    Any help or general words of advice would be greatly appreciated.

    Kind regards

    Alan

  • #2
    Master Coder mlseim's Avatar
    Join Date
    Jun 2003
    Location
    Cottage Grove, Minnesota
    Posts
    9,389
    Thanks
    8
    Thanked 1,077 Times in 1,068 Posts
    It depends on where you're doing the scrape from. Sometimes it can be almost impossible to parse the HTML. Can you give us a URL from which you are scraping? And also what information you're trying to find? You might end-up subscribing to a site that offers an API to access racing data (for a fee of some kind).

  • #3
    New to the CF scene
    Join Date
    Mar 2013
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Thanks for the reply, there are multiple webpages that would be ideal candidates for such a scrape, ultimately any website that provides uptodate racing results.

    I have a small list of potentials here (listed in order of preference due to the information they provide):

    http://www.timeform.com/free/
    http://www.racingpost.com/horses/fas...op_results=yes
    http://www.sportinglife.com/racing/fast-results

    An example of a match would be matching when the Ratings predicted the correct winner. The Aim would be to get the RSS to report the winner and the price of said selection.

    Best regards

    Alan

  • #4
    Master Coder mlseim's Avatar
    Join Date
    Jun 2003
    Location
    Cottage Grove, Minnesota
    Posts
    9,389
    Thanks
    8
    Thanked 1,077 Times in 1,068 Posts
    Let's just take one of your example links for now ...
    http://www.racingpost.com/horses2/re...ate=2013-03-26

    On "racingpost", a PHP script could query a day's result because each full report uses the same URL format (date). You can substitute a different date and get results for any day: r_date=2013-03-24

    Now, PHP would load that page's HTML into an array and search for horses using a loop.

    Every line that has a horse in it (showing the placement) has this common string on all lines that show a horse:
    "/horses/horse_home"

    That line can then be parsed to know horse name and place, because the name and place are in the same line.

    Here's an example of a line with a horse in it:
    2 <a href="http://www.racingpost.com/horses/horse_home.sd?horse_id=832550" onclick="scorecards.send(&quot;horse_name&quot;return Html.popup(this, {width:695, height:800})" title="Full details about this HORSE">My Sapphire</a> 14/1 <br/>

    We can parse-out the horse's name and what place they were.

    If the looping through the array encounters a line with the word "WIN" in it, it knows that line is from the previous group of horses.

    Here is one of those lines:
    <p>WIN &pound;2.40 PL &pound;1.90, &pound;6.70<br /> Ex &pound;36.90<br /> CSF &pound;23.93</p>

    Once that line is found, the next horse it finds will be a different race.
    ... and the script continues on like that until the array end is reached.

    I know nothing about betting on horses ... I don't know what "Ratings" means.

    You will be using PHP which is "server-side". Either your CSV file will need to be uploaded to your website before the PHP script crunches, or the PHP script can give you a CSV file to download to your Excel. I think the latter would be best because you can do elaborate calculations using Excel. Doing all of the "ratings" matching using PHP will probably require a ton of scripting and figuring out how to do the calculations.

    So, it can be done. It will take some loading, looping, parsing, etc.


    A Side Note:
    Another aspect of doing this (scraping) is Ethics. Sites like the ones you mention probably have a service where a person can subscribe to an API to query their database for all data (for a fee). Scraping a website is basically taking all of their HTML from a page, tearing it apart, taking out the information you need, and using it ... without any regards to permission, subscriptions, copyrights, etc. I don't imagine any website owner would be very happy about people using their bandwidth to scrape information from their site. That being said, I've given you enough information to get you started, whether you attempt this yourself or hire a friend to help you.



    .
    Last edited by mlseim; 03-26-2013 at 10:51 PM.

  • #5
    Regular Coder patryk's Avatar
    Join Date
    Oct 2012
    Location
    /dev/couch
    Posts
    398
    Thanks
    2
    Thanked 64 Times in 64 Posts
    my suggestion would be to drop the idea right now. that's probably not worth legal consequences you might be facing.
    you server's ip will be seen by service you are stealing data from, os they will have a way of telling you apart from regular users.
    at the very least if you'll get caught your hosting account can be suspended (if site in question has any copyright statament).

    use API as mlseim said.
    Last edited by patryk; 03-27-2013 at 01:40 AM.

  • #6
    New to the CF scene
    Join Date
    Mar 2013
    Posts
    5
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by patryk View Post
    my suggestion would be to drop the idea right now. that's probably not worth legal consequences you might be facing.
    you server's ip will be seen by service you are stealing data from, os they will have a way of telling you apart from regular users.
    at the very least if you'll get caught your hosting account can be suspended (if site in question has any copyright statament).

    use API as mlseim said.
    If I was publishing the results to a website I could see how they could potentially get funny about it but I would feed the RSS feed into a twitter account (got an RSS tweeter) publishing results. Furthermore I have searched high and low for said API but to no avail.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •