Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 5 of 5
  1. #1
    Regular Coder
    Join Date
    Jul 2010
    Location
    Oregon City
    Posts
    280
    Thanks
    5
    Thanked 50 Times in 49 Posts

    Site scraping.. job management.

    There are these forums I want to index to provide a search utility for its users, as it likely won't ever have one.

    Anyway, writing the code isn't the hard part, but when I initially start to build the index I'm not sure how I should manage queuing things so the script isn't executing for like an hour.

    The forum has 25 or so categories, multiple pages with like 30 threads per page, each one of those threads has 10 replies per page with multiple pages. Realistically I want to process, say, 1 category and X number of pages of threads at a time.

    What methods can I use to do this?

  • #2
    New Coder
    Join Date
    Jan 2010
    Posts
    29
    Thanks
    0
    Thanked 2 Times in 2 Posts
    if you have some kind of unix account couldn't you basically cap your scripts with variables (i.e. iterate this many times, quit after this period of time, etc) and perhaps cron out jobs periodically so you're splitting the work up?

    Honestly, I have a general idea of what you're trying to do but don't know your limitations. I don't see an issue with just flat out running the queries :P
    Nerd Stuff (code, rrdtool, monitoring, etc):

    blog.anthonyhurst.com

  • #3
    Regular Coder
    Join Date
    Jul 2010
    Location
    Oregon City
    Posts
    280
    Thanks
    5
    Thanked 50 Times in 49 Posts
    Quote Originally Posted by fatecaresx13 View Post
    if you have some kind of unix account couldn't you basically cap your scripts with variables (i.e. iterate this many times, quit after this period of time, etc) and perhaps cron out jobs periodically so you're splitting the work up?

    Honestly, I have a general idea of what you're trying to do but don't know your limitations. I don't see an issue with just flat out running the queries :P
    For example, one forum category has at least 1800 threads in it all with possibly up to 50 pages

    letting it run that long would take forever and when i need to run it again to update the index it would take just as long

  • #4
    New Coder
    Join Date
    Jan 2010
    Posts
    29
    Thanks
    0
    Thanked 2 Times in 2 Posts
    You absolutely have to scrape and can't access any database or anything? I feel like there may be an API or view only access of some kind for this kind of thing.
    Nerd Stuff (code, rrdtool, monitoring, etc):

    blog.anthonyhurst.com

  • #5
    Regular Coder
    Join Date
    Jun 2010
    Posts
    278
    Thanks
    63
    Thanked 8 Times in 8 Posts
    Quote Originally Posted by fatecaresx13 View Post
    You absolutely have to scrape and can't access any database or anything? I feel like there may be an API or view only access of some kind for this kind of thing.
    Ditto - scraping pages rather than querying the DB directly seems a particularly bizarre way of doing things!

    It sounds like you're making a whole hell of a load of work for yourself when there's bound to be a much simpler solution.


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •