...

View Full Version : Site scraping.. job management.



Adee
01-11-2012, 11:42 PM
There are these forums I want to index to provide a search utility for its users, as it likely won't ever have one.

Anyway, writing the code isn't the hard part, but when I initially start to build the index I'm not sure how I should manage queuing things so the script isn't executing for like an hour.

The forum has 25 or so categories, multiple pages with like 30 threads per page, each one of those threads has 10 replies per page with multiple pages. Realistically I want to process, say, 1 category and X number of pages of threads at a time.

What methods can I use to do this?

fatecaresx13
01-12-2012, 07:40 AM
if you have some kind of unix account couldn't you basically cap your scripts with variables (i.e. iterate this many times, quit after this period of time, etc) and perhaps cron out jobs periodically so you're splitting the work up?

Honestly, I have a general idea of what you're trying to do but don't know your limitations. I don't see an issue with just flat out running the queries :P

Adee
01-12-2012, 06:22 PM
if you have some kind of unix account couldn't you basically cap your scripts with variables (i.e. iterate this many times, quit after this period of time, etc) and perhaps cron out jobs periodically so you're splitting the work up?

Honestly, I have a general idea of what you're trying to do but don't know your limitations. I don't see an issue with just flat out running the queries :P

For example, one forum category has at least 1800 threads in it all with possibly up to 50 pages

letting it run that long would take forever and when i need to run it again to update the index it would take just as long

fatecaresx13
01-13-2012, 08:35 AM
You absolutely have to scrape and can't access any database or anything? I feel like there may be an API or view only access of some kind for this kind of thing.

XmisterIS
01-13-2012, 09:22 AM
You absolutely have to scrape and can't access any database or anything? I feel like there may be an API or view only access of some kind for this kind of thing.

Ditto - scraping pages rather than querying the DB directly seems a particularly bizarre way of doing things!

It sounds like you're making a whole hell of a load of work for yourself when there's bound to be a much simpler solution.



EZ Archive Ads Plugin for vBulletin Copyright 2006 Computer Help Forum