12-23-2007, 11:16 AM
I want to build a php program that can crawl into sitemap of different websites and then store the data (like property listings of real estate) into my own database. Currently, I'm only being able to store the links. Please help!
12-23-2007, 08:15 PM
It involves using PHP and a ton of scripting to parse each domain,
because each domain is different. This will be a big scripting project,
and will take you a lot of time to do. You might want to hire someone.
12-23-2007, 08:34 PM
Before you move any further, you need to check with your state's real estate commission and your local association of Realtors on the rules for IDX/ILD in that particular area.
As for the problem at hand, you can run through the remote file line by line looking for traits of information that you would want to import. But its not worth the time involved when you can get an FTP pull of the information every night from your local board of realtors.
12-24-2007, 06:58 PM
Even if you have to pay a monthly or yearly fee for FTP access to
a realtor database ... I think it would be worth it. That would be the
best solution to this. The information would all be in one place and
accessible in an easy way.
12-26-2007, 12:55 PM
Hi! Now, i need to scrap data from one site only. But th esit contains huge amount of data.
12-26-2007, 02:01 PM
I developed a code to navigate each link on the sitemap, open those pages, read the page structure parse the page and then insert the required data to my database. But this is a crap since it takes a hell lot of time. Any suggestion on easier an faster methos is mopst welcome.
12-26-2007, 07:43 PM
Ok, I have used cURL. But still, the site I want to scrap has hundreds of links. And when I open all those links and parse the HTML, its the same problem again - the time.
12-26-2007, 10:50 PM
PHP isn't really the most efficient tool to scrape large sites with(or many small sites), and you can't do anything about the time it takes to scrape it other than caching images & other large items.
12-27-2007, 05:43 AM
So what is the solution to my problem? Is there any solution?