Searching html links recursively for specific files on a home pc
I have no idea where to post this request for help as there are no categories here that fit.
So here goes...
I downloaded 60,000+ files from
I wanted ONLY the Supreme Court Cases and used HTTrack to do this. After numerous attempts and a few weeks work, I was able to download the files even though I knew I had too many (I had manually gone through all the pages to find out how many files per volume/folder there were). Somehow HTTrack downloaded the Court of Appeals cases too even though they were only accessible one level higher.
Whatever the case of how this happened I now have 60,000+ files on my hard drive (30GB of data downloaded on an 80Gb hard drive specially purchased for this task as an external USB drive). To make it even tougher the Supreme Court cases are mixed up amongst the Court of Appeals cases in two folders:
a folder called fol (for follow) that has all the html pages that reference the pdf files which are in a folder called doc (for document). The folders are ONLY numbered and the html files have names made up like so:
Row1.html, Row31.html, Row61.html, etc. These names are the same for all files in all folders. So it is impossible to find which files are Supreme Court files and which ones are Court of Appeals files by looking at the names of the html files.
I need to get rid of all the Court of Appeals files from this 'database' (I call it that even though the files are normal html and pdf files - the html files reference the pdf files).
There is a page which I can start from that shows three folders:
Court of Appeals
When the Archives folder is clicked it goes to a page that shows:
Court of Appeals
So I need to have some sort of program or script do these things:
Now I thought that there would be a program to do what I consider a simple task. After all, there are web spiders/crawlers, link testers, link crawlers, duplicate file deleters, web-ripping programs, etc. But nooooooooooo. There is no such program that does the above that I can find. What makes it hard is finding the right search terms to put into a search engine to attempt to find this type of program.
Does anyone know of a program that can do what I described above? Failing that, what technology would I have to look to, to do this?
Any help or suggestions would be greatly appreciated.
This thread has has also been posted here:
it has been sorted out and I consider it closed.
|All times are GMT +1. The time now is 04:05 AM.|
Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.