Searching html links recursively for specific files on a home pc
I have no idea where to post this request for help as there are no categories here that fit.
So here goes...
I downloaded 60,000+ files from
I wanted ONLY the Supreme Court Cases and used HTTrack to do this. After numerous attempts and a few weeks work, I was able to download the files even though I knew I had too many (I had manually gone through all the pages to find out how many files per volume/folder there were). Somehow HTTrack downloaded the Court of Appeals cases too even though they were only accessible one level higher.
Whatever the case of how this happened I now have 60,000+ files on my hard drive (30GB of data downloaded on an 80Gb hard drive specially purchased for this task as an external USB drive). To make it even tougher the Supreme Court cases are mixed up amongst the Court of Appeals cases in two folders:
a folder called fol (for follow) that has all the html pages that reference the pdf files which are in a folder called doc (for document). The folders are ONLY numbered and the html files have names made up like so:
Row1.html, Row31.html, Row61.html, etc. These names are the same for all files in all folders. So it is impossible to find which files are Supreme Court files and which ones are Court of Appeals files by looking at the names of the html files.
I need to get rid of all the Court of Appeals files from this 'database' (I call it that even though the files are normal html and pdf files - the html files reference the pdf files).
There is a page which I can start from that shows three folders:
Court of Appeals
When the Archives folder is clicked it goes to a page that shows:
Court of Appeals
So I need to have some sort of program or script do these things:
- Allow me to input a folder or an html file name (or any other type of file) to start from or with
- The program must follow all links in these html files (and also recursively search other sub-levels or sub-folders) to the links that reference the pdf files (which all have different file names)
- It must create a list of all these files and folders or have an option to just show the folders that contain these files (as there are multiple files in these folders - therefore just the folder would be good enough in this case)
- Besides each item shown there must be a checkbox to choose the file/folder and a 'Select All' or 'Unselect All' feature at the top of the list to do this if one chooses.
- After the selections are done there needs to be a button to begin the next process which is...
- Deleting ALL pdf files referenced in this search only (which are in the 'doc' folder folders), deleting ALL the folders that contained the pdf files, deleting ALL html files that referenced the pdf files (which are in the 'fol' folder folders), deleting ALL the folders that contained the html files, and deleting ALL folders and html files that lead up to the original file/folder input at the beginning to start the process. There also needs to be an option to MOVE the files/folders to another folder on the same partition or any other location that is specified by the user. After all, someone may not want to delete everything (I am thinking that this program could be used by more users than just me so am thinking of others) and may want to save it instead but have it separated out in a completely different location.
I would have to use the program in two stages. First to delete the Court of Appeals files in the Archives directory by inputting the folder number that takes you to this page, then by doing the same in the directory showing all three options and choosing Court of Appeals (and inputting the folder number that takes you to this page).
Now I thought that there would be a program to do what I consider a simple task. After all, there are web spiders/crawlers, link testers, link crawlers, duplicate file deleters, web-ripping programs, etc. But nooooooooooo. There is no such program that does the above that I can find. What makes it hard is finding the right search terms to put into a search engine to attempt to find this type of program.
Does anyone know of a program that can do what I described above? Failing that, what technology would I have to look to, to do this?
Any help or suggestions would be greatly appreciated.