Is there a script that can intelligently and recursively search for certain files?
I have no idea where to post this request (and also posted this in computer programming) for help as there are no categories here that seem to fit.
So here goes...
I downloaded 60,000+ files from
I wanted ONLY the Supreme Court Cases and used HTTrack to do this. After numerous attempts and a few weeks work, I was able to download the files even though I knew I had too many (I had manually gone through all the pages to find out how many files per volume/folder there were). Somehow HTTrack downloaded the Court of Appeals cases too even though they were only accessible one level higher.
Whatever the case of how this happened I now have 60,000+ files on my hard drive (30GB of data downloaded on an 80Gb hard drive specially purchased for this task as an external USB drive). To make it even tougher the Supreme Court cases are mixed up amongst the Court of Appeals cases in two folders:
a folder called fol (for follow) that has all the html pages that reference the pdf files which are in a folder called doc (for document). The folders are ONLY numbered and the html files have names made up like so:
Row1.html, Row31.html, Row61.html, etc. These names are the same for all files in all folders. So it is impossible to find which files are Supreme Court files and which ones are Court of Appeals files by looking at the names of the html files.
I need to get rid of all the Court of Appeals files from this 'database' (I call it that even though the files are normal html and pdf files - the html files reference the pdf files).
There is a page which I can start from that shows three folders:
Court of Appeals
When the Archives folder is clicked it goes to a page that shows:
Court of Appeals
So I need to have some sort of program or script do these things:
Now I thought that there would be a program or script to do what I consider a simple task. After all, there are web spiders/crawlers, link testers, link crawlers, duplicate file deleters, web-ripping programs, etc. But nooooooooooo. There is no such program or script that does the above that I can find. What makes it hard is finding the right search terms to put into a search engine to attempt to find this type of program.
Does anyone know of a program or php script that can do what I described above? Failing that, what technology would I have to look to, to do this?
Any help or suggestions would be greatly appreciated.
Are you trying to check these files by the content of the files (code)? This seems like it would be a lengthy process and scripting something for all of them could be sketchy if the files aren't generated the same way. For instance, looking or specific keywords would be wanted, but if one page is missing a keyword or another contains these keywords (for external links, webpage keywords in general, or some other reason.
If this is what you want, it is possible to do in php I can tell you that from the start, however if you don't have a php server installed or even know the language I would probably look around for a better language to use (someone else may suggest one that works better, I'm not too sure).
The main functions I would look into are scandir, is_dir, file_get_contents, stripos (or strpos, depending if you want it case-sensitive) and foreach (would look through each file/directory).
Since I don't know the exact file storing structure or the exact way you are going to search for these, I can't really post code that would help you.
Here would be a sample of a file crawler (possibly, not really testing).
As for the file checking, you could do something like this, after the list function above works.
Thanks for the thoughtful response.
I have spent the last 2 hours crafting a response with screen shots etc but because it took so long I was timed out and my post was lost. It is 3.32am Sunday morning now so I will continue Sunday proper and try to recreate my post that I just lost.
[Note to the MODS: I really do wish that you would extend the time for people to be logged in and 'not be doing anything' because your site doesn't seem to be able to tell when someone is typing, which is really silly. Even a simple script can do that.]
I think I may have had that problem before once or twice, I just hit back on my browser and my text was still there in the text box so I just copied it and pasted it after refreshing the page or relogging in (can't remember what I needed to do exactly). Make sure you check the remember me box when you login if you didn't.
As far as the script to check if someone is typing, that would consume a lot more resources on the web server from sending the notification as well as all the users. It would be possible and not a whole lot more bandwidth but I don't know if this site is limited on that or not. I'm going to assume it's unlimited, but it still doesn't seem necessary as it would reduce page loading times slightly, even though you wouldn't notice it too much it would add up.
Can't do much about that I'm afraid. Time can be ramped up, but it isn't really necessary overall. What I wish it would do is tell you that it will carry your post which it doesn't. So if it logs you out whilst typing, simply login and it will continue to post for you (you won't lose what you were typing). My overall advice is to make use of the post reply instead of the quick post box at the bottom, and periodically use the preview post option. That way you can hit back and it'll be guaranteed to be at least the last preview you did :D
BTW, you can use recursive directory iterators as well for your actual original question. I don't quite understand what ruleset you require in order to determine which is which, but you can combine filters with the iterators so that you only display certain things (such as titles matching a pattern, or technically even specific contents within a file although I haven't done that with a recursive iterator in the past). 60K+ files is a lot. While PHP can handle it, if you need to actually read anything within the file, I'd suggest a more robust language like C or C++ be used instead. Scripting languages like PHP will just be too slow to go through 30GB of data in a reasonable period of time.
i don't know if that's an option for you, but if you were to use linux for that, then you would have 90% of job already done for you.
you could simply use 'find', 'grep' and a few lines of shell script ;)
Thank you very much for all the comments.
I was able to get the problem sorted out in the end.
First I used 'Advanced Find and Replace 7' to strip out all references to external domains and urls not in the directories I wanted. Then I used HTTrack to re-index the downloaded website that was on my hard drive. I did this twice to capture two different sets of Supreme Court case data - one set was from 1857 to 2008 and the other set was from 2009 - 2013.
I then combined the 'fol' and 'doc' folders from both sets and changed the code in the index page (that HTTrack creates) to reflect that. It took another day or so but the whole job has now been done.
I now have everything the way I want it AND I have the original database too! So that's neat. Mind you, the total size of these two databases is around 54GB!!!
Sorry I didn't re-post the screenshots I took, but I was so disheartened by losing the post that I couldn't face doing it from scratch again and just got on with trying to figure it all out by myself.
By the way, a weird message came up when I tried to post the original post. When I got logged out when clicking submit, it came up with the login page but when I tried logging back in it said something about not being able to log me in because I was logged into a post !!??? I can't remember exactly what the words were. But it told me to press the back button which I did and I tried logging in again but it didn't work. I knew that with the first failed login that I had lost the post so gave up after that.
By the way, I was logged out again when writing post #7.
It is really pissing me off that the time given for writing a post is so short. This does need to be looked into as it took about 5 mins or less to write.
|All times are GMT +1. The time now is 10:29 PM.|
Powered by vBulletin®
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.