Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 9 of 9
  1. #1
    Regular Coder
    Join Date
    Sep 2010
    Location
    London, UK
    Posts
    167
    Thanks
    28
    Thanked 0 Times in 0 Posts

    Is there a script that can intelligently and recursively search for certain files?

    I have no idea where to post this request (and also posted this in computer programming) for help as there are no categories here that seem to fit.

    So here goes...

    I downloaded 60,000+ files from

    http://opinions.aoc.arkansas.gov/Web...startid=101893

    I wanted ONLY the Supreme Court Cases and used HTTrack to do this. After numerous attempts and a few weeks work, I was able to download the files even though I knew I had too many (I had manually gone through all the pages to find out how many files per volume/folder there were). Somehow HTTrack downloaded the Court of Appeals cases too even though they were only accessible one level higher.

    Whatever the case of how this happened I now have 60,000+ files on my hard drive (30GB of data downloaded on an 80Gb hard drive specially purchased for this task as an external USB drive). To make it even tougher the Supreme Court cases are mixed up amongst the Court of Appeals cases in two folders:

    a folder called fol (for follow) that has all the html pages that reference the pdf files which are in a folder called doc (for document). The folders are ONLY numbered and the html files have names made up like so:
    Row1.html, Row31.html, Row61.html, etc. These names are the same for all files in all folders. So it is impossible to find which files are Supreme Court files and which ones are Court of Appeals files by looking at the names of the html files.

    I need to get rid of all the Court of Appeals files from this 'database' (I call it that even though the files are normal html and pdf files - the html files reference the pdf files).

    There is a page which I can start from that shows three folders:

    Archives
    Court of Appeals
    Supreme Court

    When the Archives folder is clicked it goes to a page that shows:
    Court of Appeals
    Supreme Court

    So I need to have some sort of program or script do these things:

    1. Allow me to input a folder or an html file name (or any other type of file) to start from or with
    2. The program must follow all links in these html files (and also recursively search other sub-levels or sub-folders) to the links that reference the pdf files (which all have different file names)
    3. It must create a list of all these files and folders or have an option to just show the folders that contain these files (as there are multiple files in these folders - therefore just the folder would be good enough in this case)
    4. Besides each item shown there must be a checkbox to choose the file/folder and a 'Select All' or 'Unselect All' feature at the top of the list to do this if one chooses.
    5. After the selections are done there needs to be a button to begin the next process which is...
    6. Deleting ALL pdf files referenced in this search only (which are in the 'doc' folder folders), deleting ALL the folders that contained the pdf files, deleting ALL html files that referenced the pdf files (which are in the 'fol' folder folders), deleting ALL the folders that contained the html files, and deleting ALL folders and html files that lead up to the original file/folder input at the beginning to start the process. There also needs to be an option to MOVE the files/folders to another folder on the same partition or any other location that is specified by the user. After all, someone may not want to delete everything (I am thinking that this program could be used by more users than just me so am thinking of others) and may want to save it instead but have it separated out in a completely different location.

    I would have to use the program or script in two stages. First to delete the Court of Appeals files in the Archives directory by inputting the folder number that takes you to this page, then by doing the same in the directory showing all three options and choosing Court of Appeals (and inputting the folder number that takes you to this page).

    Now I thought that there would be a program or script to do what I consider a simple task. After all, there are web spiders/crawlers, link testers, link crawlers, duplicate file deleters, web-ripping programs, etc. But nooooooooooo. There is no such program or script that does the above that I can find. What makes it hard is finding the right search terms to put into a search engine to attempt to find this type of program.

    Does anyone know of a program or php script that can do what I described above? Failing that, what technology would I have to look to, to do this?

    Any help or suggestions would be greatly appreciated.
    Last edited by judgedredd; 04-07-2013 at 01:16 AM.

  • #2
    Regular Coder
    Join Date
    Sep 2011
    Posts
    213
    Thanks
    15
    Thanked 5 Times in 5 Posts
    Are you trying to check these files by the content of the files (code)? This seems like it would be a lengthy process and scripting something for all of them could be sketchy if the files aren't generated the same way. For instance, looking or specific keywords would be wanted, but if one page is missing a keyword or another contains these keywords (for external links, webpage keywords in general, or some other reason.

    If this is what you want, it is possible to do in php I can tell you that from the start, however if you don't have a php server installed or even know the language I would probably look around for a better language to use (someone else may suggest one that works better, I'm not too sure).

    The main functions I would look into are scandir, is_dir, file_get_contents, stripos (or strpos, depending if you want it case-sensitive) and foreach (would look through each file/directory).

    Since I don't know the exact file storing structure or the exact way you are going to search for these, I can't really post code that would help you.


    Here would be a sample of a file crawler (possibly, not really testing).
    PHP Code:
    $files scanpath(getcwd());
    function 
    scanpath($directory)
    {
        
    $filelist = array();
        
    $files scandir($directory);
        foreach(
    $files as $name)
        {
            if(
    is_dir($name))
                
    array_push($filelistscanpath($directory.'/'.$name));
            else
                
    $filelist[] = $directory.'/'.$name;
        }
        return 
    $filelist;

    I haven't tested the code but it would be something like that, it will probably need some editing.


    As for the file checking, you could do something like this, after the list function above works.
    PHP Code:
    foreach($files as $file)
    {
        
    $content file_get_contents($file);
        
    $keywords = array();//List your words here
        
    foreach($keywords as $word)
        {
            if(
    stripos($content$word) === false)
                continue 
    2//Doesn't contain a keyword, skip
        
    }
        
    //Do whatever here for files that passed the check

    Again I haven't tested this nor do I know for sure what you need, but if you have a decent knowledge level of php it should definitely get you going to what you need.
    Last edited by Dubz; 04-07-2013 at 02:10 AM.

  • #3
    Regular Coder
    Join Date
    Sep 2010
    Location
    London, UK
    Posts
    167
    Thanks
    28
    Thanked 0 Times in 0 Posts
    Hi Dubz,

    Thanks for the thoughtful response.

    I have spent the last 2 hours crafting a response with screen shots etc but because it took so long I was timed out and my post was lost. It is 3.32am Sunday morning now so I will continue Sunday proper and try to recreate my post that I just lost.

    [Note to the MODS: I really do wish that you would extend the time for people to be logged in and 'not be doing anything' because your site doesn't seem to be able to tell when someone is typing, which is really silly. Even a simple script can do that.]

  • #4
    Regular Coder
    Join Date
    Sep 2011
    Posts
    213
    Thanks
    15
    Thanked 5 Times in 5 Posts
    I think I may have had that problem before once or twice, I just hit back on my browser and my text was still there in the text box so I just copied it and pasted it after refreshing the page or relogging in (can't remember what I needed to do exactly). Make sure you check the remember me box when you login if you didn't.

    As far as the script to check if someone is typing, that would consume a lot more resources on the web server from sending the notification as well as all the users. It would be possible and not a whole lot more bandwidth but I don't know if this site is limited on that or not. I'm going to assume it's unlimited, but it still doesn't seem necessary as it would reduce page loading times slightly, even though you wouldn't notice it too much it would add up.

  • #5
    God Emperor Fou-Lu's Avatar
    Join Date
    Sep 2002
    Location
    Saskatoon, Saskatchewan
    Posts
    16,979
    Thanks
    4
    Thanked 2,659 Times in 2,628 Posts
    Can't do much about that I'm afraid. Time can be ramped up, but it isn't really necessary overall. What I wish it would do is tell you that it will carry your post which it doesn't. So if it logs you out whilst typing, simply login and it will continue to post for you (you won't lose what you were typing). My overall advice is to make use of the post reply instead of the quick post box at the bottom, and periodically use the preview post option. That way you can hit back and it'll be guaranteed to be at least the last preview you did

    BTW, you can use recursive directory iterators as well for your actual original question. I don't quite understand what ruleset you require in order to determine which is which, but you can combine filters with the iterators so that you only display certain things (such as titles matching a pattern, or technically even specific contents within a file although I haven't done that with a recursive iterator in the past). 60K+ files is a lot. While PHP can handle it, if you need to actually read anything within the file, I'd suggest a more robust language like C or C++ be used instead. Scripting languages like PHP will just be too slow to go through 30GB of data in a reasonable period of time.
    PHP Code:
    header('HTTP/1.1 420 Enhance Your Calm'); 

  • #6
    Regular Coder patryk's Avatar
    Join Date
    Oct 2012
    Location
    /dev/couch
    Posts
    398
    Thanks
    2
    Thanked 64 Times in 64 Posts
    i don't know if that's an option for you, but if you were to use linux for that, then you would have 90% of job already done for you.

    you could simply use 'find', 'grep' and a few lines of shell script

    -------------------------------------------------------------------------------
    "Real Programmers can write assembly code in any language" - Larry Wall

  • #7
    Regular Coder
    Join Date
    Sep 2010
    Location
    London, UK
    Posts
    167
    Thanks
    28
    Thanked 0 Times in 0 Posts
    Hi Everyone,

    Thank you very much for all the comments.

    I was able to get the problem sorted out in the end.

    First I used 'Advanced Find and Replace 7' to strip out all references to external domains and urls not in the directories I wanted. Then I used HTTrack to re-index the downloaded website that was on my hard drive. I did this twice to capture two different sets of Supreme Court case data - one set was from 1857 to 2008 and the other set was from 2009 - 2013.

    I then combined the 'fol' and 'doc' folders from both sets and changed the code in the index page (that HTTrack creates) to reflect that. It took another day or so but the whole job has now been done.

    I now have everything the way I want it AND I have the original database too! So that's neat. Mind you, the total size of these two databases is around 54GB!!!

    Sorry I didn't re-post the screenshots I took, but I was so disheartened by losing the post that I couldn't face doing it from scratch again and just got on with trying to figure it all out by myself.

    By the way, a weird message came up when I tried to post the original post. When I got logged out when clicking submit, it came up with the login page but when I tried logging back in it said something about not being able to log me in because I was logged into a post !!??? I can't remember exactly what the words were. But it told me to press the back button which I did and I tried logging in again but it didn't work. I knew that with the first failed login that I had lost the post so gave up after that.

  • #8
    Regular Coder
    Join Date
    Sep 2010
    Location
    London, UK
    Posts
    167
    Thanks
    28
    Thanked 0 Times in 0 Posts
    By the way, I was logged out again when writing post #7.

    It is really pissing me off that the time given for writing a post is so short. This does need to be looked into as it took about 5 mins or less to write.
    Last edited by judgedredd; 04-09-2013 at 12:50 AM.

  • #9
    Regular Coder
    Join Date
    Sep 2011
    Posts
    213
    Thanks
    15
    Thanked 5 Times in 5 Posts
    Quote Originally Posted by judgedredd View Post
    By the way, I was logged out again when writing the post below.

    It is really pissing me off that the time given for writing a post is so short. This does need to be looked into as it took about 5 mins or less to write.
    Thi sounds like a bigger problem than at first. I have stayed logged in this site for weeks and never got logged out unless I clear my history, cookies, cache, etc. to keep it from clogging up and free more memory. Make sure you have cookies enabled and all that and that you don't have any programs that clean them at all (such as ccleaner).


  •  

    Posting Permissions

    • You may not post new threads
    • You may not post replies
    • You may not post attachments
    • You may not edit your posts
    •