Hello and welcome to our community! Is this your first visit?
Register
Enjoy an ad free experience by logging in. Not a member yet? Register.
Results 1 to 3 of 3
  1. #1
    New to the CF scene
    Join Date
    May 2016
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Remove duplicate lines in text file

    Hi everyone.
    Please help me resolve issue.
    I have a large text file and i have to remove duplicate lines.
    The problem is that sometimes the duplication spans multiple lines
    For example first two lines is unique and they repeat three times.
    I have to remove this pairs of repeated lines.
    I must delete duplicates only if they appear one after another.
    The condition is met before the appearance of some other lines,
    it means that we have to start again checking count of unique lines which will be repeated.

    Test DATA :
    Test line 1
    Test line 2
    Test line 1
    Test line 2
    Test line 1
    Test line 2

    Some other lines
    ............

    Test line 1
    Test line 2
    Test line 3

    Some other lines
    ............

    Test line 1
    Test line 2
    Test line 3
    Test line 1
    Test line 2
    Test line 3
    Test line 1
    Test line 2
    Test line 3

    Some other lines
    ............

    Test line 1
    Test line 1
    Test line 1

    Some other lines
    ............

    Test line 1
    Test line 2
    Test line 3
    Test line 4
    Test line 1
    Test line 2
    Test line 3
    Test line 4

    Some other lines
    ............

    Expected result



    Test line 1
    Test line 2

    Some other lines
    ............

    Test line 1
    Test line 2
    Test line 3

    Some other lines
    ............

    Test line 1
    Test line 2
    Test line 3

    Some other lines
    ............

    Test line 1

    Some other lines
    ............

    Test line 1
    Test line 2
    Test line 3
    Test line 4

    Some other lines
    ............

  2. #2
    New Coder
    Join Date
    Nov 2016
    Location
    South Africa, Cape Town
    Posts
    49
    Thanks
    1
    Thanked 1 Time in 1 Post
    You could use:

    Code:
    prev_line = set()
    updated_file = open("new_filename.txt", "w")
    old_file = open("old_filename.txt", "r")
    count = 0
    
    for line in old_file:
        if line not in prev_line:
            updated_file.write(line)
            prev_line.add(line)
        elif line == list(prev_line)[count]:
            pass
        count++
    
    old_file.close()
    updated_file.close()
    Not tested but should work

  3. #3
    New Coder
    Join Date
    Nov 2016
    Location
    South Africa, Cape Town
    Posts
    49
    Thanks
    1
    Thanked 1 Time in 1 Post
    Please delete my previous post. I was not thinking clearly. That code is garbage. This will do the trick --

    Code:
    oldfile = open("new.txt", "r")
    updatedfile = open("updated.txt", "w")
    prev = []
    count = 0
    
    for x in oldfile.readlines():
        x = str(x).strip("\n")
        if x not in prev:
            prev.append(x)
            x = str(x + "\n")
            updatedfile.write(x)
            count += 1
            
        elif x in prev:
            prev.append(x)
            if x == prev[count-1]:
                print "Duplicate found: {x}".format(x=x)
                count += 1
            else:
                x = str(x + "\n")
                updatedfile.write(x)
                count += 1 
    
    oldfile.close()
    updatedfile.close()


 

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •