How to remove duplicate lines inside a text file?

A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).

What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won’t be neighbours) there is to be only one of the kind left.

I have written a program in Scala (consider it Java if you don’t know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?

UPDATE: the awk '!seen[$0]++' filename solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn’t work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don’t feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.

Asked By: Ivan

||

Assuming you can afford to keep as much as the de-duplicated file in memory (if your data is indeed duplicated by a factor of 100, that should be about 20MiB + overhead), you can do this very easily with Perl.

$ perl -ne 'print unless $dup{$_}++;' input_file > output_file

This preserves the order too.

You could extract the number of occurrences of each line from the %dup hash if you so wished, as an added free bonus.

If you prefer awk, this should do it too (same logic as the perl version, same ordering, same data gathered in the dup variable):

$ awk '{if (++dup[$0] == 1) print $0;}' input_file > output_file
Answered By: Mat

An awk solution seen on #bash (Freenode):

awk '!seen[$0]++' filename

If you want to edit the file in-place, you can use the following command (provided that you use a GNU awk version that implements this extension):

awk -i inplace '!seen[$0]++' filename
Answered By: enzotib

There’s a simple (which is not to say obvious) method using standard utilities which doesn’t require a large memory except to run sort, which in most implementations has specific optimizations for huge files (a good external sort algorithm). An advantage of this method is that it only loops over all the lines inside special-purpose utilities, never inside interpreted languages.

<input nl -b a -s : |           # number the lines
sort -t : -k 2 -u |             # sort and uniquify ignoring the line numbers
sort -t : -k 1n |               # sort according to the line numbers
cut -d : -f 2- >output          # remove the line numbers

If all lines begin with a non-whitespace character, you can dispense with some of the options:

<input nl | sort -k 2 -u | sort -k 1n | cut -f 2- >output

For a large amount of duplication, a method that only requires storing a single copy of each line in memory will perform better. With some interpretation overhead, there’s a very concise awk script for that (already posted by enzotib):

<input awk '!seen[$0]++'

Less concisely: !seen[$0] {print} {seen[$0] += 1}, i.e. print the current line if it hasn’t been seen yet, then increment the seen counter for this line (uninitialized variables or array elements have the numerical value 0).

For long lines, you can save memory by keeping only a non-spoofable checksum (e.g. a cryptographic digest) of each line. For example, using SHA-1, you only need 20 bytes plus a constant overhead per line. But computing digests is rather slow; this method will only win if you have a fast CPU (especially one with a hardware accelerator to compute the digests) and not a lot of memory relative to the size of the file and sufficiently long lines. No basic utility lets you compute a checksum for each line; you’d have to bear the interpretation overhead of Perl/Python/Ruby/… or write a dedicated compiled program.

<input perl -MDigest::MD5 -ne '$seen{Digest::MD5::md5($_)}++ or print' >output

Python One liners :

python -c "import sys; lines = sys.stdin.readlines(); print ''.join(sorted(set(lines)))" < InputFile
Answered By: Rahul Patil

With bash 4, a pure-bash solution that takes advantage of associative arrays can be used. Here is an example

unset llist; declare -A llist;
while read -r line; do
if [[ ${llist[$line]} ]]; then
  continue
else 
  printf '%sn' "$line"
  llist[$line]="x"
fi
done < file.txt
Answered By: iruvar
sort -u big-csv-file.csv > duplicates-removed.csv

Note that the output file will be sorted.

Answered By: Vladislavs Dovgalecs
gawk -i inplace '!a[$0]++' SOME_FILE [SOME_OTHER_FILES...]

This command filters out repeated lines while preserving their order and saves the files right in place.

It accomplishes this task by keeping a cache of all unique lines and printing each one just once.

The exact algorithm can be broken down to this:

  1. Store the current line in variable $0
  2. Check if associative array a has key $0 and if not, create the key and initialize its value to 0
  3. Compare the key’s value to 0 and if true, print the current line
  4. Increment the value of the key by 1
  5. Fetch the next line and go to step 1. until EOF is reached

or as pseudocode:

while read $line
do
    $0 := $line
    if not a.has_key($0) :
        a[$0] := 0
    if a[$0] == 0 :
        print($line)
    a[$0] := a[$0] + 1
done

NOTE: The command requires GNU AWK version 4.1 from 2013 or newer

Answered By: rindeal

You can use uniq http://www.computerhope.com/unix/uuniq.htm

uniq reports or filters out repeated lines in a file.

Answered By: Mahmoud Zalt

None of the answers here worked for me on my Mac so I wrote a simple python script that works for me. I am ignoring leading/trailing whitespace and also don’t care about memory consumption.

import sys

inputfile = sys.argv[1]
outputfile = sys.argv[2]

with open(inputfile) as f:
    content = f.readlines()

content = [x.strip() for x in content]

my_list = list(set(content))

with open(outputfile, 'w') as output:
    for item in my_list:
        output.write("%sn" % item)

Save the above to unique.py and run like this:

python unique.py inputfile.txt outputfile.txt
Answered By: Freddie

SOLUTION WITHOUT MAINTAINING THE ORIGINAL SEQUENCE ORDER

I did it with the following code piece.

sort duplicates.txt | uniq > noDuplicates.txt

The sort command sorts the lines alphabetically, and the uniq command removes the duplicates.

NOTE: Why we sorted the lines first is that uniq does not detect duplicate lines unless they are adjacent.

Answered By: Caglayan DOKME

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %dup; .put unless %dup{$_}++;'  input_file > output_file

OR:

~$ raku -e '.put for lines.unique;'  input_file > output_file

[ Note: compare first answer here with excellent Perl answer by @Mat ].

https://docs.raku.org
https://raku.org

Answered By: jubilatious1

Note that sort can write its output to any one of the files that it was given as input:

LC_ALL=C sort -u -o input input

That’s fine as sort needs to have read all its input before it can start outputting anything (before it can tell which is the line that sorts first which could very well be the last line of the input).

sort will (intelligently) use temporary files so as to avoid loading the whole input in memory. You’ll need enough space in $TMPDIR (or /tmp if that variable is not set). Some sort implementations can compress the temp files (like with --compress-program=lzop with GNU sort) which can help if you’re short on disk space or have slow disks.

With LC_ALL=C sort order is by byte value which should speed things up and also guarantee a total and deterministic order (which you don’t always get otherwise especially on GNU systems).

Answered By: Stéphane Chazelas
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.