In some scenarios, files might contain several duplicate lines. It becomes extremely hard to view those repeated neighboring lines in a file. In Linux, the uniq command detect repeated lines, reports or removes the duplicated lines, and writes the filtered data to a file or standard output.
The uniq command is a Linux command line utility program that is capable of identifying adjacent lines that are duplicated in an input file and prints unique lines to the standard output or writes to an output file.
Most importantly, the uniq command can locate duplicate lines only if they are adjacent. So, the input text file content needs to be sorted ahead. Then we can pipe the sorted content to the uniq command. In that case, the sort command can be used to sort the file content.
uniq [option] ... [input_file [output_file]]
uniq command Options
Useful options of uniq command:
|-u, - -unique||The unique lines will only be printed from the input content.|
|-d, - -repeated||The duplicate lines will only be printed from the input content where it displays one line per each repeated line.|
|-c, - -count||Displays the duplicate count of each repeated line as a number before the line.|
|-D, - -all-repeated||Only outputs all the duplicated lines and ignores unique lines.|
|-z, - -zero-terminated||A line will end with a NULL or 0 bytes. By default, each line ended with a newline.|
|-f N, - -skip-fields(N)||When the command checks for the uniqueness of a line, an N number of fields will be ignored.|
|-s N, - -skip-chars(N)||First N number of characters will be skipped when comparing each line for uniqueness.|
|-i, - -ignore-case||The comparisons done by uniq command are case sensitive. The -i option can be used to make case insensitive comparisons among each line.|
|-w N, - -check-chars(N)||This option will use the specified number of characters(N) as the first N characters to be tested for uniqueness. Opposite of the -s N option where it skips the first N chars.|
uniq options with examples
In the following examples, we will be using a text file called sample.txt with the below content.
Remote working is the new Trend. Remote working is the new Trend. Remote working is the new Trend. Remote working is the new Trend. No mercy How are you.. How are you.. How are you. Super Cars are the future.
The -c option displays the duplicate count of each line for a given input file.
uniq -c sample.txt
As shown in the output, the count of the duplicated lines is shown as a number before each repeated line group.
It prints only repeated lines and non-repeated lines are discarded.
uniq -d sample.txt
As expected, the following unique or non-repeated lines have been ignored in the output. It prints one line per each repeated group but not all the duplicate lines.
No mercy How are you. Super Cars are the future.
The -D option prints all the duplicated lines from the input file. It doesn't group the repeated lines as in the -d option. In addition, it declines non-repeated lines as well.
uniq -D sample.txt
As expected, the command permitted us to print duplicate lines in the input file.
The -u option displays all the non-repeated lines in the given text file. In short, the command is capable of displaying only the unique lines.
uniq -u sample.txt
Upon executing the above command, the uniq command enabled us to print unique lines.
-f N option
With the -f option, you can ignore a given number fields from the start of a line. A field is a collection of characters delimited by white space.
When the uniq command checks for the uniqueness among lines, it skips the given number of fields from the input text file and outputs the lines per each repeated group. In addition, it displays all the unique lines as well. Let's use the following sample1.txt file as the input.
#1 Remote working is the new Trend. #2 Remote working is the new Trend. #3 Remote working is the new Trend. #4 Remote working is the new Trend. #5 No mercy #6 How are you.. #7 How are you.. #8 How are you. #9 Super Cars are the future.
In the above input content, the first field is a #number pattern text. So let's ignore the first field when the uniq command compares each line for uniqueness.
uniq -f 1 sample1.txt
-s N option
This option is more similar to the -f option except that the -s option skips a given number of characters from the start of each line when checking for duplicates.
Let's use the sample2.txt file with the following content.
AbEHOW ARE YOU? *^#HOW ARE YOU? 089HOW ARE YOU? $#@NO MERCY pppNO MERCY 111NO MERCY
uniq -s 3 sample2.txt
As expected, the line 'HOW ARE YOU?' has been identified as a duplicated line. Because the first 3 characters were skipped from each line, All the 'HOW ARE YOU?' phrases become identical lines.
Similarly, the lines containing the 'NO MERCY' phrase is identified as repeated line.
The -w option can be used to consider only a given number of characters when comparing lines for uniqueness. The output would be one line per each repeated lines and also the unique lines.
The following input in the sample3.txt file will be used in this example.
$#1This is one line $#1This is another line Unique line $$$New type line $$$New type line to check
uniq -w 3 sample3.txt
In the above example, the uniq command only considers the first three characters of each line. So, the first two lines are considered duplicate lines. Similarly, the last two lines become identical as well. In addition, the non-repeated lines are printed too.
The -i option ignores the case of the content in the input file. The duplicated lines will be removed from the output and unique lines will be printed as shown in the following output.
We use the sample4.txt file which holds the following lines.
THIS IS CAPS LINE. this is caps line. this IS Caps LINE. unique line.
uniq -i sample4.txt
The line with the phrase 'THIS IS CAPS LINE.' is duplicated in another two lines when the case is ignored. So, those repeated lines will be removed from the output.
Usually, the uniq command gives a newline terminated output. This can be altered with the -z option when specified the output will be null-terminated. The following syntax is used.
uniq -z input_file
To conclude, the uniq command in Linux can detect matching lines in a given input file and filter out the identical lines as per your requirement. Several options like -c, -u, -i, etc are available to use with the uniq command to filter out the final output. As discussed, the uniq command needs matching lines to be adjacent to each other when determining the uniqueness of the content. Overall, the uniq command can be very useful when dealing with lengthy content which contains tens of repeated lines.