Remove duplicates by adding numerical suffix

Question

Remove duplicates by adding numerical suffix

How do I append a numerical suffix to lines to remove duplicates?

Pseudo code:

if currLine.startsWith("tag:")
  x = numFutureLinesMatching(currLine)
  if (x > 0)
    currLine = currLine + ${x:01}

Input file

tag:20230901-FAT
val:1034
tag:20230901-FAT
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Desired output

tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Notes:

The final duplicate must remain unchanged.
The earlier duplicates can have any suffix to be unique, so I chose a countdown.
Awk seems to be a good choice, but any common scripted language will work.

Asked By: Steven

||

Source

Answer 1

Here we go, exactly as required:

awk '
    NR==FNR{
        if (/^tag:/) {
            a[$1]++
        }
        next
    }
    {
        c=--a[$1]
        if (c>0) {
            printf "%s-%.2dn", $1, c
        } else {
            print
        }
    }
' file file

With explanations:

awk '
    # first block for first file
    NR==FNR{                           # first file
        if (/^tag:/)                   # if the line starts with ^tag:
            a[$1]++                    # increment array a with key as column 1
        next                           # stop processing this line
    }                                                                   
    # 2th block for second file
    {
        c=--a[$1]                      # c = decrement array a with key as column 1
        if (c>0) {                     # ... pretty simple, no ?
            printf "%s-%.2dn", $1, c  # %s = string %.2d integer, zero pading
        } else {                 
            print                      # else, print current line
        }
    }                                         
' file file

Output

tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Answered By: Gilles Quénot

Answer 2

awk can take arbitrary array indices – even a whole record ("line").

Make a regex match for tag: and start the counter, but correct by one due to the first match

awk '$0 ~ /^tag:/ { n[$0]++?$0=sprintf("%s-%02d",$0,n[$0]-1):1 }  1'

To make it a countdown, use tac twice:

tac infile | 
awk '$0 ~ /^tag:/ { n[$0]++?$0=sprintf("%s-%02d",$0,n[$0]-1):1 }  1' |
tac

Answered By: FelixJN

Answer 3

With perl:

#!/usr/bin/perl

use strict; use warnings;
use feature qw/say/;

my (%h, $c);
while (<>) {
    chomp;
    if (/^tag:/) {
        $c = sprintf "%.2d", ++$h{$_};
        if ($c>1) {
            say $_ . "-" . $c;
        } else {
            say;
        }
    } else {
        say $_;
    }
}

Usage:

./script file

Output:

tag:20230901-FAT
val:1034
tag:20230901-FAT-02
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX-02
val:1000
tag:20230901-FAT-03
val:1500

Answered By: Gilles Quénot

Answer 4

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %hash; put /^tag:/ && %hash{$_}++ ?? $_ ~ sprintf("-%02d", %hash{$_}-1) !! $_;'   file

Above is the Raku version of an excellent awk answer posted by @EdMorton in a comment.

Start by calling Raku at the commandline with the -ne non-autoprinting linewise flags. Before entering the linewise code BEGIN by declaring a %hash. Run the put… statement over the input. If the line /^tag:/ starts with tag: add the line to the %hash and ++ increment its value.

This && conditional forms the beginning of Raku’s "Test ?? True !! False" ternary operator. If True, the $_ line is output with the line’s value minus one appended (value decoded using %hash{$_} ). If False, the line is output unchanged.

Sample Input:

tag:20230901-FAT
val:1034
tag:20230901-FAT
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

Sample Output:

tag:20230901-FAT
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX-01
val:1000
tag:20230901-FAT-02
val:1500

Above implements a count-up suffix, leaving the earliest tag: lines unchanged. To implement a count-down suffix that leaves the final tag: lines unchanged, use tac twice as instructed in the accepted answer by @FelixJN. Below, the answer implemented on MacOS which uses tail -r instead of tac:

~$ tail -r  Steve_suffix.txt | raku -ne 'BEGIN my %hash; put /^tag:/ && %hash{$_}++ ?? $_ ~ sprintf("-%02d", %hash{$_}-1) !! $_;' | tail -r
tag:20230901-FAT-02
val:1034
tag:20230901-FAT-01
val:1500
tag:20230901-LAX-01
val:8934
tag:20230901-SMF
val:2954
tag:20230901-LAX
val:1000
tag:20230901-FAT
val:1500

https://unix.stackexchange.com/a/114043
https://docs.raku.org/language/operators#infix_??_!!
https://raku.org

Answered By: jubilatious1