how to de-duplicate block (timestamp+command) from bash history?

I’m working with bash_history file containing blocks with the following format: #unixtimestampncommandn

here’s sample of the bash_history file:

#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308689
cpio -v -t -F init.cpio
#1713308690
ls
#1713308691
ls

My goal is to de-duplicate blocks entirely, meaning both the timestamp and the associated commands. I’ve attempted using awk, but this approach processes lines individually, not considering them as part of a block.

I’ve heard that using ignoredups prevents deduplication, but it won’t work in this case (unless you retype the exact command) because the duplicate command is already there.

I’d appreciate suggestions on a more effective way to achieve this de-duplication.

EDIT:
as suggested by Ed Morton on the comment, here’s the expected output:

#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308690
ls

as a workaround, I add the delete functionality to this program.
but I’m still open to other approaches that use existing commands.

Asked By: Yuki San

||

Using Perl, you may do:

perl -ge '
    @u = map {
        $c = $_;
        $c =~ s/^#[0-9]{10}n//;

        exists($d{$c}) ?
            () :
            ($d{$c} = 1 && $_)
        ;
    } <> =~ /^#[0-9]{10}n.*?(?=^#[0-9]{10}n|z)/smg;

    print join("", @u);
' in

The big assumption being that that ^#[0-9]{10}n will always positively identify the start of an entry in the file.

The command is a bit dense, but the logic behind it is:

  • Read "in";
  • Split it in records, using ^#[0-9]{10}n as a record separator, without consuming the separator (<> =~ /^#[0-9]{10}n.*?(?=^#[0-9]{10}n|z)/smg);
  • Process all records;
  • For each record, remove the separator (the line lincluding the timestamp); if the remainder (the command) is in a list of already processed commands, ignore it, otherwise store the command exclusive of the separator in the list of already processed commands and store the command inclusive of the separator in an array;
  • Join the elements of the array on an empty string, printing the resulting string.

Breakdown of the regex:

  • ^#[0-9]{10}n.*?(?=^#[0-9]{10}n|z): will match a line starting with a # character, followed by 10 digits and a newline; it will then lazily match anything (including newlines) until a new occurrence of ^#[0-9]{10}n or the end of the string (z) is found (avoiding to capture the newly found occurrence of ^#[0-9]{10}n in the current match using a zero-length look-ahead assertion (?=) and allowing the next match to capture it); s will allow . to match newlines, m will allow ^ and $ to match after and before a newline and g will allow to capture multiple occurrences of the pattern in the string.

It works well on your sample input; I’ve also tested it with empty commands (a timestamp following another timestamp).

If duplicate entries are found, the first entry will be kept and later ones will be discarded.

% cat in
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308689
cpio -v -t -F init.cpio
#1713308690
ls
#1713308691
ls
% perl -ge '
        my @u = map {
                $c = $_;                                                                                                              
                $c =~ s/^#[0-9]{10}n//;

                exists($d{$c}) ? () : ($d{$c} = 1 && $_);
        } <> =~ /^#[0-9]{10}n.*?(?=^#[0-9]{10}n|z)/smg;

        print join("", @u);
' in
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308690
ls
Answered By: kos

You didn’t show your attempt in awk, but the following awk program prints entries in the sense of

#[number]
[command consisting of one or more lines]

where the command is unique. The program is:

# dedup.awk
/^#[[:digit:]]+$/ {
    if (length(body) > 0)
    {
        if (!bodies[body])
        {
            bodies[body] = 1
            print header body
        }
        body = ""
    }
    header = $0
    next
}

{
    body = body "n" $0
}

The output:

$ cat test.file
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308689
cpio -v -t -F init.cpio
#1713308690
ls
#1713308691
ls
$ awk -f dedup.awk test.file
#1713308636
cat > ./initramfs/init << "EOF"
#!/bin/sh
/bin/sh
EOF
#1713308642
file initramfs/init
#1713308686
cpio -v -t -F init.cpio
#1713308690
ls

Note that in case the "commands" such as

$ #12345678

are saved in the file, the above awk program will just skip them, eg:

$ cat test.file.2
#1234512312
#1231231233
#1231231231
cd
#1237192388
ls
#1231231231
cd
$ awk -f dedup.awk test.file.2
#1231231231
cd
#1237192388
ls

The program can be adjusted to accommodate cases like this one, but it requires more precise specification of the problem. For example, how to deal with:

#1234512312
#1231231233
#1231231231
#1231231233
#1231231231
#1231231233
#1231231231
cd
#1237192388
ls
#1231231231
cd

Edit: Thanks to @G-Man Says ‘Reinstate Monica’ for the optimization suggestion.

Answered By: Vilinkameni
Categories: Answers Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.