Rewriting log parsing script in Perl or Python (get rid of awk)

I need accomplish a task of filtering activity of bots in the log file.

Solution should only show records meeting following criteria

  • user logged in, user changed password, user logged off within same second.
  • those actions (log in, change password, log off) happened one after another with no other entries in between.

Input data example

[a lot of data]
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:42 +0200|178.57.66.225|faaaaaa11111| - |user logged in| -
Mon, 22 Aug 2016 13:15:40 +0200|178.57.66.215|terdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged in| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed password| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed profile| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -
Mon, 22 Aug 2016 13:20:42 +0200|178.57.67.225|faaaa0a11111| - |user logged in| -
[a lot of data]

I’ve written the code below in order to complete the task

awk 'BEGIN { FS=" " } { c[$5]++; l[$5,c[$5]]=$0 } END { for (i in c) { if (c[i] == 3) for (j = 1 ; j <= c[i]; j++) print l[i,j] } }' $1

Usage:

./parse_log.sh logfile.log

Output:

Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -

But I want to know what alternative written in Perl or Python (with minimum usage of external libs) would look?

Asked By: Twissell

||

The solution itself is written in Python 3.

#!/usr/bin/env python3

import sys
import re
from collections import defaultdict


column_delimiter = sys.argv[1]
column = int(sys.argv[2]) - 1

records = defaultdict(list)

with open(sys.argv[3]) as inputfile:
  for lines in inputfile:
    line = lines.rstrip('n')
    row_record = line.split(column_delimiter)
    records[row_record[column]].append(line)

for timestamps in records.values():
    if len(timestamps) == 3:
        for i in range(len(timestamps)):
          if (re.search('logged in|changed password|logged off', timestamps[i])):
                  print(timestamps[i])

Usage: parse_log.py ' ' 5 logfile.log

The Python code is much easier to read and understand what it does.

Answered By: Twissell

This is not an answer, but it is too big for a comment and requires formatting, so to address your comment that "The Python code is much easier to read and to understand what it does.", FYI an AWK script with sensible variable names that does what I think your Python script does would look a lot like your python script but briefer since, for manipulating text, awk already does for you all the common stuff you have to write code to do in python:

awk -v column=5 '
    { records[$column] = records[$column] $0 ORS }
    END {
        for ( timestamp in records ) {
            if ( gsub(ORS,"&",records[timestamp]) > 2 ) {
                printf "%s", records[timestamp]
            }
        }
    }
' logfile.log

but reading the whole file into memory before processing it is a very inefficient approach to this problem. You should just do the test and print every time the time changes:

awk -v column=5 '
    $column != prev {
        prt()
        records = ""
        prev = $column
    }
    { records = records $0 ORS }
    END { prt() }

    function prt() {
        if ( gsub(ORS,"&",records) > 2 ) {
            printf "%s", records
        }
    }
' logfile.log
Answered By: Ed Morton

In Perl this can be written as a one-liner, but it feels a bit cluttered:

perl -MTime::Piece -F'|' -ae '$epoch=Time::Piece->strptime($F[0], "%a, %d %b %Y %H:%M:%S %z")->epoch; $diff=$epochlast2 - $epoch; $last =~ /user changed password/ && $last2 =~ /user logged in/ && $_ =~ /user logged off/ && $diff==0 && print $last2, $last, $_; $epochlast2=$epochlast; $epochlast=$epoch; $last2=$last; $last=$_' <<< "$data"
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -

alternatively, as a script:

use warnings;
use Time::Piece;

my $epochlast2=0;
my $epochlast=0;
my $last2="";
my $last="";

while($line = <STDIN>){
    $date=(split(/|/, $line))[0];
    $epoch=Time::Piece->strptime($date, "%a, %d %b %Y %H:%M:%S %z")->epoch; 
    $diff=$epochlast2 - $epoch;
    if ($last =~ /user changed password/ && $last2 =~ /user logged in/ && $line =~ /user logged off/ && $diff==0) {
        print $last2, $last, $line; 
    }
    $epochlast2=$epochlast; 
    $epochlast=$epoch; 
    $last2=$last; 
    $last=$line
}
Answered By: r_31415

Here you have the python version of my previous script (reading from stdin)

import sys
from datetime import datetime

last2 = ""
last = ""
datelast2 = None
datelast = None

for line in sys.stdin:
    line = line.rstrip()
    elems = line.split("|")
    date = datetime.strptime(elems[0], "%a, %d %b %Y %H:%M:%S %z")
    if datelast2 is not None:
        diff = datelast2 - date
    if (
        "user changed password" in last
        and "user logged in" in last2
        and "user logged off" in line
        and diff.seconds == 0
    ):
        print("{}n{}n{}".format(last2, last, line))
    last2 = last
    last = line
    datelast2 = datelast
    datelast = date
Answered By: r_31415

With any awk:

#!/usr/bin/awk -f
BEGIN { FS = "[|]" }

prvHour == $1 && prvUsr == $3 {
    if ($(NF-1) == "user logged in"        || 
        $(NF-1) == "user changed password" || 
        $(NF-1) == "user logged off" )
         actions[++actionCnt] = $0
    else actionCnt = 0
}

prvHour != $1 && prvUsr != $3 {
    if (prvHour && actionCnt == 3)
        for (i = 1; i <= actionCnt ; i++)
            print actions[i]

    prvHour = $1; prvUsr = $3
    actionCnt = 0 ; actions[++actionCnt] = $0
}

END {
    if (actionCnt == 3)
        for (i = 1; i <= actionCnt; i++)
            print actions[i]
}

With Perl without using external libraries :

/bin/perl -e '
    while (1) {
        $uli = $uli // <>;
        $ucp = <> if $uli =~ /^([^|]*)[|][^|]*[|]([^|]*)[|] - [|]user logged in[|] -$/;
        last if tell() < 0 ; 
        if (!defined $ucp) { $uli = undef ; next; }
        $ulo = <> if $ucp =~ /^(Q$1E)[|][^|]*[|]($2)[|] - [|]user changed password[|] -$/;
        last if tell() < 0;
        if (!defined $ulo) { $uli = $ucp ; $ucp = undef ; next ; }
        if ($ulo !~ /^Q$1E[|][^|]*[|]$2[|] - [|]user logged off[|] -$/) {
            $uli = $ulo ; $ucp = $ulo = undef ; next ;
        }
        print "$uli$ucp$ulo";
        $uli = $ucp = $ulo = undef;
    }
' sample

With python3 usgin just sys to read the file and call exit with meaning values.

#!/bin/python3
import sys

try:
    fullLine = actions = []
    prvHour = prvUsr = None
    chk_act = lambda x: x == "user logged in" or 
                 x == "user changed password" or 
                 x == "user logged off"


    with open(sys.argv[1]) as logFile:
        for line in logFile:
            hour, _, user, _, action, _ = line.split('|')
            if prvHour == hour and prvUsr == user:
                fullLine.append(line.strip())
                actions.append(action.strip())

            elif prvUsr != user and prvHour != hour:
                if len(actions) == 3 and all(map(chk_act, actions)):
                    print("n".join(fullLine))

                prvUsr = user
                prvHour = hour
                actions = []
                actions.append(action.strip())
                fullLine = []
                fullLine.append(line.strip())


except IndexError:
    print("usage {} logfile".format(sys.argv[0]))
    sys.exit(1)

except (FileNotFoundError, PermissionError):
    print("{} not found or permission permission denied", sys.argv[1])
    sys.exit(1)

With any sed:

#!/bin/sed -nf
N;/^([^|]*)|[^|]*|([^|]*)| - |user logged in| -n1|[^|]*|2| - |user changed password| -$/{
    N;/n([^|]*)|[^|]*|([^|]*)| - |user changed password| -n1|[^|]*|2| - |user logged off| -$/{
        p;b
    }
    s/.*n/n/g;D
}
D

All solutions avoid to store the whole data in memory

Answered By: DanieleGrassini

Using Raku (formerly known as Perl_6)

~$ raku -MDateTime::Parse -e 'my @a; my @b; for lines() {   
         @a.push($_.split(/ "|" | s "-" s? /, :skip-empty));  
         @b.push(DateTime::Parse.new(@a.[*-1][0]).posix)}; my %c.=append([Z=>] @b, @a>>.[1..*-1]);  
         for %c.sort -> $i {$i.raku.put if $i.values.contains("user logged in" & "user changed password" & "user logged off")  
         && ($i.value.elems == 9)}'  file

Sample Output:

"1471864539" => $["178.57.66.225", "fxsciaqulmlk", "user logged in", "178.57.66.225", "fxsciaqulmlk", "user changed password", "178.57.66.225", "fxsciaqulmlk", "user logged off"]
"1471864559" => $["178.57.66.205", "erdsfsdfsdf", "user logged in", "178.57.66.205", "erdsfsdfsdf", "user changed password", "178.57.66.205", "erdsfsdfsdf", "user logged off"]
"1471864759" => $["178.56.66.225", "fxsciaqulmla", "user logged in", "178.56.66.225", "fxsciaqulmla", "user changed password", "178.56.66.225", "fxsciaqulmla", "user logged off"]

If you’re looking for a scripting language to use at the command line, you might want to consider Raku. The code is relatively short, and includes a few niceties–including taking advantage of Raku’s built-in hashing capabilities.

Reading the code above, lines are read-in and each line is split on | pipes and/or s "-" s? whitespace-dash. This returns four elements per line (zero-index = 0..3). Each line is pushed onto the @a array. Then the first column of each line is parsed using DateTime::Parse.new() to return posix seconds, which is pushed onto the @b array.

From these two arrays a hash %c is created using Raku’s [Z=>] Zip-reduction metaoperator. This gives a %c hash with posix seconds as key, and zero-indexed columns 1,2,3 as value. As the elements are appended to the %c hash, their values append to the appropriate posix key. Finally in the output, for each posix-key element, values are checked to make sure they contain the exact three strings requested, and also checked that exactly 9 elements exist (three rows of three columns).


Sample Input:

Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:42 +0200|178.57.66.225|faaaaaa11111| - |user logged in| -
Mon, 22 Aug 2016 13:15:40 +0200|178.57.66.215|terdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged in| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed password| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed profile| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -
Mon, 22 Aug 2016 13:20:42 +0200|178.57.67.225|faaaa0a11111| - |user logged in| -

[Thanks to @bduggan and @sergot for updating Raku’s DateTime::Parse module so quickly!]

https://andrewshitov.com/2020/06/06/some-tips-for-working-with-hashes-in-raku/
https://github.com/sergot/datetime-parse
https://raku.org

Answered By: jubilatious1

Here is a Perl solution:

 1  #!/usr/bin/env perl
 2  use strict;
 3  use warnings;
 4  my ($p2, $p1, $x);
 5  while (<>) {
 6      print($p2, $p1, $_) if
 7          /(.+)user logged off/
 8          && do { $x = $1 }
 9          && $p1 =~ /Q$xEuser changed password/
10          && $p2 =~ /Q$xEuser logged in/;
11      $p2 = $p1;
12      $p1 = $_;
13  }

Line 1 lets the shell know this is a Perl script [1].

Line 2 "The strict pragma disables certain Perl expressions that could behave unexpectedly or are difficult to debug, turning them into errors. " [2]

Line 3 "The warnings pragma gives control over which warnings are enabled in which parts of a Perl program." [3]

Line 4 declares local variables [4, 5]:

  • $p2 is the second previous line of input.

  • $p1 is the previous line of input.

  • $x is for saving the initial portion of the current line of input.

Lines 5-13 form a while compound statement [6].

Line 5 the while loop expression uses the null filehandle <> (diamond operator) [7], allowing the solution to be used as a Unix filter [8]. For each line of input, while will assign the current line of input to the default input general variable $_ [9] and evaluate BLOCK.

Lines 6-10 form an if compound statement using statement modifier syntax [10].

Lines 7-10 form the EXPR portion of the if statement, which is composed of three regular expressions [11], one do BLOCK function [12], and the C-style logical "and" operator && [13].

Line 7 attempts to match the current line of input against the regular expression /(.+)user logged off/. If successful, the initial portion of the current line of input is captured into the global variable $1 [15].

If line 7 was true, line 8 saves $1 to the local variable $x. (Subsequent regular expressions may clobber the value of $1.) The do BLOCK evaluates to the the value of the last statement of BLOCK, which will be a non-empty string, which Perl considers true.

If line 8 was true, line 9 attempts to match the previous line of input against the regular expression /Q$xEuser changed password/. $x must be escaped within the regular expression using the delimiters Q and E, so that its value is treated as a string.

If line 9 was true, line 10 attempts to match the second previous line of input using the regular expression /Q$xEuser logged in/.

If line 10 is true, line 6 prints the second previous line of input, the previous line of input, and the current line of input.

Lines 11-12 update the variables for second previous line of input and previous line of input.

The solution produces output in the same order as the input:

$ perl solution.pl logfile.log
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -

References:

[1] https://en.wikipedia.org/wiki/Shebang_(Unix)

[2] https://perldoc.perl.org/strict

[3] https://perldoc.perl.org/warnings

[4] https://perldoc.perl.org/perldata

[5] https://perldoc.perl.org/functions/my

[6] https://perldoc.perl.org/perlsyn#Compound-Statements

[7] https://perldoc.perl.org/perlop#I%2FO-Operators

[8] https://en.wikipedia.org/wiki/Filter_(software)#Unix

[9] https://perldoc.perl.org/perlvar#General-Variables

[10] https://perldoc.perl.org/perlsyn#Statement-Modifiers

[12] https://perldoc.perl.org/perlre

[13] https://perldoc.perl.org/functions/do

[14] https://perldoc.perl.org/perlop#C-style-Logical-And

[15] https://perldoc.perl.org/variables/$%3Cdigits%3E%20($1,%20$2,%20…)

Answered By: David Christensen
#!/usr/bin/python
timesecond=[]
fil=open('/home/praveen/file.txt','r')
for i in fil:
    if i.strip().split(' ')[4] not in timesecond:
        timesecond.append(i.strip().split(' ')[4])


fincheck=open('/home/praveen/file.txt','r')
fincheck_final=fincheck.readlines()

finalcapt=[]
for test2 in timesecond:
    for test3 in range(0,len(fincheck_final),1):
        if test2.strip() in fincheck_final[test3].strip().split(' ')[4]:
            bj="{0} {1}".format(test3,test2)
            finalcapt.append(bj)
            break


for bgy in finalcapt:
    bgy_fin=bgy.strip().split(' ')
    wordcheck=bgy_fin[1]
    exsichec=int(bgy_fin[0])
#    print fincheck_final[exsichec].strip().split(' ')
    try:
        if wordcheck in fincheck_final[exsichec] and wordcheck in fincheck_final[exsichec].strip().split(' ')[4] and "user logged in" in fincheck_final[exsichec]:
            if wordcheck in fincheck_final[exsichec+1] and wordcheck in fincheck_final[exsichec+1].strip().split(' ')[4] and "user changed password" in fincheck_final[exsichec+1]:
                if wordcheck in fincheck_final[exsichec+2] and wordcheck in fincheck_final[exsichec+2].strip().split(' ')[4] and "user logged off" in fincheck_final[exsichec+2]:
    
                    print fincheck_final[exsichec],fincheck_final[exsichec+1],fincheck_final[exsichec+2]
     
    except:
        pass

output

Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -

Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -

Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off|
Answered By: Praveen Kumar BS
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.