Rewriting log parsing script in Perl or Python (get rid of awk)
I need accomplish a task of filtering activity of bots in the log file.
Solution should only show records meeting following criteria
- user logged in, user changed password, user logged off within same second.
- those actions (log in, change password, log off) happened one after another with no other entries in between.
Input data example
[a lot of data]
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:42 +0200|178.57.66.225|faaaaaa11111| - |user logged in| -
Mon, 22 Aug 2016 13:15:40 +0200|178.57.66.215|terdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged in| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed password| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed profile| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -
Mon, 22 Aug 2016 13:20:42 +0200|178.57.67.225|faaaa0a11111| - |user logged in| -
[a lot of data]
I’ve written the code below in order to complete the task
awk 'BEGIN { FS=" " } { c[$5]++; l[$5,c[$5]]=$0 } END { for (i in c) { if (c[i] == 3) for (j = 1 ; j <= c[i]; j++) print l[i,j] } }' $1
Usage:
./parse_log.sh logfile.log
Output:
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -
But I want to know what alternative written in Perl or Python (with minimum usage of external libs) would look?
The solution itself is written in Python 3.
#!/usr/bin/env python3
import sys
import re
from collections import defaultdict
column_delimiter = sys.argv[1]
column = int(sys.argv[2]) - 1
records = defaultdict(list)
with open(sys.argv[3]) as inputfile:
for lines in inputfile:
line = lines.rstrip('n')
row_record = line.split(column_delimiter)
records[row_record[column]].append(line)
for timestamps in records.values():
if len(timestamps) == 3:
for i in range(len(timestamps)):
if (re.search('logged in|changed password|logged off', timestamps[i])):
print(timestamps[i])
Usage: parse_log.py ' ' 5 logfile.log
The Python code is much easier to read and understand what it does.
This is not an answer, but it is too big for a comment and requires formatting, so to address your comment that "The Python code is much easier to read and to understand what it does.", FYI an AWK script with sensible variable names that does what I think your Python script does would look a lot like your python script but briefer since, for manipulating text, awk already does for you all the common stuff you have to write code to do in python:
awk -v column=5 '
{ records[$column] = records[$column] $0 ORS }
END {
for ( timestamp in records ) {
if ( gsub(ORS,"&",records[timestamp]) > 2 ) {
printf "%s", records[timestamp]
}
}
}
' logfile.log
but reading the whole file into memory before processing it is a very inefficient approach to this problem. You should just do the test and print every time the time changes:
awk -v column=5 '
$column != prev {
prt()
records = ""
prev = $column
}
{ records = records $0 ORS }
END { prt() }
function prt() {
if ( gsub(ORS,"&",records) > 2 ) {
printf "%s", records
}
}
' logfile.log
In Perl this can be written as a one-liner, but it feels a bit cluttered:
perl -MTime::Piece -F'|' -ae '$epoch=Time::Piece->strptime($F[0], "%a, %d %b %Y %H:%M:%S %z")->epoch; $diff=$epochlast2 - $epoch; $last =~ /user changed password/ && $last2 =~ /user logged in/ && $_ =~ /user logged off/ && $diff==0 && print $last2, $last, $_; $epochlast2=$epochlast; $epochlast=$epoch; $last2=$last; $last=$_' <<< "$data"
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -
alternatively, as a script:
use warnings;
use Time::Piece;
my $epochlast2=0;
my $epochlast=0;
my $last2="";
my $last="";
while($line = <STDIN>){
$date=(split(/|/, $line))[0];
$epoch=Time::Piece->strptime($date, "%a, %d %b %Y %H:%M:%S %z")->epoch;
$diff=$epochlast2 - $epoch;
if ($last =~ /user changed password/ && $last2 =~ /user logged in/ && $line =~ /user logged off/ && $diff==0) {
print $last2, $last, $line;
}
$epochlast2=$epochlast;
$epochlast=$epoch;
$last2=$last;
$last=$line
}
Here you have the python version of my previous script (reading from stdin)
import sys
from datetime import datetime
last2 = ""
last = ""
datelast2 = None
datelast = None
for line in sys.stdin:
line = line.rstrip()
elems = line.split("|")
date = datetime.strptime(elems[0], "%a, %d %b %Y %H:%M:%S %z")
if datelast2 is not None:
diff = datelast2 - date
if (
"user changed password" in last
and "user logged in" in last2
and "user logged off" in line
and diff.seconds == 0
):
print("{}n{}n{}".format(last2, last, line))
last2 = last
last = line
datelast2 = datelast
datelast = date
With any awk
:
#!/usr/bin/awk -f
BEGIN { FS = "[|]" }
prvHour == $1 && prvUsr == $3 {
if ($(NF-1) == "user logged in" ||
$(NF-1) == "user changed password" ||
$(NF-1) == "user logged off" )
actions[++actionCnt] = $0
else actionCnt = 0
}
prvHour != $1 && prvUsr != $3 {
if (prvHour && actionCnt == 3)
for (i = 1; i <= actionCnt ; i++)
print actions[i]
prvHour = $1; prvUsr = $3
actionCnt = 0 ; actions[++actionCnt] = $0
}
END {
if (actionCnt == 3)
for (i = 1; i <= actionCnt; i++)
print actions[i]
}
With Perl
without using external libraries :
/bin/perl -e '
while (1) {
$uli = $uli // <>;
$ucp = <> if $uli =~ /^([^|]*)[|][^|]*[|]([^|]*)[|] - [|]user logged in[|] -$/;
last if tell() < 0 ;
if (!defined $ucp) { $uli = undef ; next; }
$ulo = <> if $ucp =~ /^(Q$1E)[|][^|]*[|]($2)[|] - [|]user changed password[|] -$/;
last if tell() < 0;
if (!defined $ulo) { $uli = $ucp ; $ucp = undef ; next ; }
if ($ulo !~ /^Q$1E[|][^|]*[|]$2[|] - [|]user logged off[|] -$/) {
$uli = $ulo ; $ucp = $ulo = undef ; next ;
}
print "$uli$ucp$ulo";
$uli = $ucp = $ulo = undef;
}
' sample
With python3
usgin just sys
to read the file and call exit
with meaning values.
#!/bin/python3
import sys
try:
fullLine = actions = []
prvHour = prvUsr = None
chk_act = lambda x: x == "user logged in" or
x == "user changed password" or
x == "user logged off"
with open(sys.argv[1]) as logFile:
for line in logFile:
hour, _, user, _, action, _ = line.split('|')
if prvHour == hour and prvUsr == user:
fullLine.append(line.strip())
actions.append(action.strip())
elif prvUsr != user and prvHour != hour:
if len(actions) == 3 and all(map(chk_act, actions)):
print("n".join(fullLine))
prvUsr = user
prvHour = hour
actions = []
actions.append(action.strip())
fullLine = []
fullLine.append(line.strip())
except IndexError:
print("usage {} logfile".format(sys.argv[0]))
sys.exit(1)
except (FileNotFoundError, PermissionError):
print("{} not found or permission permission denied", sys.argv[1])
sys.exit(1)
With any sed
:
#!/bin/sed -nf
N;/^([^|]*)|[^|]*|([^|]*)| - |user logged in| -n1|[^|]*|2| - |user changed password| -$/{
N;/n([^|]*)|[^|]*|([^|]*)| - |user changed password| -n1|[^|]*|2| - |user logged off| -$/{
p;b
}
s/.*n/n/g;D
}
D
All solutions avoid to store the whole data in memory
Using Raku (formerly known as Perl_6)
~$ raku -MDateTime::Parse -e 'my @a; my @b; for lines() {
@a.push($_.split(/ "|" | s "-" s? /, :skip-empty));
@b.push(DateTime::Parse.new(@a.[*-1][0]).posix)}; my %c.=append([Z=>] @b, @a>>.[1..*-1]);
for %c.sort -> $i {$i.raku.put if $i.values.contains("user logged in" & "user changed password" & "user logged off")
&& ($i.value.elems == 9)}' file
Sample Output:
"1471864539" => $["178.57.66.225", "fxsciaqulmlk", "user logged in", "178.57.66.225", "fxsciaqulmlk", "user changed password", "178.57.66.225", "fxsciaqulmlk", "user logged off"]
"1471864559" => $["178.57.66.205", "erdsfsdfsdf", "user logged in", "178.57.66.205", "erdsfsdfsdf", "user changed password", "178.57.66.205", "erdsfsdfsdf", "user logged off"]
"1471864759" => $["178.56.66.225", "fxsciaqulmla", "user logged in", "178.56.66.225", "fxsciaqulmla", "user changed password", "178.56.66.225", "fxsciaqulmla", "user logged off"]
If you’re looking for a scripting language to use at the command line, you might want to consider Raku. The code is relatively short, and includes a few niceties–including taking advantage of Raku’s built-in hashing capabilities.
Reading the code above, lines
are read-in and each line is split
on |
pipes and/or s "-" s?
whitespace-dash. This returns four elements per line (zero-index = 0..3). Each line is push
ed onto the @a
array. Then the first column of each line is parsed using DateTime::Parse.new()
to return posix
seconds, which is pushed onto the @b
array.
From these two arrays a hash %c
is created using Raku’s [Z=>]
Zip-reduction metaoperator. This gives a %c
hash with posix seconds as key
, and zero-indexed columns 1,2,3 as value
. As the elements are append
ed to the %c
hash, their values append to the appropriate posix
key. Finally in the output
, for each posix-key element, values are checked to make sure they contain
the exact three strings requested, and also checked that exactly 9 elements exist (three rows of three columns).
Sample Input:
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:42 +0200|178.57.66.225|faaaaaa11111| - |user logged in| -
Mon, 22 Aug 2016 13:15:40 +0200|178.57.66.215|terdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:49 +0200|178.57.66.215|terdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged in| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed password| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user changed profile| -
Mon, 22 Aug 2016 13:17:50 +0200|178.57.66.205|abcbbabab| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -
Mon, 22 Aug 2016 13:20:42 +0200|178.57.67.225|faaaa0a11111| - |user logged in| -
[Thanks to @bduggan and @sergot for updating Raku’s DateTime::Parse
module so quickly!]
https://andrewshitov.com/2020/06/06/some-tips-for-working-with-hashes-in-raku/
https://github.com/sergot/datetime-parse
https://raku.org
Here is a Perl solution:
1 #!/usr/bin/env perl
2 use strict;
3 use warnings;
4 my ($p2, $p1, $x);
5 while (<>) {
6 print($p2, $p1, $_) if
7 /(.+)user logged off/
8 && do { $x = $1 }
9 && $p1 =~ /Q$xEuser changed password/
10 && $p2 =~ /Q$xEuser logged in/;
11 $p2 = $p1;
12 $p1 = $_;
13 }
Line 1 lets the shell know this is a Perl script [1].
Line 2 "The strict pragma disables certain Perl expressions that could behave unexpectedly or are difficult to debug, turning them into errors. " [2]
Line 3 "The warnings pragma gives control over which warnings are enabled in which parts of a Perl program." [3]
Line 4 declares local variables [4, 5]:
-
$p2
is the second previous line of input. -
$p1
is the previous line of input. -
$x
is for saving the initial portion of the current line of input.
Lines 5-13 form a while
compound statement [6].
Line 5 the while
loop expression uses the null filehandle <>
(diamond operator) [7], allowing the solution to be used as a Unix filter [8]. For each line of input, while
will assign the current line of input to the default input general variable $_
[9] and evaluate BLOCK
.
Lines 6-10 form an if
compound statement using statement modifier syntax [10].
Lines 7-10 form the EXPR
portion of the if
statement, which is composed of three regular expressions [11], one do BLOCK
function [12], and the C-style logical "and" operator &&
[13].
Line 7 attempts to match the current line of input against the regular expression /(.+)user logged off/
. If successful, the initial portion of the current line of input is captured into the global variable $1
[15].
If line 7 was true, line 8 saves $1
to the local variable $x
. (Subsequent regular expressions may clobber the value of $1
.) The do BLOCK
evaluates to the the value of the last statement of BLOCK
, which will be a non-empty string, which Perl considers true.
If line 8 was true, line 9 attempts to match the previous line of input against the regular expression /Q$xEuser changed password/
. $x
must be escaped within the regular expression using the delimiters Q
and E
, so that its value is treated as a string.
If line 9 was true, line 10 attempts to match the second previous line of input using the regular expression /Q$xEuser logged in/
.
If line 10 is true, line 6 prints the second previous line of input, the previous line of input, and the current line of input.
Lines 11-12 update the variables for second previous line of input and previous line of input.
The solution produces output in the same order as the input:
$ perl solution.pl logfile.log
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off| -
References:
[1] https://en.wikipedia.org/wiki/Shebang_(Unix)
[2] https://perldoc.perl.org/strict
[3] https://perldoc.perl.org/warnings
[4] https://perldoc.perl.org/perldata
[5] https://perldoc.perl.org/functions/my
[6] https://perldoc.perl.org/perlsyn#Compound-Statements
[7] https://perldoc.perl.org/perlop#I%2FO-Operators
[8] https://en.wikipedia.org/wiki/Filter_(software)#Unix
[9] https://perldoc.perl.org/perlvar#General-Variables
[10] https://perldoc.perl.org/perlsyn#Statement-Modifiers
[12] https://perldoc.perl.org/perlre
[13] https://perldoc.perl.org/functions/do
[14] https://perldoc.perl.org/perlop#C-style-Logical-And
[15] https://perldoc.perl.org/variables/$%3Cdigits%3E%20($1,%20$2,%20…)
#!/usr/bin/python
timesecond=[]
fil=open('/home/praveen/file.txt','r')
for i in fil:
if i.strip().split(' ')[4] not in timesecond:
timesecond.append(i.strip().split(' ')[4])
fincheck=open('/home/praveen/file.txt','r')
fincheck_final=fincheck.readlines()
finalcapt=[]
for test2 in timesecond:
for test3 in range(0,len(fincheck_final),1):
if test2.strip() in fincheck_final[test3].strip().split(' ')[4]:
bj="{0} {1}".format(test3,test2)
finalcapt.append(bj)
break
for bgy in finalcapt:
bgy_fin=bgy.strip().split(' ')
wordcheck=bgy_fin[1]
exsichec=int(bgy_fin[0])
# print fincheck_final[exsichec].strip().split(' ')
try:
if wordcheck in fincheck_final[exsichec] and wordcheck in fincheck_final[exsichec].strip().split(' ')[4] and "user logged in" in fincheck_final[exsichec]:
if wordcheck in fincheck_final[exsichec+1] and wordcheck in fincheck_final[exsichec+1].strip().split(' ')[4] and "user changed password" in fincheck_final[exsichec+1]:
if wordcheck in fincheck_final[exsichec+2] and wordcheck in fincheck_final[exsichec+2].strip().split(' ')[4] and "user logged off" in fincheck_final[exsichec+2]:
print fincheck_final[exsichec],fincheck_final[exsichec+1],fincheck_final[exsichec+2]
except:
pass
output
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged in| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user changed password| -
Mon, 22 Aug 2016 13:15:39 +0200|178.57.66.225|fxsciaqulmlk| - |user logged off| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged in| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user changed password| -
Mon, 22 Aug 2016 13:15:59 +0200|178.57.66.205|erdsfsdfsdf| - |user logged off| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged in| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user changed password| -
Mon, 22 Aug 2016 13:19:19 +0200|178.56.66.225|fxsciaqulmla| - |user logged off|