Sort and count number of occurrence of lines
I have Apache
logfile, access.log
, how to count number of line occurrence in that file? for example the result of cut -f 7 -d ' ' | cut -d '?' -f 1 | tr '[:upper:]' '[:lower:]'
is
a.php
b.php
a.php
c.php
d.php
b.php
a.php
the result that I want is:
3 a.php
2 b.php
1 d.php # order doesn't matter
1 c.php
| sort | uniq -c
As stated in the comments.
Piping the output into sort
organises the output into alphabetical/numerical order.
This is a requirement because uniq
only matches on repeated lines, ie
a
b
a
If you use uniq
on this text file, it will return the following:
a
b
a
This is because the two a
s are separated by the b
– they are not consecutive lines. However if you first sort the data into alphabetical order first like
a
a
b
Then uniq
will remove the repeating lines. The -c
option of uniq
counts the number of duplicates and provides output in the form:
2 a
1 b
References:
You can use an associative array on awk and then -optionally- sort:
$ awk ' { tot[$0]++ } END { for (i in tot) print tot[i],i } ' access.log | sort
output:
1 c.php
1 d.php
2 b.php
3 a.php
[your command] | sort | uniq -c | sort -nr
The accepted answer is almost complete you might want to add an extra sort -nr
at the end to sort the results with the lines that occur most often first
uniq options:
-c, --count
prefix lines by the number of occurrences
sort options:
-n, --numeric-sort
compare according to string numerical value
-r, --reverse
reverse the result of comparisons
In the particular case were the lines you are sorting are numbers, you need use sort -gr
instead of sort -nr
, see comment
There is only 1 sample for d.php
. So you’ll get nice output like this.
wolf@linux:~$ cat file | sort | uniq -c
3 a.php
2 b.php
1 c.php
1 d.php
wolf@linux:~$
What happens when there is 4 d.php
?
wolf@linux:~$ cat file | sort | uniq -c
3 a.php
2 b.php
1 c.php
4 d.php
wolf@linux:~$
If you want to sort the output by the number of occurrence, you might want to send the stdout to sort
again.
wolf@linux:~$ cat file | sort | uniq -c | sort
1 c.php
2 b.php
3 a.php
4 d.php
wolf@linux:~$
Use -r
for reverse
wolf@linux:~$ cat file | sort | uniq -c | sort -r
4 d.php
3 a.php
2 b.php
1 c.php
wolf@linux:~$
Hope this example helps
You can use clickhouse-client tool for working with files like with a sql table with a single column in this case:
clickhouse-local --query
"select data, count() from file('access.log', TSV, 'data String') group by data order by count(*) desc limit 10"
My brief experiment shows it’s about 50 times faster than
cat access.log | sort | uniq -c | sort -nr | head 10