Sort and count number of occurrence of lines

I have Apache logfile, access.log, how to count number of line occurrence in that file? for example the result of cut -f 7 -d ' ' | cut -d '?' -f 1 | tr '[:upper:]' '[:lower:]' is

a.php
b.php
a.php
c.php
d.php
b.php
a.php

the result that I want is:

3 a.php
2 b.php
1 d.php # order doesn't matter
1 c.php 
Asked By: Kokizzu

||
| sort | uniq -c

As stated in the comments.

Piping the output into sort organises the output into alphabetical/numerical order.

This is a requirement because uniq only matches on repeated lines, ie

a
b
a

If you use uniq on this text file, it will return the following:

a
b
a

This is because the two as are separated by the b – they are not consecutive lines. However if you first sort the data into alphabetical order first like

a
a
b

Then uniq will remove the repeating lines. The -c option of uniq counts the number of duplicates and provides output in the form:

2 a
1 b

References:

Answered By: visudo

You can use an associative array on awk and then -optionally- sort:

$ awk ' { tot[$0]++ } END { for (i in tot) print tot[i],i } ' access.log | sort

output:

1 c.php
1 d.php
2 b.php
3 a.php
Answered By: Laurence R. Ugalde
[your command] | sort | uniq -c | sort -nr

The accepted answer is almost complete you might want to add an extra sort -nr at the end to sort the results with the lines that occur most often first

uniq options:

-c, --count
       prefix lines by the number of occurrences

sort options:

-n, --numeric-sort
       compare according to string numerical value
-r, --reverse
       reverse the result of comparisons

In the particular case were the lines you are sorting are numbers, you need use sort -gr instead of sort -nr, see comment

Answered By: Eduard Florinescu

There is only 1 sample for d.php. So you’ll get nice output like this.

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      1 d.php
wolf@linux:~$

What happens when there is 4 d.php?

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      4 d.php
wolf@linux:~$ 

If you want to sort the output by the number of occurrence, you might want to send the stdout to sort again.

wolf@linux:~$ cat file | sort | uniq -c | sort
      1 c.php
      2 b.php
      3 a.php
      4 d.php
wolf@linux:~$ 

Use -r for reverse

wolf@linux:~$ cat file | sort | uniq -c | sort -r
      4 d.php
      3 a.php
      2 b.php
      1 c.php
wolf@linux:~$ 

Hope this example helps

Answered By: Wolf

You can use clickhouse-client tool for working with files like with a sql table with a single column in this case:

clickhouse-local --query 
"select data, count() from file('access.log', TSV, 'data String') group by data order by count(*) desc limit 10"

My brief experiment shows it’s about 50 times faster than

cat access.log | sort | uniq -c | sort -nr | head 10
Answered By: Alexey Kupershtokh
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.