how to run awk twice across the same file

I have a log file with user_id’s and another column with contest results.
I would like to:

  1. find all the user_ids of the users who won
  2. given those user_id’s, return ALL the log entries for those users

Example:

log.csv

id,user_id,winner,page_id
1,user_1,win,1
2,user_1,,10
3,user_2,,1
4,user_2,,2
5,user_4,win,10
6,user_5,,2
7,user_5,win,3

Given a log file like this, I’m currently doing this as a two-step process:

Step 1: Return each row mentioning the word ‘win’

/win/ {
    FS=","

    # To make unique, store user_id's in array
    n[$2] = 0
}

# Print the unique array keys
END{
    for (x in n)
        print x
}

This yields:

user_1
user_4
user_5

I save this output in the file output.txt

Then, I pass that file and the original log file into another awk file:

NR == FNR{
    n[$1] = 0   # Set the user ID to the array
    next        # Go to the next file
}
{
    FS=","
    if($2 in n){
        print $0
    }
}

This returns the correct output (all rows for each of the user_id’s that won):

1,user_1,win,1
2,user_1,,10
5,user_4,win,10
6,user_5,,2
7,user_5,win,3

Is there a more elegant way to do this? With a single awk file?

Asked By: zbinsd

||

I would use two arrays:

awk -F, '{a[$0]=$2;if($3=="win")b[$2]++}END{for(i in a){if(b[a[i]])print i}}'
Answered By: jimmij

How I fill grep is quicker than awk So if you have GNU grep with perl extention you can try

fgrep -f <(grep -oP "[^,]*(?=,win)" log.csv) log.csv

Without perl.ext you’l have to pipe grep output through cut

fgrep -f <(grep win log.csv | cut -d, -f2) log.csv

Or use sed (it seems to be quicker a little than above grep|cut)

fgrep -f <(sed -n '/win/s/^[^,]*,([^,]*).*/1/p' log.csv) log.csv
Answered By: Costas

Rather late for this, but for posterity, I’d like to point out that you can do this:

awk '
   {
       # do first pass stuff
   }
   
   END
   {
       while(getline < FILENAME)
       {    
           # do second pass stuff
       }
       close(FILENAME)
   }
' file

If you want to do more passes, you can close(FILENAME) after the first while loop, then do a second one.

Answered By: Paul Sinclair

Using csvkit, one may do

$ csvsql --query 'SELECT b.* FROM log AS a JOIN log AS b USING (user_id) WHERE a.winner="win"' log.csv
id,user_id,winner,page_id
1,user_1,win,1
2,user_1,,10
5,user_4,win,10
6,user_5,,2
7,user_5,win,3

But I’m not so sure that will be very fast on a huge input file. By default, an SQLite database is created, populated and queried in the background.

You may however do this in the shell directly too:

$ join -t ',' -1 2 -2 2 log.csv log.csv | awk -v FS=',' -v OFS=',' '$3 == "win" { print $5,$1,$6,$7 }'
1,user_1,win,1
2,user_1,,10
5,user_4,win,10
6,user_5,,2
7,user_5,win,3

Again, I’m not sure how this scales to huge files. Also, the input file has to be sorted on column 2 (which the example data is).

Answered By: Kusalananda

This is a complete single file gnu awk solution
you can just run it as :> awk -f singlestep.awk log.csv

BEGIN {
    FS=",";
    #you cannot use FILENAME , since in BEGIN section you are not processing any files and the FILENAME variable is empty
    # So you need to use the ARGV
    while(getline < ARGV[1])
    {
        if ($0 ~ /win/) {
            # To make unique, store user_id's in array
            n[$2] = 0;
        }
    }
}
{
    if ($2 in n)
    {
        print $0;
    }
}
Answered By: Wim Stockman

Is there a more elegant way to do this?

Yes, of course there is. Just run Awk twice across the same file (as you said in your title).

awk -F, '$3=="win"{won[$2]} FNR==NR{next} $2 in won' log.csv log.csv
Answered By: Wildcard
Categories: Answers Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.