awk on date conditions

I am trying delete rows from a person.csv (below) for the condition that person who are not born in last 1 year:

Dataset1:

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"

So, the expected output would look like(last row got deleted as born in less than one year):

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"

I tried to use awk for this something like :

awk -F , '{print $5 ....}' person.csv > output.csv

But, couldn’t figure out how to compare the each row date with (today minus 1 year).

Dataset2: Sometimes it is possible to have double quotes inside double quoted field e.g.(line1 field4):

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley (aka "dud")","03 Oct 2023","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","03 Dec 2022","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"

I am open for ‘sed’ too if that can do this.
Any help please, thanks!

Asked By: zoomer

||

Using GNU and GNU and 1 year condition:

awk -v epoch1y=$(date -d '1 year ago' +%s) '
    BEGIN{FPAT="([^,]*)|("[^"]+")"}
    NR>1{
        cmd="date -d " $5 " +%s"
        epoch=( (cmd | getline line) > 0 ? line : "N/A")
        close(cmd)
        if (epoch > epoch1y) next
    }
    1
' person.csv

Output

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"

Explanations

the code should be clear, concise and quite readable.

The not obvious part, is the use of FPAT special variable (splitting by content), to be able to handle more use cases than a simple -F,:

note Some programs export CSV data that contains embedded newlines between the double quotes. gawk provides no way to deal with this. Even though a formal specification for CSV data exists, there isn’t much more to be done; the FPAT mechanism provides an elegant solution for the majority of cases, and the gawk developers are satisfied with that.


The getline part is a trick from awk to run shell command and feed the variable after getline.


1 at the end, means true, so on a true condition, by default, it means print current line.


Doc about close() call: https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html


epoch is Unix timestamp: number of seconds since 1/1/1970.

Answered By: Gilles Quénot

Assumptions:

  • all columns/fields are wrapped in double quotes
  • double quotes do not show up as part of the data (otherwise we’ll need something other than a basic -F'"' as a field separator)
  • OP’s (OS) date supports the -d argument (eg, if ‘today’ is 16 Sep 2023 then on OP’s system date -d '-1 year' '+%Y%m%d' will generate 20220916)
  • since OP has mentioned the cutoff date could be anything (eg, -1 year, -7 days, etc) we’ll use the (OS) date to generate the cutoff date in YYYYMMDD format (otherwise we need to add a bit more code to awk to be able to handle various conditions like ‘-1 year’, ‘-7 days’, etc)

One awk approach:

cutoff=$(date -d '-1 year' '+%Y%m%d')                             # change '-1 year' to the desired condition;
                                                                  # alternatively: manually set to the desired date (in YYYYMMDD format)

awk -v cutoff="${cutoff}" -F'"' '                                 # set awk variable "cutoff" to the value of the OS variable of the same name
                                                                  # field delimiter is double quotes; this means data fields are even-numbered (eg, 5th field is the 10th "-delimited field)
BEGIN { mlist="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR>1  { split($10,a,/[[:space:]]+/)                               # split 5th data field on spaces; a[1]=day a[2]=month a[3]=year
        m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )           # convert 3-letter month to 2-digit month
        if ( a[3] m a[1] > cutoff) next                           # if new date is greater than the cutoff then skip to the next line of input
      } 
1                                                                 # print the current line
' person.csv

This generates:

"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"

Performance perspective …

This answer requires a single OS call to date and requires 1 file descriptor open/close (2 if you redirect the output to another file).

Gilles’ answer requires an OS call to date for each line of input and requires the costly overhead of opening/closing a file descriptor for each date call.

Test run:

100K line file          # per comment from OP
GNU awk 5.1.0
GNU date 8.32
Ubuntu 20.04
i7-1260P

This answer:

real    0m0.198s        <<< 546 times faster
user    0m0.198s
sys     0m0.000s

Gilles’ answer:

real    1m48.229s       <<<
user    1m30.598s
sys     0m23.999s

Output from both runs were saved to files; a diff of the two output files showed no differences (ie, both answers generate the same result set).


In this case OP has stated that all fields are wrapped in double quotes.

In the case where some fields may not be wrapped in double quotes we can make use of GNU awk's 'FPAT' and still only perform the single call to date, eg:

cutoff=$(date -d '-1 year' '+%Y%m%d')

awk -v cutoff="${cutoff}" '
BEGIN { FPAT="([^,]+)|("[^"]+")"
        mlist="JanFebMarAprMayJunJulAugSepOctNovDec"
      }
NR>1  { f5=$5
        gsub(/"/,"",f5)                                           # strip double quotes from 5th data field
        split(f5,a,/[[:space:]]+/)                                # change from 10th field to 5th field
        m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )
        if ( a[3] m a[1] > cutoff) next
      }
    1
' person.csv

Using the same test criteria as above, this answer’s run time:

real    0m0.861s        <<<
user    0m0.850s
sys     0m0.009s

Parsing the input based on FPAT (as opposed to -F'"') increases run time by ~4x but it’s still considerably faster than 108 seconds.

Answered By: markp-fuso

Since awk does not know about the quoting rules of CSV, it would be better to use a CSV-aware tool for performing this task.

Using the CSV-processing tools from csvkit:

$ csvsql -I --query "SELECT * FROM file WHERE `Date of birth` <= date('now', '-1 year')" file
Index,User Id,First Name,Last Name,Date of birth,Job Title
1,9E39Bfc4fdcc44e,"new, Diamond",Dudley,06 Dec 1945,Photographer
3,32C079F2Bad7e6F,Ethan,Hanson,08 Mar 2014,Actuary

This delegates the date parsing and any date computation to a database backend (SQLite).

Note that the output only quotes the fields that need quoting. If you want quoting of all fields, pass the result through csvformat -U1 (another csvkit tool).

EDIT: Note that the improperly quoted field "Dudley (aka "dud")" in the updated question will be transformed into "Dudley (aka dud"")""" by the csvkit tools (not shown here).


Without the csvkit tools and only using SQLite: Load the data directly from the CSV file and query it. SQLite contains a CSV-aware reader and writer, so we can rest assured that the data will be appropriately quoted.

rm -f data.db
sqlite3 data.db 
    '.headers on' 
    '.mode csv' 
    '.import file mytable' 
    "SELECT * FROM mytable WHERE `Date of birth` <= date('now', '-1 year')"

On my system (fanless "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz"), this takes about 0.75s for 100k records, of which 0.3s are used for building the SQLite database. For 1M records, this takes just under 7 seconds, of which 4 are used for building the database. For 10M lines, the operation takes about 1 minute 15 seconds, of which 45 seconds was used to build the database.

EDIT: Note that the improperly quoted field "Dudley (aka "dud")" in the updated question will be changed to the properly quoted field "Dudley (aka ""dud"")" by the SQLite CSV reader/writer (not shown here).

Answered By: Kusalananda
IFS=$'n'
lyear_old_sec=$(date -d "1 year ago" +%s)
echo '"Index","User Id","First Name","Last Name","Date of birth","Job Title"'
for i in $(cat sup.txt|awk 'NR>1')
do
dat_de=$(echo $i | awk -F "," '{print $(NF-1)}'|sed 's/"//g')
date_de_second=$(date -d "$dat_de" +%s)
if [[ $date_de_second -lt $lyear_old_sec ]]
then
echo $i
fi
done
Answered By: Praveen Kumar BS

Using Miller (mlr):

$ mlr --csv filter 'strptime(${Date of birth}, "%d %b %Y") < "'$(date -d '-1 year' '+%s')'"' file
Index,User Id,First Name,Last Name,Date of birth,Job Title
1,9E39Bfc4fdcc44e,"new, Diamond",Dudley,06 Dec 1945,Photographer
3,32C079F2Bad7e6F,Ethan,Hanson,08 Mar 2014,Actuary

It’s strptime() function.

If we want to change format of date after this command then the following command may be used:

$ mlr --csv filter 'strptime(${Date of birth}, "%d %b %Y") < strptime("'$(date -d '-1 year' '+%F')'","%F"); ${Date of birth} = strftime(strptime(${Date of birth}, "%d %b %Y"), "%F")' file
Index,User Id,First Name,Last Name,Date of birth,Job Title
1,9E39Bfc4fdcc44e,"new, Diamond",Dudley,1945-12-06,Photographer
3,32C079F2Bad7e6F,Ethan,Hanson,2014-03-08,Actuary

The csvsql command changes date format if used without -I or --no-inference option.

Answered By: Prabhjot Singh
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.