awk on date conditions
I am trying delete rows from a person.csv (below) for the condition that person who are not born in last 1 year:
Dataset1:
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"
So, the expected output would look like(last row got deleted as born in less than one year):
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
I tried to use awk for this something like :
awk -F , '{print $5 ....}' person.csv > output.csv
But, couldn’t figure out how to compare the each row date with (today minus 1 year).
Dataset2: Sometimes it is possible to have double quotes inside double quoted field e.g.(line1 field4):
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley (aka "dud")","03 Oct 2023","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","03 Dec 2022","Actuary"
"2","aaaaaaa, bbbbbb","Grace","Huerta","21 Jan 2023","Visual merchandiser"
I am open for ‘sed’ too if that can do this.
Any help please, thanks!
Using GNU awk and GNU date and 1 year condition:
awk -v epoch1y=$(date -d '1 year ago' +%s) '
BEGIN{FPAT="([^,]*)|("[^"]+")"}
NR>1{
cmd="date -d " $5 " +%s"
epoch=( (cmd | getline line) > 0 ? line : "N/A")
close(cmd)
if (epoch > epoch1y) next
}
1
' person.csv
Output
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
Explanations
the code should be clear, concise and quite readable.
The not obvious part, is the use of FPAT
special variable (splitting by content), to be able to handle more use cases than a simple -F,
:
Some programs export CSV data that contains embedded newlines between the double quotes. gawk provides no way to deal with this. Even though a formal specification for CSV data exists, there isn’t much more to be done; the FPAT mechanism provides an elegant solution for the majority of cases, and the gawk developers are satisfied with that.
The getline
part is a trick from awk
to run shell command and feed the variable after getline
.
1
at the end, means true
, so on a true
condition, by default, it means print
current line.
Doc about close()
call: https://www.gnu.org/software/gawk/manual/html_node/Close-Files-And-Pipes.html
epoch
is Unix timestamp: number of seconds since 1/1/1970
.
Assumptions:
- all columns/fields are wrapped in double quotes
- double quotes do not show up as part of the data (otherwise we’ll need something other than a basic
-F'"'
as a field separator) - OP’s (OS)
date
supports the-d
argument (eg, if ‘today’ is16 Sep 2023
then on OP’s systemdate -d '-1 year' '+%Y%m%d'
will generate20220916
) - since OP has mentioned the cutoff date could be anything (eg, -1 year, -7 days, etc) we’ll use the (OS)
date
to generate the cutoff date inYYYYMMDD
format (otherwise we need to add a bit more code toawk
to be able to handle various conditions like ‘-1 year’, ‘-7 days’, etc)
One awk
approach:
cutoff=$(date -d '-1 year' '+%Y%m%d') # change '-1 year' to the desired condition;
# alternatively: manually set to the desired date (in YYYYMMDD format)
awk -v cutoff="${cutoff}" -F'"' ' # set awk variable "cutoff" to the value of the OS variable of the same name
# field delimiter is double quotes; this means data fields are even-numbered (eg, 5th field is the 10th "-delimited field)
BEGIN { mlist="JanFebMarAprMayJunJulAugSepOctNovDec" }
NR>1 { split($10,a,/[[:space:]]+/) # split 5th data field on spaces; a[1]=day a[2]=month a[3]=year
m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) ) # convert 3-letter month to 2-digit month
if ( a[3] m a[1] > cutoff) next # if new date is greater than the cutoff then skip to the next line of input
}
1 # print the current line
' person.csv
This generates:
"Index","User Id","First Name","Last Name","Date of birth","Job Title"
"1","9E39Bfc4fdcc44e","new, Diamond","Dudley","06 Dec 1945","Photographer"
"3","32C079F2Bad7e6F","Ethan","Hanson","08 Mar 2014","Actuary"
Performance perspective …
This answer requires a single OS call to date
and requires 1 file descriptor open/close (2 if you redirect the output to another file).
Gilles’ answer requires an OS call to date
for each line of input and requires the costly overhead of opening/closing a file descriptor for each date
call.
Test run:
100K line file # per comment from OP
GNU awk 5.1.0
GNU date 8.32
Ubuntu 20.04
i7-1260P
This answer:
real 0m0.198s <<< 546 times faster
user 0m0.198s
sys 0m0.000s
Gilles’ answer:
real 1m48.229s <<<
user 1m30.598s
sys 0m23.999s
Output from both runs were saved to files; a diff
of the two output files showed no differences (ie, both answers generate the same result set).
In this case OP has stated that all fields are wrapped in double quotes.
In the case where some fields may not be wrapped in double quotes we can make use of GNU awk's 'FPAT'
and still only perform the single call to date
, eg:
cutoff=$(date -d '-1 year' '+%Y%m%d')
awk -v cutoff="${cutoff}" '
BEGIN { FPAT="([^,]+)|("[^"]+")"
mlist="JanFebMarAprMayJunJulAugSepOctNovDec"
}
NR>1 { f5=$5
gsub(/"/,"",f5) # strip double quotes from 5th data field
split(f5,a,/[[:space:]]+/) # change from 10th field to 5th field
m=sprintf("%02d", ( (index(mlist,a[2])+2) /3) )
if ( a[3] m a[1] > cutoff) next
}
1
' person.csv
Using the same test criteria as above, this answer’s run time:
real 0m0.861s <<<
user 0m0.850s
sys 0m0.009s
Parsing the input based on FPAT
(as opposed to -F'"'
) increases run time by ~4x but it’s still considerably faster than 108 seconds.
Since awk
does not know about the quoting rules of CSV, it would be better to use a CSV-aware tool for performing this task.
Using the CSV-processing tools from csvkit:
$ csvsql -I --query "SELECT * FROM file WHERE `Date of birth` <= date('now', '-1 year')" file
Index,User Id,First Name,Last Name,Date of birth,Job Title
1,9E39Bfc4fdcc44e,"new, Diamond",Dudley,06 Dec 1945,Photographer
3,32C079F2Bad7e6F,Ethan,Hanson,08 Mar 2014,Actuary
This delegates the date parsing and any date computation to a database backend (SQLite).
Note that the output only quotes the fields that need quoting. If you want quoting of all fields, pass the result through csvformat -U1
(another csvkit tool).
EDIT: Note that the improperly quoted field "Dudley (aka "dud")"
in the updated question will be transformed into "Dudley (aka dud"")"""
by the csvkit tools (not shown here).
Without the csvkit tools and only using SQLite: Load the data directly from the CSV file and query it. SQLite contains a CSV-aware reader and writer, so we can rest assured that the data will be appropriately quoted.
rm -f data.db
sqlite3 data.db
'.headers on'
'.mode csv'
'.import file mytable'
"SELECT * FROM mytable WHERE `Date of birth` <= date('now', '-1 year')"
On my system (fanless "Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz"), this takes about 0.75s for 100k records, of which 0.3s are used for building the SQLite database. For 1M records, this takes just under 7 seconds, of which 4 are used for building the database. For 10M lines, the operation takes about 1 minute 15 seconds, of which 45 seconds was used to build the database.
EDIT: Note that the improperly quoted field "Dudley (aka "dud")"
in the updated question will be changed to the properly quoted field "Dudley (aka ""dud"")"
by the SQLite CSV reader/writer (not shown here).
IFS=$'n'
lyear_old_sec=$(date -d "1 year ago" +%s)
echo '"Index","User Id","First Name","Last Name","Date of birth","Job Title"'
for i in $(cat sup.txt|awk 'NR>1')
do
dat_de=$(echo $i | awk -F "," '{print $(NF-1)}'|sed 's/"//g')
date_de_second=$(date -d "$dat_de" +%s)
if [[ $date_de_second -lt $lyear_old_sec ]]
then
echo $i
fi
done
Using Miller (mlr
):
$ mlr --csv filter 'strptime(${Date of birth}, "%d %b %Y") < "'$(date -d '-1 year' '+%s')'"' file
Index,User Id,First Name,Last Name,Date of birth,Job Title
1,9E39Bfc4fdcc44e,"new, Diamond",Dudley,06 Dec 1945,Photographer
3,32C079F2Bad7e6F,Ethan,Hanson,08 Mar 2014,Actuary
It’s strptime()
function.
If we want to change format of date after this command then the following command may be used:
$ mlr --csv filter 'strptime(${Date of birth}, "%d %b %Y") < strptime("'$(date -d '-1 year' '+%F')'","%F"); ${Date of birth} = strftime(strptime(${Date of birth}, "%d %b %Y"), "%F")' file
Index,User Id,First Name,Last Name,Date of birth,Job Title
1,9E39Bfc4fdcc44e,"new, Diamond",Dudley,1945-12-06,Photographer
3,32C079F2Bad7e6F,Ethan,Hanson,2014-03-08,Actuary
The csvsql
command changes date format if used without -I
or --no-inference
option.