Delete a whole word from a CSV file that is not part of another word using SED
I’ve searched for an answer to this and have come close, but not close enough.
We receive a CSV file that contains the text "NULL" which needs to be replaced by nothing. Example:
- Input
12345,George,MCNULLMAN,NULL,green,NULL
- Replacement should result in:
12345,George,MCNULLMAN,,green,
I tried this as a test, but obviously the word boundaries don’t account for commas.
echo "MCNULLMAN,HELLO,NULL,NULL" | sed 's|bNULL/b||g'
Using sed 's|NULL||g'
worked great for a while until we received a person’s name that contained NULL
in the middle of it. Any suggestions?
When working with tabulated data, I would recommend using awk
:
awk 'BEGIN{FS=OFS=","}{for (i=1;i<=NF;i++) if ($i=="NULL") $i=""}1' input.csv
This would set the input and output field separators to ,
. It then iterates over all fields of the line, and if they are exactly equal to NULL
, replace them with the empty string. The 1
instructs awk
to print the line including all modifications (if any).
If it needs to be sed
, I would recommend to hard-encode the field separators (while allowing the preceding comma to be the start-of-line instead, and the trailing one the end-of-line):
sed -E 's/(^|,)(NULL)(,|$)/13/g' input.csv
This makes use of capture groups to record the actual value of the preceding and trailing field separator (which may be the comma or start-of-line/end-of-line depending on which field is modified), and replaces the entire "preceding-separator+field+trailing-separator" combination by just "preceding-separator+trailing-separator".
Please note that this only works for "simple CSV" files where it is guaranteed that the NULL
will not be quoted.
Using Miller (mlr
) to empty each field that is the exact string NULL
in the header-less CSV input file:
$ cat file.csv
12345,George,MCNULLMAN,NULL,green,NULL
$ mlr --csv -N put 'for (k,v in $*) { v == "NULL" { $[k] = "" } }' file.csv
12345,George,MCNULLMAN,,green,
This will also work with CSV files that contain complex quoting:
$ cat file.csv
12345,"George
NULL,MacGregor",MCNULLMAN,"NULL,NULL","NULL",green,"""NULL"""
$ mlr --csv -N put 'for (k,v in $*) { v == "NULL" { $[k] = "" } }' file.csv
12345,"George
NULL,MacGregor",MCNULLMAN,"NULL,NULL",,green,"""NULL"""
In the example record above, there is only a single field that contains the string NULL
only, and that’s the 4th field (before green
). It additionally has a set of unnecessary quotes. (The last filed is "NULL"
, including the literal quotes, so it is not emptied. The first NULL
on the second line is part of the 1st field, which contains a literal newline character. Likewise, the field whose value is NULL,NULL
is not touched.)
Using awk
:
awk '{sub(/^NULL,/, ",");
gsub(/,NULL,/, ",,");
sub(/,NULL$/, ",")}1' file
Using csvsql
:
file.csv as simple CSV file.
12345,George,MCNULLMAN,NULL,green,NULL
$ csvsql -H -I --query 'select * from file' file.csv | csvformat -K 1
12345,George,MCNULLMAN,,green,
file.csv with complex quoting.
12345,"George
NULL,MacGregor",MCNULLMAN,"NULL,NULL","NULL",green,"""NULL"""
$ csvsql -H -I --query 'select * from file' file.csv | csvformat -K 1
12345,"George
NULL,MacGregor",MCNULLMAN,"NULL,NULL",,green,"""NULL"""
-
-H
for--no-header-row
. -
-I
for--no-inference
.
Without this option the command changes12345
to12345.0
. -
-K n
for--skip-lines
. Skips first n lines.
This is used because csvsql -H
command adds a header row. - K 1
removes that.
Using Raku (formerly known as Perl_6)
~$ raku -ne '.split(",").map(*.subst: :global, /^NULL$/ ).join(",").put;' file
OR
~$ raku -ne '.split(",")>>.subst( :global, /^NULL$/ ).join(",").put;' file
Above is an answer coded in Raku, a member of the Perl family of programming languages. The file is read linewise using the -ne
non-autoprinting linewise flags. These flags instruct Raku to run the code that follows the flags over each line.
The input data (i.e. line) loads into Raku’s $_
topic variable. From here the text is $_.split
on commas (note above: you can drop the leading $_
and just write .split
).
Next the resultant elements are map
ped over to apply a subst
itution to each. Here every match to /^NULL$/
an element that starts-and-ends with NULL
and nothing else in between is replaced with nothing (the :global
adverb/parameter is superfluous but may prove useful in other situations).
Finally the modified elements are join
ed together on commas and output
.
Sample Input:
12345,George,MCNULLMAN,NULL,green,NULL
12345,George,MCNULLMAN,NULL,green,nail
NULL,George,MCNULLMAN,NULL,green,neal
Sample Output:
12345,George,MCNULLMAN,,green,
12345,George,MCNULLMAN,,green,nail
,George,MCNULLMAN,,green,neal