How to remove embedded newlines from CSV fields

I am struggling with processing a file. Normally I should have a line for each host definition. But sometimes, someone split some fields on different lines.
Here is an example:

"host1","host1","linux
server",""
"host2","host2","linux server",""

Now I would like to find a way (better in bash) to fix this in to:

"host1","host1","linux server",""
"host2","host2","linux server",""

Each field should be surrounded by a double quote; if this is not the case it means that a n was inserted, which I then want to remove in order to have always 4 fields on each line.

Note that I may have a description split into several lines like:

"host1","host1","linux
server
centos",""
"host2","host2","linux server",""

I tried several awk approaches, e.g.

awk 'BEGIN {ORS=""; RS=""n""; FS="",""; OFS="",""} {if (NF == 3) print """ $1 ""," $2 ""," $3 ""n"; else printf "%s", $0} END {print ""}' /tmp/ngr4

but I did not succeed and I begin to reach my limit with this powerful tool.

Asked By: Nicolas Greder

||

Some modern implementations of awk actually bring a fully-fledged csv parser (I think it’s GoAWK?), yours probably doesn’t. GNU awk starting from 5.3 have the --csv flag. You would need to have a very recent Linux distro to have access to GNU awk 5.3 (for example, Ubuntu 24.04 doesn’t ship that, and Fedora 40 will start shipping it – once it’s released, Fedora 40 isn’t even a beta release yet).

So, the easiest solution would probably be not to use awk to parse your CSV input. Usually, I’d recommend mlr for such tasks, but if I remember correctly, it doesn’t come with support for multiline values.

So, Python would be a natural choice.

#!/usr/bin/env python3
import csv
from sys import argv

with (
    open(argv[1], "r", encoding="utf-8") as infile,
    open(argv[2], "w", encoding="utf-8") as outfile,
):
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
    for row in reader:
        writer.writerow([value.replace("n", " ") for value in row])

You can make this executable (chmod a+x thisscript.py) and run it as /path/to/thisscript.py input.csv output.csv.

Answered By: Marcus Müller

Using Miller (mlr), a CSV-aware multi-purpose processing utility for various structured document formats, to clean up the whitespace of all fields:

$ cat file
"host1","host1","linux
server",""
"host2","host2","linux server",""
$ mlr --csv -N clean-whitespace file
host1,host1,linux server,
host2,host2,linux server,

This reads the data in file as header-less CSV records and applies the clean-whitespace operation to each. The clean-whitespace operation trims flanking whitespace from each field’s value and combines consecutive whitespace characters into single spaces.

To instead only replace newlines with spaces, you may iterate over the fields with a short put expression:

$ mlr --csv -N put 'for (k,v in $*) { $[k] = gssub(v, "n", " ") }' file
host1,host1,linux server,
host2,host2,linux server,

The gssub() function acts like gsub() in Awk, but does not treat its query argument like a regular expression (Miller also has gsub()).

If you feel you need to have the fields quoted even though it’s not strictly needed (Miller adds quotes automatically if a field’s value requires it), then use mlr with its --quote-all option:

$ mlr --csv -N --quote-all clean-whitespace file
"host1","host1","linux server",""
"host2","host2","linux server",""
$ mlr --csv -N --quote-all put 'for (k,v in $*) { $[k] = gssub(v, "n", " ") }' file
"host1","host1","linux server",""
"host2","host2","linux server",""
Answered By: Kusalananda

The last thing you want to do is try to do this in bash. See Why is using a shell loop to process text considered bad practice?.

Now, if what you want can be expressed as "remove any newline characters unless they come right after a " character", you could do something like this:

perl -pe 's/(?<!")n/ /g' file

The (?<!")n matches any newline character that is NOT preceded by a ". So given an example input like this:

$ cat file
"host0","host0","linux
server",""
"host1","host1","linux
server
centos",""
"host2","host2","linux server",""

The command above gives:

$ perl -pe 's/(?<!")n/ /g' file
"host0","host0","linux server",""
"host1","host1","linux server centos",""
"host2","host2","linux server",""

But, really, mlr is the best approach.

Answered By: terdon

We assume that despite many unwanted line breaks all double quotes match and all field separators are present. In that case you could use this command:

$ sed '/^"/! s/^/ /'  infile | tr -d 'n' | sed '-e s/"/"n/'{8..1000..8}
"host1","host1","linux server centos",""
"host2","host2","linux server",""

where:

$ cat infile
"host1","host1","linux
server
centos",""
"host2","host2","linux server",""

This should work regardless of where unwanted line breaks may occur.

The number 1000 inside the braces is an arbitrary large number and must be bigger than the total numbers of characters in the input file.

If you suspect your input lines may contain leading/trailing spaces then remove them first. For example use this command awk 'NF{$1=$1}1' infile

Answered By: user9101329

Using Raku (formerly known as Perl_6)

Inspired by @terdon’s excellent Perl answer:

~$ raku -ne '/ <!after " > $/ ?? print "$_ " !! put $_;'  file

Here are answers written in Raku (a.k.a. Perl6). Raku has a new Unicode-aware regex engine, which attempts to clean up some well-known idioms. So (for example), the "Y not-after X" negative-lookbehind idiom becomes <!after X > Y in Raku, wherein <?after … > denotes a positive lookbehind and <!after … > denotes a negative lookbehind.

Because Raku regularizes line-terminator handling (auto-chomping the n newline is now the default), we simply detect the pattern using Raku’s Ternary operator Test ?? True !! False , and then output using print (does not add n newline terminator) or put (adds n newline terminator at end of text).


Sample Input:

"host0","host0","linux
server",""
"host1","host1","linux
server
centos",""
"host2","host2","linux server",""

Sample Output:

"host0","host0","linux server",""
"host1","host1","linux server centos",""
"host2","host2","linux server",""


Other Raku solutions

Using Raku’s Text::CSV module:

Using an appropriate CSV-parser (module) from the Raku ecosystem is fairly easy (see: https://raku.land/?q=CSV ). These check for RFC 4180 compliance and can give you standardized CSV output, as well as a host of customizations.

Below, Raku’s Text::CSV module parses the OP’s input just fine, and after removing n newlines, will by default output only those columns with internal-whitespace as double-quoted columns (first answer). The second answer reads linewise and produces the same answer as the first:

Reading entire file into memory, default csv() output:

~$ raku -MText::CSV -e 'my @a = csv(in => "/path/to/file", sep => ",");
                        @a = @a>>.map( *.trans: "n" => " ");
                        csv(in => @a, out => $*OUT, sep => ",");'
host0,host0,"linux server",
host1,host1,"linux server centos",
host2,host2,"linux server",

Reading linewise, then "manually"-quote output:

~$ raku -MText::CSV -e 'my $fh = "/path/to/file";  my $io = open $fh, :r, :!chomp; 
                        my $csv = Text::CSV.new;  my @data;
                        while $csv.getline($io) -> $row {
                            @data.push: $row.map: *.trans: "n" => " "; };
                        put $_.join(",") for @data>>.map({ / s / ?? (q["] ~ $_ ~ q["]) !! $_ });'
host0,host0,"linux server",
host1,host1,"linux server centos",
host2,host2,"linux server",

https://docs.raku.org/language/operators#infix_??_!!
https://github.com/Tux/CSV/blob/master/doc/Text-CSV.md#embedded-newlines
https://raku.org

Answered By: jubilatious1

Using any awk:

$ awk 'BEGIN{RS=ORS="""} !(NR%2){gsub(/n/," ")} 1' file.csv
"host1","host1","linux server",""
"host2","host2","linux server",""

If your CSV is actually more complex than you showed in some way (but I can’t think of any way it could be right now) that makes the above not work for you then see whats-the-most-robust-way-to-efficiently-parse-csv-using-awk if you want to use awk for this.

Answered By: Ed Morton

Using TXR:

$ txr fix.txr data
"host1","host1","linuxserver",""
"host2","host2","linux server",""
"host1","host1","linuxservercentos",""
"host2","host2","linux server",""

Code in fix.txr:

@(repeat)
@  (freeform "")
"@a","@b","@c","@d"
@  (do (put-line `"@a","@b","@c","@d"`))
@(end)

The freeform directive causes the following horizontal pattern to be interpreted against a fictitious version of the input in which lines are combined into one logical line. The string argument "" overrides the default separator which is "n" (newline).

With Vim syntax highlihting:

enter image description here

Answered By: Kaz
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.