Converting CSV to TSV
I have a number of large CSV files and would like them in TSV (tab separated format). The complication is that there are commas in the fields of the CSV file, eg:
A,,C,"D,E,F","G",I,"K,L,M",Z
Expected output:
A C D,E,F G I K,L,M Z
(where whitespace in between are ‘hard’ tabs)
I have Perl, Python, and coreutils installed on this server.
One option might be perl’s Text::CSV module e.g.
perl -MText::CSV -lne 'BEGIN { $csv = Text::CSV->new() }
print join "t", $csv->fields() if $csv->parse($_)
' somefile
to demonstrate
echo 'A,,C,"D,E,F","G",I,"K,L,M",Z' |
perl -MText::CSV -lne 'BEGIN { $csv = Text::CSV->new() }
print join "t", $csv->fields() if $csv->parse($_)
'
A C D,E,F G I K,L,M Z
Python
Add to file named csv2tab
, and make it executable
touch csv2tab && chmod u+x csv2tab
Add to it
#!/usr/bin/env python
import csv, sys
csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))
Test runs
$ echo 'A,,C,"D,E,F","G",I,"K,L,M",Z' | ./csv2tab
A C D,E,F G I K,L,M Z
$ ./csv2tab < data.csv > data.tsv && head data.tsv
1A C D,E,F G I K,L,M Z
2A C D,E,F G I K,L,M Z
3A C D,E,F G I K,L,M Z
Perl
perl -lne '
my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/;
print join "t", map { s/(?<!\)"//gr =~ s/\"/"/gr } split $re;
'
Awk
awk -v Q=" -v FPAT="([^,]*)|("[^"]+")" -v OFS="t" '{
for (i=1; i<=NF; ++i)
if ( substr($i, 1, 1) == Q )
$i = substr($i, 2, length($i) - 2)
print $1, $2, $3, $4, $5, $6, $7, $8
}'
Result:
A C D,E,F G I K,L,M Z
For fun, sed
.
sed -E 's/("([^"]*)")?,/2t/g' file
If your sed
doesn’t support -E
, try with -r
. If your sed
doesn’t support t
for a literal tab, try putting a literal tab (in many shells, ctrl–v tab) or in Bash, use a $'...'
C-style string (in which case the backslash in 2
needs to be doubled). If you want to keep the quotes, use 1
instead of 2
(in which case the inner pair of parentheses is useless, and can be removed).
If your sed
doesn’t support either -E
or -r
, try
sed 's/("([^"]*)")?,/2t/g' file
again possibly with the tweaks suggested above if t
is not supported.
For additional fun, here’s the same thing with the Bash "here-string" syntax, just to demonstrate what it looks like. Notice how the literal backslashes we want sed
to receive are now doubled:
sed $'s/\("\([^"]*\)"\)\?,/\2t/g' file
This makes no attempt to handle escaped double quotes inside double quotes; some CSV dialects support this by doubling the quoted double quote (sic).
Using csvkit
(Python), for example:
$ csvformat -T in.csv > out.txt
Does streaming, with correct CSV and TSV quoting and escaping
It’s in apt and other package managers
The thermonuclear flyswatter solution must be using libreoffice. While https://ask.libreoffice.org/en/question/19042/is-is-possible-to-convert-comma-separated-value-csv-to-tab-separated-value-tsv-via-headless-mode/ suggests this is not possible but it is wrong (or just outdated?) and the following command works on my 5.3.:
loffice "-env:UserInstallation=file:///tmp/LibO_Conversion" --convert-to csv:"Text - txt - csv (StarCalc)":9,34,UTF8 --headless --outdir some/path --infilter='csv:44,34,UTF8' *.csv
the env
argument could be skipped but this way the documents won’t appear in your recent document.
Vim
Just for fun, regex substitutions can be performed in Vim. Here’s a potential four line solution, adapted from: https://stackoverflow.com/questions/33332871/remove-all-commas-between-quotes-with-a-vim-regex
- Commas between quotes are first changed to underscores (or other absent character),
- All other commas are replaced with tabs,
- Underscores inside quotes are restored to commas,
-
Quotation marks are removed.
:%s/".{-}"/=substitute(submatch(0), ',', '_' , 'g')/g :%s/,/t/g :%s/_/,/g :%s/"//g
To script the solution somewhat, the four lines above (sans leading colon) can be saved to a file, e.g. to_tsv.vim
. Open each CSV for editing with Vim and source
the to_tsv.vim
script on the Vim command line (adapted from https://stackoverflow.com/questions/3374179/run-vim-script-from-vim-commandline/8806874#8806874):
:source /path/to/vim/filename/to_tsv.vim
I authored an open-source CSV to TSV converter that handles the transformations described. It’s quite fast, may be worth a look if there’s an on-going need to convert large CSV files. Tool is part of eBay’s TSV utilities toolkit (csv2tsv documentation here). Default options suffice for the input described:
$ csv2tsv file.csv > file.tsv
A consideration when converting CSV to TSV is handling of field and record delimiters (comma and newline) in the data. CSV uses an escape syntax. If the goal is to use the output with Unix tools like cut
, awk
, etc., the output needs to be free of escapes. Most solutions listed here produce CSV style escapes when delimiters are in the data. csv2tsv
is differentiated from other solutions in that it produces TSV without escapes. See the documentation for details.
To see what a particular solution does, convert a CSV containing commas, tabs, quotes, and newlines in the data. For example:
$ echo $'Line,Field1,Field2n1,"Comma: |,|","Quote: |""|"n"2","TAB: |t|","Newline: |n|"' | <conversion-script-or-command>
Solutions generating escapes will put double quotes around the fields containing quotes, newlines, or tabs.
The following is simply a correction to the answer from @tripleee
so that it strips any quotes from the final field just as it does to all the other fields.
To show what’s being corrected, below is a tripleee‘s answer, plus a slight modification to the OP’s example data with added quotes around the final ‘Z‘ field.
echo 'A,,C,"D,E,F","G",I,"K,L,M","Z"' | sed -r -e 's/("([^"]*)")?,/2t/g'
A C D,E,F G I K,L,M "Z"
You can see that ‘Z‘ is left with quotes around it. This is different to how the inner fields are handled. For example, the ‘G‘ does not have quotes on it.
The following command uses a second substitution to clean the final column:
echo 'A,,C,"D,E,F","G",I,"K,L,M","Z"' | sed -r -e 's/("([^"]*)")?,/2t/g'
-e 's/t"([^"]*)"$/t1/'
A C D,E,F G I K,L,M Z
If you have, or can install, the csvtool
utility:
csvtool -t COMMA -u TAB cat in.csv > out.ctv
Note that for some reason csvtool
doesn’t have a man page, but csvtool --help
will print a couple hundred lines of documentation.
Using mlr
is almost succinct, but disabling headers requires long options:
mlr --c2t --implicit-csv-header --headerless-csv-output cat file.csv
Output:
A C D,E,F G I K,L,M Z
Here is the example of converting CSV into TSV using jq
utility:
$ jq -rn '@tsv "(["A","","C","D,E,F","G","I","K,L,M","Z"])"'
A C D,E,F G I K,L,M Z
or:
$ echo '["A","","C","D,E,F","G","I","K,L,M","Z"]' | jq -r @tsv
A C D,E,F G I K,L,M Z
However the CSV format needs to be well formatted, so each string needs to be quoted.
Source: Simple TSV output format.
With perl
, assuming the csv fields have no embedded "
or newlines or tabs:
perl -pe 's{"(.*?)"|,}{$1//"t"}ge'
sed
solution (with requirements): One really should also handle literal double-quotes, which are represented by a pair of double-quotes within a CSV field.
First change all such double-quote literals to a stand-in flag; then delete all field delimiters (solitary double-quotes), and convert field separators (commas) to tabs; then change the stand-in flags to double-quotes.
Here I’m using v
(vertical tab) as the stand-in flag for embedded double-quotes:
sed -r 's/""/v/g; s/("([^"]+)")?,/2t/g; s/v/"/g'
This solution relies on your CSV data not containing vertical-tab characters, and on empty CSV fields not being quoted.
After gem install csv
:
$ time </tmp/a ruby -rcsv -e'puts CSV.parse($<).map{|x|x*"t"}'>/dev/null
0.307
$ time ruby -rcsv -e'CSV.foreach("/tmp/a"){|x|puts x*"t"}'>/dev/null
0.367
$ time </tmp/a ruby -rcsv -ne'puts CSV.parse($_)*"t"'>/dev/null
2.456
Parsing context sensitive formats is indeed a constant annoyance esp. when csv quoting is inconsistent and even irregular. Here is some straightforward awk that anyone ought to be able to use/revise. It’s not as clever as terse awk, but perhaps a good building block that is easily tweaked. Yes, I know it could be shortened, but I like reusable awk to have a bit of slack.
{
mark[++nmark] = 0
thislength = length($0)
for (i=1; i<=thislength; i++) {
thisc = substr($0,i,1)
if (thisc == "\") { i++; continue } # skip x for any x
if (thisc == """) qflag = 1-qflag
if (thisc == "," && !qflag) mark[++nmark] = i
}
mark[++nmark] = thislength+1
for (i=1; i<nmark; i++) {
thiso = outfield[++noutfield] = substr($0, mark[i]+1, mark[i+1]-mark[i]-1)
if (thiso ~ /^".*"$/) {
sub(/"$/,"",outfield[noutfield])
sub(/^"/,"",outfield[noutfield])
}
}
for (i=1; i<noutfield; i++) printf "%s", outfield[i]"t"
print outfield[noutfield]
}