Converting CSV to TSV

I have a number of large CSV files and would like them in TSV (tab separated format). The complication is that there are commas in the fields of the CSV file, eg:

 A,,C,"D,E,F","G",I,"K,L,M",Z

Expected output:

 A      C   D,E,F   G   I   K,L,M   Z

(where whitespace in between are ‘hard’ tabs)

I have Perl, Python, and coreutils installed on this server.

Asked By: user14755

||

One option might be perl’s Text::CSV module e.g.

perl -MText::CSV -lne 'BEGIN { $csv = Text::CSV->new() }
  print join "t", $csv->fields() if $csv->parse($_)
' somefile

to demonstrate

echo 'A,,C,"D,E,F","G",I,"K,L,M",Z' |
  perl -MText::CSV -lne 'BEGIN { $csv = Text::CSV->new() }
  print join "t", $csv->fields() if $csv->parse($_)
'
A       C   D,E,F   G   I   K,L,M   Z
Answered By: steeldriver

Python

Add to file named csv2tab, and make it executable

touch csv2tab && chmod u+x csv2tab

Add to it

#!/usr/bin/env python
import csv, sys
csv.writer(sys.stdout, dialect='excel-tab').writerows(csv.reader(sys.stdin))

Test runs

$ echo 'A,,C,"D,E,F","G",I,"K,L,M",Z' | ./csv2tab                     
A       C   D,E,F   G   I   K,L,M   Z

$ ./csv2tab < data.csv > data.tsv && head data.tsv                                                   
1A      C   D,E,F   G   I   K,L,M   Z
2A      C   D,E,F   G   I   K,L,M   Z
3A      C   D,E,F   G   I   K,L,M   Z
Answered By: OneCricketeer

Perl

perl -lne '
   my $re = qr/,(?=(?:[^"]*"[^"]*")*(?![^"]*"))/;
   print join "t", map { s/(?<!\)"//gr =~ s/\"/"/gr } split $re;
'

Awk

awk -v Q=" -v FPAT="([^,]*)|("[^"]+")" -v OFS="t" '{
   for (i=1; i<=NF; ++i)
      if ( substr($i, 1, 1) == Q )
         $i = substr($i, 2, length($i) - 2)
   print $1, $2, $3, $4, $5, $6, $7, $8
}'

Result:

A               C       D,E,F   G       I       K,L,M   Z
Answered By: user218374

For fun, sed.

sed -E 's/("([^"]*)")?,/2t/g' file

If your sed doesn’t support -E, try with -r. If your sed doesn’t support t for a literal tab, try putting a literal tab (in many shells, ctrlv tab) or in Bash, use a $'...' C-style string (in which case the backslash in 2 needs to be doubled). If you want to keep the quotes, use 1 instead of 2 (in which case the inner pair of parentheses is useless, and can be removed).

If your sed doesn’t support either -E or -r, try

sed 's/("([^"]*)")?,/2t/g' file

again possibly with the tweaks suggested above if t is not supported.

For additional fun, here’s the same thing with the Bash "here-string" syntax, just to demonstrate what it looks like. Notice how the literal backslashes we want sed to receive are now doubled:

sed $'s/\("\([^"]*\)"\)\?,/\2t/g' file

This makes no attempt to handle escaped double quotes inside double quotes; some CSV dialects support this by doubling the quoted double quote (sic).

Answered By: tripleee

Using csvkit (Python), for example:

$ csvformat -T in.csv > out.txt

Does streaming, with correct CSV and TSV quoting and escaping

It’s in apt and other package managers

Answered By: Neil McGuigan

The thermonuclear flyswatter solution must be using libreoffice. While https://ask.libreoffice.org/en/question/19042/is-is-possible-to-convert-comma-separated-value-csv-to-tab-separated-value-tsv-via-headless-mode/ suggests this is not possible but it is wrong (or just outdated?) and the following command works on my 5.3.:

loffice "-env:UserInstallation=file:///tmp/LibO_Conversion" --convert-to csv:"Text - txt - csv (StarCalc)":9,34,UTF8 --headless --outdir some/path --infilter='csv:44,34,UTF8' *.csv

the env argument could be skipped but this way the documents won’t appear in your recent document.

Answered By: chx

Vim

Just for fun, regex substitutions can be performed in Vim. Here’s a potential four line solution, adapted from: https://stackoverflow.com/questions/33332871/remove-all-commas-between-quotes-with-a-vim-regex

  1. Commas between quotes are first changed to underscores (or other absent character),
  2. All other commas are replaced with tabs,
  3. Underscores inside quotes are restored to commas,
  4. Quotation marks are removed.

    :%s/".{-}"/=substitute(submatch(0), ',', '_' , 'g')/g
    :%s/,/t/g
    :%s/_/,/g
    :%s/"//g
    

To script the solution somewhat, the four lines above (sans leading colon) can be saved to a file, e.g. to_tsv.vim. Open each CSV for editing with Vim and source the to_tsv.vim script on the Vim command line (adapted from https://stackoverflow.com/questions/3374179/run-vim-script-from-vim-commandline/8806874#8806874):

    :source /path/to/vim/filename/to_tsv.vim
Answered By: jubilatious1

I authored an open-source CSV to TSV converter that handles the transformations described. It’s quite fast, may be worth a look if there’s an on-going need to convert large CSV files. Tool is part of eBay’s TSV utilities toolkit (csv2tsv documentation here). Default options suffice for the input described:

$ csv2tsv file.csv > file.tsv

A consideration when converting CSV to TSV is handling of field and record delimiters (comma and newline) in the data. CSV uses an escape syntax. If the goal is to use the output with Unix tools like cut, awk, etc., the output needs to be free of escapes. Most solutions listed here produce CSV style escapes when delimiters are in the data. csv2tsv is differentiated from other solutions in that it produces TSV without escapes. See the documentation for details.

To see what a particular solution does, convert a CSV containing commas, tabs, quotes, and newlines in the data. For example:

$ echo $'Line,Field1,Field2n1,"Comma: |,|","Quote: |""|"n"2","TAB: |t|","Newline: |n|"' | <conversion-script-or-command>

Solutions generating escapes will put double quotes around the fields containing quotes, newlines, or tabs.

Answered By: JonDeg

The following is simply a correction to the answer from @tripleee
so that it strips any quotes from the final field just as it does to all the other fields.

To show what’s being corrected, below is a tripleee‘s answer, plus a slight modification to the OP’s example data with added quotes around the final ‘Z‘ field.

echo 'A,,C,"D,E,F","G",I,"K,L,M","Z"' |  sed -r -e 's/("([^"]*)")?,/2t/g'
A       C   D,E,F   G   I   K,L,M   "Z"

You can see that ‘Z‘ is left with quotes around it. This is different to how the inner fields are handled. For example, the ‘G‘ does not have quotes on it.

The following command uses a second substitution to clean the final column:

echo 'A,,C,"D,E,F","G",I,"K,L,M","Z"' |  sed -r -e 's/("([^"]*)")?,/2t/g' 
                                                -e 's/t"([^"]*)"$/t1/'
A       C   D,E,F   G   I   K,L,M   Z
Answered By: Fonnae

If you have, or can install, the csvtool utility:

csvtool -t COMMA -u TAB cat in.csv > out.ctv

Note that for some reason csvtool doesn’t have a man page, but csvtool --help will print a couple hundred lines of documentation.

Answered By: Keith Thompson

Using mlr is almost succinct, but disabling headers requires long options:

mlr --c2t --implicit-csv-header --headerless-csv-output cat file.csv 

Output:

A       C   D,E,F   G   I   K,L,M   Z
Answered By: agc

Here is the example of converting CSV into TSV using jq utility:

$ jq -rn '@tsv "(["A","","C","D,E,F","G","I","K,L,M","Z"])"'
A       C   D,E,F   G   I   K,L,M   Z

or:

$ echo '["A","","C","D,E,F","G","I","K,L,M","Z"]' | jq -r @tsv
A       C   D,E,F   G   I   K,L,M   Z

However the CSV format needs to be well formatted, so each string needs to be quoted.

Source: Simple TSV output format.

Answered By: kenorb

With perl, assuming the csv fields have no embedded " or newlines or tabs:

perl -pe 's{"(.*?)"|,}{$1//"t"}ge'
Answered By: Stéphane Chazelas

sed solution (with requirements): One really should also handle literal double-quotes, which are represented by a pair of double-quotes within a CSV field.

First change all such double-quote literals to a stand-in flag; then delete all field delimiters (solitary double-quotes), and convert field separators (commas) to tabs; then change the stand-in flags to double-quotes.

Here I’m using v (vertical tab) as the stand-in flag for embedded double-quotes:

sed -r 's/""/v/g; s/("([^"]+)")?,/2t/g; s/v/"/g'

This solution relies on your CSV data not containing vertical-tab characters, and on empty CSV fields not being quoted.

Answered By: RashaMatt

After gem install csv:

$ time </tmp/a ruby -rcsv -e'puts CSV.parse($<).map{|x|x*"t"}'>/dev/null
0.307
$ time ruby -rcsv -e'CSV.foreach("/tmp/a"){|x|puts x*"t"}'>/dev/null
0.367
$ time </tmp/a ruby -rcsv -ne'puts CSV.parse($_)*"t"'>/dev/null
2.456
Answered By: nisetama

Parsing context sensitive formats is indeed a constant annoyance esp. when csv quoting is inconsistent and even irregular. Here is some straightforward awk that anyone ought to be able to use/revise. It’s not as clever as terse awk, but perhaps a good building block that is easily tweaked. Yes, I know it could be shortened, but I like reusable awk to have a bit of slack.

{
  mark[++nmark] = 0
  thislength = length($0)
  for (i=1; i<=thislength; i++) {
    thisc = substr($0,i,1)
    if (thisc == "\") { i++; continue } # skip x for any x
    if (thisc == """) qflag = 1-qflag
    if (thisc == "," && !qflag) mark[++nmark] = i
  }
  mark[++nmark] = thislength+1
  for (i=1; i<nmark; i++) {
    thiso = outfield[++noutfield] = substr($0, mark[i]+1, mark[i+1]-mark[i]-1)
    if (thiso ~ /^".*"$/) {
      sub(/"$/,"",outfield[noutfield])
      sub(/^"/,"",outfield[noutfield])
    }
  }
  for (i=1; i<noutfield; i++) printf "%s", outfield[i]"t"
  print outfield[noutfield]
}
Answered By: user12283974
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.