Retrieve the 1st and 5th column of a tab-separated file, convert the spaces in the 5th to tabs

I have a tsv file with tab-separated columns. I want to obtain the 5th column, which has space-separated values. Convert the space-separation to tab-separation and save as a new file.

Attempt:

cut -d"t" -f"4" input.tsv
awk -v OFS="t" '$1=$1' input.tsv > output.tsv

Input:

Composite_Element_REF   Gene_Symbol     Chromosome      Genomic_Coordinate      TCGA-KL-8323-01A-21D-2312-05 TCGA-KL-8324-01A-11D-2312-05 TCGA-KL-8325-01A-11D-2312-05 
cg00000027      RBL0    14      53468110        0.0545368833399913 0.635089208882213 0.0581022991274144
cg00000028      RBL1    15      53468111        0.0545366588241415 0.635089205024173 0.0581085373336217
cg00000029      RBL2    16      53468112        0.0545366588040571 0.635089205078394 0.0581085373332275

Expected output:

Composite_Element_REF   TCGA-KL-8323-01A-21D-2312-05    TCGA-KL-8324-01A-11D-2312-05    TCGA-KL-8325-01A-11D-2312-05
cg00000027     0.0545368833399913    0.635089208882213    0.0581022991274144
cg00000028    0.0545366588241415    0.635089205024173    0.0581085373336217
cg00000029    0.0545366588040571    0.635089205078394    0.0581085373332275
Asked By: Anon

||

If what you want is the 1st and 5th tab-delimited field from some input file input.tsv, and then change the embedded spaces in those fields into tabs, you can do that with cut and tr like so:

cut -f 1,5 input.tsv | tr ' ' 't' >output.tsv

This first extracts the 1st and 5th fields from the input, then changes all space characters to tabs and writes the output to a file. If the 1st field is supposed to be unchanged, this assumes that this field does not contain spaces.

Since tab is the default delimiter for cut, we don’t need to use the -d option.

Alternatively, with awk:

awk 'BEGIN { OFS=FS="t" }
    {
        nf = split($5, a, " ")
        $0 = $1

        for (i = 1; i <= nf; ++i) $(NF+1) = a[i]

        print
    }' input.tsv >output.tsv

This splits the 5th tab-delimited field on spaces and then overwrites all original fields with the 1st field. The spilt-up fields are added after the first field in the loop, before outputting the resulting record.

With Miller (mlr), we can choose to read the input as header-less TSV (header-less, so that we simply treat the header as data), cut out the fields we’re interested in and then recreate each record just like we did with awk, but shorter:

mlr --tsv -N put '$* = {1:$1, 2:splita($5, " ")}' input.tsv >output.tsv

The output from each of these commands would be

Composite_Element_REF   TCGA-KL-8323-01A-21D-2312-05    TCGA-KL-8324-01A-11D-2312-05    TCGA-KL-8325-01A-11D-2312-05
cg00000027      0.0545368833399913      0.635089208882213       0.0581022991274144
cg00000028      0.0545366588241415      0.635089205024173       0.0581085373336217
cg00000029      0.0545366588040571      0.635089205078394       0.0581085373332275

Note that I’m assuming that you want the header for the initial field, and that the lone -05 on the second line of the input is a typo and should be attached to the end of the final field of the header line.

Answered By: Kusalananda

Using Raku (formerly known as Perl_6)

~$ raku -ne 'put join "t", .words.[0,4..*];'  file

Above is an answer coded in Raku, a member of the Perl-family of programming languages. Here I assume (same as @Kusalananda‚ô¶) that "the lone -05 on the second line of the input is a typo and should be attached to the end of the final field of the header line." (credit: @Kusalananda‚ô¶).

Briefly, the file is read linewise with the -ne non-autoprinting linewise flags. Raku’s .words routine takes the $_ topic (input line) and breaks it on whitespace into elements. Square brackets [0,4..*] select out the desired zero-indexed elements. These are joined back together on t tabs, giving the desired output.

Sample Input:

Composite_Element_REF   Gene_Symbol     Chromosome      Genomic_Coordinate      TCGA-KL-8323-01A-21D-2312-05 TCGA-KL-8324-01A-11D-2312-05 TCGA-KL-8325-01A-11D-2312-05 
cg00000027      RBL0    14      53468110        0.0545368833399913 0.635089208882213 0.0581022991274144
cg00000028      RBL1    15      53468111        0.0545366588241415 0.635089205024173 0.0581085373336217
cg00000029      RBL2    16      53468112        0.0545366588040571 0.635089205078394 0.0581085373332275

Sample Output:

Composite_Element_REF   TCGA-KL-8323-01A-21D-2312-05    TCGA-KL-8324-01A-11D-2312-05    TCGA-KL-8325-01A-11D-2312-05
cg00000027  0.0545368833399913  0.635089208882213   0.0581022991274144
cg00000028  0.0545366588241415  0.635089205024173   0.0581085373336217
cg00000029  0.0545366588040571  0.635089205078394   0.0581085373332275

It’s fairly easy to deal with "short" lines, just add a conditional:

~$ raku -ne 'put join "t", .words.[0,4..*] if .words.elems == 7;'  file

Finally, if you truly have a file with 2-different separators, you can manage them individually/consecutively:

~$ raku -ne '.split("t", 5).[0,4].split(" ").join("t").put;'   file

The final two solutions give the same "Sample Output" as above.

https://raku.org

Answered By: jubilatious1

Using awk:

$ awk -F 't' '{gsub(OFS,FS,$5); print $1 FS $5}' file
Answered By: Prabhjot Singh

FWIW in reality I’d use @PrapjhotSingh’s solution but for some alternatives that may provide useful information on manipulating fields in general, read on…

To get the output you show from the input you show all you need is:

$ awk -v OFS='t' '{print $1, $5, $6, $7}' file
Composite_Element_REF   TCGA-KL-8323-01A-21D-2312-05    TCGA-KL-8324-01A-11D-2312-05    TCGA-KL-8325-01A-11D-2312-05
cg00000027      0.0545368833399913      0.635089208882213       0.0581022991274144
cg00000028      0.0545366588241415      0.635089205024173       0.0581085373336217
cg00000029      0.0545366588040571      0.635089205078394       0.0581085373332275

or, since I see you attempted using $1=$1 in your code, you could do:

$ awk -v OFS='t' '{$2=$3=$4=""; $0=$0; $1=$1} 1' file
Composite_Element_REF   TCGA-KL-8323-01A-21D-2312-05    TCGA-KL-8324-01A-11D-2312-05    TCGA-KL-8325-01A-11D-2312-05
cg00000027      0.0545368833399913      0.635089208882213       0.0581022991274144
cg00000028      0.0545366588241415      0.635089205024173       0.0581085373336217
cg00000029      0.0545366588040571      0.635089205078394       0.0581085373332275

where:

  • $2=$3=$4="" sets those fields to null and reconstructs $0, where those fields still exist but are empty, replacing all blanks and tabs (which match the default FS we are using) with single tabs (the OFS we are using) so the blanks in the original $5 become tabs as it gets separated into 3 fields, $5, $6, and $7.
  • $0=$0 re-splits $0 into fields but doesn’t change the white space between fields so those 3 fields are now gone but the multiple tabs between $1 and the original $5 (now $2, $3, and $4) still exist.
  • $1=$1 sets $1 to itself which again reconstructs $0 from it’s fields, but now the multiple contiguous tabs (which again match the default FS we are using) are replaced by a single tab (the OFS we are using) resulting in the final output of 4 tab-separated fields.

You can see those different stages below, note the changes in spacing between fields and the change in the value of NF printed at the start of each line:

$ awk -v OFS='t' '{print NF ": " $0}' < <(printf 'atbtctdte f gn')
7: a    b       c       d       e f g

$ awk -v OFS='t' '{$2=$3=$4=""; print NF ": " $0}' < <(printf 'atbtctdte f gn')
7: a                            e       f       g

$ awk -v OFS='t' '{$2=$3=$4=""; $0=$0; print NF ": " $0}' < <(printf 'atbtctdte f gn')
4: a                            e       f       g

$ awk -v OFS='t' '{$2=$3=$4=""; $0=$0; $1=$1; print NF ": " $0}' < <(printf 'atbtctdte f gn')
4: a    e       f       g

Alternatively, in some awks, e.g. GNU awk, you could do either of these:

awk -v OFS='t' '{$2=$5; $3=$6; $4=$7; NF=4} 1' file
awk -v OFS='t' '{for (i=2; i<=4; i++) $i=$(i+3); NF=i-1} 1' file

where we’re shifting the values of $5 and later up 3 fields and then setting NF to 4 to remove the original $5 and later, but setting NF is undefined behavior per POSIX so different awks will either honor that or ignore that or could do anything else.

Answered By: Ed Morton