How to convert all text into 1 except "0"s and first two fields in a csv file?

I have several large.csv files that I’d like to convert to binary (1 and 0) format. In which all cells containing text would become 1 and 0 would remain 0, except first two fields.

head Test.csv
Iss1,1,0,0,Hsapiens-I34,0,0,0,Mmusculus-H01,0,0
Iss1,11,0,Scerevisiae-U09,Hsapiens-I05,0,0,0,0,0,0
Iss1,21,0,0,Hsapiens-I05,0,0,0,Hsapiens-I31,0,0
Iss1,31,0,0,Mmusculus-H13,0,0,0,0,0,Hsapiens-I31
Iss1,41,0,Scerevisiae-U09,0,0,0,0,0,0,Hsapiens-I21
Iss1,51,0,0,0,0,0,0,Scerevisiae-U25,0,Hsapiens-I21
Iss1,61,0,0,Hsapiens-I34,0,0,0,Mmusculus-H13,0,0

The expected outcome would be

head Test.csv
Iss1,1,0,0,1,0,0,0,1,0,0
Iss1,11,0,1,1,0,0,0,0,0,0
Iss1,21,0,0,1,0,0,0,1,0,0
Iss1,31,0,0,1,0,0,0,0,0,1
Iss1,41,0,1,0,0,0,0,0,0,1
Iss1,51,0,0,0,0,0,0,1,0,1
Iss1,61,0,0,1,0,0,0,1,0,0

Where all text in the file is converted into 1.

It would be greatly appreciated if someone could give me some suggestions on how to overcome this.

Thanks

With awk you could do:

awk 'BEGIN {FS=OFS=","} {for (i=3;i<=NF;i++) {$i==0?1:$i=1}} 1' test.csv
  • BEGIN {FS=OFS=","} – set in- and output separator to comma
  • for (i=3;i<=NF;i++) – we’ll loop over fields 3 until max field number NF
  • $i==0?1:$i=1 – if field i is 0, do nothing (1) else set field i to 1
  • 1awk interprets one as true and defaults to printing the record

As suggested in the comments by @EdMorton using $1=($i!=0) is a shorter alternative to $i==0?1:$i=1

  • $i!=0 is a logical test if field i is not 0. awk will return 1 for true and 0 for false and overwrite the field value accordingly
Answered By: FelixJN

With sed, you could do

sed 's/,[[:alpha:]][^,]*/,1/g'

This is: Replace any field starting with an alphabetic character by a 1. Including the comma before the field prevents replacing the first field.

The second field is always a number, so it will not get replaced either. If you want to make the script cover cases with the second field starting with a letter, you could "escape" it temporarily:

sed 's/,/,-/;s/,[[:alpha:]][^,]*/,1/g;s/,-/,/'

This is obviously less elegant, but it works.

using perl:

$ perl -F, -lne 'my @out = map { /^0$/ ? 0 : 1 } splice @F,2;
                 unshift @out, @F;
                 print join(",",@out)' Test.csv 
Iss1,1,0,0,1,0,0,0,1,0,0
Iss1,11,0,1,1,0,0,0,0,0,0
Iss1,21,0,0,1,0,0,0,1,0,0
Iss1,31,0,0,1,0,0,0,0,0,1
Iss1,41,0,1,0,0,0,0,0,0,1
Iss1,51,0,0,0,0,0,0,1,0,1
Iss1,61,0,0,1,0,0,0,1,0,0
Zed227,28897871,0,0,1,0,0,0,1,0,0
Zed227,28897881,0,1,1,0,0,0,0,0,0
Zed227,28897891,0,0,1,0,0,0,1,0,0
Zed227,28897901,0,0,0,0,0,0,0,0,1
lad1,1,0,1,0,0,0,0,0,0,1
lad1,11,0,1,0,0,0,0,1,0,0
lad1,21,0,0,1,0,0,0,0,0,0

How it works:

Perl command-line options:

  • The -F, option tells perl to use a comma as the field delimiter. -F also triggers auto-splitting of each input line, with the fields going into an array called @F – this is similar to how awk auto-splits fields into $1, $2, $3, etc.
  • -l tells perl to automatically handle line-endings, e.g. removing newlines from the input and adding them back to output from print.
  • -n makes perl run similarly to sed -n – i.e. read and process each line, but only print what it is explicitly told to.
  • -e tells perl that the next argument is the script to run.

The script:

  • perl’s splice() removes a portion of an array, and returns that portion to the caller….so splice @F,2 removes and returns all but the first two elements of the @F array. Actually, splice can do a lot more than that, but that’s all I’m using it for here. See perldoc -f splice for details.

  • perl’s map function applies an expression to each element of an array (list). In this case, the list is the elements returned by the splice function. The expression used by map here returns 0 if an element matches the regex /^0$/ or a 1 if it doesn’t match. map returns an array, which is assigned to array variable @out. See perldoc -f map for details.

    BTW, I could have used a string equality comparison (eq) with the ternary operator (i.e. $_ eq "0" ? 0 : 1) instead of a regex. A string comparison would be faster than a simple regex like this, but not noticeably so unless your .csv file was huge (many thousands of lines). Even so, it would be better to use eq – the only reason I used /^0$/ is that it’s the first thing that came to mind.

    A numeric comparison (==, i.e. $_ == 0 ? 0 : 1) wouldn’t have worked for your needs because strings that don’t start with numbers (ignoring any leading whitespace) will evaluate as 0 and you need them to become 1s.

  • unshift does the opposite of shift – it adds elements to the start of the array. In this case, it adds what’s left of the @F array (i.e. the first two elements that weren’t removed by the splice) to the beginning of the @out array. See perldoc -f unshift.

  • Finally, the @out array is joined by commas and printed. For details on the join function used, see perldoc -f join.


This can be shortened to just one statement:

perl -F, -lne 'print join ",", @F[0..1], map { /^0$/ ? 0 : 1 } splice @F,2' Test.csv

There’s no need for @out as a temp variable, and no need for unshift either.

It works exactly the same, but is harder to understand, especially for someone unfamiliar with perl….you have to kind of read it from back to front so you can tell what input each of the functions is getting.

Answered By: cas

Using Raku (formerly known as Perl_6)

~$ raku -ne 'my @a = .split(",");   
             @a[2..*] = do for @a[2..*] { $_ ~~ 0.Int ?? 0 !! 1 };  
             @a.join(",").put ;'   file

OR:

~$ raku -ne 'my @a = .split(",");   
             @a[2..*] .= map: { $_ ~~ 0.Int ?? 0 !! 1 };   
             @a.join(",").put ;'   file

OR:

~$ raku -ne 'my @a = .split(",");   
             @a[2..*] .= map: { +( $_ !~~ 0.Int ) };   
             @a.join(",").put;'   file

Raku is a programming language in the Perl-family of programming languages. It features built-in, high-level support for Concurrency, Asynchrony, and Parallelism (CAP).

Above, the first two code examples bear remarkable similarity to the excellent Perl answers posted by @cas. Of note, Raku has an asymmetric ~~ "smartmatching" operator, which is syntactic sugar for Raku’s .ACCEPTS() method (i.e. "Does the RHS accept the LHS?"). Many "Type"-related issues can be resolved using Raku’s ~~ "smartmatching" operator.

Raku also has a new format for the built-in ternary operator: (Test) ?? True !! False. Some people find this ternary operator easier to read. Alternatively, the third example uses the fact that ~~ sets the return variable after comparison, allowing the resultant True/False values to be coerced via +(…) or (…).Int to 0/1 .

[In the code examples above, the RHS .Int is actually superfluous, and smartmatching works quite well without any additional coercions. However you can perform LHS/RHS coercion ( via .Str, .Int, .Bool, etc.) if you have the need for a specific comparison not addressed by any of the defaults].

Sample Input:

Iss1,1,0,0,Hsapiens-I34,0,0,0,Mmusculus-H01,0,0
Iss1,11,0,Scerevisiae-U09,Hsapiens-I05,0,0,0,0,0,0
Iss1,21,0,0,Hsapiens-I05,0,0,0,Hsapiens-I31,0,0
Iss1,31,0,0,Mmusculus-H13,0,0,0,0,0,Hsapiens-I31
Iss1,41,0,Scerevisiae-U09,0,0,0,0,0,0,Hsapiens-I21
Iss1,51,0,0,0,0,0,0,Scerevisiae-U25,0,Hsapiens-I21
Iss1,61,0,0,Hsapiens-I34,0,0,0,Mmusculus-H13,0,0

Sample Output:

Iss1,1,0,0,1,0,0,0,1,0,0
Iss1,11,0,1,1,0,0,0,0,0,0
Iss1,21,0,0,1,0,0,0,1,0,0
Iss1,31,0,0,1,0,0,0,0,0,1
Iss1,41,0,1,0,0,0,0,0,0,1
Iss1,51,0,0,0,0,0,0,1,0,1
Iss1,61,0,0,1,0,0,0,1,0,0

Note: the above code will convert blank values, 00, 0x0, -0, and 0.0 all to 0 zeroes. In particular for blank (empty) values, you should verify that all columns are filled with the code raku -ne '.split(",", :skip-empty).elems.say;', with/without the :skip-empty parameter to detect differences. For Perl fans, see the Raku Docs section ‘0’ is True.


Take a look at Perl answers if you want a different treatment for blank values, 00, 0x0, -0, and 0.0. The Perl answer by @StéphaneChazelas includes an explanation of how these values (and blanks) are treated. In my hands, the Perl answer by @cas changes blank values, 00, 0x0, -0, and 0.0 all to 1 (different treatment for blank values than @StéphaneChazelas Perl answer). So you have your choice (two languages, three treatments)!


https://docs.raku.org/language/operators#infix_??_!!
https://docs.raku.org/routine/ACCEPTS
https://docs.raku.org/routine/~~
https://raku.org

Answered By: jubilatious1

Another perl approach, modifying @F in place:

$ perl -F, -le 'map {$_ &&= 1} @F[2..$#F]; print join ",", @F' test.csv
Iss1,1,0,0,1,0,0,0,1,0,0
Iss1,11,0,1,1,0,0,0,0,0,0
Iss1,21,0,0,1,0,0,0,1,0,0
Iss1,31,0,0,1,0,0,0,0,0,1
Iss1,41,0,1,0,0,0,0,0,0,1
Iss1,51,0,0,0,0,0,0,1,0,1
Iss1,61,0,0,1,0,0,0,1,0,0

Elements are changed to 1 if they’re considered as true, that is if they’re neither 0 nor the empty string. Other representations of zero such as 00, 0x0, -0 0.0 or zero are treated as true and changed to 1.

Answered By: Stéphane Chazelas