How can aggregate lines in CSV based on value of one column
My input is the following sorted CSV file (using :
as the field delimiter instead of the usual comma):
version:device
1.0.0:dev1
1.0.0:dev2
1.2.3:dev3
1.3.4:dev4
1.3.4:dev5
I want to aggregate it so that each version is in 1 line:
version:devices
1.0.0:dev1,dev2
1.2.3:dev3
1.3.4:dev4,dev5
Something like this should do what you want.
awk -F':' 'OFS=""; NR == 1; NR > 1 {x=$1;$1="";a[x]=a[x]$0","}END{for(x in a)print x":"a[x]}' file
try
awk -F: '$1 == before { printf ",%s",$2 ; }
$1 != before { printf "%s%s",nl,$0 ; before=$1 ; nl="n" }
END {printf nl ; }'
where
-F:
tell awk to use:
as separatorprintf
do not print newline, unless tol ("n"
)- this can be onelined (e.g.
awk -F: '....' file
), I split line for readability.
this use two variables to hold new line (nl
) and previous $1
‘s value.
miller is a pretty amazing tool:
mlr --csv --fs : group-by version then nest --ivar , -f device file.csv
version:device
1.0.0:dev1,dev2
1.2.3:dev3
1.3.4:dev4,dev5
awk -F: -v OFS=: '
NR == 1{print;next}
!v {v=$1;d=$2;next}
v == $1 {d = d","$2}
v != $1 {print v,d;v=$1;d=$2}
END{print v,d}
' file
Using GNU datamash
to group by the first colon-separated field, collapsing the second field:
$ datamash -t : groupby 1 collapse 2 <file
version:device
1.0.0:dev1,dev2
1.2.3:dev3
1.3.4:dev4,dev5
$ cat tst.awk
BEGIN { FS=OFS=":" }
$1 != prev {
if ( NR == 1 ) {
devs = $2 "s"
}
else {
print prev, devs
devs = $2
}
prev = $1
next
}
{ devs = devs "," $2 }
END { print prev, devs }
$ awk -f tst.awk file
version:devices
1.0.0:dev1,dev2
1.2.3:dev3
1.3.4:dev4,dev5
Yet another solution with csvkit:
csvsql -d ':' --query
'SELECT version, GROUP_CONCAT(device) as devices FROM file GROUP BY version' file.txt
| csvformat -D ':'
version:devices
1.0.0:dev1,dev2
1.2.3:dev3
1.3.4:dev4,dev5