cut with 2 character delimiter
I wanted to use cut to with a 2 charachter delimeter to process a file with many lines like this:
1F3C6..1F3CA
1F3CF..1F3D3
1F3E0..1F3F0
But cut only allows a single character.
Instead of cut -d'..'
I’m trying awk -F'..' "{echo $1}"
but it’s not working.
My script:
wget -O output.txt http://www.unicode.org/Public/emoji/6.0/emoji-data.txt
sed -i '/^#/ d' output.txt # Remove comments
cat output.txt | cut -d' ' -f1 | while read line ;
do echo $line | awk -F'..' "{echo $1}"
done
Sample test script that works for me:
#!/bin/sh
raw="1F3C6..1F3CA
1F3CF..1F3D3
1F3E0..1F3F0"
for r in $raw
do
f1=`echo "${r}" | cut -d'.' -f1`
f2=`echo "${r}" | cut -d'.' -f2`
f3=`echo "${r}" | cut -d'.' -f3`
echo "field 1:[${f1}] field 2:[${f2}] field 3:[${f3}]"
done
exit
And the output is:
field 1:[1F3C6] field 2:[] field 3:[1F3CA]
field 1:[1F3CF] field 2:[] field 3:[1F3D3]
field 1:[1F3E0] field 2:[] field 3:[1F3F0]
Edit
After reading Stéphane Chazelas comment and linked Q&A, I re-wrote the above to remove the loop
.
I could not work out a way to remove the loop
and keep the parts as variables (for example; $f1
, $f2
and $f3
in my original answer) that could be passed around. Still I don’t know what was required output in the original question.
First, still using cut
:
#!/bin/sh
raw="1F3C6..1F3CA
1F3CF..1F3D3
1F3E0..1F3F0"
printf '%sn' "${raw}" | cut -d'.' -f1,3
Which will output:
1F3C6.1F3CA
1F3CF.1F3D3
1F3E0.1F3F0
Could replace the displayed .
with any string using the --output-delimiter=STRING
.
Next, with sed
instead of cut
in order to give more control of the output:
#!/bin/sh
raw="1F3C6..1F3CA
1F3CF..1F3D3
1F3E0..1F3F0"
printf '%sn' "${raw}" | sed 's/^(.*)..(.*)$/field 1 [1] field 2 [2]/'
And this will render:
field 1 [1F3C6] field 2 [1F3CA]
field 1 [1F3CF] field 2 [1F3D3]
field 1 [1F3E0] field 2 [1F3F0]
You could use IFS to split each line discarding the field between the two dots:
#/bin/sh
while IFS=. read a _ b
do
echo "field one=[$a] field two=[$b]"
done < "file"
Execute:
$ ./script
field one=1F3C6 field two=1F3CA
field one=1F3CF field two=1F3D3
field one=1F3E0 field two=1F3F0
Assuming that file is:
$ cat file
1F3C6..1F3CA
1F3CF..1F3D3
1F3E0..1F3F0
awk
‘s field separator is treated as a regexp as long as it’s more than two characters. ..
as a regexp, means any 2 characters. You’d need to escape that .
either with [.]
or with .
.
awk -F'[.][.]' ...
awk -F'\.\.' ...
(the backslash itself also needs to be escaped (with some awks like gawk at least) for the n
/b
expansion that the argument to -F
undergoes).
In your case:
awk -F' +|[.][.]' '/^[^#]/{print $1}' < output.txt
In any case, avoid shell loops to process text, note that read
is not meant to be used like that, that echo
should not be used for arbitrary data and remember to quote your variables.
You can also use rev
to revert your string:
cat data.txt
1F3C6..1F3CA
1F3CF..1F3D3
1F3E0..1F3F0
First column:
cat data.txt | cut -d. -f1
1F3C6
1F3CF
1F3E0
Last column:
cat data.txt | rev | cut -d. -f1 | rev
1F3CA
1F3D3
1F3F0
Unfortunately this method would work in your case only and not acceptable for wider species of input data.
I’ve created a patch that adds new -m
command-line option to cut
, which works in the field mode and treats multiple consecutive delimiters as a single delimiter. This basically solves the OP’s question in a rather efficient way. I also submitted this patch upstream a couple of days ago, and let’s hope that it will be merged into the coreutils project.
There are some further thoughts about adding even more whitespace-related features to cut
, and having some feedback about all that would be great. I’m willing to implement more patches for cut
and submit them upstream, which would make this utility more versatile and more usable in various real-world scenarios.