Trying to sort on two fields, second then first
I am trying to sort on multiple columns. The results are not as expected.
Here’s my data (people.txt):
Simon Strange 62
Pete Brown 37
Mark Brown 46
Stefan Heinz 52
Tony Bedford 50
John Strange 51
Fred Bloggs 22
James Bedford 21
Emily Bedford 18
Ana Villamor 44
Alice Villamor 50
Francis Chepstow 56
The following works correctly:
bash-3.2$ sort -k2 -k3 <people.txt
Emily Bedford 18
James Bedford 21
Tony Bedford 50
Fred Bloggs 22
Pete Brown 37
Mark Brown 46
Francis Chepstow 56
Stefan Heinz 52
John Strange 51
Simon Strange 62
Ana Villamor 44
Alice Villamor 50
But, the following does not work as expected:
bash-3.2$ sort -k2 -k1 <people.txt
Emily Bedford 18
James Bedford 21
Tony Bedford 50
Fred Bloggs 22
Pete Brown 37
Mark Brown 46
Francis Chepstow 56
Stefan Heinz 52
John Strange 51
Simon Strange 62
Ana Villamor 44
Alice Villamor 50
I was trying to sort by surname and then by first name, but you will see the Villamors are not in the correct order. I was hoping to sort by surname, and then when surnames matched, to sort by first name.
It seems there is something about how this should work I don’t understand. I could do this another way of course (using awk), but I want to understand sort.
I am using the standard Bash shell on Mac OS X.
With GNU sort
you do it like this, not sure about MacOS:
sort -k2,2 -k1 <people.txt
Update according to comment. Quoted from man sort
:
-k, --key=KEYDEF
sort via a key; KEYDEF gives location and type
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where
F is a field number and C a character position in the field; both are
origin 1, and the stop position defaults to the line's end.
A key specification like -k2
means to take all the fields from 2 to the end of the line into account. So Villamor 44
ends up before Villamor 50
. Since these two are not equal, the first comparison in sort -k2 -k1
is enough to discriminate these two lines, and the second sort key -k1
is not invoked. If the two Villamors had had the same age, -k1
would have caused them to be sorted by first name.
To sort by a single column, use -k2,2
as the key specification. This means to use the fields from #2 to #2, i.e. only the second field.
sort -k2 -k3 <people.txt
is redundant: it’s equivalent to sort -k2 <people.txt
. To sort by last names, then first names, then age, run the following command:
sort -k2,2 -k1,1 <people.txt
or equivalently sort -k2,2 -k1 <people.txt
since there are only these three fields and the separators are the same. In fact, you will get the same effect from sort -k2,2 <people.txt
, because sort
uses the whole line as a last resort when all the keys in a subset of lines are identical.
Also note that the default field separator is the transition between a non-blank and a blank, so the keys will include the leading blanks (in your example, for the first line, the first key will be "Emily"
, but the second key " Bedford"
. Add the -b
option to strip those blanks:
sort -b -k2,2 -k1,1
It can also be done on a per-key basis by adding the b
flag at the end of the key start specification:
sort -k2b,2 -k1,1 <people.txt
But something to bear in mind: as soon as you add one such flag to the key specification, the global flags (like -n
, -r
…) no longer apply to them so it’s better to avoid mixing per-key flags and global flags.
You can do this
$ sort -k2,2 -k1,1 people.txt
Emily Bedford 18
James Bedford 21
Tony Bedford 50
Fred Bloggs 22
Mark Brown 46
Pete Brown 37
Francis Chepstow 56
Stefan Heinz 52
John Strange 51
Simon Strange 62
Alice Villamor 50
Ana Villamor 44
So first -k2,2
you are sorting by last name. Then, k1,1
sorting by first name.