How to sort each 20 lines in a 1000 line file and save only the sorted line with highest value in each interval to another file?

I have a file that has 1000 text lines. I want to sort the 4th column at each 20 lines interval and print the output to another file. Can anybody help me with sorting them with awk or sed?

Here is an example of the data structure input

   1      1.1350  1092.42    0.0000
   2      1.4645   846.58    0.0008
   3      1.4760   840.01    0.0000
   4      1.6586   747.52    0.0006
   5      1.6651   744.60    0.0000
   6      1.7750   698.51    0.0043
   7      1.9216   645.20    0.0062
   8      2.1708   571.14    0.0000
   9      2.1839   567.71    0.0023
  10      2.2582   549.04    0.0000
  11      2.2878   541.93    1.1090
  12      2.3653   524.17    0.0000
  13      2.3712   522.88    0.0852
  14      2.3928   518.15    0.0442
  15      2.5468   486.82    0.0000
  16      2.6504   467.79    0.0000
  17      2.6909   460.75    0.0001
  18      2.7270   454.65    0.0000
  19      2.7367   453.04    0.0004
  20      2.7996   442.87    0.0000
   1      1.4962   828.64    0.0034
   2      1.6848   735.91    0.0001
   3      1.6974   730.45    0.0005
   4      1.7378   713.47    0.0002
   5      1.7385   713.18    0.0007
   6      1.8086   685.51    0.0060
   7      2.0433   606.78    0.0102
   8      2.0607   601.65    0.0032 
   9      2.0970   591.24    0.0045 
  10      2.1033   589.48    0.0184 
  11      2.2396   553.61    0.0203 
  12      2.2850   542.61    1.1579 
  13      2.3262   532.99    0.0022 
  14      2.6288   471.64    0.0039 
  15      2.6464   468.51    0.0051 
  16      2.7435   451.92    0.0001 
  17      2.7492   450.98    0.0002 
  18      2.8945   428.34    0.0010 
  19      2.9344   422.52    0.0001 
  20      2.9447   421.04    0.0007 

expected output:

11      2.2878   541.93    1.1090 
12      2.2850   542.61    1.1579 

Each n interval has only one highest (unique) value.

Asked By: Sanjukta

||

Via awk:

NR%20==1 {max=$4 ; line=$0}
{ if ($4>max) {max=$4;line=$0} }
NR%20==0 {print line}
Answered By: FelixJN

With GNU sort and GNU split, you can do

split -l 20 file.txt --filter "sort -nk 4|tail -n 1"

The file gets splitted in packets of 20 lines, then the filter option filters each packet by the given commands, so they get sorted numerically by the 4th key and only the last line (highest value) extracted by tail.

Answered By: Philippos

Using the DSU (Decorate/Sort/Undecorate) idiom with any awk+sort+cut:

$ awk -v OFS='t' '(NR==1) || ($1<p){b++} {p=$1; print b, $0}' file |
    sort -k5,5rn | awk '!seen[$1]++' | sort -k1,1n | cut -f2-
  11      2.2878   541.93    1.1090
  12      2.2850   542.61    1.1579

See https://stackoverflow.com/questions/71691113/how-to-sort-data-based-on-the-value-of-a-column-for-part-multiple-lines-of-a-f/71694367#71694367 for more info on DSU.

As mentioned in the comments by @St├ęphaneChazelas if you have GNU sort then you could abbreviate the above a little to:

awk -v OFS='t' '(NR==1) || ($1<p){b++} {p=$1; print b, $0}' file |
    sort -k5,5rn | sort -suk1,1n | cut -f2-
Answered By: Ed Morton