cat line X to line Y on a huge file

Say I have a huge text file (>2GB) and I just want to cat the lines X to Y (e.g. 57890000 to 57890010).

From what I understand I can do this by piping head into tail or viceversa, i.e.

head -A /path/to/file | tail -B

or alternatively

tail -C /path/to/file | head -D

where A,B,C and D can be computed from the number of lines in the file, X and Y.

But there are two problems with this approach:

  1. You have to compute A,B,C and D.
  2. The commands could pipe to each other many more lines than I am interested in reading (e.g. if I am reading just a few lines in the middle of a huge file)

Is there a way to have the shell just work with and output the lines I want? (while providing only X and Y)?

The most orthodox way (but not the fastest, as noted by Gilles above) would be to use sed.

In your case:

X=57890000
Y=57890010
sed -n -e "$X,$Y p" -e "$Y q" filename

The -n option implies that only the relevant lines are printed to stdout.

The p at the end of finishing line number means to print lines in given range.
The q in second part of the script saves some time by skipping the remainder of the file.

Answered By: PaweĊ‚ Rumian

The head | tail approach is one of the best and most “idiomatic” ways to do this:

X=57890000
Y=57890010
< infile.txt head -n "$Y" | tail -n +"$X"

As pointed out by Gilles in the comments, a faster way is

< infile.txt tail -n +"$X" | head -n "$((Y - X))"

The reason this is faster is the first X – 1 lines don’t need to go through the pipe compared to the head | tail approach.

Your question as phrased is a bit misleading and probably explains some of your unfounded misgivings towards this approach.

  • You say you have to calculate A, B, C, D but as you can see, the line count of the file is not needed and at most 1 calculation is necessary, which the shell can do for you anyways.

  • You worry that piping will read more lines than necessary. In fact this is not true: tail | head is about as efficient as you can get in terms of file I/O. First, consider the minimum amount of work necessary: to find the X‘th line in a file, the only general way to do it is to read every byte and stop when you count X newline symbols as there is no way to divine the file offset of the X‘th line. Once you reach the *X*th line, you have to read all the lines in order to print them, stopping at the Y‘th line. Thus no approach can get away with reading less than Y lines. Now, head -n $Y reads no more than Y lines (rounded to the nearest buffer unit, but buffers if used correctly improve performance, so no need to worry about that overhead). In addition, tail will not read any more than head, so thus we have shown that head | tail reads the fewest number of lines possible (again, plus some negligible buffering that we are ignoring). The only efficiency advantage of a single tool approach that does not use pipes is fewer processes (and thus less overhead).

Answered By: jw013

I suggest the sed solution, but for the sake of completeness,

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

To cut out after the last line:

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

Speed test (here on macOS, YMMV on other systems):

  • 100,000,000-line file generated by seq 100000000 > test.in
  • Reading lines 50,000,000-50,000,010
  • Tests in no particular order
  • real time as reported by bash‘s builtin time
 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.

*: Except between the first two, sed -n p;q and head|tail, which seem to be essentially the same.

Answered By: Kevin

If you want lines X to Y inclusive (starting the numbering at 1), use

tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"

tail will read and discard the first X-1 lines (there’s no way around that), then read and print the following lines. head will read and print the requested number of lines, then exit. When head exits, tail receives a SIGPIPE signal and dies, so it won’t have read more than a buffer size’s worth (typically a few kilobytes) of lines from the input file.

Alternatively, as gorkypl suggested, use sed:

sed -n -e "$X,$Y p" -e "$Y q" /path/to/file

The sed solution is significantly slower though (at least for GNU utilities and Busybox utilities; sed might be more competitive if you extract a large part of the file on an OS where piping is slow and sed is fast). Here are quick benchmarks under Linux; the data was generated by seq 100000000 >/tmp/a, the environment is Linux/amd64, /tmp is tmpfs and the machine is otherwise idle and not swapping.

real  user  sys    command
 0.47  0.32  0.12  </tmp/a tail -n +50000001 | head -n 10 #GNU
 0.86  0.64  0.21  </tmp/a tail -n +50000001 | head -n 10 #BusyBox
 3.57  3.41  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU
11.91 11.68  0.14  sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox
 1.04  0.60  0.46  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU
 7.12  6.58  0.55  </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox
 9.95  9.54  0.28  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU
23.76 23.13  0.31  sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox

If you know the byte range you want to work with, you can extract it faster by skipping directly to the start position. But for lines, you have to read from the beginning and count newlines. To extract blocks from x inclusive to y exclusive starting at 0, with a block size of b:

dd bs="$b" seek="$x" count="$((y-x))" </path/to/file

I do this often enough and so wrote this script. I don’t need to find the line numbers, the script does it all.

#!/bin/bash

# $1: start time
# $2: end time
# $3: log file to read
# $4: output file

# i.e. log_slice.sh 18:33 19:40 /var/log/my.log /var/log/myslice.log

if [[ $# != 4 ]] ; then 
echo 'usage: log_slice.sh <start time> <end time> <log file> <output file>'
echo
exit;
fi

if [ ! -f $3 ] ; then
echo "'$3' doesn't seem to exit."
echo 'exiting.'
exit;
fi

sline=$(grep -n " ${1}" $3|head -1|cut -d: -f1)  #what line number is first occurrance of start time
eline=$(grep -n " ${2}" $3|head -1|cut -d: -f1)  #what line number is first occurrance of end time

linediff="$((eline-sline))"

tail -n+${sline} $3|head -n$linediff > $4
Answered By: Doolan

If we know the range to select, from the first line: lStart to the last line: lEnd we could calculate:

lCount="$((lEnd-lStart+1))"

If we know the total amount of lines: lAll we also could calculate the distance to the end of the file:

toEnd="$((lAll-lStart+1))"

Then we will know both:

"how far from the start"            ($lStart) and
"how far from the end of the file"  ($toEnd).

Choosing the smallest of any of those: tailnumber as this:

tailnumber="$toEnd"; (( toEnd > lStart )) && tailnumber="+$linestart"

Allows us to use the consistently fastest executing command:

tail -n"${tailnumber}" ${thefile} | head -n${lCount}

Please note the additional plus (“+”) sign when $linestart is selected.

The only caveat is that we need the total count of lines, and that may take some additional time to find.
As is usual with:

linesall="$(wc -l < "$thefile" )"

Some times measured are:

lStart |500| lEnd |500| lCount |11|
real   user   sys    frac
0.002  0.000  0.000  0.00  | command == tail -n"+500" test.in | head -n1
0.002  0.000  0.000  0.00  | command == tail -n+500 test.in | head -n1
3.230  2.520  0.700  99.68 | command == tail -n99999501 test.in | head -n1
0.001  0.000  0.000  0.00  | command == head -n500 test.in | tail -n1
0.001  0.000  0.000  0.00  | command == sed -n -e "500,500p;500q" test.in
0.002  0.000  0.000  0.00  | command == awk 'NR<'500'{next}1;NR=='500'{exit}' test.in


lStart |50000000| lEnd |50000010| lCount |11|
real   user   sys    frac
0.977  0.644  0.328  99.50 | command == tail -n"+50000000" test.in | head -n11
1.069  0.756  0.308  99.58 | command == tail -n+50000000 test.in | head -n11
1.823  1.512  0.308  99.85 | command == tail -n50000001 test.in | head -n11
1.950  2.396  1.284  188.77| command == head -n50000010 test.in | tail -n11
5.477  5.116  0.348  99.76 | command == sed -n -e "50000000,50000010p;50000010q" test.in
10.124  9.669  0.448  99.92| command == awk 'NR<'50000000'{next}1;NR=='50000010'{exit}' test.in


lStart |99999000| lEnd |99999010| lCount |11|
real   user   sys    frac
0.001  0.000  0.000  0.00  | command == tail -n"1001" test.in | head -n11
1.960  1.292  0.660  99.61 | command == tail -n+99999000 test.in | head -n11
0.001  0.000  0.000  0.00  | command == tail -n1001 test.in | head -n11
4.043  4.704  2.704  183.25| command == head -n99999010 test.in | tail -n11
10.346  9.641  0.692  99.88| command == sed -n -e "99999000,99999010p;99999010q" test.in
21.653  20.873  0.744  99.83 | command == awk 'NR<'99999000'{next}1;NR=='99999010'{exit}' test.in

Note that times change drastically if the selected lines are near the start or near the end. A command which appear to work nicely at one side of the file, may be extremely slow at the other side of the file.

Answered By: user79743

If you cat the data you will want to use tail first and then head.

cat file.name | tail -n +"3" | head -n -"1"
Answered By: Lerie

even the fastest tail + head combo is only like 1.3 % faster than awk :

__='147654389'   # extracting rows 147,654,389 - 147,654,399

  ( time ( pvE0 < "$_____" | 

    mawk2 -v __=$__ 'BEGIN {_=(__=+__)+10}NR<__{next}_<NR{exit}_' ))  

  in0: 7.17GiB 0:00:05 [1.33GiB/s] [1.33GiB/s] [=====> ] 94%            
   ( pvE 0.1 in0 < "$_____" | mawk2 -v __=$__ ; )  

   4.65s user 1.79s system 118% cpu 5.424 total

   02de381a4ea9c6d101c1935ae75cf565  stdin

  in0: 7.17GiB 0:00:05 [1.34GiB/s] [1.34GiB/s] [=====> ] 94%            
  ( pvE 0.1 in0 < "$_____" | 

                 gtail -n"+$__" | ghead -n11; )  

  2.50s user 3.96s system 120% cpu 5.355 total

  02de381a4ea9c6d101c1935ae75cf565  stdin

ironically, gnu-tail is actually slower when using its own I/O mechanism to read the file instead of through the pipe

Answered By: RARE Kpop Manifesto
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.