multiline grep search into separate files per occurence

I have a file as following:

example.txt

    -1
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000
    -1
 
#ffafsda
    -1
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23
    -1
    
    asfasd
    
    afsdasdf

It consists of blocks always starting and ending by line entirely matching ^ {4}-1$.
I need to separate a file into multiple by these blocks.

What I figured out right now is this multiline regex that extracts these blocks:

grep -Pzo '(?s)((?m:^)s{4}-1(?m:$).*?(?m:^)s{4}-1(?m:$))' example.txt

output:

    -1
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000
    -1    -1
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23
    -1

You see second match is exactly printed behind first match (no newline or separator) – I’m failing to separate these occurences into files

required output is following:

file1:

    -1
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000
    -1

file2

    -1
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23
    -1

Any help appreciated.

Asked By: Honza S.

||

Another method to get whole blocks instead of grep

First, I suggest to use sed to create the

sed -n '/^ {4}-1$/,/^ {4}-1$/p' example.txt
    -1
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000
    -1
    -1
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23
    -1

Splitting blocks to different files

Then you can use csplit command to split the file according to the pattern.

NAME

csplit – split a file into sections determined by context lines

SYNOPSIS

csplit [OPTION]… FILE PATTERN…

DESCRIPTION

Output pieces of FILE separated by PATTERN(s) to files ‘xx00’, ‘xx01’, …, and output byte counts of each piece to standard output.

Example

$ sed -n '/^ {4}-1$/,/^ {4}-1$/p' example.txt | csplit - -f example --suppress-matched -z '/^ {4}-1$/' '{*}'
331
292

Explanation:

  • csplit - – will read from standard input
  • -f example – sets the prefix of the files to "example" (instead of the default "xx". Each prefix will be followed by a two digit number starting from 00.
  • --suppress-matched – suppress the lines matching the pattern (/^ {4}-1$/).
    • It’s needed because csplit performs the split by the pattern (you can’t tell it the first and the last line, only a pattern), so after each "closing" pattern it will create a file just with the pattern (because in the following it will be split again). If you surpress the pattern, you could avoid this by the next flag:
  • -z – remove empty output file
  • '/^ {4}-1$/' – the pattern indicates where to split the file.
  • '{*}' – repeat the previous pattern as many times as possible

It will output the size of each file it creates.

Result: 2 files with the required blocks, but without the pattern.

$ cat example00
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000

$ cat example01
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23

If you want to return the separator lines to the file ( -1 in the first and last line), you can use the following command:

sed -i '1s/.*/    -1n/; $s/$/n    -1/' example[0-9][0-9]

Further explanation about --suppress-matched and -z flags

To explain the need for --suppress-matched, I’ll show you what happens

$ sed -n '/^ {4}-1$/,/^ {4}-1$/p' example.txt | csplit -f example  -z - '/^ {4}-1$/' '{*}'
338
7
299
7

It created 4 files. Notice that example01 and example03 only include the pattern.

$ cat example00
    -1
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000

$ cat example01
    -1

$ cat example02
    -1
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23

$ cat example03
    -1

When using the --suppress-matched, the lines with -1 will be suppressed, and the result would be that example01 and example03 would be empty and thus will not be created.

Answered By: aviro

With -z (a non-standard GNU extension), grep works on NUL delimited records, it’s not multiline grep¹, so:

  • the matching is done on each NUL-delimited record independently or the whole input if not delimited (the ability to work with non-delimited records is another GNU extension)
  • with -o (another non-standard GNU extension) each match is output NUL-delimited

So the records in your output are separated (actually delimited). Which you can see if you pass the output through sed -n l for instance:

$ grep -Pzo '(?s)((?m:^)s{4}-1(?m:$).*?(?m:^)s{4}-1(?m:$))' example.txt | sed -n l
    -1$
    15$
         1         0         0        11 -1.0000E+001  1.0000E+001 -1
.0000E+001$
         2         0         0        11  1.0000E+001  1.0000E+001 -1
.0000E+001$
...$
        29         0         0        11  1.0000E+001  2.0000E+001  1
.0000E+001$
        30         0         0        11  5.0000E+000  5.0000E+000  5
.0000E+000$
    -100    -1$
    780$
         1       116         1         2         1         1         
7        20$
         1        11         2        15         4        18         
3        12$
        13        16        22        19         5        24         
9        29$
         8        27         6        23$
    -100$

See the 00s that delimits each match.

Here you could simplify your matching with:

grep -Pzo '(?sm)(^s{4}-1$).*?(?1)' example.txt

But rather that using grep with that -P (for Perl, also a non-standard GNU extension), you could use the real thing which would have several advantages:

  • be more portable as perl is present on more systems than GNU grep is (and perl-like regexp support is not always enabled at build time in GNU grep)
  • perl has -0 to work with NUL-delimited records, but that’s not what you want here. You want a slurp mode which in perl is with -0777
  • perl can write the output to separate files by itself:
perl -l -0777 -ne '
  while (/(^s{4}-1$).*?(?1)/msg) {
    open OUT, ">", "output-" . ++$n . ".txt" or die;
    print OUT $&
  }' example.txt

Or rather than slurping the file as a whole and use regexps, read it line by line:

perl -ne '
  if (/^s{4}-1$/) {
    if ($inside = 1 - $inside) {
      open OUT, ">", "output-" . ++$n . ".txt" or die;
    } else {
      print OUT; next
    }
  }
  print OUT if $inside' example.txt

(though it would given a different outcome if the -1 were not all matched).


¹ for that, see pcre2grep -M (formerly pcregrep -M), pcre2grep being an example application shipped with PCRE2 which GNU grep uses (can use) for its -P option.

Answered By: Stéphane Chazelas

If it were me:

gawk '/^s{4}-1$/ { X=X+1 } { print $0 >> ( "outfile" X ) }' <inputfile
Answered By: symcbean

You can use GNU awk, which allows for regular expressions to be used as the record separator, as the thing that defines "lines". Here, we can set that as n -1n, so a newline, 4 spaces, -1 and a newline. Then, since this appears at the start and end of the sections you want, we essentially want every second "line", so we can print when the line number modulo 2 is 0:

gawk '
  BEGIN{
    RS="n    -1n"; 
    ORS=RS
  } 
  NR % 2 ==0 { print RS $0 > "outfile." ++c }' file 

Running the above on your example produced two files with the following contents:

$ ls
file  outfile.1  outfile.2
$ cat outfile.1

    -1
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000
    -1
$ cat outfile.2

    -1
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23
    -1

This does have the unfortunate side-effect of adding a blank line to the beginning of each file. If that is a problem, you could just print the -1 explicitly instead:

gawk '
  BEGIN{
    RS="n    -1n"; 
  } 
  NR % 2 ==0 { printf "   -1n%sn    -1n", $0 > "outfile." ++c }' file 
Answered By: terdon

Using any awk:

$ cat tst.awk
/^    -1/ {
    if ( inBlock ) {
        print > out; close(out)
    }
    else {
        out = FILENAME "_" (++cnt)
    }
    inBlock = !inBlock
}
inBlock { print > out }

$ awk -f tst.awk example.txt

$ head example.txt_*
==> example.txt_1 <==
    -1
    15
         1         0         0        11 -1.0000E+001  1.0000E+001 -1.0000E+001
         2         0         0        11  1.0000E+001  1.0000E+001 -1.0000E+001
...
        29         0         0        11  1.0000E+001  2.0000E+001  1.0000E+001
        30         0         0        11  5.0000E+000  5.0000E+000  5.0000E+000
    -1

==> example.txt_2 <==
    -1
    780
         1       116         1         2         1         1         7        20
         1        11         2        15         4        18         3        12
        13        16        22        19         5        24         9        29
         8        27         6        23
    -1
Answered By: Ed Morton