How to catch all lines of a repeating pattern and do some actions with the subresults

I am looking for a possibility to catch in a repeating text pattern all variable amount of lines between them and then do an action with it in bash.

Example text:

Total:
text1
text2
Total:
text3
Total:
Text1
Text4
Text5

What I am aiming to do is basically a for loop over the matches with Total: and then do an action with it, which is always going to be the first section of the follow-up subtext.

Something like in high level language:
for (cat filename = every "Total:" do <something> end

Now the interesting part for me is basically how to organize that for loop?

In the <something> part I want do some jq and awk.

The results would be basically based on the example text these three matches:
1.

Total:
text1
text2
Total:
text3
Total:
Text1
Text4
Text5

Hope the last description describes it.

What could be the right tool for catching this? Would that be rather a combination of for and grep or for and awk?

I would like to use no more than GNU tools. So no perl or other external tools.

Thanks a lot.

Asked By: André Letterer

||

There’s no Right Tool©, but many appropriate ones, certainly including awk (but not the shell). The classic approach is to have a variable that changes value when you find your string. For example, say you wanted to concatenate each section together:

$ awk '
{ 
 if($0 == "Total:"){
   c++
 } 
 else{
   lines[c] = lines[c] ? lines[c]","$0 : $0
 }
}
END{
  for (c in lines){
    printf "Text for total %d:n%sn",c,lines[c]
  }
}' file 
Text for total 1:
text1,text2
Text for total 2:
text3
Text for total 3:
Text1,Text4,Text5

Or, if you just want to separate them, you can set the record separator to Total: and do something like (with GNU awk):

$ gawk -v RS="Total:" 'NR>1{ print "Section "(NR-1),$0}' file
Section 1 
text1
text2

Section 2 
text3

Section 3 
Text1
Text4
Text5

(even better, use RS="(^|n)Total:n" as in Ed Morton’s answer)

It really all depends on what you want to do. Awk is a programming language, you are only really limited by your imagination*.

*Assuming, that is, the main objective of the program is parsing text. You won’t have much fun trying to implement a 3D shooter in awk, although I won’t be surprised if some crazy masochist industrious awk programmer has done this.

Answered By: terdon

I find the question to be a bit broad, but as a very generic answer, in Perl you could match the action(s) based on a pattern and then do something with them.

perl -wne '
  chomp; 
  if (/^(Total:)$/) { 
    $Last_Action = $1; 
    next 
  }; 
  print "Applying ${Last_Action} on line ${.}: ${_}n"
' <test.input

The print "Applying ${Last_Action} on line ${.}: ${_}n" is the part you want to change to change how the script will respond to different actions. You could have e.g. an if statement that will do different things according to the last action that was matched. You would have to add more patterns to /^(Total:)$/ in order to catch more actions.

You don’t disclose exactly what to do with those lines, so in this case I’m just printing the line number and the action that would be applied to it, followed by the line contents, but you can do whatever you want with them.

perl -wne 'chomp; if (/^(Total:)$/) { $Last_Action = $1; next }; print "Applying ${Last_Action} on line ${.}: ${_}n"' <test.input
Applying Total: on line 2: text1
Applying Total: on line 3: text2
Applying Total: on line 5: text3
Applying Total: on line 7: Text1
Applying Total: on line 8: Text4
Applying Total: on line 9: Text5
Answered By: kos

The question is kind of open ended without a specific output required for a specific input. There is a language for extracting data using multi-line patterns across text documents: TXR.

Let’s assume that you have repetitions in your data like text4 on purpose:

Total:
text1
text2
 random
  junk
Total:
text3
 more
  random
 junk
Total:
text7
no
match
  here
Total:
text1
text4
text5

Say we wanted to look for that pattern where there is a two-line Total: section, and then somewhere later a one-line one, and then a third three-line one in which the first line matches the first line of the first one:

$ txr match.txr data
t1: text1
t2: text2
t3: text3
t4: text4
t5: text5

Where match.txr is:

Total:
@text1
@text2
@(skip)
Total:
@text3
@(skip)
Total:
@text1
@text4
@text5
@(output)
t1: @text1
t2: @text2
t3: @text3
t4: @text4
t5: @text5
@(end)

There are many ways to do things, depending on what the requirements are. We could simply iterate on the sections headed by Total: and whatnot.

$ txr  tabulate.txr data
Total: text1,text2, random,  junk
Total: text3, more,  random, junk
Total: text7,no,match,  here
Total: text1,text4,text5

where `tabulate.txr is:

@(collect)
Total:
@   (collect)
@line
@   (until)
Total:
@   (end)
@(end)
@(output)
@  (repeat)
Total: @{line ","}
@  (end)
@(end)
Answered By: Kaz

Using GNU awk for multi-char RS, RT, and use of NUL () to split the file into NUL-separated multi-line records:

while IFS= read -r -d '' rec; do
    printf '=====n%sn=====n' "$rec"
done < <(
        awk -v rs='Total:' -v ORS='' '
            BEGIN { RS = "(^|n)((" rs "n)|$)" }
            NR>1 { print rs "n" $0 }
        ' file
    )

Using any awk and use of Form-Feed (f) (or any other character you know can’t be in the input) to split the file into FF-separated multi-line records:

sep=$'f'    # or whatever non-NUL character you prefer
while IFS= read -r -d "$sep" rec; do
     printf '=====n%sn=====n' "$rec"
done < <(
        awk -v rs='Total:' -v ORS="$sep" '
            $0 == rs { if (NR>1) print rec; rec=$0; next }
            { rec = rec RS $0 }
            END { if (NR>1) print rec }
        ' file
    )

Both will output:

=====
Total:
text1
text2
=====
=====
Total:
text3
=====
=====
Total:
Text1
Text4
Text5
=====

Replace the printf with whatever command you want to run on each multi-line record.

Explanations:

You could do this using GNU awk for multi-char RS, RT, and use of NUL () to split the file into NUL-separated records and then a bash read loop to process them one at a time however you like:

while IFS= read -r -d '' rec; do
    printf '=====n%sn=====n' "$rec"
done < <(
        awk -v rs='Total:' -v ORS='' '
            BEGIN { RS = "(^|n)((" rs "n)|$)" }
            NR>1 { print rs "n" $0 }
        ' file
    )

The above uses awk to do what it is designed to do, i.e. manipulate text, and the shell to do one of the things it is designed to do, i.e. sequence calls to tools. You COULD do it all inside the call to awk using system() to call other tools on each block of text but then you’re using awk to do what a shell is designed to do, i.e. sequence calls to tools, and so the resulting code would be even harder to write robustly and slower (due to spawning a subshell per block of input) than calling those tools directly from shell as I’m doing above.

The awk script is looking for records separated by Total: on a line of it’s own so we need to set RS to include the n before and after Total: otherwise it’d match anywhere on a line, and we need to include the ^ as a possibility before Total: so it also matches at the start of the input. At the end of the file the last record ends with just a n on it’s own so we need to add that possibility (n$) to the RS too. Remember – despite what is often said, $ does not mean end of line in a regexp, it means end of string/buffer so in an RS the $ will only match at the end of the input file just like ^ only matches at the start of the input file, not at the start of each line.

If you’re not sure what any of that means, just add some tracing print statements to dump RT and $0 values for each record, e.g.:

$ awk -v rs='Total:' -v ORS='' '
        BEGIN { RS = "(^|n)((" rs "n)|$)" }
        NR>1 {
            printf "NR=<%d>, $0=<%s>, RT=<%s>n-----n", NR, $0, RT
            #print rs "n" $0
        }
    ' file
NR=<2>, $0=<text1
text2>, RT=<
Total:
>
-----
NR=<3>, $0=<text3>, RT=<
Total:
>
-----
NR=<4>, $0=<Text1
Text4
Text5>, RT=<
>
-----

The record numbers start at 2 because the first record is the empty string before the first line of the file as that first line contains the record separator, Total:n so by definition there must be some record that ends with that string, even if it’s empty.

If your awk doesn’t support multi-char RS and/or printing NUL chars then with any awk you could construct the record 1 line at a time and choose some other character that you know (hope!) can’t appear in your input, e.g. some control-char like r for Carriage Return or f for Form Feed, for the ORS and then change your bash read loop to use that as the delimiter (the -d ... argument), e.g.:

sep=$'f'    # or whatever character you prefer
while IFS= read -r -d "$sep" rec; do
     printf '=====n%sn=====n' "$rec"
done < <(
        awk -v rs='Total:' -v ORS="$sep" '
            $0 == rs { if (NR>1) print rec; rec=$0; next }
            { rec = rec RS $0 }
            END { if (NR>1) print rec }
        ' file
    )

The check for NR>1 in the END section is so we don’t print a blank line given an empty input file but instead just don’t output anything for that case.

Answered By: Ed Morton
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.