A set of paragraphs of 4 lines to manage with AWK

I have a file composed of several paragraphs (more than 2000) of 4 lines.
For each paragraph, I need to match the content between brackets like the example below.

So for each paragraph,

  • the entries are the first two lines.
  • for the third line, the current content between the brackets is replaced by the content between the second line brackets.
  • for the fourth line, the current content between the brackets is replaced by the content between the first line brackets.

I hope it’s clear enough.

–Inputs–

A1 [A3 A4 A5] A2
B1 [B3 B4 B5] B2
C1 [C3 C4] C2
D1 [D3 D4] D2

E1 [E3 E4 E5] E2
F1 [F3 F4 F5] F2
G1 [G3 G4] G2
H1 [H3 H4] H2

–Outputs–

A1 [A3 A4 A5] A2
B1 [B3 B4 B5] B2
C1 [B3 B4 B5] C2
D1 [A3 A4 A5] D2

E1 [E3 E4 E5] E2
F1 [F3 F4 F5] F2
G1 [F3 F4 F5] G2
H1 [E3 E4 E5] H2

Do you have a solution? With awk and gsub I guess but how it’s the problem.

Asked By: titof poule

||

GNU awk, assuming that there are no regex-special characters between the brackets:

$ gawk -vRS= '
  BEGIN{OFS=FS="n"}
  match($1,/[[^]]*]/,x) && match($2,/[[^]]*]/,y) {
    sub(/[[^]]*]/,y[0],$3);
    sub(/[[^]]*]/,x[0],$4);
    printf "%s%s", $0, RT
  }
  ' file
A1 [A3 A4 A5] A2
B1 [B3 B4 B5] B2
C1 [B3 B4 B5] C2
D1 [A3 A4 A5] D2

E1 [E3 E4 E5] E2
F1 [F3 F4 F5] F2
G1 [F3 F4 F5] G2
H1 [E3 E4 E5] H2

The same is essentially do-able in non-GNU awk except you will need to use substr($1,RSTART,RLENGTH) etc. to obtain the replacements, and you won’t be able to use RT to restore the original input record separators:

awk '
  BEGIN{RS=""; ORS="nn"; OFS=FS="n"}
  match($1,/[[^]]*]/) {x = substr($1,RSTART,RLENGTH)}
  match($2,/[[^]]*]/) {y = substr($2,RSTART,RLENGTH)}
  {
    sub(/[[^]]*]/,y,$3);
    sub(/[[^]]*]/,x,$4);
    print
  }
  ' file
Answered By: steeldriver
awk -F[][] -vOFS= '++i==1 {a=$2} i==2 {b=$2} i==3 {$2="[" b "]"} i==4 {$2="[" a "]"} !NF {i=0} 1' input.txt

With square brackets as the field separators, your replacement sources/targets are in $2.

We increment i on each line, and reset it to zero between paragraphs. The value of i (1 though 4) tells us what to do with $2.

Answered By: Oh My Goodness
$ cat tst.awk
match($0,/[.*]/) {
    idx = (NR - 1) % 5 + 1
    sect[idx] = substr($0,RSTART,RLENGTH)
    if ( idx == 3 ) {
        $0 = $1 OFS sect[2] OFS $NF
    }
    else if ( idx == 4 ) {
        $0 = $1 OFS sect[1] OFS $NF
    }
}
{ print }

$ awk -f tst.awk file
A1 [A3 A4 A5] A2
B1 [B3 B4 B5] B2
C1 [B3 B4 B5] C2
D1 [A3 A4 A5] D2

E1 [E3 E4 E5] E2
F1 [F3 F4 F5] F2
G1 [F3 F4 F5] G2
H1 [E3 E4 E5] H2

The above does string replacement so it’ll work even if the sections inside brackets contain regexp metachars or backreferences.

Answered By: Ed Morton

And also with GNU awk for the 3rd argument to match() and using a second array indexed with NR and the default settings for RS and FS:

Updated:

awk  '
{
    match($0, /([^[]*)([.*])([^]]*)/,a)
    b[NR]=a[2]
    if (NR==3){print a[1], b[NR-1],a[3];next}
    if (NR==4){print a[1], b[NR-3],a[3];next}
    else {print a[1], a[2], a[3]}
    if ($0 == "") {NR=0}
}' file
A1  [A3 A4 A5]  A2
B1  [B3 B4 B5]  B2
C1  [B3 B4 B5]  C2
D1  [A3 A4 A5]  D2

E1  [E3 E4 E5]  E2
F1  [F3 F4 F5]  F2
G1  [F3 F4 F5]  G2
H1  [E3 E4 E5]  H2

Answered By: Carlos Pascual

With GNU sed:

sed -n -E '
    1~5 { p; s/.*([.*]).*/1/;h };
    2~5 { p; s/.*([.*]).*/1/;
          N; s/n//; s/^([.*?])([^[]*)[.*]/21/;p;x;
          N; s/n//; s/^([.*?])([^[]*)[.*]/21/;p;
}; 5~5p'  infile

TL;DR

1~5 { ... }: this applies on every 5th lines start from the first line; and same
2~5 { ... }: applies on every 5th lines but start from the second line; and
5~5 p: applies on every 5th lines start from the fifth line;

breaking each command down:

  • 1~5 { p; s/.*([.*]).*/1/;h }:

    • the p command: prints the entire line that matched 1~5 condition, so for the first paragraph first line read and will go to output without change; output now is:

      A1 [A3 A4 A5] A2
      
    • with s/.*([.*]).*/1/, we captures [ ... ] part only from that line and remove everything else from the output; then

    • with h command we copy that result into hold-space; so now hold-space contains this [A3 A4 A5].

  • 2~5 { p; s/.*([.*]).*/1/;:

    • the p command: almost same as the above, but this is for every 5th lines number starting from the second line as said; so it will print second line; now output is:

      A1 [A3 A4 A5] A2
      B1 [B3 B4 B5] B2
      
    • with s/.*([.*]).*/1/, we again capture the [ ... ] part from the second line, and remove everything else and do nothing; now our pattern-space contains this [B3 B4 B5] (and reminder that hold-space is still not changed and that is [A3 A4 A5])

    • in N; s/n//; s/^([.*?])([^[]*)[.*]/21/; p; x;

      • N, read the next line (3rd line now) and append it into pattern-space with embedded newline between; so now our pattern-space changed as following:

        [B3 B4 B5]
        C1 [C3 C4] C2
        
      • with s/n//; we delete that embedded newline first; now we have below in pattern-space

        [B3 B4 B5]C1 [C3 C4] C2
        
      • in s/^([.*?])([^[]*)[.*]/21/; p; x;:

      • with ^([.*?]), we capture [B3 B4 B5] part with back-reference of 1 that is beginning of line

      • with ([^[]*), captures C1 part with back-reference of 2

      • with [.*], captures [C3 C4] part, but will remove from the line

      • in replacement part 21 will preserve only, so now pattern-space is:

        C1 [B3 B4 B5] C2
        
      • next command is p, OK, print it; now output is:

        A1 [A3 A4 A5] A2
        B1 [B3 B4 B5] B2
        C1 [B3 B4 B5] C2
        
      • now pattern-space is C1 [B3 B4 B5] C2 and hold-space is still [A3 A4 A5]; and

      • with the next command x, we exchange the pattern-space with hold-space; now pattern-space is [A3 A4 A5]; and hold-space we don’t need it and leave it for now.

    • in N; s/n//; s/^([.*?])([^[]*)[.*]/21/; p;:

      • N, read the next line (4th line now) and append it into pattern-space with embedded newline between; so now our pattern-space changed as following:

        [A3 A4 A5]
        D1 [D3 D4] D2
        
      • in s/n//; we delete that embedded newline first; now we have below in pattern-space

        [A3 A4 A5]D1 [D3 D4] D2
        
      • with s/^([.*?])([^[]*)[.*]/21/; p;:

      • with ^([.*?]), we capture [A3 A4 A5] part with back-reference of 1 that is beginning of line

      • with ([^[]*), captures D1 part with back-reference of 2

      • with [.*], captures [D3 D4] part, but will remove from the line

      • in replacement part 21 will preserve only, so now pattern-space is:

        D1 [A3 A4 A5] D2  
        
      • next command is p, OK, print it; now output is:

        A1 [A3 A4 A5] A2
        B1 [B3 B4 B5] B2
        C1 [B3 B4 B5] C2
        D1 [A3 A4 A5] D2
        
  • with 5~5p we print every 5th line start from line 5, that is empty line between each paragraph.
    now first paragraph procced and the same steps will continue by the sed until all lines read and proceed.

Answered By: αғsнιη
perl -00pe 's/(.*)([.*])(.*n)
              (.*)([.*])(.*n)
              (.*)([.*])(.*n)
              (.*)([.*])(.*n)
             /$1$2$3$4$5$6$7$5$9$10$2$12/x' input1

perl -00pe — for each paragraph.

Each line of the RE matches an input paragraph line and separates it in the relevant parts.
In the substitution group we
just have to reorder the parts.

Sorry for the obfuscation…

Answered By: JJoao
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.