Command or script to delete all text in between two flags (html tags), in all files in a directory?

I have a directory full of html files that all have certain tags that I want deleted. For instance, in all files I have <p class="message"> </p> that I want gone, but the text inside the tags is different in each file.

For cases where the text in each file is the same I have been using

find . -type f -name '*.html' -exec sed -i'' -e 's/existing/replacement/g' {} +

to replace the old text with the new. However in the example above, the different text in between the tags means that won’t work.

Is there a similar command or tool available that would allow me to delete or replace everything between two specified strings?

Asked By: tei


Personally, I hate getting the ole’ "look at this" as an answer. However, in this case, a different thread explains this exact procedure well.


Answered By: BradB82

Using Raku (formerly known as Perl_6)

~$ raku -e 'my regex L { "<p class="message">" };                            
            my regex R { "</p>" };                                                                  
            my $dest-dir = "/path/to/destination/dir/";              
            for dir() -> $file {                                     
              with $file.slurp { / <L> .*? <R> /                     
                ?? my $new-file = .subst( :g, / <L> <(.*?)> <R> / )  
                !! next;                                             
              spurt("$dest-dir" ~ "$file".IO, $new-file)             

Raku is a programming language in the Perl-family of programming languages. Briefly, both L-and-R regexes are declared and assigned a value. A $dest-dir scalar is declared and assigned a string. The current dir() is iterated through with for, and each $file .IO object is analyzed/modified in the following block.

Within the outer block, the $file is slurped (read all at once), in an inner block this text is immediately tested for the presence of the regexes, with the .*? "any-character-zero-or-more-times,taken-frugally" regex in-between. Note the L-and-R regexes here must be interpolated with angle brackets, i.e. <L>-and-<R>, because they are within the / ... / matcher),

Within the inner block Raku’s ternary operator Test ?? True !! False is used. If the 3 regexes in series are found the central "atom" <(.*?)> is now wrapped with <()> capture-markers, indicating that the external matches are to be dropped and only the .*? deleted (subst with nothing). A $new-file is created with these inner characters deleted. If the regex is not found, the block skips ahead to the next file (exiting the inner block). This allows the newly-created $new-file to be written out (spurted) to the correct directory, with the original $file name.

Sample Input (in original dir/file):

first line
<p class="message"> foo </p>
<p class="message"> bar </p>
<p class="message">

last line

Sample Output (written to new dir/file)

first line
<p class="message"></p>
<p class="message"></p>
<p class="message"></p>

last line

Above, the "Sample Output" shows three instances of the inner text of specified html tags being deleted, even if the opening/closing tags are on different lines. In order to substitute with new (literal string) text instead, change the following segment of code.

.subst( :g, / <L> <(.*?)> <R> / )

.subst( :g, / <L> <(.*?)> <R> /, "new-text" )

Answered By: jubilatious1

HTML tags can usually span several lines, or there could be more than one per line, so you could use perl‘s slurp mode where the full contents of files are processed as a whole, and its *? non-greedy version of * to match as little as possible between the opening and closing tags.

The -i options is also non-standard, those that support it have actually copied from perl though with variations when not using a backup suffix (-i vs -i '').

find . -name '*.html' -type f -exec perl -0777 -pi -e '
  s{<p class="message">.*?</p>}{ }gs' {} +
Answered By: Stéphane Chazelas
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.