How do I use grep, awk, or sed to get a substring of a line up until a string literal?
I am trying to process a text file and omit a certain string literal if it occurs at the end of the line. E.g.:
Source:
ABC 123
DEF, characters I don't want
GHI, these characters are ok
Desired Output:
ABC 123
DEF
GHI, these characters are ok
If I do grep -v ', characters I don't want$'
, it omits that entire line.
I can’t do a simple awk
column since I want the , these characters are ok
substring
I can’t use cut
to split on a delimiter because the delimiter needs to be multiple characters (, characters I don't want
).
With Python it would be super simple with something like: string.split(", characters I don't want", 1)[0]
(Tangentially, I’m wondering in which use cases it really is preferable to use grep, awk, or sed in complicated situations like this vs Python when Python is so much more readable and maintainable.)
Most obvious here would be to use sed
:
<source sed "s/, characters I don't want$//"
To s
ubstitute that string when found at the end of the line $
which we escape as $
for the shell (to be future proof in case $/
means something in the shell in the future).
To also remove whatever follows that string if any, replace the $
with .*
, though we’d need to change the locale the C to guarantee .*
matches everything up to the end even if that’s not valid text in the user’s locale:
<source LC_ALL=C sed "s/, characters I don't want.*//"
With GNU grep
or compatible, when built with perl-like regexp support, that could be:
<source LC_ALL=C grep -Po "^.*?(?=(, characters I don't want)?$)"
Or to remove everything if any after that string as well:
<source LC_ALL=C grep -Po "^.*?(?=, characters I don't want|$)"
Or with pcregrep
(when perl-like regex support is enabled in GNU grep
, that’s actually via libpcre which comes with pcregrep
as an example application though has features beyond those of GNU grep
):
<source pcregrep -o1 "^(.*?)(, characters I don't want)?$"
Or to remove everything if any after that string as well:
<source pcregrep -o1 "^(.*?)(, characters I don't want|$)"
If the text to remove may contain anything including /
or regex operators (but not newline characters which wouldn’t make sense, nor NUL characters which can be passed in command arguments nor environment variables) and is stored in a shell variable, you do not want to use as that would make it a command injection vulnerability.sed "s/$string$//"
With the perl-grep ones, you can use:
string='/.*^$'
<source LC_ALL=C grep -Po "^.*?(?=(Q$string)?$)"
<source pcregrep -o1 "^(.*?)(Q$stringE)?$"
Or to remove everything if any after that string as well:
<source LC_ALL=C grep -Po "^.*?(?=Q$string|$)"
<source pcregrep -o1 "^(.*?)(Q$stringE|$)"
That still chokes on $string
s that contain E
, though not with as dramatic consequences as with sed
.
Or you could use perl
directly which has a sed
mode with its -p
option, has mechanisms to pass arbitrary strings (here using -s
for a crude option passing, but you could also use @ARGV
directly (equivalent of python’s sys.argv
) or environment variables (mapped to the %ENV
associative array)), and can Q
uote strings inside regexps (here with E
in $string
not being a problem):
<source perl -spe 's/Q$stringE$//' -- -string="$string"
Or to remove everything if any after that string as well:
<source perl -spe 's/Q$stringE.*$//' -- -string="$string"
perl
treats input as bytes not as if encoded in the user’s locale charset by default, so we don’t need to change the locale there.
Note that contrary to sed
, the line delimiter is included in the pattern space ($_
in perl
on which s///
acts by default) by default and its $
regex operator matches either at the end of the subject or before a line delimiter at the end of the subject so is able to cope with both delimited and undelimited lines.
Using any awk:
$ awk 'n=index($0 RS,", characters I don 47t want" RS){$0=substr($0,1,n-1)} 1' file
ABC 123
DEF
GHI, these characters are ok
That’s doing a literal string comparison so it’d work even if the string you’re trying to match with contained regexp metachars, for example using this input:
$ cat file2
ABC 123
DEF, .*, .*
GHI, .* ok
We get the expected output:
$ awk 'n=index($0 RS,", .*" RS){$0=substr($0,1,n-1)} 1' file2
ABC 123
DEF, .*
GHI, .* ok
If you didn’t care about regexp metachars you could just do:
$ awk '{sub(/, characters I don 47t want$/,"")} 1' file
ABC 123
DEF
GHI, these characters are ok
but then you’d get unexpected output from:
$ awk '{sub(/, .*$/,"")} 1' file2
ABC 123
DEF
GHI
and you’d have to escape the metachars to make them literal to get the expected output:
$ awk '{sub(/, .*$/,"")} 1' file2
ABC 123
DEF, .*
GHI, .* ok
which is getting cludgy given all you really wanted was a literal string comparison.
See http://awk.freeshell.org/PrintASingleQuote for why I’m using