How do I use grep, awk, or sed to get a substring of a line up until a string literal?

Question

How do I use grep, awk, or sed to get a substring of a line up until a string literal?

I am trying to process a text file and omit a certain string literal if it occurs at the end of the line. E.g.:

Source:

ABC 123
DEF, characters I don't want
GHI, these characters are ok

Desired Output:

ABC 123
DEF
GHI, these characters are ok

If I do grep -v ', characters I don't want$', it omits that entire line.

I can’t do a simple awk column since I want the , these characters are ok substring

I can’t use cut to split on a delimiter because the delimiter needs to be multiple characters (, characters I don't want).

With Python it would be super simple with something like: string.split(", characters I don't want", 1)[0]

(Tangentially, I’m wondering in which use cases it really is preferable to use grep, awk, or sed in complicated situations like this vs Python when Python is so much more readable and maintainable.)

Asked By: Evan Harmon

||

Source

Answer 1

Most obvious here would be to use sed:

<source sed "s/, characters I don't want$//"

To substitute that string when found at the end of the line $ which we escape as $ for the shell (to be future proof in case $/ means something in the shell in the future).

To also remove whatever follows that string if any, replace the $ with .*, though we’d need to change the locale the C to guarantee .* matches everything up to the end even if that’s not valid text in the user’s locale:

<source LC_ALL=C sed "s/, characters I don't want.*//"

With GNU grep or compatible, when built with perl-like regexp support, that could be:

<source LC_ALL=C grep -Po "^.*?(?=(, characters I don't want)?$)"

Or to remove everything if any after that string as well:

<source LC_ALL=C grep -Po "^.*?(?=, characters I don't want|$)"

Or with pcregrep (when perl-like regex support is enabled in GNU grep, that’s actually via libpcre which comes with pcregrep as an example application though has features beyond those of GNU grep):

<source pcregrep -o1 "^(.*?)(, characters I don't want)?$"

Or to remove everything if any after that string as well:

<source pcregrep -o1 "^(.*?)(, characters I don't want|$)"

If the text to remove may contain anything including / or regex operators (but not newline characters which wouldn’t make sense, nor NUL characters which can be passed in command arguments nor environment variables) and is stored in a shell variable, you do not want to use ~~sed "s/$string$//"~~ as that would make it a command injection vulnerability.

With the perl-grep ones, you can use:

string='/.*^$'
<source LC_ALL=C grep -Po "^.*?(?=(Q$string)?$)"
<source pcregrep -o1 "^(.*?)(Q$stringE)?$"

Or to remove everything if any after that string as well:

<source LC_ALL=C grep -Po "^.*?(?=Q$string|$)"
<source pcregrep -o1 "^(.*?)(Q$stringE|$)"

That still chokes on $strings that contain E, though not with as dramatic consequences as with sed.

Or you could use perl directly which has a sed mode with its -p option, has mechanisms to pass arbitrary strings (here using -s for a crude option passing, but you could also use @ARGV directly (equivalent of python’s sys.argv) or environment variables (mapped to the %ENV associative array)), and can Quote strings inside regexps (here with E in $string not being a problem):

<source perl -spe 's/Q$stringE$//' -- -string="$string"

Or to remove everything if any after that string as well:

<source perl -spe 's/Q$stringE.*$//' -- -string="$string"

perl treats input as bytes not as if encoded in the user’s locale charset by default, so we don’t need to change the locale there.

Note that contrary to sed, the line delimiter is included in the pattern space ($_ in perl on which s/// acts by default) by default and its $ regex operator matches either at the end of the subject or before a line delimiter at the end of the subject so is able to cope with both delimited and undelimited lines.

Answered By: Stéphane Chazelas

Answer 2

Using any awk:

$ awk 'n=index($0 RS,", characters I don47t want" RS){$0=substr($0,1,n-1)} 1' file
ABC 123
DEF
GHI, these characters are ok

That’s doing a literal string comparison so it’d work even if the string you’re trying to match with contained regexp metachars, for example using this input:

$ cat file2
ABC 123
DEF, .*, .*
GHI, .* ok

We get the expected output:

$ awk 'n=index($0 RS,", .*" RS){$0=substr($0,1,n-1)} 1' file2
ABC 123
DEF, .*
GHI, .* ok

If you didn’t care about regexp metachars you could just do:

$ awk '{sub(/, characters I don47t want$/,"")} 1' file
ABC 123
DEF
GHI, these characters are ok

but then you’d get unexpected output from:

$ awk '{sub(/, .*$/,"")} 1' file2
ABC 123
DEF
GHI

and you’d have to escape the metachars to make them literal to get the expected output:

$ awk '{sub(/, .*$/,"")} 1' file2
ABC 123
DEF, .*
GHI, .* ok

which is getting cludgy given all you really wanted was a literal string comparison.

See http://awk.freeshell.org/PrintASingleQuote for why I’m using 47 instead of '.

As for why to use awk instead of python – awk is a mandatory POSIX tool and so is guaranteed to exist on all POSIX-compliant Unix installations while python is not, and it usually takes much less code to manipulate text with awk than it does with python. I suspect we will have to agree to disagree on which is more easily readable and maintainable.

Answered By: Ed Morton

Answer 3

Filtering out the contents at the end of a line when those contents are known in advance can be fairly easy in Bash and shells that support similar variable expansion functions. For example:

#!/usr/bin/env bash
line='DEF, characters I do not want'
echo "${line%, characters I do not want}"

will print:

DEF

The syntax ${var%string} returns the contents of $var with the string after the % removed from the end of the contents. In this example the string to be removed is ", characters I do not want". If that string isn’t at the end, the full contents of $line are returned. There are variations for removing a string from the start of the variable, as well as a substitution that can replace a string in the middle of the contents or delete it.

I admit in the above example to changing don't -> do not in order to avoid complications from using single quotes when assigning the string to the $line variable.

The advantage of this approach is that your script doesn’t need to invoke an external command to perform the simple filtering. But is it a replacement for the power of python?. Probably not, but there may be other factors that push you toward using a shell script rather than python for this task.

Answered By: Sotto Voce