extract string before numbers and after underscore

Question

extract string before numbers and after underscore

The original string is like this:

str-str001-002_01
str-str005-006_05

I would like to extract string before number, and after underscore, so it would be like this:

str-str_01
str-str_05

I remember sed could separate pattern into groups like this:

 sed -n 's/(^.*)([0-9]-[0-9])(.*$)/13/p'

but it prints:

str-str0002_01

Then I remember that [0-9] is only one number, so I tried it with + sign, or * sign.
Then it gives empty result.

ps: by using

echo 'str-str001-002_01' | sed -n 's/(^.*)([0-9]-[0-9])(.*$)/2/p'

I can see that it matches 1-0.

Then I tried it with:

echo 'str-str001-002_01' | sed -n 's/(^.*)([0-9]+-[0-9]+)(.*$)/2/p'

it left the first 2 numbers, and only matches

1-002

so how to make it match 001-002

Asked By: Tiina

||

Source

Answer 1

This provides the required output:

sed -nE 's/^([^0-9]*).*_([^_]+)$/1_2/p'

Output from your example

str-str_01
str-str_05

Explanation

sed -nE 's/…/…/p' – Use EREs, don’t print lines unless they match
^ – anchor to the start of line
([^0-9]*) – match as long a pattern as possible, that is at least one non-digit character
.*_ – match as much as possible (including nothing), followed by "_"
([^_]+) – match as long a pattern as possible (at least one character) that isn’t an underscore
$ – anchor to the end of line
1_2 – replace the entire line with the first (…) match, "_", and second (…) match

The reason your attempts didn’t work as you expected is because * (and +) is greedy – it will consume as many characters as possible that match the preceding atom. So for an ERE of (.*)([0-9]+) applied to something like abc123, the .* will consume abc12, leaving [0-9]+ to match just 3. You’d need a "not digit" to constrain the first match: ([^0-9]*)([0-9]+) to get abc and 123.

Answered By: Chris Davies

Answer 2

$ cat file
str-str001-002_01
str-str005-006_05

$ sed 's/[0-9]{3}-[0-9]{3}//' file
str-str_01
str-str_05

The substitution command here is matching and removing NNN-NNN where NNN is a run of three digits.

To match at least one digit, use 1, in place of 3:

$ sed 's/[0-9]{1,}-[0-9]{1,}//' file
str-str_01
str-str_05

This corresponds to using + in an extended regular expression. The regular expressions used by sed by default are "basic" regular expressions, and + would match a literal plus character. Most sed implementations also supports extended expressions with -E:

$ sed -E 's/[0-9]+-[0-9]+//' file
str-str_01
str-str_05

Using *, as in [0-9]*-[0-9]*, would not work as that would match the dash in str-str (which has zero digits surrounding it).

If you feel like you really have to match the whole line and capture the bits that you want to keep, then you can do this too. The following command captures the initial non-digits and the final bit, including the underscore:

$ sed 's/([^0-9]*).*(_.*)/12/' file
str-str_01
str-str_05

This, however, is IMHO a bit difficult to decipher and makes assumptions about the start and end of the string that you never mentioned in the question. The start can’t, for example, contain digits before the digits that you want to remove, and the end of the string will be cut at the last underscore, not neccesarily after the digits that you want to remove if there are several underscores in that part of the string.

You could always further add to this expresion to ensure that only the NNN-NNN bit is not captured, but that would make it even harder to understand the expression.

Answered By: Kusalananda