extract string before numbers and after underscore
The original string is like this:
str-str001-002_01
str-str005-006_05
I would like to extract string before number, and after underscore, so it would be like this:
str-str_01
str-str_05
I remember sed could separate pattern into groups like this:
sed -n 's/(^.*)([0-9]-[0-9])(.*$)/13/p'
but it prints:
str-str0002_01
Then I remember that [0-9] is only one number, so I tried it with + sign, or * sign.
Then it gives empty result.
ps: by using
echo 'str-str001-002_01' | sed -n 's/(^.*)([0-9]-[0-9])(.*$)/2/p'
I can see that it matches 1-0
.
Then I tried it with:
echo 'str-str001-002_01' | sed -n 's/(^.*)([0-9]+-[0-9]+)(.*$)/2/p'
it left the first 2 numbers, and only matches
1-002
so how to make it match 001-002
This provides the required output:
sed -nE 's/^([^0-9]*).*_([^_]+)$/1_2/p'
Output from your example
str-str_01
str-str_05
Explanation
sed -nE 's/…/…/p'
– Use EREs, don’t print lines unless they match^
– anchor to the start of line([^0-9]*)
– match as long a pattern as possible, that is at least one non-digit character.*_
– match as much as possible (including nothing), followed by "_
"([^_]+)
– match as long a pattern as possible (at least one character) that isn’t an underscore$
– anchor to the end of line1_2
– replace the entire line with the first(…)
match, "_
", and second(…)
match
The reason your attempts didn’t work as you expected is because *
(and +
) is greedy – it will consume as many characters as possible that match the preceding atom. So for an ERE of (.*)([0-9]+)
applied to something like abc123
, the .*
will consume abc12
, leaving [0-9]+
to match just 3
. You’d need a "not digit" to constrain the first match: ([^0-9]*)([0-9]+)
to get abc
and 123
.
$ cat file
str-str001-002_01
str-str005-006_05
$ sed 's/[0-9]{3}-[0-9]{3}//' file
str-str_01
str-str_05
The substitution command here is matching and removing NNN-NNN
where NNN
is a run of three digits.
To match at least one digit, use 1,
in place of 3
:
$ sed 's/[0-9]{1,}-[0-9]{1,}//' file
str-str_01
str-str_05
This corresponds to using +
in an extended regular expression. The regular expressions used by sed
by default are "basic" regular expressions, and +
would match a literal plus character. Most sed
implementations also supports extended expressions with -E
:
$ sed -E 's/[0-9]+-[0-9]+//' file
str-str_01
str-str_05
Using *
, as in [0-9]*-[0-9]*
, would not work as that would match the dash in str-str
(which has zero digits surrounding it).
If you feel like you really have to match the whole line and capture the bits that you want to keep, then you can do this too. The following command captures the initial non-digits and the final bit, including the underscore:
$ sed 's/([^0-9]*).*(_.*)/12/' file
str-str_01
str-str_05
This, however, is IMHO a bit difficult to decipher and makes assumptions about the start and end of the string that you never mentioned in the question. The start can’t, for example, contain digits before the digits that you want to remove, and the end of the string will be cut at the last underscore, not neccesarily after the digits that you want to remove if there are several underscores in that part of the string.
You could always further add to this expresion to ensure that only the NNN-NNN
bit is not captured, but that would make it even harder to understand the expression.