Why does [0-9]* match where there aren't any digits?

so the command is :

echo "abc 123" | sed "s/[0-9]*/h/g"

and im getting output as

hahbhch h

how am i getting this output?

the output i expected it to be is abc h

which im getting by this command :

echo "abc 123" | sed "s/[0-9][0-9]*/h/g"

can someone explain this..

Asked By: sidharthanup


The * means zero-or-more matches, and it matches as soon as possible. If you run that command without the g flag (which means sed will stop after the first replacement), you will get as output habc 123. This is because it start reading from left to right, and because it couldn’t match a, it will simply match the beginning of the line and then stop there.

Using the global (g) flag, it wil keep trying to match the rest of the string, and because * matches the empty string when it can’t match anything else, it will place an h every time it cannot match more numbers.

Note that your second attempt is equivalent to sed "s/[0-9]+/h/". Here + means one or more matches, meaning it won’t match the empty string when it does not find a number to replace.

Answered By: Kira

The answer is to do with the way that Regular Expressions are handled in sed. Regular Expressions or REs can get very complex and there’s a tradeoff between the power of what you can do with them and the complexity of the syntax. Different programming languages have made different choices about how much power and therefore complexity they want to support. Sed is very powerful and therefore a bit more complex than you might expect. To get to the answer, the * matches a sequence of zero or more instances of the previous token. In your case the previous token is [0-9] which means any digit. Sed is noticing that there is a zero length string of digits before and after every character in the input string. This seems rather couterintuitive until you get used to it. You noticed one common way of fixing the problem which is to use /[0-9][0-9]*/ which is interpreted as a digit followed by zero or more digits. Another solution is to replace * with +. The + matches a sequence of one or more of the previous token. So the full command is:

echo "abc 123" | sed "s/[0-9]+/h/g"

You can read about the sed command using the manual online (just google man sed) or if the manuals are installed on your system just run the command “man sed”

Answered By: David Little
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.