Any non-whitespace regular expression

Im trying to match a string agains a regular expression inside an if statement on bash. Code below:

var='big'
If [[ $var =~ ^bS+[a-z]$ ]]; then 
echo $var
else 
echo 'none'
fi

Match should be a string that starts with ‘b’ followed by one or more non-whitespace character and ending on a letter a-z. I can match the start and end of the string but the S is not working to match the non-whitespace characters. Thanks in advance for the help.

Asked By: Fxbaez

||
[[ $var =~ ^b[^[:space:]]+[abcdefghijklmnopqrstuvwxyz]$ ]]

What [a-z] matches depends on the locale and generally is not (only) one of abcdefghijklmnopqrstuvwxyz.

perl‘s S (horizontal and vertical spaces) now also recognised by some other regexp engines is [^[:space:]] in POSIX and bash’s EREs.

bash uses the system’s regexp library to match those regular expressions, but even on systems (like recent GNU ones) where the regexps have a S operator, that won’t work because in:

[[ x = S ]]

bash calls regcomp("S") and with:

[[ x = 'S' ]]

bash calls regcomp("\S") (two backslashes).

However, with bash-3.1 or if you turn bash-3.1 compatibility on with shopt -s compat31, then:

[[ x = 'S' ]]

will work (will match a non-spacing character) on systems where EREs support S.

$ bash -c "[[ x =~ 'S' ]]" || echo no
no
$ bash -O compat31 -c "[[ x =~ 'S' ]]" && echo yes
yes

Another option would be to put the regexp in a variable:

$ a='S' bash -c '[[ x =~ $a ]]' && echo yes
yes

Again, that only works on systems that support that perl-like S in their regexps.

The POSIX equivalent to that bash-specific code, would be:

if expr " $var" : 
        ' b[^[:space:]]{1,}[abcdefghijklmnopqrstuvwxyz]$' 
   > /dev/null; then
  printf '%sn' "$var"
else
  echo none
fi

Or:

case $var in
  ([!b]* | *[!abcdefghijklmnopqrstuvwxyz] | *[[:space:]]* | "" | ? | ??)
    echo none;;
  (*) printf '%sn' "$var"
esac
Answered By: Stéphane Chazelas

In non-GNU systems what follows explain why S fail:

The S is part of a PCRE (Perl Compatible Regular Expressions). It is not part of the BRE (Basic Regular Expressions) or the ERE (Extended Regular Expressions) used in shells.

The bash operator =~ inside double bracket test [[ use ERE.

The only characters with special meaning in ERE (as opposed to any normal character) are .[()*+?{|^$. There are no S as special. You need to construct the regex from more basic elements:

regex='^b[^[:space:]]+[a-z]$'

Where the bracket expression [^[:space:]] is the equivalent to the S PCRE expressions :

The default s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32).

The test would be:

var='big'            regex='^b[^[:space:]]+[a-z]$'

[[ $var =~ $regex ]] && echo "$var" || echo 'none'

However, the code above will match bißß for example. As the range [a-z] will include other characters than abcdefghijklmnopqrstuvwxyz if the selected locale is (UNICODE).
To avoid such issue, use:

var='bißß'            regex='^b[^[:space:]]+[a-z]$'

( LC_ALL=C;
  [[ $var =~ $regex ]]; echo "$var" || echo 'none'
)

Please be aware that the code will match characters only in the list: abcdefghijklmnopqrstuvwxyz in the last character position, but still will match many other in the middle: e.g. bég.


Still, this use of LC_ALL=C will affect the other regex range: [[:space:]] will match spaces only of the C locale.

To solve all the issues, we need to keep each regex separate:

reg1=[[:space:]]   reg2='^b.*[a-z]$'           out=none

if                 [[ $var =~ $reg1 ]]  ; then out=none
elif   ( LC_ALL=C; [[ $var =~ $reg2 ]] ); then out="$var"
fi
printf '%6.8st|' "$out"

Which reads as:

  • If the input (var) has no spaces (in the present locale) then
  • check that it start with a b and ends in a-z (in the C locale).

Note that both tests are done on the positive ranges (as opposed to a “not”-range). The reason is that negating a couple of characters opens up a lot more possible matches. The UNICODE v8 has 120,737 characters already assigned. If a range negates 17 characters, then it is accepting 120720 other possible characters, which may include many non-printable control characters.

It should be a good idea to limit the character range that the middle characters could have (yes, those will not be spaces, but may be anything else).

Answered By: user79743

Summary

# match any non-whitespace char--works in bash and `grep` too
[^rntfv ]

Details

Matching S (any non-whitespace character) apparently doesn’t work in regular expressions in bash or grep or similar. So, instead of using this to match one or more occurrences of any non-whitespace character:

# INSTEAD OF THESE (which do NOT work in bash or `grep`)

# match one or more non-whitespace chars
S+
# or (same thing)
[S]+

…use this:

How to match all non-whitespace characters in bash and grep

# match one or more non-whitespace chars (DOES work in bash and `grep`!)
[^rntfv ]+

I learned this from https://regex101.com/. Click here: https://regex101.com/r/kM041K/1, and on the right-hand side of the screen, under the "EXPLANATION" section, you will see:

S matches any non-whitespace character (equivalent to [^rntfv ])

So, if in doubt about any regular expression, go to that website and see what it says.

Answered By: Gabriel Staples
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.