Split string by delimiter and get N-th element
I have a string:
one_two_three_four_five
I need to save in a variable A
value two
and in variable B
value four
from the above string
I am using ksh.
Use cut
with _
as the field delimiter and get desired fields:
A="$(cut -d'_' -f2 <<<'one_two_three_four_five')"
B="$(cut -d'_' -f4 <<<'one_two_three_four_five')"
You can also use echo
and pipe instead of Here string:
A="$(echo 'one_two_three_four_five' | cut -d'_' -f2)"
B="$(echo 'one_two_three_four_five' | cut -d'_' -f4)"
Example:
$ s='one_two_three_four_five'
$ A="$(cut -d'_' -f2 <<<"$s")"
$ echo "$A"
two
$ B="$(cut -d'_' -f4 <<<"$s")"
$ echo "$B"
four
Beware that if $s
contains newline characters, that will return a multiline string that contains the 2nd/4th field in each line of $s
, not the 2nd/4th field in $s
.
Using only POSIX sh constructs, you can use parameter substitution constructs to parse one delimiter at a time. Note that this code assumes that there is the requisite number of fields, otherwise the last field is repeated.
string='one_two_three_four_five'
remainder="$string"
first="${remainder%%_*}"; remainder="${remainder#*_}"
second="${remainder%%_*}"; remainder="${remainder#*_}"
third="${remainder%%_*}"; remainder="${remainder#*_}"
fourth="${remainder%%_*}"; remainder="${remainder#*_}"
Alternatively, you can use an unquoted parameter substitution with wildcard expansion disabled and IFS
set to the delimiter character (this only works if the delimiter is a single non-whitespace character or if any whitespace sequence is a delimiter).
string='one_two_three_four_five'
set -f; IFS='_'
set -- $string
second=$2; fourth=$4
set +f; unset IFS
This clobbers the positional parameters. If you do this in a function, only the function’s positional parameters are affected.
Yet another approach for strings that don’t contain newline characters is to use the read
builtin.
IFS=_ read -r first second third fourth trail <<'EOF'
one_two_three_four_five
EOF
Wanted to see an awk
answer, so here’s one:
A=$(awk -F_ '{print $2}' <<< 'one_two_three_four_five')
B=$(awk -F_ '{print $4}' <<< 'one_two_three_four_five')
Is a python solution allowed?
# python3 -c "import sys; print(sys.argv[1].split('_')[1])" one_two_three_four_five
two
# python3 -c "import sys; print(sys.argv[1].split('_')[3])" one_two_three_four_five
four
Here string
The simplest way (for shells with <<<
) is:
IFS='_' read -r a second a fourth a <<<"$string"
Using a temporary variable $a
instead of $_
because one shell complains.
In a full script:
string='one_two_three_four_five'
IFS='_' read -r a second a fourth a <<<"$string"
echo "$second $fourth"
- No IFS changing
- No issues with
set -f
(Pathname expansion) - No changes to the positional parameters (
$@
).
Heredoc
For a solution portable to all shells (yes, all POSIX included) without changing IFS or set -f
, use the (a bit more complex) heredoc equivalent:
string='one_two_three_four_five'
IFS='_' read -r a second a fourth a <<_EOF_
$string
_EOF_
echo "$second $fourth"
Understand that this solutions (both the here-doc and the use of <<<
) will remove all trailing newlines.
And that this is designed to a "one liner" variable content.
Solutions for multi-liners are possible but need more complex constructs.
Bash 4.4+
A very simple solution is possible in bash version 4.4
readarray -d _ -t arr <<<"$string"
echo "array ${arr[1]} ${arr[3]}" # array numbers are zero based.
beware a newline character is added at the end of the last element and an empty $string
is split into one element containing a newline character.
readarray -t -d _ arr < <(printf %s "$string")
Would create an empty array for an empty $string
, but beware that a trailing empty element like in string=foo_
would not result in an empty trailing element.
readarray -t -d _ arr < <(printf %s_ "$string")
Would preserve all elements and split an empty string into one empty element.
readarray -t -d _ arr < <(printf %s "${string+${string}_}")
Would split an empty string into one empty element, but would give an empty list if $string
was unset.
There is no equivalent for POSIX shells, as many POSIX shells do not have arrays.
Arrays
For shells that have arrays may be as simple as (tested working in attsh
, lksh
, mksh
, ksh
, and bash
, but not zsh
):
set -f; IFS=_; arr=($string)
But with a lot of additional plumbing to keep and reset variables and options:
string='one_* *_three_four_five'
case $- in
*f*) noglobset=true; ;;
*) noglobset=false;;
esac
oldIFS="$IFS"
set -f; IFS=_; arr=($string)
if $noglobset; then set -f; else set +f; fi
IFS=$oldIFS
echo "two=${arr[1]} four=${arr[3]}"
In zsh
, arrays start in 1, and no split+glob is performed by default upon parameter expansions. So some changes need to be done to get this working in zsh
:
IFS=_; arr=( $=string )
echo "two=${arr[2]} four=${arr[4]}"
Where $=string
requests word splitting explicitly (glogging is still not done so doesn’t need to be disabled globally). Also note that while foo_
would be split into foo
only in ksh/bash/yash, it’s split into foo
and the empty string in zsh
.
With zsh
you could split the string (on _
) into an array:
non_empty_elements=(${(s:_:)string})
all_elements=("${(@s:_:)string}")
and then access each/any element via array index:
print -r -- ${all_elements[4]}
Keep in mind that in zsh
(like most other shells, but unlike ksh
/bash
) array indices start at 1.
Or directly in one expansion:
print -r -- "${${(@s:_:)string}[4]}"
Or using an anonymous function for the elements to be available in its $1
, $2
…:
(){print -r -- $4} "${(@s:_:)string}"
Another awk example; simpler to understand.
A=$(echo one_two_three_four_five | awk -F_ '{print $1}')
B=$(echo one_two_three_four_five | awk -F_ '{print $2}')
C=$(echo one_two_three_four_five | awk -F_ '{print $3}')
... and so on...
Can be used with variables also.
Suppose:
this_str="one_two_three_four_five"
Then the following works:
A=$(printf '%sn' "${this_str}" | awk -F_ '{print $1}')
B=$(printf '%sn' "${this_str}" | awk -F_ '{print $2}')
C=$(printf '%sn' "${this_str}" | awk -F_ '{print $3}')
... and so on...
That assumes ${this_str}
doesn’t contain newline characters, or it would return the first _
in each line of the contents of the variable instead of the first field in the contents of the variable.
With due respect to everyone who have posted excellent answers, I wonder if we are over-engineering this problem. Three simple lines to just answer the question asked without generalizing:
str="one_two_three_four_five"
<– create a string
A=$(echo $str | awk -F_ '{print $2}')
<– tell awk to use _ as the delimiter and assign the second field to A
B=$(echo $str | awk -F_ '{print $4}')
<– tell awk to use _ as the delimiter and assign the fourth field to B
You can then use the variables as usual. Here is an example:
$ echo "The value of A is: $A; the value of B is: $B"
The value of A is: two; the value of B is: four
$
Using Raku (formerly known as Perl_6)
A=$(raku -e 'print $*IN.split("_")[1];' <<< 'one_two_three_four_five')
B=$(raku -e 'print $*IN.split("_")[3];' <<< 'one_two_three_four_five')
This answer complements the awk
answer by @Paul_Evans. You can place print
at the right end of the method chain, if you find that more readable. Also, if you have an issue with quoting, then the .split("_")
call can be replaced by .split(q[_])
.
Putting these two options together:
A=$(raku -e '$*IN.split(q[_])[1].print;' <<< 'one_two_three_four_five')
B=$(raku -e '$*IN.split(q[_])[3].print;' <<< 'one_two_three_four_five')
Finally, a word about indexing. You can take the first element after split
ting with head
, or the first 2 elements with head(2)
. If you want to take elements from the right end, use tail
in a similar manner. The way to numerically index from the right end in Raku is to use the *
"whatever-star" idiom. So the last (zero-indexed) element is [*-1]
, the second-to-last is [*-2]
, etc.
~$ raku -e 'print $*IN.split(q[_])[*-4];' <<< 'one_two_three_four_five'
two
~$ raku -e 'print $*IN.split(q[_])[*-2];' <<< 'one_two_three_four_five'
four