awk/sed find indexes of the first and the last capital letter in a string
I have several kinds of strings like this:
the example string1
--AbbbAnde---
the example string2
abksjiRNNBBKUGFLYFYLF
the example string3
-ankNUGUYUBUIGCafrg--
the example string4
BNKJUGFVULNK-Kew---
PS: there are no strings having zero or one capital letter.
I want to find indexes of the first and the last capital letter from the string which looks like the above examples by awk, sed, or other bash programs, because I have thousands of files, and Python would be time-consuming.
the index of the first capital letter should be the count from the start to the end (left to the right). while the index of the last capital letter should be the count for the end to the start (right to the left).
for example,
for the example string1, the first capital letter is A, and the index is 3 from the left to the right (start to end). the last capital letter is A, and the index is 7 from the end to the beginning.
for the example string2, the first capital letter is R, and the index is 7 from the left to the right (start to end). the last capital letter is F, and the index is 1 from the end to the beginning.
for the example string3, the first capital letter is N, and the index is 5 from the left to the right (start to end). the last capital letter is C, and the index is 7 from the end to the beginning.
for the example string4, the first capital letter is B, and the index is 1 from the left to the right (start to end). the last capital letter is K, and the index is 6 from the end to the beginning.
Thanks for your help.
It is easy to get the length of the leading or trailing part with AWK. Add 1 to get the index as shown in the question.
echo '--AbbbAnde---
abksjiRNNBBKUGFLYFYLF
-ankNUGUYUBUIGCafrg--
BNKJUGFVULNK-Kew---
foobarbaz' | awk '{
printf("string %sn", $0);
head=tail=$0;
sub(/[A-Z].*$/,"",head);
sub(/^.*[A-Z]/,"",tail);
printf("head <%s> %dn", head, length(head)+1);
printf("tail <%s> %dn", tail, length(tail)+1);
}'
output:
string --AbbbAnde---
head <--> 3
tail <nde---> 7
string abksjiRNNBBKUGFLYFYLF
head <abksji> 7
tail <> 1
string -ankNUGUYUBUIGCafrg--
head <-ank> 5
tail <afrg--> 7
string BNKJUGFVULNK-Kew---
head <> 1
tail <ew---> 6
string foobarbaz
head <foobarbaz> 10
tail <foobarbaz> 10
You might need to extend the script to handle input that does not contain an uppercase letter. (The question does not tell what result you would expect in this case.)
awk '
{
start = match($0, /[A-Z]/)
end = match($0, /[A-Z][^A-Z]*$/)
print (start ? start : "NaN"), (end ? length() - end + 1 : "NaN")
}' infile
$ awk '
match($0,/[[:upper:]](.*[[:upper:]])?/) {
print $0, RSTART, length()-(RSTART+RLENGTH-2)
}
' file
xyzAb 4 2
--AbbbAnde--- 3 7
abksjiRNNBBKUGFLYFYLF 7 1
-ankNUGUYUBUIGCafrg-- 5 7
BNKJUGFVULNK-Kew--- 1 6
The above was run on this input:
$ cat file
xyzAb
--AbbbAnde---
abksjiRNNBBKUGFLYFYLF
-ankNUGUYUBUIGCafrg--
BNKJUGFVULNK-Kew---
{
pad = s = match($0, /[A-Z]/);
clone = substr($0, RSTART + 1)
while (match(clone, /[A-Z]/)) {
clone = substr(clone, RSTART + 1)
pad += RSTART
}
print $0, ": ", s, length - --pad
}
Save in a file (Eg.: find_index.awk
) with #!/bin/awk -f
as the first line (shebang) and run as ./find_index.awk yourfile
If perl
is fine :
perl -lne '
/[A-Z]/; $s = $+[0];
reverse =~ //;
print "$_ : $s - $+[0]"
' sample
Using Raku (formerly known as Perl_6)
Sample Input:
--AbbbAnde---
abksjiRNNBBKUGFLYFYLF
-ankNUGUYUBUIGCafrg--
BNKJUGFVULNK-Kew---
nouppercaseletters
oneUppercaseletter
Skips lines with 0 or 1 uppercase letters:
~$ raku -ne 'if m/ <:Lu> .+ <:Lu> / { say $/.from+1 ~qb[t]~ $_.chars - $/.to+1 };' file
3 7
7 1
5 7
1 6
Skips lines with 0 uppercase letters:
~$ raku -ne 'if m/ <:Lu> .+ <:Lu> / || m/ <:Lu> / { say $/.from+1 ~qb[t]~ $_.chars - $/.to+1 };' file
3 7
7 1
5 7
1 6
4 15
Handles all sample lines (above), inserting tabs if 0 uppercase letters:
~$ raku -ne 'if m/ <:Lu> .+ <:Lu> / || m/ <:Lu> / { say $/.from+1 ~qb[t]~ $_.chars - $/.to+1} else {qb[tt].say };' file
3 7
7 1
5 7
1 6
4 15
An advantage of the code above is that Raku handles uppercase letters according to Unicode’s definition: <:Lu>
is its Unicode "Uppercase-letter" class.
Match variable $/
(alternatively $<>
) is used to pull out the start $/.from
and end $/.to
of the match. The character distance from the right-end of the line is computed as $_.chars - $/.to
, where $_.chars
represents the number of characters in the line. Backslash-escaped tabs can be written qb[t]
in Raku’s quoting sub-language, to increase code portability (reduces doublequotes). String concatenation in Raku is accomplished with ~
tilde.
See links below for more information.
https://docs.raku.org/language/regexes#Unicode_properties
https://docs.raku.org/language/quoting
https://raku.org
Another perl
approach:
perl -Mopen=locale -lne '
print 1+length$`, " ", 1+length$''' if /p{Lu}(.*p{Lu})?/'
p{Lu}
matches anu
pper caseL
etter (such as ABCÀÁÂÃÄÅАБВГᏢᏣᏤᏥ ⰗⰘⰙⱠⱢⱣⱤⱧⱩ…)$`
contains the part before the match,$'
the part after the match- with
-n
(sed -n
mode),perl
processes the input one line at a time with$_
containing the line. - with
-l
, the line delimiter is trimmed from the lines on input and added back on output (likesed
does). - with
-Mopen=locale
, the input is decoded into text as per the locale’s encoding and encoded back on output. Without it, the input would be interpreted in the ISO8859-1 (aka latin1) charset.
In any case, note that awk
or sed
or perl
are not bash programs, nor have they anything to do with any shell, other than like any command they can be invoked by any shell including bash
or anything else that is not a shell for that matters.
POSIX awk with field separator as an uppercase regex.
LC_ALL=C
awk -F '[A-Z]' '
NF>2{
print length("x"$1), length("x"$NF)
}' file
Perl has index & rindex builtins to get the index (zero-based) of a substring from front & end, respectively. But before that we transform all uppercase to A since index builtin doesn’t do regex.
perl -lne '1 < tr/A-Z/A/ and
print 1+index($_,"A"), $",
length()-rindex($_,"A");
' file
GNU sed with extended regex mode (-E)
LC_ALL=C
sed -E 'h;
s/[A-Z].*/./
:a
s/./a/g;tb
:b
s/^a/c/
s/([b-j])a/u1/
y/BCDEFGHIJ/cdefghijk/
s/ka/ab/
tb
y/bcdefghijk/0123456789/
G;P
/n$/d
z;x
s/.*[A-Z]/./
ba
' file | paste -d" " - -
Output:
3 7
7 1
5 7
1 6