Remove all characters from string, except specific multibyte range

How would one use sed or another tool to remove all characters except a specific multibyte range?

For instance, the range xd8xa0 - xd9x8a covers Arabic characters encoded in UTF-8. This was my naive hopeful attempt at using sed:

$ echo "Peace be upon you. السلام عليكم. שלום עליכם. Paz sobre vosotros." | sed -n -e 's|[^xd8xa0-xd9x8a]||p'
sed: -e expression #1, char 25: Invalid range end

Had sed understood multibyte characters, that should have left us with the string:

السلام عليكم

I have considered using iconv to convert to some Arabic-specific encoding, use sed, then convert back. However the work that I’m doing is non-trival, such as I might want to preserve both Arabic and Hebrew, so I’d like to keep everything in UTF-8 for the entire time if possible.

If sed cannot work with multibyte characters, what should I use? I’ve searched for both sed and perl solutions with no luck.

Note that I’ve tried to add a UTF-8 tag to the question, but unix.SE changes that to a Unicode tag. There is no Unicode here, these are encoded strings stored on disk. I think that there should be a separate UTF-8 tag.

Asked By: dotancohen


You can use Unicode properties in Perl:

echo "Peace be upon you. السلام عليكم. שלום עליכם. Paz sobre vosotros." 
| perl -CIO -pe 's/P{Arabic}//g'
  • -CIO tells Perl that input and output are UTF-8 encoded.
  • P{...} is a negative property match, p{...} would be a positive one.
Answered By: choroba

Using Raku (formerly known as Perl_6)

Sample Input (OP’s text saved as a one-line file):

~$ cat peace.txt
Peace be upon you. السلام عليكم. שלום עליכם. Paz sobre vosotros.

Extract Arabic script:

~$ raku -ne 'put m:g/ <:Script<Arabic>>+ /;' peace.txt
السلام عليكم

Extract Hebrew script:

~$ raku -ne 'put m:g/ <:Script<Hebrew>>+ /;' peace.txt
שלום עליכם

Extract Latin script:

~$ raku -ne 'put m:g/ <:Script<Latin>>+ /;' peace.txt
Peace be upon you Paz sobre vosotros

Raku provides high-level support for Unicode, and it’s Regex engine has grown to be a capable "grammar parser". See the links below. Note, the code above will return blank lines if no match. If you only want matching elements/lines you can simply add a conditional call to the $/ (or $<>) match variable:

~$ cat peace2.txt #(remove terminal period from each phrase)
Peace be upon you
السلام عليكم
שלום עליכם
Paz sobre vosotros

~$ raku -ne 'put m:g/ <:Script<Latin>>+ /;' peace2.txt
Peace be upon you

Paz sobre vosotros

~$ raku -ne 'put $/ if m:g/ <:Script<Latin>>+ /;' peace2.txt
Peace be upon you
Paz sobre vosotros

Finally, you can search for the negation of a Unicode property by interposing a ! character. You can also add (or even subtract) from the desired character class(es) with + and/or -. [Note additional regex atom :Z (whitespace) below. In the interest of completeness the code below shows both + and ! parameters–although a - might work in practice]:

~$ raku -ne 'put $/ if m:g/ <+:!Script<Latin> + :!Z >+ /' peace2.txt
السلام عليكم
שלום עליכם

Answered By: jubilatious1
Categories: Answers Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.