How can I identify a strange character?

I am trying to identify a strange character I have found in a file I am working with:

$ cat file
�
$ od file
0000000 005353
0000002
$ od -c file
0000000 353  n
0000002
$ od -x file
0000000 0aeb
0000002

The file is using ISO-8859 encoding and can’t be converted to UTF-8:

$ iconv -f ISO-8859 -t UTF-8 file
iconv: conversion from `ISO-8859' is not supported
Try `iconv --help' or `iconv --usage' for more information.
$ iconv  -t UTF-8 file
iconv: illegal input sequence at position 0
$ file file
file: ISO-8859 text

My main question is how can I interpret the output of od here? I am trying to use this page which lets me translate between different character representations, but it tells me that 005353 as a “Hex code point” is which doesn’t seem right and 0aeb as a “Hex code point” is which, again, seems wrong.

So, how can I use any of the three options (355, 005353 or 0aeb) to find out what character they are supposed to represent?

And yes, I did try with Unicode tools but it doesn’t seem to be a valid UTF character either:

$ uniprops $(cat file)
U+FFFD ‹�› N{REPLACEMENT CHARACTER}
    pS p{So}
    All Any Assigned Common Zyyy So S Gr_Base Grapheme_Base Graph X_POSIX_Graph
       GrBase Other_Symbol Print X_POSIX_Print Symbol Specials Unicode

if I understand the description of the Unicode U+FFFD character, it isn’t a real character at all but a placeholder for a corrupted character. Which makes sense since the file isn’t actually UTF-8 encoded.

Asked By: terdon

||

Note that od is short for octal dump, so 005353 are the two bytes as octal word, od -x is 0aeb in hexadecimal as word, and the actual contents of your file are the two bytes eb and 0a in hexadecimal, in this order.

So both 005353 and 0aeb can’t just be interpreted as “hex code point”.

0a is a line feed (LF), and eb depends on your encoding. file is just guessing the encoding, it could be anything. Without any further information where the file came from etc. it will be difficult to find out.

Answered By: dirkt

Your file contains two bytes, EB and 0A in hex. It’s likely that the file is using a character set with one byte per character, such as ISO-8859-1; in that character set, EB is ë:

$ printf "353n" | iconv -f ISO-8859-1
ë

Other candidates would be δ in code page 437, Ù in code page 850

od -x’s output is confusing in this case because of endianness; a better option is -t x1 which uses single bytes:

$ printf "353n" | od -t x1
0000000 eb 0a
0000002

od -x maps to od -t x2 which reads two bytes at a time, and on little-endian systems outputs the bytes in reverse order.

When you come across a file like this, which isn’t valid UTF-8 (or makes no sense when interpreted as a UTF-8 file), there’s no fool-proof way to automatically determine its encoding (and character set). Context can help: if it’s a file produced on a Western PC in the last couple of decades, there’s a fair chance it’s encoded in ISO-8859-1, -15 (the Euro variant), or Windows-1252; if it’s older than that, CP-437 and CP-850 are likely candidates. Files from Eastern European systems, or Russian systems, or Asian systems, would use different character sets that I don’t know much about. Then there’s EBCDIC… iconv -l will list all the character sets that iconv knows about, and you can proceed by trial and error from there.

(At one point I knew most of CP-437 and ATASCII off by heart, them were the days.)

Answered By: Stephen Kitt

It is impossible to guess with 100% of accuracy the charset of text files.

Tools like chardet, firefox, file -i when there is no explicit charset information defined
(eg. if a HTML contains a meta charset=… in the head, things are easier)
will try to use heuristics that are not so bad if the text is big enough.

In the following, I demonstrate charset-detection with chardet (pip install chardet / apt-get install python-chardet if necessary).

$ echo "in Noël" | iconv -f utf8 -t latin1  | chardet
<stdin>: windows-1252 with confidence 0.73

After having good charset candidate, we can use iconv, recode or similar
to change the file charset to your “active” charset (in my case utf-8) and see if it guessed correctly…

iconv -f windows-1252  -t utf-8 file

Some charset (like iso-8859-3, iso-8859-1) have many chars in common — sometimes it is not easy to see if we found the perfect charset…

So it is very important to have metadata associated with relevant text (eg XML).

Answered By: JJoao
#!/bin/bash
#
# Search in a file, a known (part of a ) String (i.E.: Begrüßung),
# by testing all encodings
#
[[ $# -ne 2 ]] && echo "Usage: encoding-finder.sh FILE fUnKy_CHAR_FOR_SURE_IN_FILE" && exit
FILE=$1
PATTERN=$2
for enc in $( iconv -l | sed 's/..$//') 
do 
    iconv -f $enc -t UTF-8 $FILE  2>/dev/null | grep -m 1 $PATTERN && echo $enc 
done 

If I get a file, which contains, for Instance the Word Begrung, I can infer that Begrüßung might be meant. So I convert it by all known encodindgs and look, whether one is found, which converts it properly.

Usually, there are multiple encodings which seem to fit.

For longer files, you might cut a snippet instead of converting hundreds of pages.

So I would call it

encodingfinder.sh FILE Begrüßung

and the script tests, whether by converting it with the known encodings, which of them produce “Begrüßung”.

To find such characters, less is usually of help, since funky characters often stand out. From the context, the right word to search for can usually be inferred. But we don’t want to check with a hexeditor, what byte this is, and then visit endless tables of encodings, to find our offender. 🙂

Answered By: user unknown
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.