Get consistent encoding for all files in directory

I have a directory containing lots of csv files from various vendors with two different encodings:

  • ASCII Text / UTF-8
  • UCS2 / UTF-16 little endian

I’d like to use grep, awk, sed and other utilities on these datafiles using conventional syntax.

Re-encoding these files from UTF-16 to UTF-8 does not lose any useful data. All csv files only contain ASCII data so it’s beyond me why they’re being supplied as little-endian UTF-16 by some vendors, some of the time.

I’ve written a short script that parses the output of file, but I think it’s probably quite fragile.

There must be better ways of managing files with multiple encodings, are there any programs or utilities that can assist with this sort of problem?

I’m using Debian Stable.

for f in ./*.csv
do
  if  [[ $(file "$f") == *"UTF-16"* ]]
  then
    iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new
    mv "$f"-new "$f"
  fi
done
Asked By: jon

||

I’d refine your script to:

set -o noclobber
for f in ./*.csv
do
  if [ "$(file -b --mime-encoding "$f")" = utf-16le ]; then
    iconv -f UTF-16 -t UTF-8 "$f" > "$f"-new &&
      mv "$f"-new "$f"
  fi
done
Answered By: Stéphane Chazelas
Categories: Answers Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.