How can I remove the BOM from a UTF-8 file?

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file?

$ file test.xml
test.xml:  XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
Asked By: m13r

||

It is possible to remove the BOM from a file with the tail command:

tail -c +4 withBOM.txt > withoutBOM.txt

Be aware that this chops the first 3 bytes (-c +N makes the output start at byte nr. N, so it cuts the first (N-1) bytes) from the file, so be sure that the file really contains the BOM before running tail.

Answered By: m13r

A BOM doesn’t make sense in UTF-8. Those are generally added by mistake by bogus software on Microsoft OSes.

dos2unix will remove it and also take care of other idiosyncrasies of Windows text files.

dos2unix test.xml
Answered By: Stéphane Chazelas

If you’re not sure if the file contains a UTF-8 BOM, then this (assuming the GNU implementation of sed) will remove the BOM if it exists, or make no changes if it doesn’t.

sed '1s/^xEFxBBxBF//' < orig.txt > new.txt

You can also overwrite the existing file with the -i option:

sed -i '1s/^xEFxBBxBF//' orig.txt

If you are using the BSD version of sed (eg macOS) then you need to have bash do the escaping:

 sed $'1s/xefxbbxbf//' < orig.txt > new.txt
Answered By: CSM

You can use

LANG=C LC_ALL=C sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- filename

to remove the byte order mark from the beginning of the file, if it has any, as well as convert any CR LF newlines to LF only. The LANG=C LC_ALL=C tells the shell you want the command to run in the default C locale (also known as the default POSIX locale), where the three bytes forming the Byte Order Mark are treated as bytes. The -i option to sed means in-place. If you use -i.old, then sed saves the original file as filename.old, and the new file (with the modifications, if any) as filename.


I personally like to have this as ~/bin/fix-ms; for example, as

#!/bin/dash
export LANG=C LC_ALL=C
if [ $# -gt 0 ]; then
    for FILE in "$@" ; do
        sed -e 's/r$// ; 1 s/^xefxbbxbf//' -i -- "$FILE" || exit 1
    done
else
    exec sed -e 's/r$// ; 1 s/^xefxbbxbf//'
fi

so that if I need to apply this to say all C source files and headers (my old code from the MS-DOS era, for example!), I just run

find . -name '*.[CHch]' -print0 | xargs -r0 ~/bin/ms-fix

or, if I just want to look at such a file, without modifying it, I can run

~/bin/ms-fix < filename | less

and not see the ugly <U+FEFF> in my UTF-8 terminal.

Answered By: Nominal Animal

Using VIM

  1. Open file in VIM:

     vi text.xml
    
  2. Remove BOM encoding:

     :set nobomb
    
  3. Save the file and quit:

     :x
    

For a non-interactive solution, try the following command line:

vi -c ":set nobomb" -c ":wq" text.xml

That should remove the BOM, save the file and quit, all from the command line.

Answered By: Joshua Pinter

Recently I found this tiny command-line tool which adds or removes the BOM on arbitary UTF-8 encoded files: UTF BOM Utils (new link at github)

Little drawback, you can download only the plain C++ source code. You have to create the makefile (with CMake, for example) and compile it by yourself, binaries are not provided on this page.

Answered By: Wernfried Domscheit

I use a vim one-liner on the regular for this:

vim --clean -c 'se nobomb|wq' filename

vim --clean -c 'bufdo se nobomb|wqa' filename1 filename2 ...
Answered By: Robyn Murdock

I have a slightly different problem, and am putting this here for someone who, like me, ends up here with data full of ZERO WIDTH NO-BREAK SPACE characters (which are known as Byte Order Mark when they are the first character of the file).

I got this data by copying out of grafana query metrics field, and it had multiple (17) xefxbbxbf sequences (which show up in vim as rate<feff>(<feff>node<feff>{<feff>job<feff>) in a single line with only 81 actual characters.

I modified Nominal Animal’s code just slightly:

LANG=C LC_ALL=C sed -e 's/xefxbbxbf//g'

And the :set nobomb thing in vim only removes the very first one in the file.

tried this:

LANG=C vim b

Then vim doesn’t show them, but they are still there (even after a write…)

Answered By: Wayne Walker

I know it’s been a while, but since I had a slightly different issue, I’m posting so others may benefit.

My text file was randomly haunted by characters feff, luckily for me they appeared at start of the lines and the set of allowed characters is limited to alphanumeric.

The below command in vim cuts first non-alphanumeric character, but use it with caution as your set of allowed characters might vary.

:%s/^[^a-zA-Z0-9]//g
Answered By: Smirk

I had the same question and ended up writing a dedicated utility bom(1) for this. It’s available here.

Here’s the man page:

NAME
     bom -- Decode Unicode byte order mark

SYNOPSIS
     bom --strip [--expect types] [--lenient] [--prefer32] [--utf8] [file]
     bom --detect [--expect types] [--prefer32] [file]
     bom --print type
     bom --list
     bom --help
     bom --version

DESCRIPTION
     bom decodes, verifies, reports, and/or strips the byte order mark (BOM) at the
     start of the specified file, if any.

     When no file is specified, or when file is -, read standard input.

OPTIONS
     -d, --detect
             Report the detected BOM type to standard output and then exit.

             See SUPPORTED BOM TYPES for possible values.

     -e, --expect types
             Expect to find one of the specified BOM types, otherwise exit with an
             error.

             Multiple types may be specified, separated by commas.

             Specifying NONE is acceptable and matches when the file has no (sup-
             ported) BOM.

     -h, --help
             Output command line usage help.

     -l, --lenient
             Silently ignore any illegal byte sequences encountered when converting
             the remainder of the file to UTF-8.

             Without this flag, bom will exit immediately with an error if an ille-
             gal byte sequence is encountered.

             This flag has no effect unless the --utf8 flag is given.

     --list  List the supported BOM types and exit.

     -p, --print type
             Output the byte sequence corresponding to the type byte order mark.

     --prefer32
             Used to disambiguate the byte sequence FF FE 00 00, which can be
             either a UTF-32LE BOM or a UTF-16LE BOM followed by a NUL character.

             Without this flag, UTF-16LE is assumed; with this flag, UTF-32LE is
             assumed.

     -s, --strip
             Strip the BOM, if any, from the beginning of the file and output the
             remainder of the file.

     -u, --utf8
             Convert the remainder of the file to UTF-8, assuming the character
             encoding implied by the detected BOM.

             For files with no (supported) BOM, this flag has no effect and the
             remainder of the file is copied unmodified.

             For files with a UTF-8 BOM, the identity transformation is still
             applied, so (for example) illegal byte sequences will be detected.

     -v, --version
             Output program version and exit.

SUPPORTED BOM TYPES
     The supported BOM types are:

     NONE    No supported BOM was detected.

     UTF-7   A UTF-7 BOM was detected.

     UTF-8   A UTF-8 BOM was detected.

     UTF-16BE
             A UTF-16 (Big Endian) BOM was detected.

     UTF-16LE
             A UTF-16 (Little Endian) BOM was detected.

     UTF-32BE
             A UTF-32 (Big Endian) BOM was detected.

     UTF-32LE
             A UTF-32 (Little Endian) BOM was detected.

     GB18030
             A GB18030 (Chinese National Standard) BOM was detected.

EXAMPLES
     To tell what kind of byte order mark a file has:

           $ bom --detect

     To normalize files with byte order marks into UTF-8, and pass other files
     through unchanged:

           $ bom --strip --utf8

     Same as previous example, but discard illegal byte sequences instead of gener-
     ating an error:

           $ bom --strip --utf8 --lenient

     To verify a properly encoded UTF-8 or UTF-16 file with a byte-order-mark and
     output it as UTF-8:

           $ bom --strip --utf8 --expect UTF-8,UTF-16LE,UTF-16BE

     To just remove any byte order mark and get on with your life:

           $ bom --strip file

RETURN VALUES
     bom exits with one of the following values:

     0       Success.

     1       A general error occurred.

     2       The --expect flag was given but the detected BOM did not match.

     3       An illegal byte sequence was detected (and --lenient was not speci-
             fied).

SEE ALSO
     iconv(1)

     bom: Decode Unicode byte order mark, https://github.com/archiecobbs/bom.
Answered By: Archie

The answer posted by Smirk was a great hint about how to do this on an VERY OLD UNIX system that has ancient versions of vim, ex, iconv, piconv, etc. I did not want to restrict to treatment of only alpha-numeric as non-BOM characters, so these patterns assume two or three leading non-printable ASCII on the first line only are the BOM characters to remove. A non-interactive method was also desired.

An excommands file was created as follows:

" UTF-8 Byte-Order-Mark (BOM) characters
1,1g/^[^ -~][^ -~][^ -~][ -~]/s/^...//
" UTF-16LE, UTF-16 (Big Endian) BOM
" ex happens to strip unwanted NULs
1,1g/^[^ -~][^ -~][ -~]/s/^..//

To remove the BOM characters:

ex - file-w-BOM <excommands

To use interactively, just enter as a colon command in vim. For example:

:1,1g/^[^ -~][^ -~][^ -~][ -~]/s/^...//

NOTE: For some reason, the ex on my VERY OLD UNIX system just happened to remove the unwanted NUL bytes from UTF-16LE files in a way that didn’t garble data that all cleanly corresponded with ASCII characters. This was fortunate since both iconv and piconv on the VERY OLD UNIX system were also unable to properly re-encode UTF-16LE as something else.

CAVEAT: The above is sure to BREAK files that contain multi-byte characters that do not map to plain ASCII, so the solution must only be used with this in mind.

Answered By: kbulgrien
Categories: Answers Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.