situ: bom

Showing posts with label bom. Show all posts

Tuesday, 28 September 2010

Encoding an Unicode Character to UTF-8

Introduction

I'm living in an UTF-8 world. Everthing, from the operating systems I use (Solaris, Linux and Mac OS X) to my terminals. Even the file systems I use must be able to support the UTF-8 names I give to my files.

UTF-8 is a flexible and well supported encoding for Unicode and it's supported, out of the box, on all the operating systems I use. UTF-8 allows me not to worry (ever) about the characters I use in file names or in the files I write. Being a native Italian living in Spain, it's essential for me to have an encoding that supports all of the languages I use in my daily file.

UTF-8 is the perfect solution: it's ASCII backwards compatible, so I need no conversion when writing ASCII files, and can encode every characters in the Unicode character set. No need to worry if one day I had to write some Japanese file name. The only "drawback" it may have it's a little space overhead compared to some other encodings (such as UTF-16) but in this particular case, UTF-8 beats them all.

In such an UTF-8 world, there's no need to worry: just use the tools you have and everything will be fine.

Well, almost. Sometimes, as described in that post, I need to know exactly how UTF-8 encodes a specific Unicode code point. Some other times, since I'm a sed and grep addict, it's really handy knowing what to look for. Some days ago, for example, I had look for file names that contained specific code points to correct an idiosyncrasy between Subversion and Mac OS X (I still cannot believe it...). In such a cases, an UNIX terminal and its utilities are really your best friends. Since the encoding process is very easy, I'm quickly describing it to you in the case you need it.

The UTF-8 Encoding

The UTF-8 encoding is a variable width encoding in which each Unicode code point can be encoded as a sequence of 1 to 4 octects. Each octect is composed by a heading byte and trailing bytes. Since the encoding is a variable width one, you need a way to tell where a sequence starts and where it ends. That information is stored in the head byte.

Heading Byte

The head byte can take one of these forms:

0xxxxxxx: Single byte.
110xxxxx: Head of a sequence of two bytes.
1110xxxx: Head of a sequence of three bytes.
11110xxx: Head of a sequence of four bytes.

You may have noticed that the number of 1s in a heading byte tells how long the sequence is. If the heading byte starts with 0, the sequence is single byte.

The 0xxxxxxx format guarantees that ASCII characters in the [0, 127] range are encoded by the identical function thus guaranteeing the backwards compatibility with ASCII.

Trailing Bytes

Trailing bytes in a multibyte UTF-8 sequence always have the following form:

10xxxxxx

Encoding

The 0s and 1s in the heading and trailing bytes format described in the previous sections are called control bits. In the UTF-8 encoding, the concatenation of the non control bits is the scalar value of the encoded Unicode code point. Because of the structure of the aforementioned forms, the number of required bytes to encode a specific Unicode code point with UTF-8 depends on the Unicode range where it falls (the ranges are given both hexadecimal and binary forms:

[U+000, U+007F] [00000000, 01111111]: one byte.
[U+0080, U+07FF] [10000000, 111 11111111]: two bytes.
[U+0800, U+FFFF] [1000 00000000, 11111111 11111111]: three bytes.
[U+10000, U+10FFFF] [1 00000000 00000000, 10000 00000000 00000000]: four bytes.

One Byte Range

In the one byte range, meaningful bits of the Unicode code points are in then [0, 6] range. Since the single byte form has a 1 bit overhead (the leading 0), a single byte is sufficient:

[0xxxxxxx 0xxxxxxxx]
[00000000, 011111111]

Two Bytes Range

In the two bytes range, meaningful bits of the Unicode code points are in the [0, 10] range. Since the head of a sequence of two bytes has a 3 bits overhead and the form of a trailing byte has a 2 bits overhead, there's room for exactly 11 bits.

[ fff ffssssss, fff ffssssss]
[00000000 10000000, 00000111 11111111]

(* f are bits from the first byte of the encoded sequence)
(** s are bits from the second byte of encoded the sequence)

The encoded sequence is:

[110fffff 10ssssss]

Three Bytes Range

In the three bytes range, meaningful bits of the Unicode code points are in the [0, 15] range. Since the head of a sequence of three bytes has a 4 bits overhead and the form of a trailing byte has a 2 bits overhead, there's room for exactly 16 bits.

[ffffssss sstttttt, ffffssss sstttttt]
[00001000 00000000, 11111111 11111111]

(* f are bits from the first byte of the sequence)
(** s are bits from the second byte of the sequence)
(*** t are bits from the third byte of the sequence)

The encoded sequence is:

[1110ffff 10ssssss 10tttttt]

Four Bytes Range

In the four bytes range, meaningful bits of the Unicode code points are in the [0, 20] range. Since the head of a sequence of four bytes has a 5 bits overhead and the form of a trailing byte has a 2 bits overhead, there's room for exactly 21 bits.

[ fffss sssstttt tthhhhhh, fffss sssstttt tthhhhhh]
[00000001 00000000 00000000, 00010000 11111111 11111111]

(* f are bits from the first byte of the sequence)
(** s are bits from the second byte of the sequence)
(*** t are bits from the third byte of the sequence)
(**** h are bits from the third byte of the sequence)

The encoded sequence is:

[11110fff 10ssssss 10tttttt 10hhhhhh]

Conclusion

You know see how easy is to convert an Unicode code point to its representation in the UTF-8 encoding. The UTF-8 encoding of the Byte Order Mark (BOM) character, for example, whose code point is U+FEFF, can be easily be computed as explained above.

The U+FEFF code point falls in the three byte range and its binary representation is:

[F E F F ]
[1111 1110 1111 1111]

According the the aforementioned rule, the conversion gives:

[E F B B B F ]
[11101111 10111011 10111111]

that corresponds to the \xEF\xBB\xBF expression used with sed in our previous example.

A Shell Script to Find and Remove the BOM Marker

Introduction

Have you ever seen this characters while dumping the contents of some of your text files, haven't you?

ï»¿

If you have, you found a BOM marker! The BOM marker is a Unicode character with code point U+FEFF that specifies the endianness of an Unicode text stream.

Since Unicode characters can be encoded as a multibyte sequence with a specific endianness, and since different architectures may adopt distinct endianness types, it's fundamental to signal the receiver about the endianness of the data stream being sent. Dealing with the BOM, then, it's part of the game.

If you want to know more about when to use the BOM you can start by reading this official Unicode FAQ.

UTF-8

UTF-8 is one of the most widely used Unicode characters encoding on software and protocols that have to deal with textual data stream. UTF-8 represents each Unicode character with a sequence of 1 to 4 octects. Each octect contains control bits that are used to identify the beginning and the length of an octect sequence. The Unicode code point is simply the concatenation of the non control bits in the sequence. One of the advantages of UTF-8 is that it retains backwards compatibility with ASCII in the ASCII [0-127] range since such characters are represented with the same octect in both encodings.

If you feel curious about how the UTF-8 encoding works, I've written an introductory post about it.

Common Problems

Because of its design, the UTF-8 encoding is not endianness-sensible and using the BOM with this encoding is discouraged by the Unicode standard. Unfortunately some common utilities, notably Microsoft Notepad, keep on adding a BOM in your UTF-8 files thus breaking those application that aren't prepared to deal with it.

Some programs could, for example, display the following characters at the beginning of your file:

ï»¿

A more serious problem is that a BOM will break a UNIX shell script interfering with the shebang (#!).

A Shell Scripts to Check for BOMs and Remove Them

The Byte Order Mark (BOM) is a Unicode character with code point U+FEFF. Its UTF-8 representation is the following sequence of 3 octects:

1110 1111 1011 1011 1011 1111
E F B B B F

The quickest way I know of to process a text file and perform this operation is sed. The following syntax will instruct sed to remove the BOM from the first line of its input file:

sed '1 s/\xEF\xBB\xBF//' < input > output

A Warning for Solaris Users

I haven't found a way (yet) to correctly use a sed implementation bundled with Solaris 10 to perform this operation, neither using /usr/bin/sed nor /usr/xpg4/bin/sed. If you're a Solaris user, please consider installing GNU sed to use the following script.

The quickest way to install sed and a lot of fancy Solaris packages is using Blastwave or OpenCSW. I've also written a post about loopback-mounting Blastwave/OpenCSW installation directory in Solaris Zones to simplify Blastwave/OpenCSW software administration.

A Suggestion for Windows Users

If you want to execute this script in a Windows environment, you can install CygWin. The base install with bash and the core utilities will be sufficient for this script to work on your CygWin environment.

Source

This is the source code of a skeleton implementation of a bash shell script that will remove the BOM from its input files. The script support recursive scanning of directories to "clean" an entire file system tree and a flag (-x) to avoid descending in a filesystem mounted elsewhere. The script uses temporary files while doing the conversion and the original file will be overwritten only if the -d option is not specified.

#!/bin/bash

set -o nounset
set -o errexit

DELETE_ORIG=true
DELETE_FLAG=""
RECURSIVE=false
PROCESSING_FILES=false
SED_EXEC=sed
TMP_CMD="mktemp"
TMP_OPTS="--tmpdir="
XDEV=""

if [ $(uname) == "SunOS" ] ; then

TMP_OPTS="-p "

if [ -x /usr/gnu/bin/sed ] ; then

echo "Using GNU sed..."

SED_EXEC=/usr/gnu/bin/sed

function usage() {
echo "bom-remove [-dr] [-s sed-name] files..."
echo ""
echo " -d    Do not overwrite original files and do not remove temp files."
echo " -r    Scan subdirectories."
echo " -s    Specify an alternate sed implementation."
echo " -x    Don't descend directories in other file systems."
}

function checkExecutable() {
if ( ! which "$1" > /dev/null 2>&1 ); then
    echo "Cannot find executable:" $1
    exit 4
fi
}

function parseArgs() {
while getopts "dfrs:x" flag
do
    case $flag in
      r) RECURSIVE=true ;;
      f) PROCESSING_FILES=true ;;
      s) SED_EXEC=$OPTARG ;;
      d) DELETE_ORIG=false ; DELETE_FLAG="-d" ;;
      x) XDEV="-xdev" ;;
      *) echo "Unknown parameter." ; usage ; exit 2 ;;
    esac
done

shift $(($OPTIND - 1))

FILES="$@"

if [ ! -n "$FILES" ] ; then

echo "No files specified. Exiting."

exit 2

  fi

if [ $RECURSIVE == true ] && [ $PROCESSING_FILES == true ] ; then
    echo "Cannot use -r and -f at the same time."
    usage
    exit 1
fi

checkExecutable $SED_EXEC
checkExecutable $TMP_CMD
}

function processFile() {
TEMPFILENAME=$($TMP_CMD $TMP_OPTS$(dirname "$1"))
echo "Processing $1 using temp file $TEMPFILENAME"

cat "$1" | $SED_EXEC '1 s/\xEF\xBB\xBF//' > "$TEMPFILENAME"

if [ $DELETE_ORIG == true ] ; then
    if [ ! -w "$1" ] ; then
      echo "$1 is not writable. Leaving tempfile."
    else
      echo "Removing temp file..."
      mv "$TEMPFILENAME" "$1"
    fi
fi
}

function doJob() {
# Check if the script has been called from the outside.
if [ $PROCESSING_FILES == true ] ; then
    for i in $FILES ; do
      processFile "$i"
    done
else
    # processing every file
    for i in $FILES ; do
      # checking if file or directory exist
      if [ ! -e "$i" ] ; then echo "File not found: $i. Skipping..." ; continue ; fi

      # if a parameter is a directory, process it recursively if RECURSIVE is set
      if [ -d "$i" ] ; then
        if [ $RECURSIVE == true ] ; then
          find "$i" $XDEV -type f -exec "$0" $DELETE_FLAG -f "{}" +
        else
          echo "$i is a directory. Skipping..."
        fi
      else
        processFile "$i"
      fi
    done
fi
}

parseArgs $@
doJob

Examples

Assuming the script is in your $PATH and it's called bom-remove, you can "clean" a bunch of files invoking it this way:

$ bom-remove file-to-clean ...

If you want to clean the files in an entire directory, you can use the following syntax:

$ bom-remove -r dir-to-clean

If your sed installation is not in your $PATH or you have to use an alternate version, you can invoke the script with the following syntax:

$ bom-remove -s path/to/sed file-to-clean

If you want to clean a directory in which other file systems might be mounted, you can use the -x option so that the script does not descend them:

$ bom-remove -xr dir-to-clean

Next Steps

The most effective way to fight the BOM is avoiding spreading it. Microsoft Notepad, if there's anybody out there using it, isn't the best tool to edit your UTF-8 files so, please, avoid it.

However, should your file system be affected by the BOM-desease, I hope this script will be a good starting point to build a BOM-cleaning solution for your site.

Enjoy!