Tuesday, 28 September 2010

A Shell Script to Find and Remove the BOM Marker

Introduction

Have you ever seen this characters while dumping the contents of some of your text files, haven't you?



If you have, you found a BOM marker! The BOM marker is a Unicode character with code point U+FEFF that specifies the endianness of an Unicode text stream.

Since Unicode characters can be encoded as a multibyte sequence with a specific endianness, and since different architectures may adopt distinct endianness types, it's fundamental to signal the receiver about the endianness of the data stream being sent. Dealing with the BOM, then, it's part of the game.

If you want to know more about when to use the BOM you can start by reading this official Unicode FAQ.

UTF-8

UTF-8 is one of the most widely used Unicode characters encoding on software and protocols that have to deal with textual data stream. UTF-8 represents each Unicode character with a sequence of 1 to 4 octects. Each octect contains control bits that are used to identify the beginning and the length of an octect sequence. The Unicode code point is simply the concatenation of the non control bits in the sequence. One of the advantages of UTF-8 is that it retains backwards compatibility with ASCII in the ASCII [0-127] range since such characters are represented with the same octect in both encodings.

If you feel curious about how the UTF-8 encoding works, I've written an introductory post about it.

Common Problems

Because of its design, the UTF-8 encoding is not endianness-sensible and using the BOM with this encoding is discouraged by the Unicode standard. Unfortunately some common utilities, notably Microsoft Notepad, keep on adding a BOM in your UTF-8 files thus breaking those application that aren't prepared to deal with it.

Some programs could, for example, display the following characters at the beginning of your file:



A more serious problem is that a BOM will break a UNIX shell script interfering with the shebang (#!).

A Shell Scripts to Check for BOMs and Remove Them

The Byte Order Mark (BOM) is a Unicode character with code point U+FEFF. Its UTF-8 representation is the following sequence of 3 octects:

1110 1111 1011 1011 1011 1111
E    F    B    B    B    F

The quickest way I know of to process a text file and perform this operation is sed. The following syntax will instruct sed to remove the BOM from the first line of its input file:

sed '1 s/\xEF\xBB\xBF//' < input > output

A Warning for Solaris Users

I haven't found a way (yet) to correctly use a sed implementation bundled with Solaris 10 to perform this operation, neither using /usr/bin/sed nor /usr/xpg4/bin/sed. If you're a Solaris user, please consider installing GNU sed to use the following script.

The quickest way to install sed and a lot of fancy Solaris packages is using Blastwave or OpenCSW. I've also written a post about loopback-mounting Blastwave/OpenCSW installation directory in Solaris Zones to simplify Blastwave/OpenCSW software administration.

A Suggestion for Windows Users

If you want to execute this script in a Windows environment, you can install CygWin. The base install with bash and the core utilities will be sufficient for this script to work on your CygWin environment.

Source

This is the source code of a skeleton implementation of a bash shell script that will remove the BOM from its input files. The script support recursive scanning of directories to "clean" an entire file system tree and a flag (-x) to avoid descending in a filesystem mounted elsewhere. The script uses temporary files while doing the conversion and the original file will be overwritten only if the -d option is not specified.

#!/bin/bash

set -o nounset
set -o errexit

DELETE_ORIG=true
DELETE_FLAG=""
RECURSIVE=false
PROCESSING_FILES=false
SED_EXEC=sed
TMP_CMD="mktemp"
TMP_OPTS="--tmpdir="
XDEV=""

if [ $(uname) == "SunOS" ] ; then
  TMP_OPTS="-p "
  
  if [ -x /usr/gnu/bin/sed ] ; then
    echo "Using GNU sed..."
    SED_EXEC=/usr/gnu/bin/sed
  fi
  
fi

function usage() {
  echo "bom-remove [-dr] [-s sed-name] files..."
  echo ""
  echo "  -d    Do not overwrite original files and do not remove temp files."
  echo "  -r    Scan subdirectories."
  echo "  -s    Specify an alternate sed implementation."
  echo "  -x    Don't descend directories in other file systems."
}

function checkExecutable() {
  if ( ! which "$1" > /dev/null 2>&1 ); then
    echo "Cannot find executable:" $1
    exit 4
  fi
}

function parseArgs() {
  while getopts "dfrs:x" flag
  do
    case $flag in
      r) RECURSIVE=true ;;
      f) PROCESSING_FILES=true ;;
      s) SED_EXEC=$OPTARG ;;
      d) DELETE_ORIG=false ; DELETE_FLAG="-d" ;;
      x) XDEV="-xdev" ;;
      *) echo "Unknown parameter." ; usage ; exit 2 ;;
    esac
  done

  shift $(($OPTIND - 1))

  FILES="$@"
  if [ ! -n "$FILES" ] ; then
    echo "No files specified. Exiting."
    exit 2
  fi

  if [ $RECURSIVE == true ]  && [ $PROCESSING_FILES == true ] ; then
    echo "Cannot use -r and -f at the same time."
    usage
    exit 1
  fi

  checkExecutable $SED_EXEC
  checkExecutable $TMP_CMD
}

function processFile() {
  TEMPFILENAME=$($TMP_CMD $TMP_OPTS$(dirname "$1"))
  echo "Processing $1 using temp file $TEMPFILENAME"

  cat "$1" | $SED_EXEC '1 s/\xEF\xBB\xBF//' > "$TEMPFILENAME"

  if [ $DELETE_ORIG == true ] ; then
    if [ ! -w "$1" ] ; then
      echo "$1 is not writable. Leaving tempfile."
    else
      echo "Removing temp file..."
      mv "$TEMPFILENAME" "$1"
    fi
  fi
}

function doJob() {
  # Check if the script has been called from the outside.
  if [ $PROCESSING_FILES == true ] ; then
    for i in $FILES ; do
      processFile "$i"
    done
  else
    # processing every file
    for i in $FILES ; do
      # checking if file or directory exist
      if [ ! -e "$i" ] ; then echo "File not found: $i. Skipping..." ; continue ; fi
      
      # if a parameter is a directory, process it recursively if RECURSIVE is set
      if [ -d "$i" ] ; then
        if [ $RECURSIVE == true ] ; then
          find "$i" $XDEV -type f -exec "$0" $DELETE_FLAG -f "{}" +
        else
          echo "$i is a directory. Skipping..."
        fi
      else
        processFile "$i"
      fi
    done
  fi
}

parseArgs $@
doJob

Examples

Assuming the script is in your $PATH and it's called bom-remove, you can "clean" a bunch of files invoking it this way:

$ bom-remove file-to-clean ...

If you want to clean the files in an entire directory, you can use the following syntax:

$ bom-remove -r dir-to-clean

If your sed installation is not in your $PATH or you have to use an alternate version, you can invoke the script with the following syntax:

$ bom-remove -s path/to/sed file-to-clean

If you want to clean a directory in which other file systems might be mounted, you can use the -x option so that the script does not descend them:

$ bom-remove -xr dir-to-clean

Next Steps

The most effective way to fight the BOM is avoiding spreading it. Microsoft Notepad, if there's anybody out there using it, isn't the best tool to edit your UTF-8 files so, please, avoid it.

However, should your file system be affected by the BOM-desease, I hope this script will be a good starting point to build a BOM-cleaning solution for your site.

Enjoy!






No comments:

Post a Comment