home > posts > 2012-05-25-detecting-non-ascii-characters-in-a-text-file

Detecting non-ASCII characters in a text file

Our internal coding standard for C++ source files dictates that 7-bit US-ASCII should be used for file encoding.

This decision is based on the fact that the current C++ standard (2003) limits characters that can be used in variable and type identifiers to ASCII letters. Although some compilers and the new (2011) C++ standard allow most Unicode code points in identifiers (basically whatever can be called a “letter” in the various scripts), the “same-glyph, different Unicode code-point syndrome” described here advises against that.

One could still allow non-ASCII characters in string constants and in comments, and this is tolerated by most modern compilers. But the decision was to be quite conservative in the current standard; in the future, as C++ 2011 is fully implemented, we might revise it.

The trouble is that sometimes non-ASCII characters sneak in, for example the euro sign €, the degree symbol ° and the dash – which looks so similar to the minus sign –.

Long story short, we needed an utility to detect non-ASCII characters in a collection of text (source) files. This utility is called checkAscii, and the C++ source code is:

/*
   @file checkAscii.cc
   @brief Detect non-ASCII characters in a text file
   @author (C) Copyright 2012 Paolo Greppi libpf.com
   @date 20120525
   @version 0.1

   no warranties whatsoever
   distribute freely and free of charge citing this:
   


  Detecting non-ASCII characters in a text file

*/

#include <iostream>
#include <fstream>
#include <cstdio>

int main(int argc, char *argv[]) {
  std::istream *in = NULL;
  std::ifstream inf;
  if (argc == 1) {
    in = &std::cin;
    std::cout << "Now checking stdin" << std::endl;
  } else if (argc == 2) {
    inf.open(argv[1]);
    if(!inf) {
      std::cerr << "Error opening input file !" << std::endl;
      return -1;
    }
    in = &inf;
    std::cout << "Now checking file " << argv[1] << std::endl;
  } else {
    std::cerr << "Only 0 or 1 argument !" << std::endl;
    return -1;
  }

  char c, bit8 = (1 << 7);
  int line(0), column(0), count(0);
  while ((c = in->get()) && (c != EOF)) {
    if (c == '\n') {
      ++line;
      column = 0;
    }
    if ((c & bit8) == bit8) {
      std::cout << "line: " << line + 1 << " column: " << column << " nonascii " << c << std::endl;       count++;     }     ++column;   }      if (argc > 1) {
    inf.close();
  }
  return count;
}

Usage is as follows:

cat mySourceFile.cc | checkAscii

or:

checkAscii mySourceFile.cc

It will print this if non-ASCII characters are found (and return the number of found non-ASCII characters):

Now checking file mySourceFile.h
line: 53 column: 26 nonascii °
line: 54 column: 27 nonascii €

or will print this (and return 0) if only ASCII characters are found:

Now checking file mySourceFile.h

We use it on large sets of files using bash and xargs as follows:

ls -1 include/*.h | xargs -d '\n' -n 1 checkAscii
ls -1 src/*.cc | xargs -d '\n' -n 1 checkAscii

Enjoy !