Detecting non-ASCII characters in a text file

Our internal coding standard for C++ source files dictates that 7-bit US-ASCII should be used for file encoding.

This decision is based on the fact that the current C++ standard (2003) limits characters that can be used in variable and type identifiers to ASCII letters. Although some compilers and the new (2011) C++ standard allow most Unicode code points in identifiers (basically whatever can be called a “letter” in the various scripts), the “same-glyph, different Unicode code-point syndrome” described here advises against that.

One could still allow non-ASCII characters in string constants and in comments, and this is tolerated by most modern compilers. But the decision was to be quite conservative in the current standard; in the future, as C++ 2011 is fully implemented, we might revise it.

The trouble is that sometimes non-ASCII characters sneak in, for example the euro sign , the degree symbol ° and the dash which looks so similar to the minus sign .

Long story short, we needed an utility to detect non-ASCII characters in a collection of text (source) files. This utility is called checkAscii, and the C++ source code is:

/*
   @file checkAscii.cc
   @brief Detect non-ASCII characters in a text file
   @author (C) Copyright 2012 Paolo Greppi libpf.com
   @date 20120525
   @version 0.1

   no warranties whatsoever
   distribute freely and free of charge citing this:
   
Detecting non-ASCII characters in a text file
*/ #include <iostream> #include <fstream> #include <cstdio> int main(int argc, char *argv[]) { std::istream *in = NULL; std::ifstream inf; if (argc == 1) { in = &std::cin; std::cout << "Now checking stdin" << std::endl; } else if (argc == 2) { inf.open(argv[1]); if(!inf) { std::cerr << "Error opening input file !" << std::endl; return -1; } in = &inf; std::cout << "Now checking file " << argv[1] << std::endl; } else { std::cerr << "Only 0 or 1 argument !" << std::endl; return -1; } char c, bit8 = (1 << 7); int line(0), column(0), count(0); while ((c = in->get()) && (c != EOF)) { if (c == '\n') { ++line; column = 0; } if ((c & bit8) == bit8) { std::cout << "line: " << line + 1 << " column: " << column << " nonascii " << c << std::endl; count++; } ++column; } if (argc > 1) { inf.close(); } return count; }

Usage is as follows:

cat mySourceFile.cc | checkAscii

or:

checkAscii mySourceFile.cc

It will print this if non-ASCII characters are found (and return the number of found non-ASCII characters):

Now checking file mySourceFile.h
line: 53 column: 26 nonascii °
line: 54 column: 27 nonascii €

or will print this (and return 0) if only ASCII characters are found:

Now checking file mySourceFile.h

We use it on large sets of files using bash and xargs as follows:

ls -1 include/*.h | xargs -d '\n' -n 1 checkAscii
ls -1 src/*.cc | xargs -d '\n' -n 1 checkAscii

Enjoy !

About paolog

homo technologicus cynicus
This entry was posted in C++, Howtos. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

Anti-Spam Quiz: