banner

For a full list of BASHing data blog posts see the index page.     RSS


Encoding detection smackdown

In A Data Cleaner's Cookbook I've been recommending the file command for detecting whether or not a file is in UTF-8 encoding, mainly because file also reports line endings. I had some doubts, though, so I decided to run a couple of simple tests on file and some other command-line encoding detectors.


Preparation. The starting file I used in both cases is a modified, plain-text version of Tolstoy's War and Peace as downloaded from the Project Gutenberg website. The file is called "wap" and is a single line in UTF-8 encoding with about 3,200,000 characters.

For the first test I "salted" my UTF-8 file "wap" with a non-UTF-8 character, hex code 80 (the euro sign in Windows-1252 encoding). With this "salt" I replaced the one character at position 1, or position 10, or position 100, etc up to position 1,000,000, renaming the files with the "salt" character's nominal position: "wap1", "wap10" etc. For the second test I put the UTF-8 byte order mark (hex ef bb bf) at the start of "wap" and each of the "salted" files, renaming them "wapBOM", "wap1BOM", "wap10BOM" etc.

The five programs I tested (in alphabetical order) were


Round 1. file, iconv and isutf8 all found something wrong with the salted files. Note that iconv and isutf8 return nothing (exit status 0) if the file is valid UTF-8:

encoding1

enca and uchardet both failed. enca thought the salted files "wap100K" and "wap1M" were OK UTF-8, while uchardet got all the salted files wrong:

encoding2

Round 2. Neither file, iconv nor isutf8 was fooled by the initial byte order mark:

encoding3

No change with enca, and uchardet took my word for it (my BOM) and said "wap1BOM" was also UTF-8:

encoding4

Conclusion. file and isutf8 are both good choices for detecting whether or not a file contains only UTF-8 characters. I'll continue using file because I like the output messages, as in this screenshot from the Cookbook:

encoding5

Last update: 2020-09-23
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License