banner

For a full list of BASHing data blog posts see the index page.     RSS


Spotting spaces, and AWK's view of emptiness


Spaces. A new function in my data auditing toolkit is "spacevis". Its job is to highlight plain whitespaces and mark each one with a dot for easy counting:

spacevis() { sed 's|\x20|\x1b[103m\xc2\xb7\x1b[0m|g'; }

"spacevis" is just a global replacement using sed with "g".
 
The character being replaced is a plain whitespace, which I've represented in the command with its hexadecimal code "20".
 
The replacement is the "middle dot" character "c2 b7".
 
The highlighting is done by turning on a yellow background color with an ANSI escape (\x1b[103m) for the middle dot, then turning it off (\x1b[0m). "1b" is hexadecimal code for the escape character (ESC)
 
For other background colors and their codes, see this handy website or the colors section on this Wikipedia page.

spacevis1

"spacevis" is especially useful when doing a tally of a field, because it reveals and enumerates sometimes unneeded spaces:

spacevis2

Note that if you pipe the output of "spacevis" to less for paging, you'll get ugliness:

spacevis3

The trick to getting less to respect ANSI color escapes is to use its -R option: less -R.

In a BASHing data post in 2019 I used BASH string expansion to force sed to understand ANSI color escapes beginning with the ESC character, represented as "033". I didn't know at the time that with GNU sed the octal code is small "o" followed by the octal number, so string expansion isn't necessary. See screenshot below and this page in the GNU sed manual.

spacevis4

AWK and emptiness. Last month I posted about a function that tallies up pairs of empty/non-empty data items. I used AWK (GNU AWK 4, gawk) to determine whether a field entry was empty or not, but the test I used didn't distinguish between empty and zero.

I've updated that post, and here I take a closer look at what AWK is doing when you use $N or !$N as a condition. For this exercise I'll print a numbered, tab-separated list as follows:

line 1: the non-zero character "a"
line 2: an empty string
line 3: the null character, hex 00
line 4: the space character, hex 20
line 5: a zero
line 6: the string "00"
 
printf "a\n\n\x00\n\x20\n0\n00\n" | nl -ba -w1

Then I'll pass the list to AWK, getting AWK to print the line (the default action) if the second field meets the condition $2 or !$2:

test1

OK, pretty clear result. I thought that the AWK shorthand "$N" meant "field $N exists and is non-empty", but it actually means "field $N exists, is non-empty and is non-zero". Apologies to any readers who adopted my "fldpair" function from that August post before I revised it! Here's a non-shorthand test is to see if a field's entry is just an empty string:

test2

Last update: 2020-09-09
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License