banner

For a full list of BASHing data blog posts see the index page.     RSS


A quick repair job on a dislocated table

The tab-separated data table I was auditing had 5463 records with 21 fields each, but something was seriously wrong. Here's a hugely simplified version of that table, called "dislocated":

Fld1Fld2Fld3Fld4Fld5Fld6Fld7
001cccLLL111Qqq, 888mmmhhh
002gggPPP777Rrr444sss
003vvvAAA333Eee, 666dddhhh
004oooEEE222Iii, 999uuuhhh
005hhhJJJ888Vvv, 111aaahhh
006bbbCCC444Sss222kkk
007iiiRRR666Bbb555ddd
008rrrNNN333Jjj, 888iiihhh
009sssHHH555Www333xxx
010aaaTTT999Qqq, 111uuuhhh

Somehow the comma[space] in field 5 (above) had been converted to a field separator in certain records. This displaced all the subsequent fields in those records one field to the right, and the last field (always containing "hhh") had been trimmed off.

To continue the audit I needed to fix the real-world table. My quick, Band-Aid® solution was an AWK command (shown here for "dislocated", not for the real-world table):

awk 'BEGIN {FS=OFS="\t"} NR>1 && $5 !~ /,/ \
{print $1,$2,$3,$4,$5", "$6,$7,"hhh"; next} 1' dislocated

dislocated1

The BEGIN statement tells AWK that the input field separator (FS) and the output field separator (OFS) are both the tab character.
 
The "1" at the end of the command tells AWK to print all lines. It's an AWK shorthand for the usual pattern {action} statement. The pattern in this case is the number "1", which is always true (1 always equals 1) and the action is the default AWK action, which is to print the line.
 
The line pattern to be treated specially is that field 5 doesn't match a comma ($5 !~ /,/). Since this is also true in the header line (NR is 1), which I don't want to treat specially, the full pattern for special action is NR>1 && $5 !~ /,/.
 
The first special action is to print the line so that fields 5 and 6 are separated by comma[space] rather than by a tab, and with "hhh" (the standard entry) added as the last tab-separated field. This prints the fields in their correct form and order.
 
The second special action is "next", telling AWK to go straight to the next line without printing the dislocated original line.


Last update: 2020-07-15
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License