banner

For a full list of BASHing data blog posts, see the index page.     RSS


Reformatting a list, cleverly

I sometimes lurk on Stack Overflow's AWK forum, where most of the the questions are about reformatting a text file. Sometimes a question gets unconventional answers — I described a really surprising solution in a May 2018 BASHing data post.

A January 2019 question on the forum was in the "Hmm, that's interesting" category. The OP had this list of items:

"abc"
4
21
22
25
"standard"
1
"test"
4
5
10
11
12

and wanted to turn it into a series of horizontal lists:

"abc" 4 21 22 25
"standard" 1
"test" 4 5 10 11 12

Two of the Stack Overflow responses relied on AWK and both are a little bit cryptic, so I'll explain them here in detail.


Method 1. This elegant solution was proposed by AWK guru Ed Morton:

awk '{printf "%s%s", (/^"/ ? ors : OFS), $0; ors=ORS} END {print ""}' file

awk1

The first thing to note about Morton's AWK command is that two of its parts are only there to keep the result tidy. The END statement's print "" just adds a newline at the end of the output (see screenshot below with the END statement left out), and the use of the variable "ors" instead of ORS prevents AWK from printing a newline before it prints the first line (see screenshot again; I'll explain in a moment).

awk2

The first part of the command is the printf instruction. It tells AWK to print 2 strings (%s%s) but no newline after those strings (that would be "%s%s\n").

The first string to print is the result of a test which AWK carries out on each line. The test is formatted as a ternary operator as used in many programming languages. If the line matches the regular expression /^"/, which means "begins with a quote", then AWK prints the variable "ors" followed by the whole line ($0). Since the variable "ors" hasn't been defined yet when the first line is read, AWK just prints the whole first line, namely "abc".

The second part of the AWK command defines "ors" as the output record separator, ORS, a built-in AWK variable which by default is a newline. This means the next time AWK encounters a line beginning with a quote, it will print a newline first, then the line beginning with a quote.

If the line doesn't begin with a quote, AWK prints the output field separator, OFS, another built-in AWK variable, which by default is a single whitespace. When AWK processes the second line of the file ("4"), it prints a whitespace and a 4 after the "abc" printed after processing the first line. There's no newline between the "abc" and the "whitespace + 4". The first newline (ORS) to be printed in the output comes when AWK encounters the "standard" line, because that line begins with a quote, and the "standard" line follows the newline.

Note that whitespace separation was specified by the question-asker, but another separator, such as a comma, could be used instead of OFS:

awk3
 

Method 2. The second solution comes from SO contributor dawg:

awk '/^"/ {if (s) print s; s=$0; next} {s=s OFS $0} END {print s}' file

awk4

As usual, it helps in understanding this command to work it through the file line by line. The first line begins with a quote, so AWK goes to the first action in curly brackets, namely if (s) print s; s=$0; next. The variable "s" hasn't been defined yet, so there's nothing to print. AWK goes to the next command in the first action, which is to set "s" equal to the whole first line; "s" now contains "abc". The third command in the first action is "next", which sends AWK to the second line in the file, bypassing the second action in curly brackets.

The second line in the file doesn't begin with a quote, so AWK moves to the second action in curly brackets. Here it redefines "s" to be the current value of "s" followed by the OFS (single whitespace by default) followed by the whole second line. "s" now contains "abc" 4. This action is repeated with lines 3, 4 and 5, none of which begin with a quote.

When AWK reaches line 6 ("standard") it finds a starting quote, so it does the first action. "s" exists, it contains "abc" 4 21 22 25 and that string gets printed. AWK then resets "s" to "standard" and moves with "next" to line 7, and so on.

When AWK reaches the last line in the file, "s" contains "test" 4 5 10 11 12, but there's no instruction to print that. The command to print "test" 4 5 10 11 12 comes in the END statement, after which AWK exits.

And once again, the separator in the horizontal lists can be something other than a whitespace:

awk5
 

Last update: 2019-01-20
The blog posts on this website are licensed under a
Creative Commons Attribution-NonCommercial 4.0 International License