Showing headlines posted by Bob_Mesibov
« Previous ( 1 2 3 4 5 6 7 8 9 ... 10 ) Next »JSON Lines: record-style JSON
The JSON Lines data format bridges the gap between standard JSON and table-style data. Each JSON object is on a separate line, allowing line-by-line processing with shell tools.
Hunting gremlins
In the UTF-8 files I audit, the only invisible characters I expect to see... er... not see... are whitespace, horizontal tab and linefeed. All others I call "gremlins". They include carriage return, no-break space, soft hyphen and another 62 control characters. Gremlins are a nuisance. One gremlin causes a shell to hang. Less evil gremlins lurk inside apparently OK strings and cause the strings to be processed weirdly. This post explains a new script that locates and visualises gremlins in tab-separated tables.
Getting around a subshell problem
I was running some commands within a BASH subshell and it worked fine, except when it didn't. The workaround avoids what seems to be a buffering issue.
Build your own character class inventories
This post explains how to use AWK printf formatting to inventory the POSIX character classes in your system, and how to "see" some invisible characters.
Msot popele can undreatnsd tihs setennce
Text can be garbled and ungarbled with shell scripts based on BASH string functions.
Emphasising text in the terminal
Working with plain text data in a terminal means that you're not distracted by formatting. The characters are all the same colour and font weight, and there's no highlighting. Sometimes, however, a bit of emphasis in a particular string would be welcome.
Another surprising AWK trick
If you tell AWK that you're doing arithmetic operations on strings beginning with numbers, it ignores the non-number parts of the string. On the other hand, if you happen to be going to St Ives and meet 7 wives, 7 sacks, 7 cats...
Data validation on entry with YAD
The best way to clean a data table? Clean the data before it gets entered. Build an error-free lookup list and enforce its use.
Python and shell tools
I'm not a pythonista, and what little I know about Python for data work amounts to a few published recipes. Out of curiosity, I sometimes compare those recipes with the GNU/Linux tools I use every day. Here are three such comparisons.
Embedded newlines without a clue
Sometimes I see tables with embedded newlines and no special marker for "end-of-record". Worse yet, the program that allowed embedded newlines in data items (for example, a spreadsheet) didn't escape the embedded newline characters when exporting as text, or didn't double-quote all the data items so I can tell where a data item begins and ends. How to fix?
Topping and tailing, and the slowness of GNU sort
I noticed that with big tables, both my field-tallying function and its top-and-tailing variant ran pretty slowly. A simple diagnostic showed that the rate-limiting step was GNU sorting.
Introducing the replo
A typo is a text error in typing or typesetting, usually caused by a human pressing the wrong key on a keyboard. What I call a replo is a text error caused by a computer replacing one or more characters with a question mark, mojibake, a replacement character or some other unwelcome substitute.
VisiData: a table explorer for the terminal
VisiData lets you quickly open, explore, filter and analyse tables of data, without leaving the terminal. It's fast, easy on memory and format-agnostic, but be prepared to learn a lot of VisiData keyboard commands.
How to guess the field separator in a table
So I download the table "blahblahblah.csv" for data auditing. Muttering a quick prayer to MIME, the goddess of file formats, I open the table. Yes! Despite the ".csv" filename suffix, the table is actually tab-separated, not comma-separated. Once again I've escaped the dreaded Curse of the CSV Monster. But did I have to actually open the file to check the format?
A command-line
Use this super-fast but little-known Linux utility while you're watching the long-running TV game show.
A muggle's guide to AWK arrays: 4
GNU AWK version 4 has an easy and flexible way to sort array outputs by index string or value string.
Add leading zeroes that aren't really leading
A leading zero can be a useful addition to a number, and there are several ways to add one or more leading zeroes on the command line. The addition is a little less straightforward if the leading zero sits inside a non-numeric string.
Getting data from an Enphase Envoy S
The Envoy S solar system monitor runs Linux and contains a Web server. Two simple requests to that server will return real-time and summary data for the system as a whole and for individual microinverters.
A GUI to re-order fields in a table
Re-ordering fields in a table with lots of fields isn't simple on the command line (or in a spreadsheet!). The job can be made easier with a GUI dialog, as demonstrated in this post.
A muggle's guide to AWK arrays: 3
An AWK array can be something like a lookup table, held in memory. You can use that lookup table for data operations on another file, and in this post I demonstrate reformatting and table joining.