LXer: Showing headlines posted by Bob

Mojibake bonanza

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Dec 17, 2020 8:08 AM EDT)
Story Type: Tutorial

More mojibake detective work: reconstructing the encoding histories of gibberished words from a museum database.

Comparing strings more clearly

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Dec 9, 2020 9:55 AM EDT)
Story Type: Tutorial

One field has a name, the other has the name plus other info. Three methods are demonstrated for detecting when the name isn't the same in the two fields, and two of these methods make the differences a lot more obvious.

Re-format blah,YYYYMMDD,blah as blah,YYYY,MM,DD,blah

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Dec 2, 2020 1:31 PM EDT)
Story Type: Tutorial

Two sed and 5 AWK methods for doing this re-formatting job are explained here, including a surprisingly simple trick that requires GNU AWK 4.2 or later.

How to stack columns

BASHing data (Posted by Bob_Mesibov on Nov 24, 2020 8:21 PM EDT)
Story Type: Tutorial; Groups: Linux

How to use the cut command to convert a "columnated" table into a straight up-and-down one.

Check the day of year, given a date

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Nov 18, 2020 4:16 AM EDT)
Story Type: Tutorial

The ugliest function in my data-auditing toolkit is "chkday". It checks to see whether an ISO 8601 date (YYYY-MM-DD) and its day number are in agreement, where both are present in a data table. I tried two other ways to get the same result but the code is still ugly.

Updating a file from a lookup table

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Nov 12, 2020 10:36 AM EDT)
Story Type: Tutorial

This post explains how to replace, append and insert values from a lookup table held in an AWK array. It's easier than it sounds.

How to keep an eye on field numbers

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Nov 4, 2020 10:21 PM EDT)
Story Type: Tutorial

This week: I have a simple shell function, "fields", that generates a numbered list of the fields in a data table. With those field numbers I can then use AWK and the cut command efficiently on the command line. As an alternative I use a "fieldlist" function that puts the numbered list on my desktop like a Post-it note. Last week: A short rant about Python, R and UNIX.

How to use flags in AWK (revisited)

BASHing data (Posted by Bob_Mesibov on Oct 20, 2020 6:16 PM EDT)
Story Type: Tutorial; Groups: Linux

Flags in AWK are variables which are set to either true or false. They're handy for defining ranges over which AWK can act, as explained in this post.

The myth of equinoctial gales

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Oct 14, 2020 3:59 PM EDT)
Story Type: Tutorial

A belief in the coincidence of strong winds and the equinoxes is handed down from generation to generation and reinforced by selective memory. It's not a belief supported by real-world wind data, worked up here on the command line.

Building a data table from a sentence

BASHing data (Posted by Bob_Mesibov on Oct 6, 2020 5:24 PM EDT)
Story Type: Tutorial; Groups: Linux

Punctuation is sometimes used as a kind of data compression tool. This post details the expansion of a particularly condensed data structure into a simple table using command line tools.

Encoding detection smackdown

BASHing data (Posted by Bob_Mesibov on Sep 22, 2020 11:48 PM EDT)
Story Type: Tutorial; Groups: Linux

enca vs file vs iconv vs isutf8 vs uchardet? The file and isutf8 commands are best for detecting whether or not a file contains only UTF-8 characters.

Finding one-to-many entries in a data table

BASHing data (Posted by Bob_Mesibov on Sep 15, 2020 10:09 PM EDT)
Story Type: Tutorial; Groups: Linux

The command described in this post answers this question: Are there records in your data table with non-blank entries in field 1 that have multiple corresponding entries in field 2, instead of just an expected one entry?

Spotting spaces, and AWK's view of emptiness

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Sep 9, 2020 5:36 PM EDT)
Story Type: Tutorial

Presenting a sed command that highlights plain whitespaces and marks each one with a dot for easy counting, plus a re-examination of an AWK shorthand showing that "non-empty" is different from "non-empty and non-zero".

Checking DIY primary/foreign key relationships

BASHing data (Posted by Bob_Mesibov on Sep 2, 2020 2:19 AM EDT)
Story Type: Tutorial; Groups: Linux

In some of the data tables I audit, the primary and foreign keys have been built manually, often in Excel spreadsheets. Checking for the resulting stuff-ups can be done on the command line.

What's wrong with these records?

BASHing data (Posted by Bob_Mesibov on Aug 26, 2020 8:04 AM EDT)
Story Type: Tutorial; Groups: Linux

When my wife added a tab-delimited text table as a layer in her GIS program, nine of her points got rejected. This was odd because the XY values in those records (UTM eastings and northings) were no different in format from all the other XY pairs in the table, and were well within the numerical range of the others as well. So why the nine rejections? I tried some tinkering on the command line to find out.

How to do a both/neither/one/other tally

BASHing data (Posted by Bob_Mesibov on Aug 18, 2020 9:57 PM EDT)
Story Type: Tutorial; Groups: Linux

A routine check I do on data tables is to see if paired fields are either both filled or both empty in each record. By "paired fields" I mean, for example, a latitude field and a longitude field. It doesn't make much sense to have a latitude without a longitude, or vice versa! The command described here will tally up the "both filled", "both empty", "only the first filled" and "only the second filled" entries.

Sharing data and metadata together

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Jul 29, 2020 7:41 PM EDT)
Story Type: Tutorial

Metadata is data about data. For a data table, the metadata might include information about what the table is, who compiled it, what its sources are, what its field names mean and what its abbreviations stand for. The table's metadata normally stand outside the rows and columns of the table itself, so how can you keep a table and its metadata together, but still make the table data available for re-use?

A grizzle about captive data

BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Jul 22, 2020 2:59 AM EDT)
Story Type: Tutorial

A co-worker gave me some data for checking. The client had sent it in a RAR file. Inside the RAR was a Microsoft Access database. Inside the Access database was a single Access table, and inside the Access table was the data. I don't know why the client did that matryoshka-style data packing, but I know that many people don't understand that software isn't data. In this case, the raw data was just plain text (letters and numbers, punctuation, spaces, tabs and newlines), but the data was imprisoned in Microsoft Access, which is mostly secret gobbledygook.

A quick repair job on a dislocated table

BASHing data (Posted by Bob_Mesibov on Jul 14, 2020 11:12 PM EDT)
Story Type: Tutorial; Groups: Linux

Something was seriously wrong with the data table I was checking. In many but not all records, a couple of characters had been converted to a field separator. This displaced all the subsequent fields in those records one field to the right, and the last field had been trimmed off. My quick, Band-Aid® solution was an AWK command.

Extra commas in a CSV

BASHing data (Posted by Bob_Mesibov on Jul 7, 2020 8:11 PM EDT)
Story Type: Tutorial; Groups: Linux

Every once in a while I'm given a haunted CSV to audit. It looks like it contains the ghosts of entries that were once in right-hand-side columns and bottom-most rows, but were later deleted, leaving long strings of commas.

Linux News
The world is talking about GNU/Linux and Free/Open Source Software

Login

Today's Big Story

LXer Features

Have something to say?

Latest Discussions

Site Menu

Other News

Showing headlines posted by Bob_Mesibov