Showing headlines posted by Bob_Mesibov
« Previous ( 1 2 3 4 5 6 7 8 9 ... 10 ) Next »Mojibake bonanza
More mojibake detective work: reconstructing the encoding histories of gibberished words from a museum database.
Comparing strings more clearly
One field has a name, the other has the name plus other info. Three methods are demonstrated for detecting when the name isn't the same in the two fields, and two of these methods make the differences a lot more obvious.
Re-format blah,YYYYMMDD,blah as blah,YYYY,MM,DD,blah
Two sed and 5 AWK methods for doing this re-formatting job are explained here, including a surprisingly simple trick that requires GNU AWK 4.2 or later.
How to stack columns
How to use the cut command to convert a "columnated" table into a straight up-and-down one.
Check the day of year, given a date
The ugliest function in my data-auditing toolkit is "chkday". It checks to see whether an ISO 8601 date (YYYY-MM-DD) and its day number are in agreement, where both are present in a data table. I tried two other ways to get the same result but the code is still ugly.
Updating a file from a lookup table
This post explains how to replace, append and insert values from a lookup table held in an AWK array. It's easier than it sounds.
How to keep an eye on field numbers
This week: I have a simple shell function, "fields", that generates a numbered list of the fields in a data table. With those field numbers I can then use AWK and the cut command efficiently on the command line. As an alternative I use a "fieldlist" function that puts the numbered list on my desktop like a Post-it note. Last week: A short rant about Python, R and UNIX.
How to use flags in AWK (revisited)
Flags in AWK are variables which are set to either true or false. They're handy for defining ranges over which AWK can act, as explained in this post.
The myth of equinoctial gales
A belief in the coincidence of strong winds and the equinoxes is handed down from generation to generation and reinforced by selective memory. It's not a belief supported by real-world wind data, worked up here on the command line.
Building a data table from a sentence
Punctuation is sometimes used as a kind of data compression tool. This post details the expansion of a particularly condensed data structure into a simple table using command line tools.
Encoding detection smackdown
enca vs file vs iconv vs isutf8 vs uchardet? The file and isutf8 commands are best for detecting whether or not a file contains only UTF-8 characters.
Finding one-to-many entries in a data table
The command described in this post answers this question: Are there records in your data table with non-blank entries in field 1 that have multiple corresponding entries in field 2, instead of just an expected one entry?
Spotting spaces, and AWK's view of emptiness
Presenting a sed command that highlights plain whitespaces and marks each one with a dot for easy counting, plus a re-examination of an AWK shorthand showing that "non-empty" is different from "non-empty and non-zero".
Checking DIY primary/foreign key relationships
In some of the data tables I audit, the primary and foreign keys have been built manually, often in Excel spreadsheets. Checking for the resulting stuff-ups can be done on the command line.
What's wrong with these records?
When my wife added a tab-delimited text table as a layer in her GIS program, nine of her points got rejected. This was odd because the XY values in those records (UTM eastings and northings) were no different in format from all the other XY pairs in the table, and were well within the numerical range of the others as well. So why the nine rejections? I tried some tinkering on the command line to find out.
How to do a both/neither/one/other tally
A routine check I do on data tables is to see if paired fields are either both filled or both empty in each record. By "paired fields" I mean, for example, a latitude field and a longitude field. It doesn't make much sense to have a latitude without a longitude, or vice versa! The command described here will tally up the "both filled", "both empty", "only the first filled" and "only the second filled" entries.
Sharing data and metadata together
Metadata is data about data. For a data table, the metadata might include information about what the table is, who compiled it, what its sources are, what its field names mean and what its abbreviations stand for. The table's metadata normally stand outside the rows and columns of the table itself, so how can you keep a table and its metadata together, but still make the table data available for re-use?
A grizzle about captive data
A co-worker gave me some data for checking. The client had sent it in a RAR file. Inside the RAR was a Microsoft Access database. Inside the Access database was a single Access table, and inside the Access table was the data. I don't know why the client did that matryoshka-style data packing, but I know that many people don't understand that software isn't data. In this case, the raw data was just plain text (letters and numbers, punctuation, spaces, tabs and newlines), but the data was imprisoned in Microsoft Access, which is mostly secret gobbledygook.
A quick repair job on a dislocated table
Something was seriously wrong with the data table I was checking. In many but not all records, a couple of characters had been converted to a field separator. This displaced all the subsequent fields in those records one field to the right, and the last field had been trimmed off. My quick, Band-Aid® solution was an AWK command.
Extra commas in a CSV
Every once in a while I'm given a haunted CSV to audit. It looks like it contains the ghosts of entries that were once in right-hand-side columns and bottom-most rows, but were later deleted, leaving long strings of commas.