Showing headlines posted by Bob_Mesibov
( 1 2 3 4 5 6 ... 10 ) Next »People are the best data cleaners
This is the 200th and last "BASHing data" blog post. I've enjoyed writing this blog over the past four years and I hope command-line users have found it helpful. If I could condense "A Data Cleaner's Cookbook" and the blog into a single piece of advice for data workers, it's this: learn AWK.
Mapping with gnuplot, part 3
In the last post I explained how I built a base map with a lat/lon grid using gnuplot. Here I describe how the base map was used to generate GIF animations.
Mapping with gnuplot, part 2
The first part of this series appeared back in 2018. It describes how you can use gnuplot to build a simplified GIS. I wasn't intending to do any more with this idea, but I was tinkering with animated map GIFs and realised that the gnuplot approach had some advantages.
gron the JSON flattener
gron is a self-contained Go executable you can download from GitHub. In the UNIX tradition, gron does one thing well: it flattens JSON into a structure that's easily processed by shell tools, line by line.
How to flatten ("unpivot") a data table
Flattening a rectangular data table means converting the table into a list of its component triplets (row, column, value). The command-line flattening shown here can also be used to flatten within a spreadsheet.
Search for (exact) strings; report line, column and context
GNU grep is a great utility but it can only report a search target's line number. Because field location is often important in my data work, I wrote a function that searches for an exact string and returns the string's line and field location (field number and field name) plus the data item containing the string, with the string coloured red.
Auto-incrementing version letters
I have several scripts that auto-increment a number in a data table. That's not hard to do on the command line. But what if I wanted to have version letters at the end of mixed letter-number codes, like 101a, 101b, 101c etc? And also wanted to increment version numbers after a letter cycle, like 101c, 101d, 102a, 102b...? As is usually the case with command-line operations, there's more than one way to do it. This post looks at a couple of solutions to this particular problem, namely incrementing (single) version letters and version numbers on a fixed letter cycle.
Online shopping and a one2many tweak
Online commerce sites often show a selection of items below the one you're after, with a caption something like "People who bought this item also bought..." It's a marketing ploy, the aim being to encourage you to buy something else while you're visiting the site. It also suggests an interesting question: what product combinations are bought most and least frequently by individual shoppers? Here's a command-line approach to this question.
DNA-style frameshift cryptography
The 3-character codons in a DNA sequence can be read in any one of three ways, depending on where the triplet-reading starts. Imagine doing that with 8-character bytes.
Apple + Microsoft = character confusion
A Mac-using client wanted to save a Microsoft Word .docx as a plain text document. The .docx was stored in an iCloud folder. Downloading and opening the file in Word for Mac, the client chose "Save without formatting (.txt)". What could go wrong?
How to use patsplit (GNU AWK)
The patsplit function was introduced in version 4.0 of GNU AWK. It's a string splitter, and it allows you to dissect a string more flexibly than you can with AWK's substr function.
Scripting a temperature notifier
This desktop notifier reports minimum overnight temperature and the latest temperature as well - just what I need to know for my early morning walk. Also: the third and final version of "A Data Cleaner's Cookbook" is now online.
A dog-cat-horse-turtle problem
Sometimes text-processing problems have so many possible command-line solutions, it's hard to decide which is best. This 2021 problem is a good example.
Tidy tables for data processing
I've seen some very pretty data tables in spreadsheets, on webpages and in word-processed documents. There were lots of colours and careful attention had been paid to font, font size and font emphasis. Of course, all that colour and data decoration is for human eyes. If the same tables were to be processed digitally, the processing program wouldn't care what the table looks like. It just wants the data to be tidy and workable. Here's what that means.
Are you 10000 days old yet?
Suppose you were born on 22 December 1995. Have you already had your 10000-day birthday, or is that still in the future? Here are three command-line ways to find out.
Building an ODT on the command line
This post explains a shell script that creates a formatted ODT without involving a GUI word processor. The script uses LibreOffice Writer to convert an HTML file to an ODT on the command line. The HTML doesn't have to be up-to-date, either, or use CSS for styling.
Making a transect into a point and circle
The Well-Known_Text (WKT) format can be used to describe a straight-line transect from one point to another point. Another way to describe a straight-line transect is with its midway point plus the radius of a circle which includes the whole of the transect. This point-radius description can be built with a single AWK command, as described in this post.
Detecting truncations: another sometimes successful method
I occasionally find truncated strings in a data audit. These usually appear when a data item N characters long has been entered in a free-text field with a character limit smaller than N. Detecting truncations programmatically is difficult and not always successful. This post describes the latest of four methods I've tried.
Gremlin detection bigly improved and a NUL problem avoided
"Gremlin" is my name for an invisible character other than a plain whitespace, a linefeed or a horizontal tab. Gremlins can cause errors in data processing and can also make it harder to detect duplicate records in a data table. The newest version of a gremlin detector script (for UTF-8-encoded plain text files) is demonstrated in this blog post, with notes on the sometimes difficult NUL byte.
Combinations from 2 lists: speed trials
This post was inspired by a recently published scientific paper describing how Python was used to build a list of a million scientific names. Each name was composed of parts taken from a list, and combinations of those parts were generated. The result was something like a Cartesian product, about which I've blogged before. This time I was interested in performance: how does the time required to get a result vary with the number of combinations to be built?