Showing headlines posted by Bob_Mesibov

« Previous ( 1 ... 3 4 5 6 7 8 9 ... 10 ) Next »

Dog and cat data

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Mar 30, 2019 5:31 PM EDT)
  • Story Type: Tutorial
The Australian government's open data portal has a surprisingly large amount of data on dogs and cats. In this post I look at five of the datasets with command-line tools.

How to choose special characters, revisited

There's no euro symbol on my keyboard, but I can enter that character in any document or in my terminal with Ctrl + Shift + u +20ac. I can do the same with "umlaut a" (00e4) and "cedilla c" (00e7) and the degree symbol (00b0) and... Wait! Who am I kidding? There's no way I can remember all those Unicode code points. For this reason I wrote a script for quick and easy retrieval of my most-often-used special characters from a GUI.

The trouble with Windows CRLF

Windows line endings are in a pain in the ... terminal. They muck up the operations of AWK, comm, diff, echo, grep, join, paste, read, rev, sed and tr.

Data with bulges

Data analysis sometimes turns up unexpected "bulges" in the value of data items. Forensic auditors are trained to look for such things in business and banking accounts, because a bulge might be evidence of fraud or embezzlement. In the following three examples, bulges appear for more innocent reasons.

Two special data validations

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Mar 2, 2019 6:18 PM EDT)
  • Story Type: Tutorial
All data validations are special cases. You can always identify data "of the wrong sort" that you want to exclude from data processing, but how do you define "right" and "wrong"? It depends! This post explains two "special" validations.

Data from dingbats: copying down

Copying down is easy in a spreadsheet, but it's also possible on the command line. In this post, copying down is used to repair a messy table.

Fancy numbering of records

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Feb 17, 2019 3:52 PM EDT)
  • Story Type: Tutorial
With the "nl" and "uuidgen" commands and AWK, you can number a list of records any way you like on the command line.

Getting data out of Excel safely

Excel is perfectly OK for what it does, and millions of people happily use Excel every day. But when Excel data get exported for use in various other applications, sometimes Bad Things Happen.

Comparing fields across two tables

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Feb 2, 2019 5:43 PM EDT)
  • Story Type: Tutorial
It's a shell-user's axiom: if you find yourself typing certain commands again and again, script them. This script saves me time when checking if the contents of a field have changed when data are moved from one data table to another.

Reformatting a list, cleverly

A recent Stack Overflow problem was solved with ingenious commands from two AWK experts. In this post I explain the solutions in detail.

Parsing scientific names

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Jan 20, 2019 11:59 AM EDT)
  • Story Type: Tutorial
Scientific names like "Hoplatessara luxuriosa (Silvestri, 1895)" are much harder to parse than personal names, but "gnparser" can do the job on the command line.

Horizontal sorting within a field

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Jan 13, 2019 11:57 AM EDT)
  • Story Type: Tutorial
There are two different ways to sort a field "horizontally" on the command line, but neither of them is simple.

Drugs on the command line

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Jan 5, 2019 5:37 PM EDT)
  • Story Type: Tutorial
A publicly available dataset on registered drugs from the US Food and Drug Administration is a low-quality mess.

Changing the month format: a fairly general solution

The same month can have different but perfectly valid formats, like September, Sep, 9, 09 and ix. Conversions between formats are easier with a simple table of equivalents.

Has the rainfall pattern in my hometown changed?

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Dec 22, 2018 4:41 PM EDT)
  • Story Type: Tutorial
From 1916 to 2015 there were only minor ups and downs in the number and intensity of rainfall events. Interesting swings in event length might explain why older locals say "The rain's changed".

How many fruits in 5 apples, 3 oranges, 1 pear and 17 lemons?

On the command line, you can do sums like this either by looking just at the numbers, or by ignoring the parts that aren't numbers — and those aren't quite the same thing.

Putting information into a table from the table's filename

Example: how to extract a date from a filename and add it to each record in the file. An example of this data-processing task would be grabbing the date part of a date-stamped filename and adding it to the table records (assuming they don't have a date), so that the files can be combined for a time-series study.

Finding changepoints in a list, revisited

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Dec 6, 2018 12:14 PM EDT)
  • Story Type: Tutorial
Use a simple AWK command to locate the places in a list where the value of a data item suddenly changes.

Unwrap your fasta

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Dec 1, 2018 6:58 AM EDT)
  • Story Type: Tutorial
FASTA is a plain-text file format for DNA sequences, but the sequences are often wrapped to a fixed line length. This post explains 3 Linux command-line methods for joining the sequence lines end-to-end.

Avoiding senior moments with command-line functions

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Nov 12, 2018 4:00 PM EDT)
  • Story Type: Tutorial
The trick is to make the documentation available on the CLI. Also, how to get a "yes" or "no" answer from grep.

« Previous ( 1 ... 3 4 5 6 7 8 9 ... 10 ) Next »