Showing headlines posted by Bob_Mesibov

« Previous ( 1 2 3 4 5 6 7 ... 10 ) Next »

How to watermark a UTF-8 plain text file

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Nov 24, 2021 6:10 AM EDT)
  • Story Type: Tutorial
Watermarking plain text isn't easy. Plain text files don't have headers (or magic numbers), and although you can insert invisible control characters, those characters may get revealed by text editors and word processors. There's a sneaky way to do it, though.

What's wrong with my footprintWKT?

WKT is a simple text format for describing some geometric shapes, like those used in GIS. The format is so simple that it's hard to get it wrong, yet the Global Biodiversity Information Facility has lately been declaring "invalid" thousands of apparently valid WKT strings. Why?

A quick cross-file comparison with AWK

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Nov 10, 2021 9:36 AM EDT)
  • Story Type: Tutorial
I really like AWK. It allows me to do simple, effective, ad hoc processing of data files, as this post will demonstrate. If AWK was a football club I'd be an ardent supporter: "Carn the mighty AWK!"

On visual contrast and QR codes

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Nov 3, 2021 6:27 AM EDT)
  • Story Type: Tutorial
Webpage text is easier to read if there's good contrast, and boosting contrast can also make blurry QR codes readable.

Duplicate records differing only in unique identifiers

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Oct 27, 2021 11:47 AM EDT)
  • Story Type: Tutorial
There's a big data table with lots of fields and lots of records. Each record has one or more unique identifier field entries. How to check for records that are exactly the same, apart from those unique identifiers? I've been tinkering with this problem for years, and I described a fairly clumsy method in a 2020 blog post. Here I present a much-improved command-line solution.

Some regex tests with grep, sed and AWK

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Oct 20, 2021 2:09 AM EDT)
  • Story Type: Tutorial
In my data work I regularly do searching and filtering with GNU grep (version 3.3), GNU sed (4.7) and GNU AWK (4.2.1). I don't know if they all use the same regex engine, but I've noticed differences in regex speed between these three programs. This post documents some of the differences.

TSV to CSV on the CLI (if you really have to)

Regular visitors to this blog will know that I don't like the CSV format. It's awful. In my humble opinion, data workers should aim to use invisible tabs (TSV) or visible pipes (PSV) as field separators in delimited text tables. Sometimes, though, data workers are required to convert a perfectly good TSV or PSV to a CSV. What to do?

How to do replacements based on multiple field values

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Oct 6, 2021 6:11 AM EDT)
  • Story Type: Tutorial
In a previous post I explained how to normalise entries in a field based on the entry in another field. The same command-line method can be used to repair entries based on entries in several other fields in the same record. An example will make it a lot easier to see what this is all about and why this method is so useful.

How to find mixed Latin+Cyrillic words

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Sep 29, 2021 11:59 AM EDT)
  • Story Type: Tutorial
Latin, Greek and Cyrillic "A" all look alike but are encoded differently and mixed-script words can cause data processing problems. This post demonstrates a function that finds these mixed words in a data table and colors the letters according to the script they come from.

An AWK histogram with scaling

It's not hard to build a frequency-of-occurrence histogram with AWK, but scaling the histogram bars is a little bit trickier. By "scaling" I mean setting the longest bar to a defined character length in the terminal, and adjusting the lengths of all the shorter bars proportionally.

Show Unicode code points for UTF-8 characters

Like the title says, I wanted to show the Unicode code points (formatted "uxxxx") for a set of UTF-8 characters. There are programs that do just that in a number of programming languages, but here's how to do it with shell tools.

Put an editable command at the next prompt

This post explains two simple ways to send an editable command to a prompt without executing the command.

Yet another gremlin: the zero-width space

A zero-width space (ZWSP) is a formatting aid for the text to be displayed by browsers and word processors. In other processing contexts it's a damned nuisance, but ZWSPs can be located and zapped with command-line tools.

zbarimg and blurry QR codes

This is a tinkering post about zbarimg and its ability to read QR code images. Because there are so many possible ways to produce an image of a QR code, I decided to start with a readable image and progressively degrade it, to see what happened.

CSV viewers for CSV haters

There are quite a few programs that let us CLI people view a CSV file as a simple table. I doubt if there any CSV parsers that work perfectly with all variations in the wild of the horrible, awful CSV format, but the two parsers shown here are usually reliable.

Two data formatting tweaks

Nearly all the data files I audit are tab-separated plain text (TSV). While tabs are wonderful field separators, they're invisible, and sometimes it helps to see where one field ends and another begins. This post describes a couple of the methods I use to make TSVs more eye-friendly.

Visualising data as a PGM image

An ASCII PGM file ("Portable Gray Map") is a simple text file that encodes a grayscale image. It's so simple that I was tempted to build a PGM from a 3-parameter dataset, with "x" for image width, "y" for image height and various gray colors for "z" categories. The resulting image wasn't impressive, but the method might be useful for other datasets.

Revisiting a command-line translator

  • BASHing data; By Bob Mesibov (Posted by Bob_Mesibov on Jul 27, 2021 10:32 PM EDT)
  • Story Type: Tutorial
The translate-shell program works on the command line and translates any common language into any other, with the help of online translation engines. This post explains how to embed translate-shell in a handy GUI, and also how to colorize the translation.

The data worker's guide to psiphiorrhea

In the same way that someone who talks far too much is exhibiting logorrhea, or excessive word-iness, someone who uses far too many digits in their numbers is exhibiting psiphiorrhea, or excessive digit-iness. No amount of explaining about significant figures or measurement error will convince the psiphiorrhea sufferer that their numbers are absurd.

What is +ACY- doing in the data?

Almost all the data files I audit are in UTF-8, but the files have often started out in other encodings. This can lead to some hilarious mojibake and loads of fun for me as I try to reverse the encoding conversion failures. Last week a file appeared with mojibake I'd never seen before.

« Previous ( 1 2 3 4 5 6 7 ... 10 ) Next »