Regular expressions: Pulling it all together

It's time to refine your understanding of building regular expressions with grep and sed.

Posted: October 17, 2019 by David Both (Sudoer alumni)

Regex: Pulling it all together — "Crossfit Bootcamp Fitness Models" by ThoroughlyReviewed is licensed under CC BY 2.0

In Introducing regular expressions, I introduced the concept and basics, and then in Getting started with regular expressions: An example, we walked through an example that cleans up lists of names and email addresses so they are consistent and parseable. After our dive into Regex and grep: Data flow and building blocks, where we got into more detail about regular expressions, it’s now time to explore ways in which we can shorten and simplify the command-line program from the first example. We’ll focus here on grep and sed.

Example: Simplifying the mailing list program

First, let’s look back at our first example, where we built the following command-line interface (CLI) program:

cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$" | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

You might find the regular expressions easier to read at this point, but this program can be simplified.

cat and grep

Let’s start by focusing on the beginning of the command, which involves cat and grep:

cat Experiment_6-1.txt | grep -v Team | grep -v "^\s*$"

We can combine the two grep statements, which originally look like this:

| grep -v Team | grep -v "^\s*$"

Tip: When the STDOUT from grep is not piped through another utility, and when using a terminal emulator that supports color, the regex matches are highlighted in the output data stream.

The revised command is:

grep -vE "Team|^\s*$"

Here, we’ve added the E option, which specifies extended regex. According to the grep man page:

"In GNU grep there is no difference in available functionality between basic and extended syntaxes."

This statement is not strictly true, because our new combined expression fails without the E option. Run the following to see the results:

[student@studentvm1 testing]$ cat Experiment_6-1.txt | grep -vE "Team|^\s*$"

Try it without the E option.

Now, let’s look at cat. The grep tool can also read data from a file, so we can eliminate the cat command entirely:

[student@studentvm1 testing]$ grep -vE "Team|^\s*$" Experiment_6-1.txt

This change and the previous one together leave us with the following, somewhat simplified CLI program:

grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/\[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'

This command is shorter, more succinct, and will execute faster because grep only needs to parse the data stream once.

Note: Again, it is important to realize that this solution is not the only one. There are different methods in Bash for producing the same output, and there are other languages like Python and Perl that can also be used. And, of course, there are always LibreOffice Writer macros. But, I can always count on Bash as part of any Linux distribution. I can perform these tasks using Bash programs on any Linux computer, even one without a GUI desktop, or one that has a GUI desktop but does not have LibreOffice installed.

sed

We can also simplify the sed command. The sed utility not only allows searching for text that matches a regex pattern, it can also modify, delete, or replace the matched text. I use sed at the command line and in Bash shell scripts as a fast and easy way to locate text and alter it. The name sed stands for stream editor because it operates on data streams in the same manner as other tools that can transform a data stream. Most of those changes involve selecting specific lines from the data stream and passing them on to another transformer program.

Note: Many people call tools like grep filter programs, because they filter unwanted lines out of the data stream. I prefer the term transformers, because tools like sed and awk do more than just filter. They can test content for various string combinations and alter the matching content in many different ways. Tools like sort, head, tail, uniq, fmt, and more all transform the data stream in some way.

We have already seen sed in action, but now, with an understanding of regular expressions, we can better analyze and understand our earlier usage. It is possible to combine four of the five expressions used in the sed command into a single expression. The sed command now has two expressions instead of five:

sed -e "s/[Ll]eader//" -e "s/[]()\[]//g"

This format makes it a bit difficult to understand the more complex expression. Note that no matter how many expressions a single sed command contains, the data stream is only parsed once to match all of the expressions.

Let’s examine the revised expression more closely:

-e "s/[]()\[]//g"

By default, sed interprets all [ characters as the beginning of a set, and the last ] character as the end of that set. So, in the code above, the first [ and the last ] contain the set. The intervening ] characters are not interpreted as metacharacters.

Since we need to match [ as a literal character in order to remove it from the data stream, and sed normally interprets [ as a metacharacter, we need to escape it so that it is interpreted as a literal ]. That is where the backslash (\) comes in, giving us \[ in the middle.

Let’s plug this new version into the CLI script and test it:

[student@studentvm1 testing]$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()\[]//g"

I know what you are asking: "Why not place the \[ after the [ that opens the set, and before the ] character?" Try it as I did:

[student@studentvm1 testing]$  grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[\[]()]//g"`

I think that should work, but it does not. Little unexpected results like this make it clear that we must be careful and test each regex carefully to ensure that it actually does what we intend.

After some experimentation of my own, I discovered that the escaped left square brace \[ works fine in all positions of the expression except for the first one. This behavior is noted in the grep man page, which I probably should have read first. However, I find that experimentation reinforces the things I read, and I usually discover more interesting things than what I was looking for.

Adding the last component, the awk statement, our optimized program looks like this and the results are exactly what we want:

[student@studentvm1 testing]$ grep -vE "Team|^\s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()\[]//g" | awk '{print $1" "$2" <"$3">"}'

Other tools that implement regular expressions

Many Linux tools implement regular expressions. Most of those implementations are very similar to that of awk, grep, and sed, so it should be easy to learn the differences. Although we have not looked in detail at awk, it is a powerful text-processing language that also implements regexes.

Most of the more advanced text editors use regexes. Vim, gVim, Kate, and GNU Emacs are no exceptions. The less utility implements regexes, as does LibreOffice Writer’s search and replace facility.

Programming languages like Perl, awk, and Python also contain implementations of regexes, which makes them well suited to writing tools for text manipulation.

Resources

I have found some excellent resources for learning about regular expressions. There are more than I have listed here, but these are the ones I have found to be particularly useful:

The grep man page has a good reference but is not appropriate for learning about regular expressions.
The O’Reilly book, Mastering Regular Expressions, by Jeffrey E. F. Friedl, is a good tutorial and reference for regular expressions. I recommend it for anyone who is or wants to be a Linux sysadmin because you will use regular expressions.
The O’Reilly book sed & awk: UNIX Power Tools, by Arnold Robbins and Dale Dougherty, is another good one. It covers both of these powerful tools and it also has an excellent discussion of regular expressions.

There are also some good web sites that can help you learn about regular expressions, and which provide interesting and useful cookbook-style regex examples. There are some that ask for money in return for using them. Jason Baker, my Technical Reviewer for Volumes 1 and 2 of my Using and Administering Linux course suggests regexcrossword.com as a good learning tool.

Summary

This series has provided a brief introduction to the complex world of regular expressions. We have explored the regex implementation in the grep utility in just enough depth to give you an idea of some of the amazing things that can be accomplished with regexes. We have also looked at several Linux tools and programming languages that also implement regexes.

But make no mistake! We have only scratched the surface of these tools, and of regular expressions. There is much more to learn, and as you can see, there are some excellent resources for doing so.

Note: This article is a slightly modified version of Chapter 6 from Volume 2 of my Linux self-study course, "Using and Administering Linux: Zero to SysAdmin," due out from Apress in late 2019.

Topics: Linux Regular expressions

Regular expressions: Pulling it all together

Example: Simplifying the mailing list program

cat and grep

sed

Other tools that implement regular expressions

Resources

Summary

David Both

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.

Regular expressions: Pulling it all together

Example: Simplifying the mailing list program

cat and grep

sed

Other tools that implement regular expressions

Resources

Summary

David Both

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.

Related Content