In this demonstration we are exploring some command line tools that let us use regular expressions. In particular will we look at find, grep, and sed.
Requirements
These examples use the Bash shell (command line) as part of the Ubuntu operating system. You will find Bash already installed (or easy to install) on most Linux and Mac operating systems. At the time of writing, Windows uses Powershell, but recent versions also have support for Bash.
Irregular Expressions
In practice regular expressions are often referred to as reg, regex, regexp or similar variations instead of the full "regular expressions." We'll do the same here. Before getting started we should also talk about basic versus extended regular expressions.
Unfortunately theory and practice don't always align. Various software that allow the use of regexps sometimes implement them differently. For the most part it's similar enough, but it's a good idea to look up the documentation or find cheat sheets before using them. In the Bash examples presented here, basic expressions as well as extended expressions are the same except for how they use the backslash (\) character.
The main issue is this: let's say you want to use the regex "The seal watch(ed|s) observantly." The problem is, what did we mean when we added the period at the end of the sentence? Is it just a regular period? Or is it a regex saying we want to match any character? This is where the backslash comes in.
It’s common practice in many coding tools to add a backslash before a character to indicate it should have a meaning different than expected. The problem now becomes: do we add backslashes to the regex?
The seal watch\(ed\|s\) observantly.
or to the normal characters?
The seal watch(ed|s) observantly\.
For historical reasons, basic expressions take the former approach, while extended expressions take the latter.
Navigating a Code Library
In the world of coding there's a good chance at some point you'll want to familiarize yourself with someone else's code library. It's best when there's already documentation to explain how the code is organized, but this isn't always the case. Sometimes you have to create a map for yourself, and regular expressions (and command line tools) can be a great help with that.
For our first two command line tool examples I use the source code from my C++ nik library since I have ownership rights, so there's no copyright issues in using it here. You aren't expected to know C++ of course, the idea is to show you how to orient yourself when exploring unfamiliar code landscapes.
For the past few months I have been writing this (and other) curriculum modules. Before that I was working on the nik library, but it's been so long now and I've focused my concentration on other things that I've forgotten exactly how I organized it at the time. What's more, since I left it unfinished, I hadn’t made documentation yet.
In any case, whether it's an unfamiliar library or one you've been away from for a while, a good way to orient or reorient yourself is by ls-ing (listing) the file and folder contents of the main directory:
The subdirectories (folders) are in blue. So far so good, this gives us an initial feel for how big or complex the library might be, but more detailed information would be better.
Find
The find command line tool lets us explore the filesystem to find files and directories. Regexps aren't necessary to use this tool; we can ask to find all files in every folder and subfolder starting at our current location:
Here, the (.) represents the current directory, and adding the option -type f means "search, but only report the files":
That's a lot of files! Maybe instead of displaying all the files, we can just get a count:
285 files? That certainly is a lot of files! What I did here was pipe (|) the output of our file command (which is just a page of text), and input it into the wc command which counts words. Adding the -l option changes it to counting lines instead, which is what we're interested in.
Note: as you may have noticed, a challenging aspect of coding is that many of the same symbols and characters (such as '|') are used by different tools to mean different things. Unfortunately, there's no way to fix this in a way that would satisfy everyone.
As a quick reminder: if you want to see what options a command line tool has, you can use the man tool...
...to bring up its documentation:
Searching for a Function
Back to our example: I would like to review how I implemented a function called power(…) but have forgotten exactly which file it's a part of. As there are 285 files, that's too many to review manually.
So let's use regular expressions to narrow the field:
Here I used the find tool restricting our search to files with the -type f option, but I've also added the -regextype egrep option to use "grep style" extended expressions. I then used the expression:
-regex ".*power.*"
This expression finds and returns files (searching their path names as well) which have zero or more any characters ".*", followed by the word "power", followed again by zero or more any characters ".*".
Command line tools are expressive enough that they often give you more than one way to do the same thing. If you don't like adding all those options to the find tool each time, an alternative is:
Here we still use find, but return the list of all files (as we did initially), then pipe this into grep, which also uses regular expressions. We don't have to be so thorough in our regex, because grep simply looks for a match within a line of text and returns the whole line. It also highlights matches in red, which is an additional incentive to use this alternative style of search, as it's easier to read.
Note: this time I didn't quote the regex as in previous examples. In Bash if there's no spaces in your expression or variables (Bash is a programming language), you're not required to use quotes, though it's best practice to do so.
Grep
The grep tool is an acronym for "globally search a regular expression and print." It lets us search individual text files, or recursively all the files in a folder, and returns lines that match.
Our search returned 15 files which is far better than 285, but we still may be able to reduce this further. Instead of searching for files with the word "power" in the file name, let's look inside the files themselves for this word. As it is also a function, which might have arguments, I'll narrow our results further by including its parentheses grammar:
Here we are using extended expressions, meaning, if we want to search for the parenthesis characters we need to backslash them. As for the grep options, -E means "use extended expressions", -r means "search all files recursively" (this way we don't need to specify which file we want to search), and -e means "what follows is our regular expression." We could have written the options as:
grep -E -r -e power
But grep allows us to combine them as above. As for the results:
There's a lot. This search wasn't as helpful as I had hoped, but that's okay. Sometimes that’s how it goes. Don't let a failed attempt stop you; even experienced regex users run into this problem. In any case, we can either abandon this approach and search the initial 15 files by find, or refine the regex and try again. Here I opt for the latter:
I refined the search by ignoring the "copower" matches and included a space before the word "power." An alternative to filtering out the "copower" matches is to exclude the "co" before the "power" more directly:
If you begin a character class [char] with a '^', it means exclude. By writing "[^o]" we're saying exclude the letter 'o' before the word "power." Either way, both of these return much better results than our attempt with find. What’s more, they show the files where the matches occur, as well as the text surrounding the match, which is also helpful.
This concludes the main examples of this demonstration. The most common applications that use regular expressions generally require better knowledge of programming languages or database interfaces. Because of this, I chose to explore my nik library as our example, but it wasn't an artificial use of regexps: in my own work I often lookup code this way. It's easy, fast, and accessible.
Finally, even though there are many graphical tools that let you use regular expressions, one nice advantage in using them on the command line is that the tools there were written decades ago under strong hardware constraints (tech was limited at that time). This means that the underlying algorithms implemented tend to be fast and efficient. Don't take these tools for granted just because they're old, there are many situations where they perform better than modern ones.
Sed
So far we've only looked at searching and navigating our text files, what about editing them?
This is another way in which regular expressions show their power. For example, sometimes I like to look at Inuit traditional stories. One of my favorites is Eskimo Folk-Tales, collected by Knud Rasmussen as he journeyed across the Arctic 100 years ago. This is a good example because the copyright has expired, so these stories are now in the public domain and free to use. Just remember to be respectful, as these stories have a lot of power in them.
If you download the plain text file and view it in your browser, you'll see it is just one big text file. It has the table of contents from the original book, but in this file there are no distinct pages so it has very little navigational structure.
Let's say I wanted to edit this book, and split each story into its own text file. How would I do this? I could do it by hand, going into the file, highlighting each story, copying and pasting and saving it, but this is exactly the sort of tedious task computers are much better at. If a computer is going to navigate this file, we need to figure out what markers we can use to separate the stories.
With modern advances in machine learning we could find a way to do this by using the text of the stories themselves, but we're not ready for that. Besides, it's a good skill to learn hybrid approaches: sometimes this sort of project is best done as a person-machine collaboration. If convenient markers don't exist, make them. The first step is to add our own markers.
I chose the following pattern for our markers:
?[[:digit:]]+>
which is similar to html (or xml) tags: I marked the start of a story with a left brace '<' followed by a number (one or more digits) followed by a right brace '>'. To mark the end of the same story I used the same pattern but added a forward slash '/' before the number. You can see this markup by viewing the file in the assets folder called Eskimo-Folk-Tales-indexed.txt.
Note: before adding such markers, it's best to check to make sure the chosen pattern doesn't already exist in the file as it would confuse the automation process later on.
Note: I did this markup editing with the vim text editor, which also lets you use regular expressions. Although vim has a steep learning curve, it is worth learning since it helps to automate many text editing processes.
On to the machine component of this project: using the sed command line tool, I searched for each marked-up story and wrote it to its own file:
To be fair, if you're unfamiliar with using Bash or sed I'm not expecting you to understand this script, as there's a lot going on here.
The point is to show you it can be done, and requires only a few lines of code. When you run this script, it takes only a second or two to create the separate files. Not only is this faster than doing it manually, it also saves your wrists from repetitive strain. Not everything in life or work should be automated, but if you know when to apply it, then this sort of automation can be very helpful.
Note: my intention is not to scare you away from programming with this example. Most languages aren't so complicated. Truthfully, I'm using two entirely separate languages (Bash, sed) and intermixing them, which is a strange thing to do in general. Mostly, I just wanted to show you some of the possibilities.
Note: if you decide to develop this style of automated editing as a skill, I recommend as a best practice to make a backup or a copy of your files and/or directories first, before you run your scripts. If there's a bug in your code, you could accidentally delete important parts of your project. If you make such a mistake but have a backup, you can always restore the original and try again (making another copy of course). Always edit the copy, not the original.
Challenge
The 52 individual text file stories mentioned above are included in the asset folder. In one I used sed to substitute all occurrences of the word dog for qimmiq. In another, I substituted all occurrences for cat. Can you use grep to quickly find which ones?
I've also inserted a website link into one of the stories. It doesn't make sense as part of the story, it's just there for you to find. I didn't include the https part to make it more challenging to locate. Can you find it with regexps? I believe you can! Think about the predictable patterns that website urls have.