find Command with regex [Including Examples]

Last updated: July 24, 2023 | Zachary Blackstein

1. Introduction

In Linux-based systems, the go-to command to locate and search for files is the find command. We often use it for finding files based on matching their filenames using the shell's wildcards (or glob expressions).

Here's the thing: these wildcards can be effective, but they fall short in many cases where we want to define more advanced and complex patterns. That's when we bring out the big guns – regular expressions (or regex for short) to describe our patterns in much finer detail.

In this tutorial, we'll briefly discuss what are regular expressions, learn how to apply them with find, and then we'll check some practical examples where regex outshines the standard wildcards.

2. Understanding Regular Expressions

2.1. Regex And Globbing Patterns Aren’t The Same

Regular expressions (regex) are pattern-matching notations just like globular expressions, but they're more expressive. You may think of regex as an advanced way of searching for complex patterns without constructing lengthy Linux commands. You can use regex in most text-processing commands right out of the box, including the find command.

To use regular expressions effectively, you'll need to get a grasp of how they work.

2.2. Regex Characters

Regular expressions are characters or text fragments with special meaning. By combining these into a string, you can specify a template to match and filter text. One thing to consider is that there isn’t just one set of these regex characters, meaning they can be interpreted differently based on what type of regex you're using.

Luckily, there is consistency across all types and we may summarize them in the following wildcard characters:

ElementsMeaningUsage
.Matches any single character and works similarly to the ? in shell globs."x.z" matches any 3-character string that starts with x and ends with z. This means"xxz" or "xyz" but not "xz" or "xyyz".
*Matches the preceding character zero or more times."z*" matches "z", "zz", "zzz", but not "zzxz" or any other character.
\Works as an escape character to interpret the subsequent special character literally."\." matches the dot literally. This also means that "\\" negates the effect of the second backslash and matches it as is "\".
^Indicates the beginning of the pattern"^z" matches any string starting with "z", like "zaaiy" or "zebra", but not "azerty". This also applies to strings, where "^Linux" matches any line starting with "Linux".
$Indicates the end of the pattern"o$" matches any string ending with "o", like "zoro". Also, "key$" matches any lines ending with "key"
|Performs a logical OR relationship between two strings or even two regex expressions."linux|windows" matches either "linux" or "windows".
[ ]Matches any item specified within the square brackets [], be it a set or a range of characters. This performs a logical OR to identify a match from the specified characters."k[ea1]y" matches only "key", "kay", and "k1y". We can also specify a range of characters or numbers rather than a set. For instance, [1-9] matches any digit from 1 to 9, and [a-z]. matches any lowercase letter from a to z.
[^]Matches any character that is NOT specified within the square brackets. Think of the caret as a logical NOT that negates the specified set or a range of characters."[^1-3]" matches any single digit except for 1, 2 and 3.

Regular expressions are a big topic, and there is a lot left to be said. However, we don't want this to be an in-depth article about regex. So, it's a good idea to check out more information about it if you're interested.

3.  Using find with Regular Expressions

3.1. Syntax and Relevant Options

There's nothing tricky about defining a find command that applies regular expressions. You simply forget about the usual -name test that uses globbing, and opt for either -regex or -iregex (for case-insensitive matching).

Your find -regex command should look like this:

find [path] -regex [expression]

As usual, the [path] is where you want to start the search. The [expression] is the pattern you want to match against filenames, and it should be enclosed in quotation marks (preferably single quotes) to ensure correct interpretation.

By default, the find command interprets -regex expressions using the Emacs type. However, you can explicitly select alternative regex types using the -regextype option.

3.2. Proof of Concept

Before you jump right into crafting your regex commands, keep in mind that find will match your regex pattern against the path of the file rather than just its filename (basename). This means that most of the patterns you define should consider the path of where you invoke find.

For instance, if you want to locate files in the working directory, the only thing you may consider is the ./ shorthand representation for the current path.

Let's print all filenames composed of letters:

$ find . -regex '\./[a-z]*'
./smara
./terminal

We may split the -regex expression above into two parts for better understanding: "\./" and "[a-z]*". The first and crucial one matches the current directory of the file path "./", and you should always include it (at least when matching a file's basename). The second part, matches any lowercase letter from "a" to "z", zero or more times thanks to the * count modifier.

Next, how about we print only filenames with digits? As before, let's consider the current directory and the path separator (\./), and then define the pattern to match against the filenames:

$ find . -regex '\./[0-9]*'
./11910
./3301

Again, the above regular expression first considers that we're searching within the current directory, hence the "\./". Then matches only filenames that consist entirely of digits between 0 and 9.

3.3. Using Other Regex Types

You may want to use other regex wildcards and expressions that find can't understand by default. Often, you'll find yourself copy-pasting pre-existing regex expressions. In this case, it's easy to change the regex type in find rather than waste time translating what you have into the Emacs regex style.

In the previous example, you can also define digits from 0 to 9 with [[:digit:], which is a construct supported by POSIX.

Let's check if this will work:

$ find . -regex '\./.[[:digit:]]*'

The above invocation yields nothing because [[:digit:]] simply isn't supported by Emacs which is the assumed type by find.

To make it work, simply use the -regextype option with any POSIX-style regex:

$ find . -regextype posix-basic -regex '\./.[[:digit:]]*'
./11910
./3301

Now our command knows exactly what we mean by "[[:digit:]]", because we've explicitly told find that we're dealing with posix-basic regex.

4. Practical Examples

After seeing some toy examples of the find command with regex, it's time to see some practical ones. However, rather than discussing simple use cases to match filenames, it's better to show you how and when regex is the optimal option over the standard shell globs.

4.1. Finding Files With Different Extensions

A common criterion to search for files is their extensions. You'll probably find yourself wanting to locate files with only two extensions (or even more), let's say .zip and .rar. You can do this the "old way" with shell globs, like this:

$ find . -type f -name "*.zip" -o -type f -name "*.rar"

While this command yields correct results, it's somewhat tricky to write, for even two extensions. Now imagine if we had more.

You'll quickly come to realize that using regex is much simpler after the following example:

$ find . -type f -regex '.*\.\(zip\|rar\)'
./bouhannana.zip
./linuxsysop.rar

Since we focus on the extensions of the files and not the actual filename, we skipped matching the current directory shorthand (\./). With that out of the way, let's break this down step by step and see how find grabbed the files above:

  • -type f restricts the search to only files.
  • ".*" matches any character multiple times, which gives us for example "./bouhannana"
  • "\." simply matches a dot character, so now we have "./bouhannana."
  • "\(zip\|rar\)" matches either "zip" or "rar", so we get "./bouhannana.zip"

You may also check this visual breakdown:

using find with regex search for all files (not directories) in the current directory and its subdirectories that end with .zip or .rar.

Note when you add a sub-pattern you should enclose it within parenthesis. However, since find defaults to Emacs-style regex, you have to escape the parenthesis and even the alternation operator "|".

You can avoid all of those escape characters by choosing a POSIX-compliant regex type, say "posix-extended":

$ find . -regextype posix-extended -regex '.*\.(zip|rar)'

Our command uses the same exact regex expression, but without the escape characters which aren't required by POSIX.

4.2. Negate The Match and Exclude Files With Certain Patterns

One important thing to be aware of when using find with regex is, how to do a negative search. For instance, instead of searching for files that end with .zip or .rar, we do the opposite and capture all files that do not have these extensions.

The good part here is you can simply leverage the built-in find operator -not or ! without fiddling with the regex expression:

$ find . -type f -regextype posix-extended ! -regex '.*\.(zip|rar)'

This command looks exactly like the previous one, except for the negation operator "!" before the -regex test. This slight edit alters the command to: Find files that do NOT end with zip or rar.

Note that ! negates the result of the expression following it. So when you want to use it to negate a regex pattern, place it exactly before the -regex test.

4.3. Finding Files with a Specific Number of Characters

Suppose you have a directory with a lot of files , and you want to catch only the ones with characters between 5 to 9 letters in length.

If you try to achieve this with shell globbing, it'll only be a time-consuming and painful task, especially when there's a more efficient alternative. Let's spare ourselves the hassle and check how this can be simple with regex:

$ find . -type f -regextype posix-extended -iregex '\./[a-z]{5,9}'

The new thing above is the {5,9} expression, which is a count modifier just like the asterisk character (*). The only difference is that {5,9} is specific, matching the preceding character 5 to 9 times.

Moreover, we took advantage of the -iregex test instead of -regex to skip including the uppercase character range in the pattern expression. If we used -regex, we'll have to write the character range like this [a-z,A-Z].

SHARE

Comments

Please add comments below to provide the author your ideas, appreciation and feedback.

Leave a Reply

Leave a Comment