Grep and Regex: A Deep Dive into Text Searching

Last updated: July 31, 2023 | Nitish Singh

grep is a text-processing tool. It stands for Global Expression Print. As a Linux user, you can use grep to search files and directories. You can use it to match patterns and print output to the terminal.

However, grep with regex (regular expression) can change how you approach pattern searching and matching. That’s because regex is an advanced output-filtering tool that lets you input a sequence of characters for matching.

Regex also works with other Linux commands, such as awk and sed commands, making it an ideal search choice.

In this guide, we’ll look closer at grep and regex. We’ll use both of them to do advanced text searching and filtering.

Pre-requisite to follow the tutorial

To follow the tutorial, you’ll need to have the following:

grep syntax

The syntax of grep is as below.

grep -options [regex] [file]

Also, you’ll need to know about three regex syntax options. These include:

  • Basic Regular Expressions (BRE)
  • Extended Regular Expressions (ERE)
  • Pearl Compatible Regular Expressions (PCRE)

Notes:

  • When you use grep, it uses BRE as default.
  • When working with basic regular expressions, the metacharacters such as + , | , ? , { , ( , and ) are interpreted as they are. So, if you want to use the characters, you’ll need to use the backslash (\) as an escape character.

Grep and Regex in Action

Now that we have a complete grip on grep, it’s time to see grep and regex in action. In this section, we’ll learn about them with examples.

Let’s get started.

1. Do a basic grep word search, doing literal matches

A word is also a regex. So, let’s do a basic grep search with it. 

The syntax to do so is:

grep [regex] [filename]
$ grep fruit fruits.txt
The grep command searched for the word 'fruit' in the 'fruits.txt' file

Let’s take inside the .bashrc file with grep.

$ grep if .bashrc
The grep command searched for the occurrences of the word 'if' in the .bashrc file

And, if you want to show the current users by including lines that contain the “bash” word in the /etc/passwd file, run the following grep command.

$ grep bash /etc/passwd
The grep command searched for the word 'bash' in the '/etc/passwd' file

These examples show the most basic use of the grep command. Here, it searches for literal characters or sets of character.

2. Anchor Matching

Anchor matching is a valuable technique. Here, you use anchors (meta-characters) to refine your search. These anchors include:

  • ^ caret sign: The caret (^) sign looks for an empty string that is present at the beginning of a line.
  • $ dollar sign: The dollar ($) sign does the opposite of the caret symbol and looks for an empty string at the end of a line. 

Let’s see each of them in action to understand their use-case better.

You can use the ^ sign to search for lines that start with a specific string. So, if you want to search for lines that start with berries, you’ll need to write the command below.

$ grep ‘^Moving’’ fruits.txt
grep anchor matching searched for lines in the 'fruits.txt' file that start with the word 'Moving'

Note: It is case-sensitive. So, if you use the ‘^moving’ regex to search, it would return nothing or something else, depending on your text file.

Similarly, you can use the $ symbol to print out lines that end with your preferred string value.

So, if you want to return a line that ends the word ”digestion.” (yes, there is a dot at the end of the string), you’ll need to run the following command.

$ grep ‘digestion.$’ fruits.txt
grep anchor matching example command searched for lines in the 'fruits.txt' file that end with the word 'digestion'

You can combine both to find sentences that start and end with the same word.

$ grep ‘^Moving$’ fruits.txt

3. Using Character Classes and bracket expressions

Character Classes in regex are a set of characters. It is represented within square brackets. The LC_TYPE locale determines the character class interpretation. These character classes include:

  • [:alnum:] -  This character class consists of alphanumeric characters.
  • [:alpha:] - This character class comprises alphabetic characters.
  • [:blank:] - This character consists of blank characters, usually tab and space.
  • [:cntrl:] - This character class consists of control characters.
  • [:digit:] - It consists of digits 0 to 9
  • [:graph:] - This character class comprises graphical characters. It includes [:alnum:] and [:punct:] - (punctuation characters).
  • [:lower:] - It consists of lowercase letters.
  • [:upper:] - This character class consists of upper-case letters.
  • [:print:] - This class comprises printable characters, including [:alnum:], [:punct:], and space.
  • [:xdigit:] - It comprises hexadecimal digits. For example, (0 1 2 3 4 5 6 7 8 9  A  B C D E F a b c d e f).

Let’s see an example to understand it.

Let’s search for all characters that start with upper-case. 

$ grep [[:upper:]] fruits.txt
Using regex Character Classes grep command searched for uppercase letters in the 'fruits.txt' file

Now, let’s search for digits within the file.

$ grep [[:digit:]] fruits.txt
Using regex character Classes grep command searched for digits in the 'fruits.txt' file

If you see the screenshot above, we also have alphanumeric characters; let’s search for them with the [[:alpha:]] class. It’ll also list alpha characters in different colors.

$ grep [[:alpha:]] fruits.txt
Using regex character classes the grep command results of the search for alphabetic characters in the 'fruits.txt' file

We also have bracket expressions. These are used to match characters or a group of characters.

So, if you want to search for lines that contain two words, accept and accent, you’ll need to run the following command.

$ grep ‘acce[np]t’ fruits.txt

4. Grouping

Grouping in regex is a way to group multiple patterns in one. Here, we’ll need to use parentheses() metacharacter so, if you want to create a group such as (mon) that includes the letters “m,”o,”n” -- treating multiple characters as a single unit. 

You can use the group by escaping it in a basic or extended regular expression.

Let’s see an example to understand it.

$ grep 'script\(ing\)\?' linux.txt
Screenshot of terminal displaying the output of the 'grep' command searching for the regex pattern 'script(ing)?' in the 'linux.txt' file

Let’s see other examples in action.

Matching repeated patterns

You can use grouping to match repeated patterns, for example.

$ echo ‘aaa bbb ccc abc cba’ | grep '\(a\+\)'

This will return any instances that contain one or more “a” character occurrences.

grep regex Matching repeated patterns identifies and highlights consecutive occurrences of the letter 'a' in the text

Capturing groups

If you want to capture groups rather than a character instance, you’ll need to use a different regex approach.

$ echo 'abc123 ab123c abc12345' | grep "\(abc\)\(123\)"

As you can see, it searches for grouped characters. 

prints the string "abc123 ab123c abc12345" and then pipes it into grep with the regular expression grouping

Alternatives within a group

Using grouping in regex, you can also search for alternatives. This way, you can search for multiple groups of characters throughout the text.

$ echo "apple orange banana" | grep "\(apple\|orange\)"
grep regex multiple groups of characters throughout the text.

This command matches and prints "apple" or "orange" from the input. The output will be "apple" and "orange."

Non-capturing groups

Lastly, you can use non-capturing groups with ?:

You can put the ?: within the parentheses, creating a non-capturing group.

$ echo "hello" | grep "\(?:he\)llo"

It still groups the pattern but doesn't capture it as a separate group. So in this example, it matches and prints "hello."

5. PCRE

PCRE stands for Perl Compatible Regular Expression. It is the most-compatible library written in C. It implements the regular expression engine with PERL programming capabilities.

You can use PCRE to do much more than basic regular expressions. For example, you can use ‘\d’ as a short form of ‘[0-9]’ for digits finding.

Let’s go through the examples below.

Matching an email address

You can use PCRE to match your email address. This is a tricky regex and grep usage, but it gets the job done. So, if you’re writing a script or an app that needs to identify the email address, you can use the following approach.

$ echo "Contact me at [email protected]" | grep -P "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b"
grep PCRE Matching an email address

This command matches and prints the email address in the given input. It uses a PCRE pattern to match the standard email address format.

Matching a date in YYYY-MM-DD format

Similarly, you can use the PCRE pattern to match a date in YYYY-MM-DD format.

$ echo "The event is scheduled for 2023-05-20" | grep -P "\b\d{4}-\d{2}-\d{2}\b"
grep PCRE Matching a date in YYYY-MM-DD format

This command matches and prints the date in "YYYY-MM-DD" format from the given input. The PCRE pattern matches four digits, followed by a hyphen, two digits, another hyphen, and two more digits.

Using positive lookahead

You can also use the PCRE pattern to look ahead for a word or string and only print if it is found.

$ echo "I love coding and programming" | grep -P "coding(?=\s+and)"

This command matches and prints the word "coding" only if followed by the phrase "and" with one or more whitespace characters in between. The positive lookahead assertion (?=\s+and) ensures that the match is made only if the lookahead condition is met.

6. Extended Regex

Extended regex removes the need to add escape characters in the patterns. This means these are not interpreted as literal characters. 

By default, grep defaults to basic regex. To use extended regex, you must use the -E flag.

$ grep -E 'ap+les' fruits.txt
Screenshot of terminal displaying the output of the 'grep' command using the extended regex pattern 'ap+les' on the file 'fruits.txt'

Searches for lines in the file fruits.txt that contain the pattern 'ap+les', where the letter 'a' is followed by one or more occurrences of the letter 'p', and then the letters 'les'

Similarly, you can use the ^ and $ signs without escaping them.

$ grep ‘^Moving$’ fruits.txt

7. Alternation

Alternation is the simple means to define alternative matches. You’ll need to use the escaped pipe character (\|) to do so.

For example, you can search for the words “warning” and “error” in log files by running the following command.

$ sudo grep -l ‘warning\|error’ /var/log/*.log

Here the alternation operator (|) in the regular expression 'warning\|error' allows for alternative matches, enabling the command to search for either 'warning' or 'error' in the log files.

8. Quantifiers

Next, we have Quantifiers. These refer to metacharacters that specify the number of match appearances. Some of these quantifiers include:

QuantifierDescription
* (asterisk)Zero or more matches
? (question mark)Zero or one match
+ (plus sign)One or   more matches
{n}n matches
{n, }n or more matches
{, p}Up to p matches
{n, p}From n to p matches

Let’s use the + symbol for one or more matches.

$ grep -E 'ap+les' fruits.txt
Command output of grep -E 'ap+les' fruits.txt searches for lines in fruits.txt that contain the pattern 'ap+les' and displays the matching lines

The -E option enables extended regular expressions, allowing the use of the + quantifier to match one or more occurrences of the preceding character or group.

Another example would be to use the * asterisk symbol for zero or more matches.

$ grep a*les fruits.txt

That’s it. We have come to the end of our grep and regex tutorial.

SHARE

Comments

Please add comments below to provide the author your ideas, appreciation and feedback.

Leave a Reply

Leave a Comment