Join Command in Linux with Examples

Last updated: April 15, 2024 | Emmanuel Oyibo

The join command in Linux is a powerful tool for merging the content of two text files based on a common field. Imagine you have two files containing related data that you want to combine. That’s where the join command comes in.

Think of it as a way to create a new file that combines matching lines from your original file. The join command is often used for managing information like customer lists, inventory data, or system logs.

To use the join command, your files need to have a common field. For example, both files might have an “ID” column or an “email” address column. The join command uses this common field as reference to match up lines.

In this tutorial, you’ll learn how to use the join command for both basic and complex operations.

Basic Usage

Below is the general syntax of the join command:

join [OPTIONS] file1.txt file2.txt

Here’s a breakdown of each item in the syntax:

  • join: this is the core command to merge data from two files based on a common field
  • [OPTIONS]: this represents various options that adjust the behavior of the join command. Let’s take a look at some of the common ones:
    • -a 1 or -a 2: includes unpairable lines from file 1 or file 2 respectively in the output
    • -1 field: specifies the join field in the first file (e.g., -1 2 for the second field)
    • -2 field: specifies the join field in the second file
    • -i: makes the join operation case-sensitive
    • -t char: uses “char” as the field delimiter/separator
    • -o format: specifies the format of the output lines
  • file1.txt and  file2.txt: the names of the two files you want to join

Example

Let’s say we have two text files –– file1.txt and file2.txt, each containing data separated by a common delimiter, such as a space or tab. Let’s see the content of both files:

$ cat file1.txt

1 Alice
2 Bob
3 Charlie
$ cat file2.txt

2 Engineer
3 Doctor
4 Artist

Let’s now use the join command without options or flags:

$ join file1.txt file2.txt

Let’s examine a breakdown of the command:

  • join: this is the command used to join lines of two files on a common field
  • file1.txt and file2.txt: these are the input files to be joined. The command assumes that both files are sorted on the fields to be joined
  • By default, join merges the lines from both files based on the first field (assuming it’s sorted)

Output:

2 Bob Engineer
3 Charlie Doctor
3 Charlie Doctor

Here’s what happened in the output:

  • 2 Bob Engineer: this line represents where the field is common in file1.txt and file2.txt (in this case, the number 2). It combines the respective lines from both files
  • 3 Charlie Doctor: represents where the field is common in both files (number 3 in this case)

Printing Unpairable Lines

By default, the join command only displays lines that have matching fields in both input files. However, you can use the -a option to include lines that don’t have a match in the other file.

Syntax and Explanation:

  • -a 1 includes unpairable lines from the first file
  • -a 2 includes unpairable lines from the second file

Example

Let’s say we want to print the unpairable files in our files file1.txt and file2.txt  above:

1. Unpairable lines from file1.txt

$ join -a 1 file1.txt file2.txt

1 Alice
2 Bob Engineer
3 Charlie Doctor

Notice that 1 Alice is included since it doesn’t have a match in file2.txt.

2. Unpairable lines from file2.txt

$ join -a 2 file1.txt file2.txt

2 Bob Engineer
3 Charlie Doctor
4 Artist

Here, 4 Artist is included as it exists only in file2.txt.

From the above examples, it’s essential to note that the -a option lets you find unique data to each file. This is useful for comparing and highlighting differences.

Specifying Custom Join Fields

Normally, the join command uses the first field of each file to find matches. However, the -1 and -2 options let you control exactly which fields are used for comparison:

  • -1 FIELD_NUMBER specifies the join field for the first file (file1.txt)
  • -2 FIELD_NUMBER specifies the join field for the second file (file2.txt)

Example

Let’s assume you have two space-delimited files file3.txt and file4.txt:

$ cat file3.txt

1 Alice
2 Bob
3 Charlie
$ cat file4.txt
Alice 20
Bob 25
Charlie 30

If you want to join based on the names (which are different in columns), you’d use:

$ join -1 2 -2 1 file3.txt file4.txt

Alice 1 20
Bob 2 25
Charlie 3 30

Here the breakdown of the command:

  • -1 2: join file3.txt using its second field (the name column)
  • -2 1: join file4.txt using its first field (also the name column)

This offers you precise control over how your data is merged. It’s helpful in situations where the matching information isn’t located in the first column of both files.

Making the join Command Case-Insensitive

The join command in Linux is case-sensitive by default. This means "[email protected]" and "[email protected]" would not be considered a match. For scenarios where you need to ignore case differences, use the -i option.

Example

Let’s create two new files file5.txt and file6.txt with the content below:

$ cat file5.txt

1 Alice
2 Bob
3 Charlie
$ cat file6.txt

2 Engineer
3 doctor
4 Artist

Note the inconsistent capitalization of “doctor” in file6.txt. To combine the data, use:

$ join -i file5.txt file6.txt
2 Bob Engineer
3 Charlie doctor
  • The -i option tells the join command to ignore case when comparing fields
  • Despite the capitalization difference, the lines with the key “3” are successfully matched and joined

You may wonder why this is important. Real-world data often has inconsistent capitalization, especially names or email addresses entered by users. The -i option helps you overcome these inconsistencies and merge your data accurately.

Understanding Sorted Input and the --nocheck-order Option

The join command in Linux delivers better when your input files are sorted on a common field. This allows you to use fast algorithms to find matches. 

But what if your files aren’t sorted, or you’re confident that they are even without a check? That’s where --nocheck-order comes in. Let’s now take a look at how it works:

$ join --nocheck-order file1 file2
  • --nocheck-order: tells the join command: "Don't worry about whether the files are sorted. Just try to merge the data as best you can."

When should you use --nocheck-order?

  • Unsorted Files: If your data isn't sorted, and you don't want to take the time to sort it, --nocheck-order lets you proceed
  • Performance: In rare cases, if you know your files are already sorted, this option can slightly speed up the join operation by skipping the initial sorting check

Example

Let’s create two files file7.txt and file8.txt with the content below

$ cat file7.txt

3 Charlie
1 Alice
2 Bob
$ cat file8.txt
2 Engineer
3 Doctor
1 Artist

Without --nocheck-order, the join command will produce the error message below due to the unsorted data.

$ join file7.txt file8.txt

join: file7.txt:2: is not sorted: 1 Alice
join: file8.txt:3: is not sorted: 1 Artist
3 Charlie Doctor
1 Alice Artist
join: input is not in sorted order

But, by using the --nocheck-order option, you'll get the output:

$ join --nocheck-order file7.txt file8.txt

3 Charlie Doctor
1 Alice Artist
2 Bob Engineer

However, it’s important to note that:

  • You should use --nocheck-order with care. If your files are truly unsorted, you might get unexpected or incorrect output

Sorting files beforehand (using the sort command) is generally the safest way to ensure reliable results with the join command

Unlocking More Power with the join Command: Advanced Scenarios

The join command isn’t only for simple file merges. Let’s find out how you can use it to gain precise control over outputs and even combine multiple files in a single operation.

Example 1: Combining Customer Data with Orders

Imagine you work with a system where customer information and their orders are stored in separate files. To analyze this data effectively, you’ll need to bring it together. Here’s how the join command, along with some data preparation, accomplishes this:

Step 1: Prepare the Files

$ cat customers.txt

CustomerID FirstName LastName Email
1 Alice Smith [email protected]
2 Bob Johnson [email protected]
3 Charlie Williams [email protected]
$ cat orders.txt

1 2024-03-12 50.99
3 2024-03-18 125.00
2 2024-03-05 80.50

The join command generally expects your files to be sorted by the field you want to join on (in this case, CustomerID. We use the sort command to achieve this:

$ sort -t ' ' -k1,1 customers.txt > customers_sorted.txt
$ sort -k1,1 orders.txt > orders_sorted.txt
  • -t ‘ ‘: specifies that the files are space-separated
  • -k1,1: tells sort to focus on the first column for sorting

Step 2: Execute the join Command

Now we have customers_sorted.txt and orders_sorted.txt ready to be combined:

$ join -t ' ' -1 1 -2 1 customers_sorted.txt orders_sorted.txt
  • -t ' ': specifies a space as the delimiter since both files are space-separated
  • -1 1: tells join to use the first column of customers_sorted.txt (CustomerID) as the join field
  • -2 1: tells join to use the first column of orders_sorted.txt (also the CustomerID) as the join field

The above command produces the following:

1 Alice Smith [email protected] 2024-03-12 50.99
2 Bob Johnson [email protected] 2024-03-05 80.50
3 Charlie Williams [email protected] 2024-03-18 125.00

Each line now contains customer information followed by their corresponding order details.

Example 2: Customizing Output with the -o Option 

The -o option lets you control the merged output format. Let’s say you only want to display customer names and the total amount they’ve spent. Here’s how to modify the join  command:

$ join -t ' ' -1 1 -2 1 -o 1.2,1.3,2.3 customers_sorted.txt orders_sorted.txt

Let’s now breakdown what the options in the above command mean:

  • -o 1.2,1.3,2.3:
    • 1.2: FirstName from 'customers_sorted.txt'
    • 1.3: LastName from 'customers_sorted.txt'
    • 2.3: Order Amount from 'orders_sorted.txt'

Advanced Usage: Joining Multiple Files

Let’s assume we want to analyze website traffic logs alongside user location data, showcasing some more powerful features of the join command. We have the following files:

$ cat web_traffic.txt

IP_Address 2024-03-20 /products/gadget-x
IP_Address 2024-03-19 /about
IP_Address 2024-03-21 /products/super-tool
$ cat ip_locations.csv

IP_Address,Region
<IP_Address>,East
<IP_Address>,West
<IP_Address>,North

The goal is to generate a report showing the region from which most website sections were accessed. This could help target marketing or understand user demographics.

Solution

1. Preprocess web_traffic.txt  to extract only the IP addresses:

$ cut -d ' ' -f1 web_traffic.txt > ip_addresses.txt
  • cut -d ' ' -f1: cuts the first field (IP address) using a space delimiter

2. Sort both files by IP address:

$ sort -k1,1 ip_addresses.txt > ip_addresses_sorted.txt
$ sort -t, -k1,1 ip_locations.csv > ip_locations_sorted.csv

3. join with Calculations: We'll join and then use awk to count region occurrences:

$ join -t, -1 1 -2 1 ip_addresses_sorted.txt ip_locations_sorted.csv | awk -F',' '{print $2}' | sort | uniq -c
  • join Part: joins on the IP address.
  • awk -F',' '{print $2}': extracts the region (second field in the joined output)
  • sort | uniq -c: Sorts regions and counts unique occurrences

Output:

 2 East 
 1 North
 1 West

Let’s examine the breakdown of the above command and output:

  • Preprocessing: cut isolates the relevant data before the join
  • Sorting: as always, sorting is crucial for join to work efficiently
  • Beyond Basic join: we pipe the join output to awk and other commands for further processing. This demonstrates the power of combining join with Linux's text manipulation tools
SHARE

Comments

Please add comments below to provide the author your ideas, appreciation and feedback.

Leave a Reply

Leave a Comment