The join command in Linux is a powerful tool for merging the content of two text files based on a common field. Imagine you have two files containing related data that you want to combine. That’s where the join command comes in.
Think of it as a way to create a new file that combines matching lines from your original file. The join command is often used for managing information like customer lists, inventory data, or system logs.
To use the join command, your files need to have a common field. For example, both files might have an “ID” column or an “email” address column. The join command uses this common field as reference to match up lines.
In this tutorial, you’ll learn how to use the join command for both basic and complex operations.
Basic Usage
Below is the general syntax of the join command:
join [OPTIONS] file1.txt file2.txt
Here’s a breakdown of each item in the syntax:
- join: this is the core command to merge data from two files based on a common field
- [OPTIONS]: this represents various options that adjust the behavior of the join command. Let’s take a look at some of the common ones:
- -a 1 or -a 2: includes unpairable lines from file 1 or file 2 respectively in the output
- -1 field: specifies the join field in the first file (e.g., -1 2 for the second field)
- -2 field: specifies the join field in the second file
- -i: makes the join operation case-sensitive
- -t char: uses “char” as the field delimiter/separator
- -o format: specifies the format of the output lines
- file1.txt and file2.txt: the names of the two files you want to join
Example
Let’s say we have two text files –– file1.txt and file2.txt, each containing data separated by a common delimiter, such as a space or tab. Let’s see the content of both files:
$ cat file1.txt
1 Alice
2 Bob
3 Charlie
$ cat file2.txt
2 Engineer
3 Doctor
4 Artist
Let’s now use the join command without options or flags:
$ join file1.txt file2.txt
Let’s examine a breakdown of the command:
- join: this is the command used to join lines of two files on a common field
- file1.txt and file2.txt: these are the input files to be joined. The command assumes that both files are sorted on the fields to be joined
- By default, join merges the lines from both files based on the first field (assuming it’s sorted)
Output:
2 Bob Engineer
3 Charlie Doctor
3 Charlie Doctor
Here’s what happened in the output:
- 2 Bob Engineer: this line represents where the field is common in file1.txt and file2.txt (in this case, the number 2). It combines the respective lines from both files
- 3 Charlie Doctor: represents where the field is common in both files (number 3 in this case)
Printing Unpairable Lines
By default, the join command only displays lines that have matching fields in both input files. However, you can use the -a option to include lines that don’t have a match in the other file.
Syntax and Explanation:
- -a 1 includes unpairable lines from the first file
- -a 2 includes unpairable lines from the second file
Example
Let’s say we want to print the unpairable files in our files file1.txt and file2.txt above:
1. Unpairable lines from file1.txt
$ join -a 1 file1.txt file2.txt
1 Alice
2 Bob Engineer
3 Charlie Doctor
Notice that 1 Alice is included since it doesn’t have a match in file2.txt.
2. Unpairable lines from file2.txt
$ join -a 2 file1.txt file2.txt
2 Bob Engineer
3 Charlie Doctor
4 Artist
Here, 4 Artist is included as it exists only in file2.txt.
From the above examples, it’s essential to note that the -a option lets you find unique data to each file. This is useful for comparing and highlighting differences.
Specifying Custom Join Fields
Normally, the join command uses the first field of each file to find matches. However, the -1 and -2 options let you control exactly which fields are used for comparison:
- -1 FIELD_NUMBER specifies the join field for the first file (file1.txt)
- -2 FIELD_NUMBER specifies the join field for the second file (file2.txt)
Example
Let’s assume you have two space-delimited files file3.txt and file4.txt:
$ cat file3.txt
1 Alice
2 Bob
3 Charlie
$ cat file4.txt
Alice 20
Bob 25
Charlie 30
If you want to join based on the names (which are different in columns), you’d use:
$ join -1 2 -2 1 file3.txt file4.txt
Alice 1 20
Bob 2 25
Charlie 3 30
Here the breakdown of the command:
- -1 2: join file3.txt using its second field (the name column)
- -2 1: join file4.txt using its first field (also the name column)
This offers you precise control over how your data is merged. It’s helpful in situations where the matching information isn’t located in the first column of both files.
Making the join Command Case-Insensitive
The join command in Linux is case-sensitive by default. This means "[email protected]" and "[email protected]" would not be considered a match. For scenarios where you need to ignore case differences, use the -i option.
Example
Let’s create two new files file5.txt and file6.txt with the content below:
$ cat file5.txt
1 Alice
2 Bob
3 Charlie
$ cat file6.txt
2 Engineer
3 doctor
4 Artist
Note the inconsistent capitalization of “doctor” in file6.txt. To combine the data, use:
$ join -i file5.txt file6.txt
2 Bob Engineer
3 Charlie doctor
- The -i option tells the join command to ignore case when comparing fields
- Despite the capitalization difference, the lines with the key “3” are successfully matched and joined
You may wonder why this is important. Real-world data often has inconsistent capitalization, especially names or email addresses entered by users. The -i option helps you overcome these inconsistencies and merge your data accurately.
Understanding Sorted Input and the --nocheck-order Option
The join command in Linux delivers better when your input files are sorted on a common field. This allows you to use fast algorithms to find matches.
But what if your files aren’t sorted, or you’re confident that they are even without a check? That’s where --nocheck-order comes in. Let’s now take a look at how it works:
$ join --nocheck-order file1 file2
- --nocheck-order: tells the join command: "Don't worry about whether the files are sorted. Just try to merge the data as best you can."
When should you use --nocheck-order?
- Unsorted Files: If your data isn't sorted, and you don't want to take the time to sort it, --nocheck-order lets you proceed
- Performance: In rare cases, if you know your files are already sorted, this option can slightly speed up the join operation by skipping the initial sorting check
Example
Let’s create two files file7.txt and file8.txt with the content below
$ cat file7.txt
3 Charlie
1 Alice
2 Bob
$ cat file8.txt
2 Engineer
3 Doctor
1 Artist
Without --nocheck-order, the join command will produce the error message below due to the unsorted data.
$ join file7.txt file8.txt
join: file7.txt:2: is not sorted: 1 Alice
join: file8.txt:3: is not sorted: 1 Artist
3 Charlie Doctor
1 Alice Artist
join: input is not in sorted order
But, by using the --nocheck-order option, you'll get the output:
$ join --nocheck-order file7.txt file8.txt
3 Charlie Doctor
1 Alice Artist
2 Bob Engineer
However, it’s important to note that:
- You should use --nocheck-order with care. If your files are truly unsorted, you might get unexpected or incorrect output
Sorting files beforehand (using the sort command) is generally the safest way to ensure reliable results with the join command
Unlocking More Power with the join Command: Advanced Scenarios
The join command isn’t only for simple file merges. Let’s find out how you can use it to gain precise control over outputs and even combine multiple files in a single operation.
Example 1: Combining Customer Data with Orders
Imagine you work with a system where customer information and their orders are stored in separate files. To analyze this data effectively, you’ll need to bring it together. Here’s how the join command, along with some data preparation, accomplishes this:
Step 1: Prepare the Files
$ cat customers.txt
CustomerID FirstName LastName Email
1 Alice Smith [email protected]
2 Bob Johnson [email protected]
3 Charlie Williams [email protected]
$ cat orders.txt
1 2024-03-12 50.99
3 2024-03-18 125.00
2 2024-03-05 80.50
The join command generally expects your files to be sorted by the field you want to join on (in this case, CustomerID. We use the sort command to achieve this:
$ sort -t ' ' -k1,1 customers.txt > customers_sorted.txt
$ sort -k1,1 orders.txt > orders_sorted.txt
- -t ‘ ‘: specifies that the files are space-separated
- -k1,1: tells sort to focus on the first column for sorting
Step 2: Execute the join Command
Now we have customers_sorted.txt and orders_sorted.txt ready to be combined:
$ join -t ' ' -1 1 -2 1 customers_sorted.txt orders_sorted.txt
- -t ' ': specifies a space as the delimiter since both files are space-separated
- -1 1: tells join to use the first column of customers_sorted.txt (CustomerID) as the join field
- -2 1: tells join to use the first column of orders_sorted.txt (also the CustomerID) as the join field
The above command produces the following:
1 Alice Smith [email protected] 2024-03-12 50.99
2 Bob Johnson [email protected] 2024-03-05 80.50
3 Charlie Williams [email protected] 2024-03-18 125.00
Each line now contains customer information followed by their corresponding order details.
Example 2: Customizing Output with the -o Option
The -o option lets you control the merged output format. Let’s say you only want to display customer names and the total amount they’ve spent. Here’s how to modify the join command:
$ join -t ' ' -1 1 -2 1 -o 1.2,1.3,2.3 customers_sorted.txt orders_sorted.txt
Let’s now breakdown what the options in the above command mean:
- -o 1.2,1.3,2.3:
- 1.2: FirstName from 'customers_sorted.txt'
- 1.3: LastName from 'customers_sorted.txt'
- 2.3: Order Amount from 'orders_sorted.txt'
Advanced Usage: Joining Multiple Files
Let’s assume we want to analyze website traffic logs alongside user location data, showcasing some more powerful features of the join command. We have the following files:
$ cat web_traffic.txt
IP_Address 2024-03-20 /products/gadget-x
IP_Address 2024-03-19 /about
IP_Address 2024-03-21 /products/super-tool
$ cat ip_locations.csv
IP_Address,Region
<IP_Address>,East
<IP_Address>,West
<IP_Address>,North
The goal is to generate a report showing the region from which most website sections were accessed. This could help target marketing or understand user demographics.
Solution
1. Preprocess web_traffic.txt to extract only the IP addresses:
$ cut -d ' ' -f1 web_traffic.txt > ip_addresses.txt
- cut -d ' ' -f1: cuts the first field (IP address) using a space delimiter
2. Sort both files by IP address:
$ sort -k1,1 ip_addresses.txt > ip_addresses_sorted.txt
$ sort -t, -k1,1 ip_locations.csv > ip_locations_sorted.csv
3. join with Calculations: We'll join and then use awk to count region occurrences:
$ join -t, -1 1 -2 1 ip_addresses_sorted.txt ip_locations_sorted.csv | awk -F',' '{print $2}' | sort | uniq -c
- join Part: joins on the IP address.
- awk -F',' '{print $2}': extracts the region (second field in the joined output)
- sort | uniq -c: Sorts regions and counts unique occurrences
Output:
2 East
1 North
1 West
Let’s examine the breakdown of the above command and output:
- Preprocessing: cut isolates the relevant data before the join
- Sorting: as always, sorting is crucial for join to work efficiently
- Beyond Basic join: we pipe the join output to awk and other commands for further processing. This demonstrates the power of combining join with Linux's text manipulation tools
Comments