Using Wget to Recursively Download Directories

Written by: Bobbin Zachariah   |   Last updated: July 25, 2023

Introduction

In this guide, we learn about how to use wget command for recursive downloads in Linux. Mainly focus on its recursive capabilities and directory structure handling.

Wget Recursive Downloading: Basics

The wget recursive is used to download web pages recursively to download the contents of a specified URL, including all of its subdirectories. It works by fetching the specified URL and then parses it for links, then continues this process until it has either downloaded the entire site or it reaches the maximum recursion depth. By default, the depth is 5.

Use -r or --recursive option with wget for the recursive download.

Example:

wget --recursive https://jsonplaceholder.typicode.com/guide/
wget recursively download from a website

This command starts download from the root page of https://jsonplaceholder.typicode.com/guide/, then downloads each page linked from there, each page linked from those pages, and so on, recursively.

Remember some website block mass downloads or web scrapping. As wget follow robots.txt any exclusion there wont be downloaded. You can use -e robots=off option to tell wget to ignore robots.txt file.

If you don't want wget to descend down to the parent directory, use -np or --no-parent option. This instructs wget not to ascend to the parent directory when it hits references like ../ in href links.

Example:

wget --recursive --no-parent https://jsonplaceholder.typicode.com/guide/
wget recursively download from a website with --no-parent option

The parent directory jsonplaceholder.typicode.com/ is ignored and only /guide directory is parsed.

There are two options in wget that are particularly useful when you want to view a downloaded webpage in the same way it appears online: --page-requisites and --convert-links. The --page-requisites tells wget downloads all the files that are necessary to properly display a given HTML page (such as inlined images, sounds, and referenced stylesheets). Whereas --convert-links option converts links in the downloaded HTML files to point locally.

Limiting Recursion Depth

As mentioned in the previous section the default recursion depth of wget is 5. You may wish to limit the depth of recursion to prevent downloading more data than necessary. Use -l option (or --level option) followed by the maximum desired depth number to set a limit on the recursion.

Example:

wget --recursive --level=3 https://jsonplaceholder.typicode.com/posts/1/comments

In some cases, we don't know the exact depth level and want to download the whole website so in this case we can use the -l inf infinite option.

Example:

wget --recursive -l inf https://jsonplaceholder.typicode.com/

This command will start from the given URL and download the entire site, regardless of how many levels deep the site goes.

Note: If you specify the level as 0 it’s equivalent to the infinite option.

Controlling Directory Structure

If use wget to download a file by default wget creates a directory in the current directory with the hostname of the URL and download files into it. For example, if you use wget to download files from https://random-data-api.com/api/v2/appliances, by default wget will create a directory with the name random-data-api.com in the current folder. We can use -nH or --no-host-directories option to prevent creating such a directory.

Example:

wget --recursive --no-host-directories https://random-data-api.com/api/v2/appliances
wget recursively download from a website with --no-host-directories option

This command will not create the random-data-api.com directory, and instead it will directly create the api directory in your current directory and download the file into that.

We can use --cut-dirs option in conjunction with --no-host-directories to to ignore a specified number of directory components.

Example:

wget --recursive --no-host-directories --cut-dirs=1 https://random-data-api.com/api/v2/appliances
wget recursively download from a website with --no-host-directories and --cut-dirs=1 options

This would only create the v2 directory in the current directory and place the downloaded file there. This is because --cut-dirs=1 tells wget to ignore the first directory 'api'.

Managing File Types in Recursive Downloading

You can tell wget to download only specific file types using -A or --accept option. You can specify a comma-separated list of file types (extensions) that wget should download.

Example:

wget --recursive --no-host-directories -A "svg","html" https://jsonplaceholder.typicode.com
wget recursively download from a website with -A option

This command downloads only svg and html files.

The same way you can tell wget to reject specific file types using -R or --reject option. Example:

wget --recursive --no-host-directories -R "index.html*","*.svg" https://jsonplaceholder.typicode.com
wget recursively download from a website with -R option

This command tells wget to ignore files starting with index.html and all .svg and then download the rest of the files.

Wget Options in Conjunction with Recursion

The following tables show some useful wget options you can use with the --recursion or -r option.

OptionsDescription
-e robots=offIgnore directives in robots.txt file.
-np or --no-parentRestrict the download without retrieving files from the parent directory.
--page-requisitesDownload all resources including stylesheets, images, sounds, scripts etc.
--convert-linksTell wget to convert online links to local for offline viewing.
--no-host-directoriesPrevent creating download directory name with host name.
--cut-dirsIgnore the specific number of directories to look clean locally.
-A or --acceptSpecify only to download specific file types.
-R or --reject Specify not to download specific file types.
-H or --span-hostsTell wget to go to foreign hosts when encounters different hosts in the URL.
--domains=nameSpecify a specific domain name that wget to follow links.

About The Author

Bobbin Zachariah

Bobbin Zachariah

Bobbin Zachariah is an experienced Linux engineer who has been supporting infrastructure for many companies. He specializes in Shell scripting, AWS Cloud, JavaScript, and Nodejs. He has qualified Master’s degree in computer science. He holds Red Hat Certified Engineer (RHCE) certification and RedHat Enable Sysadmin.

SHARE

Comments

Please add comments below to provide the author your ideas, appreciation and feedback.

Leave a Reply

Leave a Comment