Introduction
In this guide, we learn about how to use wget command for recursive downloads in Linux. Mainly focus on its recursive capabilities and directory structure handling.
Wget Recursive Downloading: Basics
The wget recursive is used to download web pages recursively to download the contents of a specified URL, including all of its subdirectories. It works by fetching the specified URL and then parses it for links, then continues this process until it has either downloaded the entire site or it reaches the maximum recursion depth. By default, the depth is 5.
Use -r or --recursive option with wget for the recursive download.
Example:
wget --recursive https://jsonplaceholder.typicode.com/guide/
This command starts download from the root page of https://jsonplaceholder.typicode.com/guide/, then downloads each page linked from there, each page linked from those pages, and so on, recursively.
Remember some website block mass downloads or web scrapping. As wget follow robots.txt any exclusion there wont be downloaded. You can use -e robots=off
option to tell wget to ignore robots.txt file.
If you don't want wget to descend down to the parent directory, use -np or --no-parent option. This instructs wget not to ascend to the parent directory when it hits references like ../
in href links.
Example:
wget --recursive --no-parent https://jsonplaceholder.typicode.com/guide/
The parent directory jsonplaceholder.typicode.com/
is ignored and only /guide directory is parsed.
There are two options in wget that are particularly useful when you want to view a downloaded webpage in the same way it appears online: --page-requisites and --convert-links. The --page-requisites tells wget downloads all the files that are necessary to properly display a given HTML page (such as inlined images, sounds, and referenced stylesheets). Whereas --convert-links option converts links in the downloaded HTML files to point locally.
Limiting Recursion Depth
As mentioned in the previous section the default recursion depth of wget is 5. You may wish to limit the depth of recursion to prevent downloading more data than necessary. Use -l
option (or --level
option) followed by the maximum desired depth number to set a limit on the recursion.
Example:
wget --recursive --level=3 https://jsonplaceholder.typicode.com/posts/1/comments
In some cases, we don't know the exact depth level and want to download the whole website so in this case we can use the -l inf
infinite option.
Example:
wget --recursive -l inf https://jsonplaceholder.typicode.com/
This command will start from the given URL and download the entire site, regardless of how many levels deep the site goes.
Note: If you specify the level as 0 it’s equivalent to the infinite option.
Controlling Directory Structure
If use wget to download a file by default wget creates a directory in the current directory with the hostname of the URL and download files into it. For example, if you use wget to download files from https://random-data-api.com/api/v2/appliances, by default wget will create a directory with the name random-data-api.com in the current folder. We can use -nH
or --no-host-directories option to prevent creating such a directory.
Example:
wget --recursive --no-host-directories https://random-data-api.com/api/v2/appliances
This command will not create the random-data-api.com directory, and instead it will directly create the api directory in your current directory and download the file into that.
We can use --cut-dirs option in conjunction with --no-host-directories to to ignore a specified number of directory components.
Example:
wget --recursive --no-host-directories --cut-dirs=1 https://random-data-api.com/api/v2/appliances
This would only create the v2 directory in the current directory and place the downloaded file there. This is because --cut-dirs=1 tells wget to ignore the first directory 'api'.
Managing File Types in Recursive Downloading
You can tell wget to download only specific file types using -A or --accept option. You can specify a comma-separated list of file types (extensions) that wget should download.
Example:
wget --recursive --no-host-directories -A "svg","html" https://jsonplaceholder.typicode.com
This command downloads only svg and html files.
The same way you can tell wget to reject specific file types using -R or --reject option. Example:
wget --recursive --no-host-directories -R "index.html*","*.svg" https://jsonplaceholder.typicode.com
This command tells wget to ignore files starting with index.html and all .svg and then download the rest of the files.
Wget Options in Conjunction with Recursion
The following tables show some useful wget options you can use with the --recursion or -r option.
Options | Description |
---|---|
-e robots=off | Ignore directives in robots.txt file. |
-np or --no-parent | Restrict the download without retrieving files from the parent directory. |
--page-requisites | Download all resources including stylesheets, images, sounds, scripts etc. |
--convert-links | Tell wget to convert online links to local for offline viewing. |
--no-host-directories | Prevent creating download directory name with host name. |
--cut-dirs | Ignore the specific number of directories to look clean locally. |
-A or --accept | Specify only to download specific file types. |
-R or --reject | Specify not to download specific file types. |
-H or --span-hosts | Tell wget to go to foreign hosts when encounters different hosts in the URL. |
--domains=name | Specify a specific domain name that wget to follow links. |
Comments