How to scrape web pages from the command line using htmlq

Web scraping is the process of analyzing the structure of HTML pages, and programmatically extract data from them. In the past we saw how to scrape the web using the Python programming language and the “Beautilful Soup” library; in this tutorial, instead, we see how to perform the same operation using a command line tool written in Rust: htmlq.

In this tutorial you will learn:

  • How to install cargo and htmlq
  • How to add the ~/.cargo/bin directory to PATH
  • How to scrape a page with curl and htmlq
  • How to extract a specific tag
  • How to get the value of a specific tag attribute
  • How to add base URLs to links
  • How to use css selectors
  • How to get text between tags
How to scrape web pages from the command line using htmlq
How to scrape web pages from the command line using htmlq

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Distribution-independent
Software curl, cargo, htmlq
Other None
Conventions # – requires given linux-commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux-commands to be executed as a regular non-privileged user

Installation

Htmlq is an application written using Rust, a general-purpose programming language, syntactically similar to C++. Cargo is the Rust package manager: it is basically what pip is for Python. In this tutorial we will use Cargo to install the htmlq tool, therefore the first thing we have to do, is to install it in our system.

Installing cargo

The “cargo” package is available in the repositories of all the most commonly used Linux distribution. To install “Cargo” on Fedora, for example, we simply use the dnf package manager:

$ sudo dnf install cargo



On Debian, and Debian-based distributions, instead, a modern way to perform the installation is to use the apt wrapper, which is designed to provide a more user-friendly interface to commands like apt-get and apt-cache. The command we need to run is the following:

$ sudo apt install cargo

If Archlinux is our favorite Linux distribution, all we have to do is to install the rust package: Cargo is part of it. To achieve the task, we can use the pacman package manager:

$ sudo pacman -Sy rust

Installing htmlq

Once Cargo is installed, we can use it to install the htmlq tool. We don’t need administrative privileges to perform the operation, since we will install the software only for our user. To install htmlq we run:

$ cargo install htmlq

Binaries installed with cargo are placed in the ~/.cargo/bin directory, therefore, to be able to invoke the tool from the command line without having to specify its full patch each time, we need to add the directory to our PATH. In our ~/.bash_profile or ~/.profile file, we add the following line:

export PATH="${PATH}:${HOME}/.cargo/bin"

To make the modification effective we need to logout and log back in, or as a temporary solution, just re-source the file:

$ source ~/.bash_profile



At this point we should be able to invoke htmlq from our terminal. Let’s see some examples of its usage.

Htmlq usage examples

The most common way to use htmlq is to pass it the output of another very commonly used application: curl. For those of you who don’t know it, curl is a tool used to transfer data from or to a server. Running it on a web page, it does return that page source to standard output; all we have to do is to pipe it to htmlq. Let’s see some examples.

Extracting a specific tag

Suppose we want to extract all the links contained in the homepage of “The New York Times” website. We know the in the HTML links are created using the a tag, therefore the command we would run is the following:

$ curl --silent https://www.nytimes.com | htmlq a

In the example above, we invoked curl with the --silent option: this is to avoid the application showing the page download progress or other messages we don’t need in this case. With the | pipe operator we used the output produced by curl as htmlq input. We called the latter passing the name of the tag we are searching for as argument. Here is the (truncated) result of the command:

[...]
<a class="css-1wjnrbv" href="/section/world">World</a>
<a class="css-1wjnrbv" href="/section/us">U.S.</a>
<a class="css-1wjnrbv" href="/section/politics">Politics</a>
<a class="css-1wjnrbv" href="/section/nyregion">N.Y.</a>
<a class="css-1wjnrbv" href="/section/business">Business</a>
<a class="css-1wjnrbv" href="/section/opinion">Opinion</a>
<a class="css-1wjnrbv" href="/section/technology">Tech</a>
<a class="css-1wjnrbv" href="/section/science">Science</a>
<a class="css-1wjnrbv" href="/section/health">Health</a>
<a class="css-1wjnrbv" href="/section/sports">Sports</a>
<a class="css-1wjnrbv" href="/section/arts">Arts</a>
<a class="css-1wjnrbv" href="/section/books">Books</a>
<a class="css-1wjnrbv" href="/section/style">Style</a>
<a class="css-1wjnrbv" href="/section/food">Food</a>
<a class="css-1wjnrbv" href="/section/travel">Travel</a>
<a class="css-1wjnrbv" href="/section/magazine">Magazine</a>
<a class="css-1wjnrbv" href="/section/t-magazine">T Magazine</a>
<a class="css-1wjnrbv" href="/section/realestate">Real Estate</a>
[...]

We truncated the output above for convenience, however, we can see that the entire <a> tags were returned. What if we want to obtain only the value of one of the tag attributes? In such cases we can simply invoke htmlq with the --attribute option, and pass the attribute we want to retrieve the value of as argument. Suppose, for example, we only want to get the value of the href attribute, which is the actual URL of the page the links sends to. Here is what we would run:

$ curl --silent https://www.nytimes.com | htmlq a --attribute href

Here is the result we would obtain:

[...]
/section/world
/section/us
/section/politics
/section/nyregion
/section/business
/section/opinion
/section/technology
/section/science
/section/health
/section/sports
/section/arts
/section/books
/section/style
/section/food
/section/travel
/section/magazine
/section/t-magazine
/section/realestate
[...]

Obtaining complete links URLs

As you can see, links are returned as they appear in the page. What is missing from them is the “base” URL, which in this case is https://www.nytimes.com. Is there a way to add it on the fly? The answer is yes. What we have to do is to use the -b (short for --base) option of htmlq, and pass the base URL we want to as argument:

$ curl --silent https://www.nytimes.com | htmlq a --attribute href -b https://www.nytimes.com

The command above would return the following:

[...]
https://www.nytimes.com/section/world
https://www.nytimes.com/section/us
https://www.nytimes.com/section/politics
https://www.nytimes.com/section/nyregion
https://www.nytimes.com/section/business
https://www.nytimes.com/section/opinion
https://www.nytimes.com/section/technology
https://www.nytimes.com/section/science
https://www.nytimes.com/section/health
https://www.nytimes.com/section/sports
https://www.nytimes.com/section/arts
https://www.nytimes.com/section/books
https://www.nytimes.com/section/style
https://www.nytimes.com/section/food
https://www.nytimes.com/section/travel
https://www.nytimes.com/section/magazine
https://www.nytimes.com/section/t-magazine
https://www.nytimes.com/section/realestate
[...]

Obtaining the text between tags

What if we want to “extract” the text contained between specific tags? Say for example, we want to get only the text used for the links existing in the page? All we have to do is to use the -t (--text) option of htmlq:

$ curl --silent https://www.nytimes.com | htmlq a --text



Here is the output returned by the command above:

[...]
World
U.S.
Politics
N.Y.
Business
Opinion
Tech
Science
Health
Sports
Arts
Books
Style
Food
Travel
Magazine
T Magazine
Real Estate
[...]

Using css selectors

When using htmlq, we are not limited to simply pass the name of the tag we want to retrieve as argument, but we can use more complex css selectors. Here is an example. Of all the links existing in the page we used in the example above, suppose we want to retrieve only those with css-jq1cx6 class. We would run:

$ curl --silent https://www.nytimes.com | htmlq a.css-jq1cx6

Similarly, to filter all the tags where the data-testid attribute exists and have the “footer-link” value, we would run:

$ curl --silent https://www.nytimes.com | htmlq a[data-testid="footer-link"]

Conclusions

In this tutorial we learned how to use the htmlq application to perform the scraping of web pages from the command line. The tool is written in Rust, so we saw how to install it using the “Cargo” package manager, and how to add the default directory Cargo uses to store binaries to our PATH. We learned how to retrieve specific tags from a page, how to get the value of a specific tag attribute, how to pass a base URL to be added to partial links, how to use css selectors, and, finally, how to retrieve text enclosed between tags.



Comments and Discussions
Linux Forum