HTTP is the protocol used by the World Wide Web, that’s why being able to interact with it programmatically is essential: scraping a web page, communicating with a service APIs, or even simply downloading a file, are all tasks based on this interaction. Python makes such operations very easy: some useful functions are already provided in the standard library, and for more complex tasks it’s possible (and even recommended) to use the external requests
module. In this first article of the series we will focus on the built-in modules. We will use python3 and mostly work inside the python interactive shell: the needed libraries will be imported only once to avoid repetitions.
In this tutorial you will learn:
- How to perform HTTP requests with python3 and the urllib.request library
- How to work with server responses
- How to download a file using the urlopen or the urlretrieve functions
Software Requirements and Conventions Used
Category | Requirements, Conventions or Software Version Used |
---|---|
System | Os-independent |
Software | Python3 |
Other |
|
Conventions | # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux commands to be executed as a regular non-privileged user |
Performing requests with the standard library
Let’s start with a very easy GET
request. The GET HTTP verb is used to retrieve data from a resource. When performing such type of requests, it is possible to specify some parameters in the form variables: those variables, expressed as key-value pairs, form a query string
which is “appended” to the URL
of the resource. A GET request should always be idempotent
(this means that the result of the request should be independent from the number of times it is performed) and should never be used to change a state. Performing GET requests with python is really easy. For the sake of this tutorial we will take advantage of the open NASA API call which let us retrieve the so called “picture of the day”:
>>> from urllib.request import urlopen
>>> with urlopen("https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY") as response:
... response_content = response.read()
The first thing we did was to import the urlopen
function from the urllib.request
library: this function returns an http.client.HTTPResponse
object which has some very useful methods. We used the function inside a with
statement because the HTTPResponse
object supports the context-management
protocol: resources are immediately closed after the “with” statement is executed, even if an exception
is raised.
The read
method we used in the example above returns the body of the response object as a bytes
and optionally takes an argument which represents the amount of bytes to read (we will see later how this is important in some cases, especially when downloading big files). If this argument is omitted, the body of the response is read in its entirety.
At this point we have the body of the response as a bytes object
, referenced by the response_content
variable. We may want to transform it into something else. To turn it into a string, for example, we use the decode
method, providing the encoding type as argument, typically:
>>> response_content.decode('utf-8')
In the example above we used the utf-8
encoding. The API call we used in the example, however, returns a response in JSON
format, therefore, we want to process it with the help of the json
module:
>>> import json
json_response = json.loads(response_content)
The json.loads
method deserializes a string
, a bytes
or a bytearray
instance containing a JSON document into a python object. The result of calling the function, in this case, is a dictionary:
>>> from pprint import pprint
>>> pprint(json_response)
{'date': '2019-04-14',
'explanation': 'Sit back and watch two black holes merge. Inspired by the '
'first direct detection of gravitational waves in 2015, this '
'simulation video plays in slow motion but would take about '
'one third of a second if run in real time. Set on a cosmic '
'stage the black holes are posed in front of stars, gas, and '
'dust. Their extreme gravity lenses the light from behind them '
'into Einstein rings as they spiral closer and finally merge '
'into one. The otherwise invisible gravitational waves '
'generated as the massive objects rapidly coalesce cause the '
'visible image to ripple and slosh both inside and outside the '
'Einstein rings even after the black holes have merged. Dubbed '
'GW150914, the gravitational waves detected by LIGO are '
'consistent with the merger of 36 and 31 solar mass black '
'holes at a distance of 1.3 billion light-years. The final, '
'single black hole has 63 times the mass of the Sun, with the '
'remaining 3 solar masses converted into energy in '
'gravitational waves. Since then the LIGO and VIRGO '
'gravitational wave observatories have reported several more '
'detections of merging massive systems, while last week the '
'Event Horizon Telescope reported the first horizon-scale '
'image of a black hole.',
'media_type': 'video',
'service_version': 'v1',
'title': 'Simulation: Two Black Holes Merge',
'url': 'https://www.youtube.com/embed/I_88S8DWbcU?rel=0'}
As an alternative we could also use the json_load
function (notice the missing trailing “s”). The function accepts a file-like
object as argument: this means that we can use it directly on the HTTPResponse
object:
>>> with urlopen("https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY") as response:
... json_response = json.load(response)
Reading the response headers
Another very useful method usable on the HTTPResponse
object is getheaders
. This method returns the headers
of the response as an array of tuples. Each tuple contains an header parameter and its corresponding value:
>>> pprint(response.getheaders()) [('Server', 'openresty'), ('Date', 'Sun, 14 Apr 2019 10:08:48 GMT'), ('Content-Type', 'application/json'), ('Content-Length', '1370'), ('Connection', 'close'), ('Vary', 'Accept-Encoding'), ('X-RateLimit-Limit', '40'), ('X-RateLimit-Remaining', '37'), ('Via', '1.1 vegur, http/1.1 api-umbrella (ApacheTrafficServer [cMsSf ])'), ('Age', '1'), ('X-Cache', 'MISS'), ('Access-Control-Allow-Origin', '*'), ('Strict-Transport-Security', 'max-age=31536000; preload')]
You can notice, among the others, the Content-type
parameter, which, as we said above, is application/json
. If we want to retrieve only a specific parameter we can use the getheader
method instead, passing the name of the parameter as argument:
>>> response.getheader('Content-type')
'application/json'
Getting the status of the response
Getting the status code and reason phrase
returned by the server after an HTTP request is also very easy: all we have to do is to access the status
and reason
properties of the HTTPResponse
object:
>>> response.status
200
>>> response.reason
'OK'
Including variables in the GET request
The URL of the request we sent above contained only one variable: api_key
, and its value was "DEMO_KEY"
. If we want to pass multiple variables, instead of attaching them to the URL manually, we can provide them and their associated values as key-value pairs of a python dictionary (or as a sequence of two-elements tuples); this dictionary will be passed to the urllib.parse.urlencode
method, which will build and return the query string
. The API call we used above, allow us to specify an optional “date” variable, to retrieve the picture associated with a specific day. Here is how we could proceed:
>>> from urllib.parse import urlencode
>>> query_params = {
..."api_key": "DEMO_KEY",
..."date": "2019-04-11"
}
>>> query_string = urlencode(query_params)
>>> query_string
'api_key=DEMO_KEY&date=2019-04-11'
First we defined each variable and its corresponding value as key-value pairs of a dictionary, than we passed said dictionary as an argument to the urlencode
function, which returned a formatted query string. Now, when sending the request, all we have to do is to attach it to the URL:
>>> url = "?".join(["https://api.nasa.gov/planetary/apod", query_string])
If we send the request using the URL above, we obtain a different response and a different image:
{'date': '2019-04-11', 'explanation': 'What does a black hole look like? To find out, radio ' 'telescopes from around the Earth coordinated observations of ' 'black holes with the largest known event horizons on the ' 'sky. Alone, black holes are just black, but these monster ' 'attractors are known to be surrounded by glowing gas. The ' 'first image was released yesterday and resolved the area ' 'around the black hole at the center of galaxy M87 on a scale ' 'below that expected for its event horizon. Pictured, the ' 'dark central region is not the event horizon, but rather the ' "black hole's shadow -- the central region of emitting gas " "darkened by the central black hole's gravity. The size and " 'shape of the shadow is determined by bright gas near the ' 'event horizon, by strong gravitational lensing deflections, ' "and by the black hole's spin. In resolving this black hole's " 'shadow, the Event Horizon Telescope (EHT) bolstered evidence ' "that Einstein's gravity works even in extreme regions, and " 'gave clear evidence that M87 has a central spinning black ' 'hole of about 6 billion solar masses. The EHT is not done -- ' 'future observations will be geared toward even higher ' 'resolution, better tracking of variability, and exploring the ' 'immediate vicinity of the black hole in the center of our ' 'Milky Way Galaxy.', 'hdurl': 'https://apod.nasa.gov/apod/image/1904/M87bh_EHT_2629.jpg', 'media_type': 'image', 'service_version': 'v1', 'title': 'First Horizon-Scale Image of a Black Hole', 'url': 'https://apod.nasa.gov/apod/image/1904/M87bh_EHT_960.jpg'}
In case you didn’t notice, the returned image URL points to the recently unveiled first picture of a black hole:
Sending a POST request
Sending a POST request, with variables ‘contained’ inside the request body, using the standard library, requires additional steps. First of all, as we did before, we construct the POST data in the form of a dictionary:
>>> data = {
... "variable1": "value1",
... "variable2": "value2"
...}
After we constructed our dictionary, we want to use the urlencode
function as we did before, and additionally encode the resulting string in ascii
:
>>>post_data = urlencode(data).encode('ascii')
Finally, we can send our request, passing the data as the second argument of the urlopen
function. In this case we will use https://httpbin.org/post
as destination URL (httpbin.org is a request & response service):
>>> with urlopen("https://httpbin.org/post", post_data) as response:
... json_response = json.load(response)
>>> pprint(json_response)
{'args': {},
'data': '',
'files': {},
'form': {'variable1': 'value1', 'variable2': 'value2'},
'headers': {'Accept-Encoding': 'identity',
'Content-Length': '33',
'Content-Type': 'application/x-www-form-urlencoded',
'Host': 'httpbin.org',
'User-Agent': 'Python-urllib/3.7'},
'json': None,
'origin': 'xx.xx.xx.xx, xx.xx.xx.xx',
'url': 'https://httpbin.org/post'}
The request was successful, and the server returned a JSON response which includes information about the request we made. As you can see the variables we passed in the body of the request are reported as the value of the 'form'
key in the response body. Reading the value of the headers
key, we can also see the that the content type of the request was application/x-www-form-urlencoded
and the user agent 'Python-urllib/3.7'
.
Sending JSON data in the request
What if we want to send a JSON representation of data with our request? First we define the structure of the data, than we convert it to JSON:
>>> person = {
... "firstname": "Luke",
... "lastname": "Skywalker",
... "title": "Jedi Knight"
... }
We also want to use a dictionary to define custom headers. In this case, for example, we want to specify that our request content is application/json
:
>>> custom_headers = {
... "Content-Type": "application/json"
...}
Finally, instead of sending the request directly, we create a Request
object and we pass, in order: the destination URL, the request data and the request headers as arguments of its constructor:
>>> from urllib.request import Request
>>> req = Request(
... "https://httpbin.org/post",
... json.dumps(person).encode('ascii'),
... custom_headers
...)
One important thing to notice is that we used the json.dumps
function passing the dictionary containing the data we want to be included in the request as its argument: this function is used to serialize
an object into a JSON formatted string, which we encoded using the encode
method.
At this point we can send our Request
, passing it as the first argument of the urlopen
function:
>>> with urlopen(req) as response:
... json_response = json.load(response)
Let’s check the content of the response:
{'args': {},
'data': '{"firstname": "Luke", "lastname": "Skywalker", "title": "Jedi '
'Knight"}',
'files': {},
'form': {},
'headers': {'Accept-Encoding': 'identity',
'Content-Length': '70',
'Content-Type': 'application/json',
'Host': 'httpbin.org',
'User-Agent': 'Python-urllib/3.7'},
'json': {'firstname': 'Luke', 'lastname': 'Skywalker', 'title': 'Jedi Knight'},
'origin': 'xx.xx.xx.xx, xx.xx.xx.xx',
'url': 'https://httpbin.org/post'}
This time we can see that the dictionary associated with the “form” key in the response body is empty, and the one associated with “json” key represents the data we sent as JSON. As you can observe, even the custom header parameter we sent has been received correctly.
Sending a request with an HTTP verb other than GET or POST
When interacting with APIs we may need to use HTTP verbs
other than just GET or POST. To accomplish this task we must use the last parameter of the Request
class constructor and specify the verb we want to use. The default verb is GET if the data
parameter is None
, otherwise POST is used. Suppose we want to send a PUT
request:
>>> req = Request(
... "https://httpbin.org/put",
... json.dumps(person).encode('ascii'),
... custom_headers,
... method='PUT'
...)
Downloading a file
Another very common operation we may want to perform is to download some kind of file from the web. Using the standard library there are two ways to do it: using the urlopen
function, reading the response in chunks (especially if the file to download is big) and writing them to a local file “manually”, or using the urlretrieve
function, which, as stated in the official documentation, is considered part of an old interface, and might become deprecated in the future. Let’s see an example of both strategies.
Downloading a file using urlopen
Say we want to download the tarball containing the latest version of the Linux kernel source code. Using the first method we mentioned above, we write:
>>> latest_kernel_tarball = "https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.7.tar.xz"
>>> with urlopen(latest_kernel_tarball) as response:
... with open('latest-kernel.tar.xz', 'wb') as tarball:
... while True:
... chunk = response.read(16384)
... if chunk:
... tarball.write(chunk)
... else:
... break
In the example above we first used both the urlopen
function and the open
one inside with statements and therefore using the context-management protocol to ensure that resources are cleaned immediately after the block of code where they are used is executed. Inside a while
loop, at each iteration, the chunk
variable references the bytes read from the response, (16384 in this case – 16 Kibibytes). If chunk
is not empty, we write the content to the file object (“tarball”); if it is empty, it means that we consumed all the content of the response body, therefore we break the loop.
A more concise solution involves the use of the shutil
library and the copyfileobj
function, which copies data from a file-like object (in this case “response”) to another file-like object (in this case, “tarball”). The buffer size can be specified using the third argument of the function, which, by default, is set to 16384 bytes):
>>> import shutil
... with urlopen(latest_kernel_tarball) as response:
... with open('latest-kernel.tar.xz', 'wb') as tarball:
... shutil.copyfileobj(response, tarball)
Downloading a file using the urlretrieve function
The alternative and even more concise method to download a file using the standard library is by the use of the urllib.request.urlretrieve
function. The function takes four argument, but only the first two interest us now: the first is mandatory, and is the URL of the resource to download; the second is the name used to store the resource locally. If it is not given, the resource will be stored as a temporary file in /tmp
. The code becomes:
>>> from urllib.request import urlretrieve
>>> urlretrieve("https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.7.tar.xz")
('latest-kernel.tar.xz', <http.client.HTTPMessage object at 0x7f414a4c9358>)
Very simple, isn’t it? The function returns a tuple which contains the name used to store the file (this is useful when the resource is stored as temporary file, and the name is a random generated one), and the HTTPMessage
object which holds the headers of the HTTP response.
Conclusions
In this first part of the series of articles dedicated to python and HTTP requests, we saw how to send various types of requests using only standard library functions, and how to work with responses. If you have doubt or want to explore things more in depth, please consult the official official urllib.request documentation. The next part of the series will focus on Python HTTP request library.