How to create and manipulate tar archives using Python

On Linux and other Unix-like operating systems, tar is undoubtedly one of the most used archiving utilities; it let us create archives, often called “tarballs”, we can use for source code distribution or backup purposes. In this tutorial we will see how to read, create and modify tar archives with python, using the tarfile module.

In this tutorial you will learn:

  • The modes in which a tar archive can be opened using the tarfile module
  • What are the TarInfo and TarFile classes and what they represent
  • How to list the content of a tar archive
  • How to extract the content of a tar archive
  • How to add files to a tar archive

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Distribution-independent
Software Python3
Other Basic knowledge of python3 and object oriented programming
Conventions # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux commands to be executed as a regular non-privileged user

Basic usage

The tarfile module is included in the python standard library, so we don’t need to install it separately; to use it, we just need to “import” it. The recommended way to access a tarball using this module is by the open function; in its most basic usage, we must provide, as the first and second arguments:

  • The name of the tarball we want to access
  • The mode in which it should be opened

The “mode” used to open a tar archive depends on the action we want to perform and on the type of compression (if any) in use. Let’s see them together.

Opening an archive in read-only mode

If we want to examine or extract the content of a tar archive, we can use one of the following modes, to open it read-only:

Mode Meaning
‘r’ Read only mode – the compression type will be automatically handled
‘r:’ Read-only mode without compression
‘r:gz’ Read-only mode – zip compression explicitly specified
‘r:bz2’ Read-only mode – bzip compression explicitly specified
‘r:xz’ Read-only mode – lzma compression explicitly specified

In most of the cases, where the compression method can be easily detected, the recommended mode to use is ‘r’.

Opening an archive to append files

If we want to append files to an existing archive we can use the ‘a’ mode. It’s important to notice that it’s possible to append to an archive only if it is not compressed; if we attempt to open a compressed archive with this mode, a ValueError exception will be raised. If we reference a non-existing archive it will be created on the fly.

Opening an archive for writing

If we want to explicitly create a new archive and open it for writing, we can use one of the following modes:

Mode Meaning
‘w’ Open the archive for writing – use no compression
‘w:gz’ Open the archive for writing – use gzip compression
‘w:bz’ Open the archive for writing – use bzip2 compression
‘w:xz’ Open the archive for writing – use lzma compression

If an existing archive file is opened for writing, it is truncated, so all its content is discarded. To avoid such situations, we may want to open the archive exclusively, as described in the next section.

Create an archive only if it doesn’t exist

When we want to be sure an existing file is not overridden when creating an archive, we must open it exclusively. If we use the ‘x’ mode and a file with the same name of the one we specified for the archive already exists, a FileExistsError will be raised. The compression methods can be specified as follows:

Mode Meaning
‘x’ Create the archive without compression if doesn’t exist
‘x:gz’ Create the archive with gzip compression only if it doesn’t exist
‘x:bz2’ Create the archive with bzip2 compression only if it doesn’t exist
‘x:xz’ Create the archive with lzma compression only if it doesn’t exist

Working with archives

There are two classes provided by the tarfile module that are used to interact with tar archives and their contents, and are, respectively: TarFile and TarInfo. The former is used to represent a tar archive in its entirety and can be used as a context manager with the Python with statement, the latter is used to represent an archive member, and contains various information about it. As a first step, we will focus on some of the most often used methods of the TarFile class: we can use them to perform common operations on tar archives.

Retrieving a list of the archive members

To retrieve a list of the archive members we can use the getmembers method of a TarFile object. This method returns a list of TarInfo objects, one for each archive member. Here is an example of its usage with a dummy compressed archive containing two files:

>>> with tarfile.open('archive.tar.gz', 'r') as archive:
...     archive.getmembers()
...
[<TarInfo 'file1.txt' at 0x7f58dab50d00>, <TarInfo 'file2.txt' at 0x7f58dab50ac0>]

As we will see later, we can access some of the attributes of an archived file, as its ownership and modification time, via the corresponding TarInfo object properties and methods.

Displaying the content of a tar archive

If all we want to do is to display the content of a tar archive, we can open it in read mode and use the list method of the Tarfile class.

>>> with tarfile.open('archive.tar.gz', 'r') as archive:
...     archive.list()
...
?rw-r--r-- egdoc/egdoc          0 2020-05-16 15:45:45 file1.txt
?rw-r--r-- egdoc/egdoc          0 2020-05-16 15:45:45 file2.txt

As you can see the list of the files contained in the archive is displayed as output. The list method accepts a positional parameter, verbose  which is True by default. If we change its value to False, only the file names will be reported in the output, with no additional information.

The method also accepts an optional named parameter, members. If used, the argument provided must be a subset of the list of TarInfo objects as returned by the getmembers method. Only information about the specified files will be displayed if this parameter is used and a correct value is provided.

Extracting all members from the tar archive

Another very common operation we may want to perform on a tar archive is to extract all its content. To perform such operation we can use the extractallmethod of the corresponding TarFile object. Here is what we would write:

>>> with tarfile.open('archive.tar.gz', 'r') as archive:
...     archive.extractall()

The first parameter accepted by the method is path: it used to specify where the members of the archive should be extracted. The default value is '.', so the members are extracted in the current working directory.

The second parameter, members, can be used to specify a subset of members to extract from the archive, and, as in the case of the list method, it should be a subset of the list returned by the getmembers method.

The extractall method has also a named parameter, numeric_owner. It is False by default: if we change it to True, numeric uid and gid will be used to set the ownership of the extracted files instead of user and group names.

Extracting only one member from the archive

What if we want to extract only a single file from the archive? In that case we want to use the extract method and reference the file that should be extracted by its name (or as a TarFile object). For example, to extract only the file1.txt file from the tarball, we would run:

>>> with tarfile.open('archive.tar.gz', 'r') as archive:
...     archive.extract('file1.txt')

Easy, isn’t it? The file is extracted on the current working directory by default, but a different position can be specified using the second parameter accepted by the method: path.

Normally the attributes the file has inside the archive are set when it is extracted on the filesystem; to avoid this behavior we can set the third parameter of the function, set_attrs, to False.

The method accepts also the numeric_owner parameter: the usage its the same we saw in the context of the extractall method.

Extracting an archive member as a file-like object

We saw how, by using the extractall and extract methods we can extract one or multiple tar archive members to the filesystem. The tarfile module provides another extraction method: extractfile. When this method is used, the specified file is not extracted to the filesystem; instead, a read-only file-like object representing it is returned:

>>> with tarfile.open('archive.tar.gz', 'r') as archive:
...     fileobj = archive.extractfile('file1.txt')
...     fileobj.writable()
...     fileobj.read()
...
False
b'hello\nworld\n'

Adding files to an archive

Until now we saw how to obtain information about an archive and its members, and the different methods we can use to extract its content; now it’s time to see how we can add new members.

The easiest way we can use to add a file to an archive is by using the add method. We reference the file to be included in the archive by name, which is the first parameter accepted by the method. The file will be archived with its original name, unless we specify an alternative one using the second positional parameter: arcname. Suppose we want to add the file1.txt to a new archive, but we want to store it as archived_file1.txt; we would write:

>>> with tarfile.open('new_archive.tar.gz', 'w') as archive:
...     archive.add('file1.txt', 'archived_file1.txt')
...     archive.list()
...
-rw-r--r-- egdoc/egdoc         12 2020-05-16 17:49:44 archived_file1.txt

In the example above, we created a new uncompressed archive using the ‘w’ mode and added the file1.txt as archive_file1.txt, as you can see by the output of list().

Directories can be archived in the same way: by default the are added recursively, so together with their content. This behavior can be changed by setting the third positional parameter accepted by the add method, recursive, to False.

What if we want to apply a filter, so that only specified files are included in the archive? For this purpose we can use the optional filter named parameter. The value passed to this parameter must be a function that takes a TarInfo object as argument and returns said object if it must be included in the archive or None if it must be excluded. Let’s see an example. Suppose we have three files in our current working directory: file1.txt, file2.txt and file1.md. We want to add only the files with the .txt extension to the archive; here is what we could write:

>>> import os
>>> import tarfile
>>> with tarfile.open('new_archive.tar.gz', 'w') as archive:
...     for i in os.listdir():
...         archive.add(i, filter=lambda x: x if x.name.endswith('.txt') else None)
...     archive.list()
...
-rw-r--r-- egdoc/egdoc          0 2020-05-16 18:26:20 file2.txt
-rw-r--r-- egdoc/egdoc          0 2020-05-16 18:22:13 file1.txt

In the example above we used the os.listdir method to obtain a list of the files contained in the current working directory. Iterating over said list, we used the add method to add each file to the archive. We passed a function as the argument of the filter parameter, in this case an anonymous one, a lambda. The function takes the tarfile object as argument (x) and returns it if its name (name is one of the properties of the TarInfo object) ends with “.txt”. If it’s not the case, the function returns None so the file is not archived.

The TarInfo object

We already learned that the TarInfo objects represents a tar archive member: it stores the attributes of the referenced file and provides some methods which can help us identify the file type itself. The TarInfo object doesn’t contain the actual file data. Some of the attributes of the TarInfo object are:

  • name (name of the file)
  • size (file size)
  • mtime (file modification time)
  • uid (the user id of the file owner)
  • gid (the id of the file group)
  • uname (the user name of the file owner)
  • gname (the name of the file group)

The object has also some very useful methods, here are some of them:

  • isfile() – Returns True if the file is a regular file, False otherwise
  • isdir() – Returns True if the file is a directory, False otherwise
  • issym() – Returns True if the file is a symbolic link, False otherwise
  • isblk() – Returns True if the file is a block device, False otherwise

Conclusions

In this tutorial we learned the basic usage of the tarfile Python module, and we saw how we can use it to work with tar archives. We saw the various operating modes, what the TarFile and TarInfo classes represent, and some of the most used methods to list the content of an archive, to add new files or to extract them. For a more in depth knowledge of the tarfile module please take a look at the module official documentation



Comments and Discussions
Linux Forum