On Linux and other Unix-like operating systems, tar is undoubtedly one of the most used archiving utilities; it let us create archives, often called “tarballs”, we can use for source code distribution or backup purposes. In this tutorial we will see how to read, create and modify tar archives with python, using the tarfile
module.
In this tutorial you will learn:
- The modes in which a tar archive can be opened using the tarfile module
- What are the TarInfo and TarFile classes and what they represent
- How to list the content of a tar archive
- How to extract the content of a tar archive
- How to add files to a tar archive
Software requirements and conventions used
Category | Requirements, Conventions or Software Version Used |
---|---|
System | Distribution-independent |
Software | Python3 |
Other | Basic knowledge of python3 and object oriented programming |
Conventions | # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux commands to be executed as a regular non-privileged user |
Basic usage
The tarfile module is included in the python standard library, so we don’t need to install it separately; to use it, we just need to “import” it. The recommended way to access a tarball using this module is by the open
function; in its most basic usage, we must provide, as the first and second arguments:
- The name of the tarball we want to access
- The mode in which it should be opened
The “mode” used to open a tar archive depends on the action we want to perform and on the type of compression (if any) in use. Let’s see them together.
Opening an archive in read-only mode
If we want to examine or extract the content of a tar archive, we can use one of the following modes, to open it read-only:
Mode | Meaning |
---|---|
‘r’ | Read only mode – the compression type will be automatically handled |
‘r:’ | Read-only mode without compression |
‘r:gz’ | Read-only mode – zip compression explicitly specified |
‘r:bz2’ | Read-only mode – bzip compression explicitly specified |
‘r:xz’ | Read-only mode – lzma compression explicitly specified |
In most of the cases, where the compression method can be easily detected, the recommended mode to use is ‘r’.
Opening an archive to append files
If we want to append files to an existing archive we can use the ‘a’ mode. It’s important to notice that it’s possible to append to an archive only if it is not compressed; if we attempt to open a compressed archive with this mode, a ValueError
exception will be raised. If we reference a non-existing archive it will be created on the fly.
Opening an archive for writing
If we want to explicitly create a new archive and open it for writing, we can use one of the following modes:
Mode | Meaning |
---|---|
‘w’ | Open the archive for writing – use no compression |
‘w:gz’ | Open the archive for writing – use gzip compression |
‘w:bz’ | Open the archive for writing – use bzip2 compression |
‘w:xz’ | Open the archive for writing – use lzma compression |
If an existing archive file is opened for writing, it is truncated, so all its content is discarded. To avoid such situations, we may want to open the archive exclusively, as described in the next section.
Create an archive only if it doesn’t exist
When we want to be sure an existing file is not overridden when creating an archive, we must open it exclusively. If we use the ‘x’ mode and a file with the same name of the one we specified for the archive already exists, a FileExistsError
will be raised. The compression methods can be specified as follows:
Mode | Meaning |
---|---|
‘x’ | Create the archive without compression if doesn’t exist |
‘x:gz’ | Create the archive with gzip compression only if it doesn’t exist |
‘x:bz2’ | Create the archive with bzip2 compression only if it doesn’t exist |
‘x:xz’ | Create the archive with lzma compression only if it doesn’t exist |
Working with archives
There are two classes provided by the tarfile
module that are used to interact with tar archives and their contents, and are, respectively: TarFile
and TarInfo
. The former is used to represent a tar archive in its entirety and can be used as a context manager with the Python with
statement, the latter is used to represent an archive member, and contains various information about it. As a first step, we will focus on some of the most often used methods of the TarFile
class: we can use them to perform common operations on tar archives.
Retrieving a list of the archive members
To retrieve a list of the archive members we can use the getmembers
method of a TarFile
object. This method returns a list of TarInfo
objects, one for each archive member. Here is an example of its usage with a dummy compressed archive containing two files:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.getmembers()
...
[<TarInfo 'file1.txt' at 0x7f58dab50d00>, <TarInfo 'file2.txt' at 0x7f58dab50ac0>]
As we will see later, we can access some of the attributes of an archived file, as its ownership and modification time, via the corresponding TarInfo
object properties and methods.
Displaying the content of a tar archive
If all we want to do is to display the content of a tar archive, we can open it in read mode and use the list
method of the Tarfile
class.
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.list()
...
?rw-r--r-- egdoc/egdoc 0 2020-05-16 15:45:45 file1.txt
?rw-r--r-- egdoc/egdoc 0 2020-05-16 15:45:45 file2.txt
As you can see the list of the files contained in the archive is displayed as output. The list
method accepts a positional parameter, verbose which is True
by default. If we change its value to False
, only the file names will be reported in the output, with no additional information.
The method also accepts an optional named parameter, members. If used, the argument provided must be a subset of the list of TarInfo
objects as returned by the getmembers
method. Only information about the specified files will be displayed if this parameter is used and a correct value is provided.
Extracting all members from the tar archive
Another very common operation we may want to perform on a tar archive is to extract all its content. To perform such operation we can use the extractall
method of the corresponding TarFile
object. Here is what we would write:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.extractall()
The first parameter accepted by the method is path: it used to specify where the members of the archive should be extracted. The default value is '.'
, so the members are extracted in the current working directory.
The second parameter, members, can be used to specify a subset of members to extract from the archive, and, as in the case of the list
method, it should be a subset of the list returned by the getmembers
method.
The extractall
method has also a named parameter, numeric_owner. It is False
by default: if we change it to True
, numeric uid and gid will be used to set the ownership of the extracted files instead of user and group names.
Extracting only one member from the archive
What if we want to extract only a single file from the archive? In that case we want to use the extract
method and reference the file that should be extracted by its name (or as a TarFile
object). For example, to extract only the file1.txt
file from the tarball, we would run:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... archive.extract('file1.txt')
Easy, isn’t it? The file is extracted on the current working directory by default, but a different position can be specified using the second parameter accepted by the method: path.
Normally the attributes the file has inside the archive are set when it is extracted on the filesystem; to avoid this behavior we can set the third parameter of the function, set_attrs, to False
.
The method accepts also the numeric_owner parameter: the usage its the same we saw in the context of the extractall
method.
Extracting an archive member as a file-like object
We saw how, by using the extractall
and extract
methods we can extract one or multiple tar archive members to the filesystem. The tarfile
module provides another extraction method: extractfile
. When this method is used, the specified file is not extracted to the filesystem; instead, a read-only file-like object representing it is returned:
>>> with tarfile.open('archive.tar.gz', 'r') as archive:
... fileobj = archive.extractfile('file1.txt')
... fileobj.writable()
... fileobj.read()
...
False
b'hello\nworld\n'
Adding files to an archive
Until now we saw how to obtain information about an archive and its members, and the different methods we can use to extract its content; now it’s time to see how we can add new members.
The easiest way we can use to add a file to an archive is by using the add
method. We reference the file to be included in the archive by name, which is the first parameter accepted by the method. The file will be archived with its original name, unless we specify an alternative one using the second positional parameter: arcname. Suppose we want to add the file1.txt
to a new archive, but we want to store it as archived_file1.txt
; we would write:
>>> with tarfile.open('new_archive.tar.gz', 'w') as archive:
... archive.add('file1.txt', 'archived_file1.txt')
... archive.list()
...
-rw-r--r-- egdoc/egdoc 12 2020-05-16 17:49:44 archived_file1.txt
In the example above, we created a new uncompressed archive using the ‘w’ mode and added the file1.txt
as archive_file1.txt
, as you can see by the output of list()
.
Directories can be archived in the same way: by default the are added recursively, so together with their content. This behavior can be changed by setting the third positional parameter accepted by the add
method, recursive, to False
.
What if we want to apply a filter, so that only specified files are included in the archive? For this purpose we can use the optional filter named parameter. The value passed to this parameter must be a function that takes a TarInfo
object as argument and returns said object if it must be included in the archive or None
if it must be excluded. Let’s see an example. Suppose we have three files in our current working directory: file1.txt
, file2.txt
and file1.md
. We want to add only the files with the .txt
extension to the archive; here is what we could write:
>>> import os
>>> import tarfile
>>> with tarfile.open('new_archive.tar.gz', 'w') as archive:
... for i in os.listdir():
... archive.add(i, filter=lambda x: x if x.name.endswith('.txt') else None)
... archive.list()
...
-rw-r--r-- egdoc/egdoc 0 2020-05-16 18:26:20 file2.txt
-rw-r--r-- egdoc/egdoc 0 2020-05-16 18:22:13 file1.txt
In the example above we used the os.listdir
method to obtain a list of the files contained in the current working directory. Iterating over said list, we used the add
method to add each file to the archive. We passed a function as the argument of the filter parameter, in this case an anonymous one, a lambda. The function takes the tarfile object as argument (x) and returns it if its name (name is one of the properties of the TarInfo
object) ends with “.txt”. If it’s not the case, the function returns None
so the file is not archived.
The TarInfo object
We already learned that the TarInfo
objects represents a tar archive member: it stores the attributes of the referenced file and provides some methods which can help us identify the file type itself. The TarInfo
object doesn’t contain the actual file data. Some of the attributes of the TarInfo
object are:
- name (name of the file)
- size (file size)
- mtime (file modification time)
- uid (the user id of the file owner)
- gid (the id of the file group)
- uname (the user name of the file owner)
- gname (the name of the file group)
The object has also some very useful methods, here are some of them:
- isfile() – Returns True if the file is a regular file, False otherwise
- isdir() – Returns True if the file is a directory, False otherwise
- issym() – Returns True if the file is a symbolic link, False otherwise
- isblk() – Returns True if the file is a block device, False otherwise
Conclusions
In this tutorial we learned the basic usage of the tarfile
Python module, and we saw how we can use it to work with tar archives. We saw the various operating modes, what the TarFile
and TarInfo
classes represent, and some of the most used methods to list the content of an archive, to add new files or to extract them. For a more in depth knowledge of the tarfile
module please take a look at the module official documentation