Package software and data with self-compressed scripts

Self-compressed scrips are a quick, reliable way to distribute software or data to users without a package manager, elevated privileges, or other limitations.

Posted: November 23, 2021 by Jose Vicente Nunez (Sudoer)

Sealing a package with tape — _{Photo by Ketut Subiyanto from Pexels}

Sometimes, you need a quick and reliable way to distribute data or software to users without using a package manager (for example, the end user may not have root access to install an application).

This issue could be tackled by using containers and Podman or Docker, but what if they're not available on the destination system? What if one of the requirements is for the application to work on a bare-metal environment?

You could use Python with pip (and you probably know that you can package non-Python artifacts, too), but then you may be faced with some installation limitations (a virtual environment, or the --user option), not to mention that you need boilerplate code to package your Python code.

So is everything lost? Fear not! In this article, I demonstrate a very small but effective technique to write a self-extracting script that doesn't require elevated privileges.

Set up your data

Damiaan Zwietering has a cool Git repository about coronavirus with data and visualizations (Jupyter books and Excel spreadsheets), but no installer. Suppose you want to give this to your users, but they don't have access to Git. You can create a self-extracting installer for your users.

In real life, you'd already have data that you want to distribute. But so you have some sample data to work with, first clone this repository to your home directory:

$ git clone https://gitlab.com/dzwietering/corona.git

There's now a lot of data and a not-so-shallow directory structure, but you can create a .tar file using the git archive command:

$ cd $HOME/corona
$ git archive --verbose --output $tempdir/corona.tar.gz HEAD

For the sake of this example, this tarball is the file you want to share with your users.

The self-extracting script's structure

The self-extracting script is split into the following sections:

Code that helps users extract the data (the "payload")
An anchor separating data (to be extracted) from the script
The anchor position to extract the data that comes after it

Bash, as it turns out, is pretty good at defining a script this way.

[ Download the free Bash shell scripting cheat sheet. ]

Create the payload

Here is an idea: Say the data you need to distribute is a directory with many scripts and also data. You want to keep your permissions and structure intact, and you want the user to just "unpack" this into their home directory.

This sounds like a job for the tar command. But, for the sake of argument, say your users don't know how to use tar, or they want special options when installing the tarball file (like extracting only a specific file).

Another issue is that your .tar archive is a binary file. If you want to send it by email, you have to encode it properly with Uuencode or Base64 so that it can be transmitted safely.

What to do? Don't throw away the .tar file yet. Instead, prepare it so you can append it to a Bash script (which you'll write shortly):

$ base64 $tempdir/corona.tar.gz > $tempdir/corona_payload
$ file $tempdir/corona_payload
/tmp/tmp.8QNdzdKEkG/corona_payload: ASCII text

Extract data from a .tar file

You can either dump all of the contents into a new directory:

$ newbase=$HOME/coronadata
$ test ! -d $newbase && /bin/mkdir --parents --verbose $newbase
$ tar --directory $newbase \
--file corona.tar.gz --extract --gzip --verbose

Or you can extract just part of it, such as the measures, experiment, and test directories:

$ newbase=$HOME/coronadata
$ test ! -d $newbase && /bin/mkdir --parents --verbose $newbase
$ tar --directory $newbase --file corona.tar.gz \
--extract --gzip --verbose measures experiment test

For this exercise, extract the whole thing to a base directory (like $HOME), so the result is:

$HOME/$COVIDUSERDIR

[ You might also be interested in More tips for packaging your Linux software with RPM. ]

Anatomy of the self-extracting script

Below is the code of my self-extracting script. You can save the script in your Git repository and reuse it for other deployments. Things to notice:

SCRIPT_END is the position where the payload starts inside the script
It sanitizes user input
Once you figure out the position of the metadata, extract it from the script ($0), decode it back to binary, and then unpack it.

#!/usr/bin/env bash
# Author: Jose Vicente Nunez
SCRIPT_END=$(/bin/grep --max-count 2 --line-number ___END_OF_SHELL_SCRIPT___ "$0"| /bin/cut --field 1 --delimiter :| /bin/tail -1)|| exit 100
((SCRIPT_END+=1))
basedir=
while test -z "$basedir"; do
    read -r -p "Where do you want to extract the COVID-19 data, relative to $HOME? (example: mydata -> $HOME/mydata. Press CTRL-C to abort):" basedir
done
:<<DOC
Sanitize the user input. This is quite restrictive, so it depends of the real application requirements.
DOC
CLEAN=${basedir//_/}
CLEAN=${CLEAN// /_}
CLEAN=${CLEAN//[^a-zA-Z0-9_]/}
if [ ! -d "$HOME/$CLEAN" ]; then
    echo "[INFO]: Will try to create the directory $HOME/$CLEAN"
    if ! /bin/mkdir --parent --verbose "$HOME/$CLEAN"; then
        echo "[ERROR]: Failed to create $HOME/$CLEAN"
        exit 100
    fi
fi

/bin/tail --lines +"$SCRIPT_END" "$0"| /bin/base64 -d| /bin/tar --file - --extract --gzip --directory "$HOME/$CLEAN"

exit 0
# Here's the end of the script followed by the embedded file
___END_OF_SHELL_SCRIPT___

So how do you add the payload to the script? Just put together the two pieces with a little bit of cat glue:

$ cat covid_extract.sh \
$tempdir/corona_payload > covid_final_installer.sh

Make it executable:

$ chmod u+x covid_final_installer.sh

You can see how the installer combines with the payload. It's big because it contains the payload.

Run the installer

Does it work? Test it out for yourself:

$ ./covid_final_installer.sh 
Where do you want to extract the COVID-19 data, relative to /home/josevnz? (example: mydata -> /home/josevnz/mydata. Press CTRL-C to abort):COVIDDATA

[INFO]: Will try to create the directory /home/josevnz/COVIDDATA

/bin/mkdir: created directory '/home/josevnz/COVIDDATA'

$ tree /home/josevnz/COVIDDATA
/home/josevnz/COVIDDATA
├── acaps_covid19_government_measures_dataset_0.xlsx
├── acaps_covid19_government_measures_dataset.xlsx
├── COVID-19-geographic-disbtribution-worldwide.xlsx
├── EUCDC.ipynb
├── experiment
...

Self-extracting installers are useful

I find self-extracting installers useful for many reasons.

First, you can make them as complicated or simple as you want them to be. The most complex part is dictating where the script should extract the payload.

And it's useful to know this technique because malware installers can use it, too. Now you're more prepared to spot code like this in a script. Just as importantly, you now know how to prevent shell injection misuse by validating user input in your own self-extracting scripts.

There are good tools out there to automate this. Give them a try (but check their code first).