How to install spark on RHEL 8

Apache Spark is a distributed computing system. It consists of a master and one or more slaves, where the master distributes the work among the slaves, thus giving the ability to use our many computers to work on one task. One could guess that this is indeed a powerful tool where tasks need large computations to complete, but can be split into smaller chunks of steps that can be pushed to the slaves to work on. Once our cluster is up and running, we can write programs to run on it in Python, Java, and Scala.

In this tutorial we will work on a single machine running Red Hat Enterprise Linux 8, and will install the Spark master and slave to the same machine, but keep in mind that the steps describing the slave setup can be applied to any number of computers, thus creating a real cluster that can process heavy workloads. We’ll also add the necessary unit files for management, and run a simple example against the cluster shipped with the distributed package to ensure our system is operational.

In this tutorial you will learn:

  • How to install Spark master and slave
  • How to add systemd unit files
  • How to verify successful master-slave connection
  • How to run a simple example job on the cluster

Spark shell with pyspark.

Spark shell with pyspark.

Software Requirements and Conventions Used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Red Hat Enterprise Linux 8
Software Apache Spark 2.4.0
Other Privileged access to your Linux system as root or via the sudo command.
Conventions # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux commands to be executed as a regular non-privileged user

How to install spark on Redhat 8 step by step instructions

Apache Spark runs on JVM (Java Virtual Machine), so a working Java 8 installation is required for the applications to run. Aside from that, there are multiple shells shipped within the package, one of them is pyspark, a python based shell. To work with that, you’ll also need python 2 installed and set up.

  1. To get the URL of Spark’s latest package, we need to visit the Spark downloads site. We need to choose the mirror closest to our location, and copy the URL provided by the download site. This also means that your URL may be different from the below example. We’ll install the package under /opt/, so we enter the directory as root:
    # cd /opt

    And feed the aquired URL to wget to get the package:

    # wget https://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz


  2. We’ll unpack the tarball:
    # tar -xvf spark-2.4.0-bin-hadoop2.7.tgz
  3. And create a symlink to make our paths easier to remember in the next steps:
    # ln -s /opt/spark-2.4.0-bin-hadoop2.7 /opt/spark
  4. We create a non-privileged user that will run both applications, master and slave:
    # useradd spark

    And set it as owner of the whole /opt/spark directory, recursively:

    # chown -R spark:spark /opt/spark*
  5. We create a systemd unit file /etc/systemd/system/spark-master.service for the master service with the following content:
    [Unit]
    Description=Apache Spark Master
    After=network.target
    
    [Service]
    Type=forking
    User=spark
    Group=spark
    ExecStart=/opt/spark/sbin/start-master.sh
    ExecStop=/opt/spark/sbin/stop-master.sh
    
    [Install]
    WantedBy=multi-user.target

    And also one for the slave service that will be /etc/systemd/system/spark-slave.service.service with the below contents:

    [Unit]
    Description=Apache Spark Slave
    After=network.target
    
    [Service]
    Type=forking
    User=spark
    Group=spark
    ExecStart=/opt/spark/sbin/start-slave.sh spark://rhel8lab.linuxconfig.org:7077
    ExecStop=/opt/spark/sbin/stop-slave.sh
    
    [Install]
    WantedBy=multi-user.target

    Note the highlighted spark URL. This is constructed with spark://<hostname-or-ip-address-of-the-master>:7077, in this case the lab machine that will run the master has the hostname rhel8lab.linuxconfig.org. Your master’s name will be different. Every slaves must be able to resolve this hostname, and reach the master on the specified port, which is port 7077 by default.

  6. With the service files in place, we need to ask systemd to re-read them:
    # systemctl daemon-reload
  7. We can start our Spark master with systemd:
    # systemctl start spark-master.service
  8. To verify our master is running and functional, we can use systemd status:
    # systemctl status spark-master.service
      spark-master.service - Apache Spark Master
       Loaded: loaded (/etc/systemd/system/spark-master.service; disabled; vendor preset: disabled)
       Active: active (running) since Fri 2019-01-11 16:30:03 CET; 53min ago
      Process: 3308 ExecStop=/opt/spark/sbin/stop-master.sh (code=exited, status=0/SUCCESS)
      Process: 3339 ExecStart=/opt/spark/sbin/start-master.sh (code=exited, status=0/SUCCESS)
     Main PID: 3359 (java)
        Tasks: 27 (limit: 12544)
       Memory: 219.3M
       CGroup: /system.slice/spark-master.service
                 3359 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181.b13-9.el8.x86_64/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host [...]
    
    Jan 11 16:30:00 rhel8lab.linuxconfig.org systemd[1]: Starting Apache Spark Master...
    Jan 11 16:30:00 rhel8lab.linuxconfig.org start-master.sh[3339]: starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-1[...]


    The last line also indicates the main logfile of the master, which is in the logs directory under the Spark base directory, /opt/spark in our case. By looking into this file, we should see a line in the end similar to the below example:

    2019-01-11 14:45:28 INFO  Master:54 - I have been elected leader! New state: ALIVE

    We should also find a line that tells us where the Master interface is listening:

    2019-01-11 16:30:03 INFO  Utils:54 - Successfully started service 'MasterUI' on port 8080

    If we point a browser to the host machine’s port 8080, we should see the status page of the master, with no workers attached at the moment.

    Spark master status page with no workers attached.

    Spark master status page with no workers attached.

    Note the URL line on the Spark master’s status page. This is the same URL we need to use for every slave’s unit file we created in step 5.
    If we receive a “connection refused” error message in the browser, we probably need to open the port on the firewall:

    # firewall-cmd --zone=public --add-port=8080/tcp --permanent
    success
    # firewall-cmd --reload
    success
  9. Our master is running, we’ll attach a slave to it. We start the slave service:
    # systemctl start spark-slave.service
  10. We can verify that our slave is running with systemd:
    # systemctl status spark-slave.service
      spark-slave.service - Apache Spark Slave
       Loaded: loaded (/etc/systemd/system/spark-slave.service; disabled; vendor preset: disabled)
       Active: active (running) since Fri 2019-01-11 16:31:41 CET; 1h 3min ago
      Process: 3515 ExecStop=/opt/spark/sbin/stop-slave.sh (code=exited, status=0/SUCCESS)
      Process: 3537 ExecStart=/opt/spark/sbin/start-slave.sh spark://rhel8lab.linuxconfig.org:7077 (code=exited, status=0/SUCCESS)
     Main PID: 3554 (java)
        Tasks: 26 (limit: 12544)
       Memory: 176.1M
       CGroup: /system.slice/spark-slave.service
                 3554 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181.b13-9.el8.x86_64/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker [...]
    
    Jan 11 16:31:39 rhel8lab.linuxconfig.org systemd[1]: Starting Apache Spark Slave...
    Jan 11 16:31:39 rhel8lab.linuxconfig.org start-slave.sh[3537]: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-spar[...]

    This output also provides the path to the logfile of the slave (or worker), which will be in the same directory, with “worker” in it’s name. By checking this file, we should see something similar to the below output:

    2019-01-11 14:52:23 INFO  Worker:54 - Connecting to master rhel8lab.linuxconfig.org:7077...
    2019-01-11 14:52:23 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62059f4a{/metrics/json,null,AVAILABLE,@Spark}
    2019-01-11 14:52:23 INFO  TransportClientFactory:267 - Successfully created connection to rhel8lab.linuxconfig.org/10.0.2.15:7077 after 58 ms (0 ms spent in bootstraps)
    2019-01-11 14:52:24 INFO  Worker:54 - Successfully registered with master spark://rhel8lab.linuxconfig.org:7077

    This indicates that the worker is successfully connected to the master. In this same logfile we’ll find a line that tells us the URL the worker is listening on:

    2019-01-11 14:52:23 INFO  WorkerWebUI:54 - Bound WorkerWebUI to 0.0.0.0, and started at http://rhel8lab.linuxconfig.org:8081

    We can point our browser to the worker’s status page, where it’s master is listed.

    Spark worker status page, connected to master.

    Spark worker status page, connected to master.


    At the master’s logfile, a verifying line should appear:

    2019-01-11 14:52:24 INFO  Master:54 - Registering worker 10.0.2.15:40815 with 2 cores, 1024.0 MB RAM

    If we reload the master’s status page now, the worker should appear there as well, with a link to it’s status page.

    Spark master status page with one worker attached.

    Spark master status page with one worker attached.

    These sources verify that our cluster is attached and ready to work.

  11. To run a simple task on the cluster, we execute one of the examples shipped with the package we downloaded. Consider the following simple textfile /opt/spark/test.file:
    line1 word1 word2 word3
    line2 word1
    line3 word1 word2 word3 word4

    We will execute the wordcount.py example on it that will count the occurance of every word in the file. We can use the spark user, no root privileges needed.

    $ /opt/spark/bin/spark-submit /opt/spark/examples/src/main/python/wordcount.py /opt/spark/test.file
    2019-01-11 15:56:57 INFO  SparkContext:54 - Submitted application: PythonWordCount
    2019-01-11 15:56:57 INFO  SecurityManager:54 - Changing view acls to: spark
    2019-01-11 15:56:57 INFO  SecurityManager:54 - Changing modify acls to: spark
    [...]

    As the task executes, a long output is provided. Close to the end of the output, the result is shown, the cluster calculates the needed information:

    2019-01-11 15:57:05 INFO  DAGScheduler:54 - Job 0 finished: collect at /opt/spark/examples/src/main/python/wordcount.py:40, took 1.619928 s
    line3: 1
    line2: 1
    line1: 1
    word4: 1
    word1: 3
    word3: 2
    word2: 2
    [...]

    With this we have seen our Apache Spark in action. Additional slave nodes can be installed and attached to scale the computing power of our cluster.