Apache Spark is a distributed computing system. It consists of a master and one or more slaves, where the master distributes the work among the slaves, thus giving the ability to use our many computers to work on one task. One could guess that this is indeed a powerful tool where tasks need large computations to complete, but can be split into smaller chunks of steps that can be pushed to the slaves to work on. Once our cluster is up and running, we can write programs to run on it in Python, Java, and Scala.
In this tutorial we will work on a single machine running Red Hat Enterprise Linux 8, and will install the Spark master and slave to the same machine, but keep in mind that the steps describing the slave setup can be applied to any number of computers, thus creating a real cluster that can process heavy workloads. We’ll also add the necessary unit files for management, and run a simple example against the cluster shipped with the distributed package to ensure our system is operational.
In this tutorial you will learn:
- How to install Spark master and slave
- How to add systemd unit files
- How to verify successful master-slave connection
- How to run a simple example job on the cluster
Software Requirements and Conventions Used
Category | Requirements, Conventions or Software Version Used |
---|---|
System | Red Hat Enterprise Linux 8 |
Software | Apache Spark 2.4.0 |
Other | Privileged access to your Linux system as root or via the sudo command. |
Conventions |
# – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux commands to be executed as a regular non-privileged user |
How to install spark on Redhat 8 step by step instructions
Apache Spark runs on JVM (Java Virtual Machine), so a working Java 8 installation is required for the applications to run. Aside from that, there are multiple shells shipped within the package, one of them is pyspark
, a python based shell. To work with that, you’ll also need python 2 installed and set up.
- To get the URL of Spark’s latest package, we need to visit the Spark downloads site. We need to choose the mirror closest to our location, and copy the URL provided by the download site. This also means that your URL may be different from the below example. We’ll install the package under
/opt/
, so we enter the directory asroot
:# cd /opt
And feed the aquired URL to
wget
to get the package:# wget https://www-eu.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
- We’ll unpack the tarball:
# tar -xvf spark-2.4.0-bin-hadoop2.7.tgz
- And create a symlink to make our paths easier to remember in the next steps:
# ln -s /opt/spark-2.4.0-bin-hadoop2.7 /opt/spark
- We create a non-privileged user that will run both applications, master and slave:
# useradd spark
And set it as owner of the whole
/opt/spark
directory, recursively:# chown -R spark:spark /opt/spark*
- We create a
systemd
unit file/etc/systemd/system/spark-master.service
for the master service with the following content:[Unit] Description=Apache Spark Master After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-master.sh ExecStop=/opt/spark/sbin/stop-master.sh [Install] WantedBy=multi-user.target
And also one for the slave service that will be
/etc/systemd/system/spark-slave.service.service
with the below contents:[Unit] Description=Apache Spark Slave After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-slave.sh spark://rhel8lab.linuxconfig.org:7077 ExecStop=/opt/spark/sbin/stop-slave.sh [Install] WantedBy=multi-user.target
Note the highlighted spark URL. This is constructed with
spark://<hostname-or-ip-address-of-the-master>:7077
, in this case the lab machine that will run the master has the hostnamerhel8lab.linuxconfig.org
. Your master’s name will be different. Every slaves must be able to resolve this hostname, and reach the master on the specified port, which is port7077
by default. - With the service files in place, we need to ask
systemd
to re-read them:# systemctl daemon-reload
- We can start our Spark master with
systemd
:# systemctl start spark-master.service
- To verify our master is running and functional, we can use systemd status:
# systemctl status spark-master.service spark-master.service - Apache Spark Master Loaded: loaded (/etc/systemd/system/spark-master.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2019-01-11 16:30:03 CET; 53min ago Process: 3308 ExecStop=/opt/spark/sbin/stop-master.sh (code=exited, status=0/SUCCESS) Process: 3339 ExecStart=/opt/spark/sbin/start-master.sh (code=exited, status=0/SUCCESS) Main PID: 3359 (java) Tasks: 27 (limit: 12544) Memory: 219.3M CGroup: /system.slice/spark-master.service 3359 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181.b13-9.el8.x86_64/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host [...] Jan 11 16:30:00 rhel8lab.linuxconfig.org systemd[1]: Starting Apache Spark Master... Jan 11 16:30:00 rhel8lab.linuxconfig.org start-master.sh[3339]: starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-1[...]
The last line also indicates the main logfile of the master, which is in the
logs
directory under the Spark base directory,/opt/spark
in our case. By looking into this file, we should see a line in the end similar to the below example:2019-01-11 14:45:28 INFO Master:54 - I have been elected leader! New state: ALIVE
We should also find a line that tells us where the Master interface is listening:
2019-01-11 16:30:03 INFO Utils:54 - Successfully started service 'MasterUI' on port 8080
If we point a browser to the host machine’s port
8080
, we should see the status page of the master, with no workers attached at the moment.Note the URL line on the Spark master’s status page. This is the same URL we need to use for every slave’s unit file we created in
step 5
.
If we receive a “connection refused” error message in the browser, we probably need to open the port on the firewall:# firewall-cmd --zone=public --add-port=8080/tcp --permanent success # firewall-cmd --reload success
- Our master is running, we’ll attach a slave to it. We start the slave service:
# systemctl start spark-slave.service
- We can verify that our slave is running with systemd:
# systemctl status spark-slave.service spark-slave.service - Apache Spark Slave Loaded: loaded (/etc/systemd/system/spark-slave.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2019-01-11 16:31:41 CET; 1h 3min ago Process: 3515 ExecStop=/opt/spark/sbin/stop-slave.sh (code=exited, status=0/SUCCESS) Process: 3537 ExecStart=/opt/spark/sbin/start-slave.sh spark://rhel8lab.linuxconfig.org:7077 (code=exited, status=0/SUCCESS) Main PID: 3554 (java) Tasks: 26 (limit: 12544) Memory: 176.1M CGroup: /system.slice/spark-slave.service 3554 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.181.b13-9.el8.x86_64/jre/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker [...] Jan 11 16:31:39 rhel8lab.linuxconfig.org systemd[1]: Starting Apache Spark Slave... Jan 11 16:31:39 rhel8lab.linuxconfig.org start-slave.sh[3537]: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-spar[...]
This output also provides the path to the logfile of the slave (or worker), which will be in the same directory, with “worker” in it’s name. By checking this file, we should see something similar to the below output:
2019-01-11 14:52:23 INFO Worker:54 - Connecting to master rhel8lab.linuxconfig.org:7077... 2019-01-11 14:52:23 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62059f4a{/metrics/json,null,AVAILABLE,@Spark} 2019-01-11 14:52:23 INFO TransportClientFactory:267 - Successfully created connection to rhel8lab.linuxconfig.org/10.0.2.15:7077 after 58 ms (0 ms spent in bootstraps) 2019-01-11 14:52:24 INFO Worker:54 - Successfully registered with master spark://rhel8lab.linuxconfig.org:7077
This indicates that the worker is successfully connected to the master. In this same logfile we’ll find a line that tells us the URL the worker is listening on:
2019-01-11 14:52:23 INFO WorkerWebUI:54 - Bound WorkerWebUI to 0.0.0.0, and started at http://rhel8lab.linuxconfig.org:8081
We can point our browser to the worker’s status page, where it’s master is listed.
At the master’s logfile, a verifying line should appear:
2019-01-11 14:52:24 INFO Master:54 - Registering worker 10.0.2.15:40815 with 2 cores, 1024.0 MB RAM
If we reload the master’s status page now, the worker should appear there as well, with a link to it’s status page.
These sources verify that our cluster is attached and ready to work.
- To run a simple task on the cluster, we execute one of the examples shipped with the package we downloaded. Consider the following simple textfile
/opt/spark/test.file
:line1 word1 word2 word3 line2 word1 line3 word1 word2 word3 word4
We will execute the
wordcount.py
example on it that will count the occurance of every word in the file. We can use thespark
user, noroot
privileges needed.$ /opt/spark/bin/spark-submit /opt/spark/examples/src/main/python/wordcount.py /opt/spark/test.file 2019-01-11 15:56:57 INFO SparkContext:54 - Submitted application: PythonWordCount 2019-01-11 15:56:57 INFO SecurityManager:54 - Changing view acls to: spark 2019-01-11 15:56:57 INFO SecurityManager:54 - Changing modify acls to: spark [...]
As the task executes, a long output is provided. Close to the end of the output, the result is shown, the cluster calculates the needed information:
2019-01-11 15:57:05 INFO DAGScheduler:54 - Job 0 finished: collect at /opt/spark/examples/src/main/python/wordcount.py:40, took 1.619928 s line3: 1 line2: 1 line1: 1 word4: 1 word1: 3 word3: 2 word2: 2 [...]
With this we have seen our Apache Spark in action. Additional slave nodes can be installed and attached to scale the computing power of our cluster.