3 ways to optimize Ansible Automation Platform for scale and performance

Try these settings to optimize performance with Ansible Automation Platform on a massive scale.

Posted: November 30, 2021 by Brant Evans (Red Hat), Nicholas Awtry (Red Hat)

Light trails on highway at night — ^{Photo by Pixabay from Pexels}

The new features in Ansible Automation Platform 2 bring exciting changes that support running Ansible on a large scale, including separating the control and execution planes. This ensures that job execution does not impact the web user interface (UI) or user experience. In addition, it allows for scaling each plane independently.

Container groups have been part of the Ansible automation controller (Red Hat's enterprise version of Ansible) in tech preview for the last few releases. With Ansible Automation Platform 2, container groups are now out of tech preview and fully supported. They are similar to instance groups but allow job execution to occur in an OpenShift namespace. This allows you to use different namespaces for different types of workloads, such as one for CPU-intensive tasks and one for memory-intensive tasks. You can also dedicate namespaces to different groups to protect one group's automation from impacting another group.

Finally, execution environments are replacing the Python virtual environments, which you must synchronize across all Ansible automation controller nodes. An execution environment is a container holding all the pieces necessary to run a playbook. You can create execution environments for specific types of automation. For example, an organization could have an execution environment for network automation, one for Windows automation, and one for Linux automation.

At AnsibleFest 2021, where we presented Automation at Large: Managing Ansible automation controller on a massive scale, Red Hat announced that Ansible automation controller is now a component of Ansible Automation Platform 2. Our talk, which we summarize in this article and a follow-up piece, focuses on Ansible Automation 2 with Ansible automation controller 4.0.

The trouble with scaling up too quickly

We often see this story: An IT department makes a small investment in Ansible to perform simple tasks. Initially, only system administrators use Ansible, so a small Ansible automation controller instance is sufficient to run any needed automation. However, not too long after, other teams ask to use the Ansible automation controller to experiment with more complex automation. Then many different teams and users within the company ask to use the Ansible automation controller to run playbooks against thousands of endpoints.

Predictably, the Ansible automation controller instance starts experiencing performance issues, ranging from slow job runs to a completely unresponsive UI. The root cause is clear: Usage grew too quickly without a plan for the underlying architecture or how the automation platform will be used.

[ Learn how to use IT automation to transform network, infrastructure, security, DevOps, and other IT services in the free eBook The automated enterprise: Unify people and processes. ]

With new capabilities and features constantly improving the usefulness of the Ansible automation controller, enterprises are using it for increasingly more use cases. Automation controllers must scale up to keep up with the increased usage demands, but this expansion presents new challenges and considerations for administrators and architects.

This article looks at the unique challenges of running an automation controller on a large scale, as well as techniques to combat potential problems as Ansible automation controller handles larger loads.

Large-scale performance issues: A real-world example

To get an idea of what can happen when you don't consider performance, take a look at a real-world example:

A company establishes Ansible Automation Platform as the backbone of its global IT automation deployment. The architecture consists of a half-dozen Ansible automation controller clusters, each with up to 10 instances running in containers on top of OpenShift. Some of these clusters support hundreds of users running thousands of jobs a day against tens of thousands of hosts. Eventually, some of the clusters start experiencing performance issues, including the following symptoms:

An unresponsive web UI

Jobs queuing up and stuck in a pending state

Hanging jobs that never complete

Random 500 and 404 errors appearing in the UI and API

These clusters became almost unusable at times, from an unresponsive web UI to the inability to run jobs and other random errors. This affects hundreds of users who can't run automation consistently. After weeks of investigating, they discover multiple issues and solutions.

Types of workloads

You can understand the stresses of different workloads on Ansible automation controllers by looking at how the Ansible playbooks are constructed. When a job within the Ansible automation controller gets launched, the selected controller executes the playbook against a set of managed nodes. The playbook defines the tasks that must execute to accomplish the playbook's goal. Most of the time, these tasks execute on the managed node. Occasionally, a task gets executed on the controller even if it sometimes appears that the task executes on the managed node. There are many types of tasks that can consume resources on the controller.

Tasks can use CPU and memory on the controller node even when the task is not "delegated to" the controller. These types of tasks may not be of concern when executing jobs against a small number of managed nodes, but as the number of managed nodes and the number of jobs increase, you need to pay more attention and understand the needs of these tasks.

CPU-intensive tasks

First, there are tasks that use CPU resources on the controller. One example occurs when filter plugins are used in the task. Filter plugins always execute on the controller. An example of this type of task is:

- name: Generate data checksum
  set_fact:
    chksum: {{ host_secret | password_hash('sha512') }}

This task looks pretty harmless—it just computes a hash. But suppose this task is running against hundreds or thousands of managed nodes simultaneously. In that case, the task might consume all of the CPUs on the controller, leaving no CPU resources available for other processes.

Memory-intensive tasks

Next, there are memory-intensive tasks, like this:

- name: Get data
  postgresql_query: 
    db: "{{ db_name }}"
    login_host: "{{ login_host }}"
    login_user: "{{ login_user }}"
    login_password: "{{ login_password }}" 
    query: SELECT * FROM atable
  register: alldata

This task reads data from an external database. The task (and query) runs on the managed node and stores the results in the alldata variable. All the output gets sent back to the controller and consumes memory on the controller. The memory on the controller doesn't get released until the playbook's execution completes. This can cause a large amount of memory to be consumed on the controller and may cause the controller to start swapping processes out of memory or not have memory available to fulfill allocation requests.

CPU- and memory-intensive tasks

There are also tasks that consume both CPU and memory resources on the controller. Take this task, for example:

- name: Extract more data
  set_fact:
    real_data: "{{ alldata.query_result | json_query('[*].fieldname') }}"

This task uses the json_query filter plugin to transform the data returned from the database query and then stores the results in a new variable called real_data. This task consumes the CPU on the controller because it uses a filter plugin. It also consumes additional memory because of storing the results in a new variable. Like the above two examples, this task could consume large amounts of CPU and memory when run against hundreds or thousands of managed nodes.

The template module is another example of a task that can consume large amounts of CPU and memory on the controller. In addition, it consumes network bandwidth and could potentially saturate the controller's network connection:

- name: Template a file to /etc/file.conf
  ansible.builtin.template: 
    src: /mytemplates/foo.j2
    dest: /etc/file.conf 
    owner: bin 
    group: wheel 
    mode: '0644'

Here, the template module reads the entire source file into memory on the controller and then applies the templating rules. The rules and variables may use plugin filters that execute on the controller. After the template gets rendered, it gets copied across the network to the managed node. If the resulting file is large (multiple megabytes) and the task is running against many managed nodes simultaneously, it uses a large amount of bandwidth.

[ Learn more about automation by registering for the Ansible Basics: Automation Technical Overview course. ]

By themselves, each of these tasks is fairly harmless, but each may introduce performance issues when running at scale. An administrator must look at how to manage jobs to minimize the impacts of running large workloads.

Concurrent jobs

Running a large number of concurrent jobs or a few long-running jobs can drastically affect the Ansible automation controller's performance. Most, or all, of the controller's resources may get consumed, leading to various issues. It is essential to consider how the job templates and playbooks configure forks, slicing, and other settings.

Using forks

In Ansible, you achieve playbook execution parallelism by utilizing forks. A fork equates to one process on the controller node and relates to both the available CPU and memory. The number of forks set on a job causes the Ansible automation controller to run each task in a playbook against that number of hosts simultaneously. This is especially useful when a job runs against a large number of hosts, allowing you to run against a significant number of them simultaneously and generally reducing the overall run time.

For example, if you run against 100 hosts without specifying the hosts, it defaults to five forks, and the Ansible playbook runs against five hosts at a time. But if you set the forks to 100, then the Ansible playbook runs against all 100 hosts simultaneously, significantly reducing the run time. By utilizing forks, you can make playbooks run a lot faster without changing the code.

Even though forks can improve the efficiency of a job run, they can also overwork the cluster if they're misused. Each instance within the cluster has a certain number of available forks based on its underlying CPU and RAM. You can find the available forks in the automation controller UI by looking at the specific instance on the Instance Groups page. If a job uses most or all of the available forks, it prevents any other jobs from running on that instance until it completes. For long-running jobs, over-allocating forks can result in other jobs getting stuck in a pending state for hours. Therefore, it is important to be aware of the number of forks set and avoid over-allocating long-running or recurring jobs.

Using slices

While forks allow more hosts to run concurrently, slicing allows jobs to be run across multiple Ansible automation controller instances simultaneously. This results in a distributed load that helps prevent a single instance from being overworked and running out of capacity.

Under the SLICES field, slicing is sent in the Job Template in the automation controller's UI. The Ansible automation controller breaks the job template into multiple job templates within a workflow when you set a number greater than one. Each new job template runs simultaneously on a different node against a unique subset of the hosts in the inventory. This technique is excellent for long-running jobs that run against many hosts and use up a significant number of resources. Slicing can reduce the amount of stress on your nodes and provide a significant performance boost.

Slicing automatically breaks up job templates for you, but doing it manually can also lead to performance gains. By creating smaller, more concise playbooks with their own templates, you can run each in a workflow, with some running simultaneously. This allows jobs to run faster and gives more flexibility with running certain tasks in parallel. Failing to break up large playbooks can result in long-running jobs that can hog the cluster's resources.

Wrap up

These suggestions relate to the Ansible automation controller UI, but you can make changes at the configuration level. The ansible.cfg file contains various settings for Ansible, and you can leverage these settings to optimize performance. For example, you can set the default number of forks here rather than in the UI. To help determine which tasks are more demanding or causing a bottleneck, you can enable the profile_tasks callback, giving you information about how long each task takes to complete. Enabling pipelining improves performance by reducing the number of SSH connections Ansible opens, while you can tailor the various timeout options to specific needs.

It is important to be aware of any ansible.cfg files in the projects since they can have settings that contradict what is configured in the UI, especially relating to forks. Settings in the UI always take precedence over those in the ansible.cfg.

This article presents a few considerations for handling different types of workloads and avoiding performance issues while running multiple tasks. A follow-up article will explain how to more easily scale Ansible using a containerized controller and monitoring to keep an eye on Ansible automation controller's health and performance and explore why using Ansible Automation Platform could improve job execution, further enhancing your enterprise's ability to achieve automation at large.

You can also watch our presentation after logging into the AnsibleFest website.

AnsibleFest is a free, Red Hat-sponsored technology conference, a virtual event that brings the entire global automation community together. Visit the AnsibleFest website for on-demand access to the demos, keynotes, labs, and technical sessions that you may have missed.