Debugging and benchmarking can be heavy topics. Frankly, there are a lot of options when it comes to analyzing playbook performance. My goal will be to make this as simple as possible.
Playbook Debugging and Connection Logging
Ansible allows you to enable debug logging at a number of levels. Most importantly, at both the playbook run level, and the network connection level. Having that info will become invaluable with your future troubleshooting efforts.
From the Ansible CLI, you can set these as environment variables before running a playbook:
export ANSIBLE_DEBUG=true export ANSIBLE_LOG_PATH=/tmp/ansible-debug.log ansible-playbook … -v
-v
will show increased logging around individual plays/tasks-vvv
will show Ansible playbook execution run logs-vvvv
will show SSH and TCP connection logging
And from within AWX/Tower, you can set the logging/debugging verbosity for each playbook job template:
If standard debug options aren’t enough…if you want to truly see everything that’s happening with Ansible, say no more! You can keep remote log files and enable persistent log messages. And with great power comes great responsibility.
export ANSIBLE_KEEP_REMOTE_FILES=true export ANSIBLE_PERSISTENT_LOG_MESSAGES=True
Both of these are super insecure. They will give you everything that Ansible does and sees — the deepest view of exactly what’s happening on a remote device. And you’ll expose vars/logs as they’re unencrypted and read. So do be careful!
On the network and non-OS side, I often enable ansible_persistent_log_messages
to see netconf responses, system calls, and other such things from my network inventories.
2021-03-01 20:35:23,127 p=26577 u=ec2-user n=ansible | jsonrpc response: {"jsonrpc": "2.0", "id": "9c4d684c-d252-4b6d-b624-dff71b20e0d3", "result": [["vvvv", "loaded netconf plugin default from path /home/ec2-user/ansible/ansible_venv/lib/python3.7/site-packages/ansible/plugins/netconf/default.py for network_os default"], ["log", "network_os is set to default"], ["warning", "Persistent connection logging is enabled for lab1-idf1-acc-sw01. This will log ALL interactions to /home/ec2-user/ansible/ansible_debug.log and WILL NOT redact sensitive configuration like passwords. USE WITH CAUTION!"]]}
And then it comes to a traditional OS, setting ansible_keep_remote_files
will allow you to see the equivalent levels of process events, system calls, and whatnot that have happened on the remote system.
Ansible and Network Debugging:
https://docs.ansible.com/ansible/latest/dev_guide/debugging.html
https://docs.ansible.com/ansible/latest/network/user_guide/network_debug_troubleshooting.html
Scale and Performance Testing
Now let’s talk benchmarking! We can easily get ourselves deep into the weeds here. So let’s start simple.
Simply noting Ansible CLI or Tower Job run times is the best place to start. Additionally, we can look at timing for individual tasks. To aid us in the process, there are a number of plugins that can be enabled both in Ansible.
Ansible is built around many types of plugins, one of which is callback plugins. Callback plugins allow you to add some very interesting capabilities to Ansible, such as making your computer read the playbook as it runs. Ansible ships with a number of callback plugins that are ready to use out-of-the-box — you simply have to enable them in your ansible.cfg. Add a comma separated list of callback plugins to callback_whitelist in your ansible.cfg.
The particular callback plugin that will help with performance tuning our playbooks is called profile_tasks. It prints out a detailed breakdown of task execution times, sorted from longest to shortest, as well as a running timer during play execution. Speaking of, timer is another useful callback plugin that prints total execution time, similar to time but with more friendly output.
Ultimately, let’s start with these. Edit your ansible.cfg
file to enable these callback plugins:
callback_whitelist = profile_tasks, timer
With the profile_tasks
and timer
callback plugins enabled, run your playbook again and you’ll see more output. For example, here’s a profile of fact collection tasks on a single Cisco inventory host:
ansible-playbook facts.yml --ask-vault-pass -e "survey_hosts=cisco-ios" ansible_facts : collect output from ios device ------------ 1.94s ansible_facts : include cisco-ios tasks ------------------- 0.50s ansible_facts : set config_lines fact --------------------- 0.26s ansible_facts : set version fact -------------------------- 0.07s ansible_facts : set management interface name fact -------- 0.07s ansible_facts : set model number -------------------------- 0.07s ansible_facts : set config fact --------------------------- 0.07s
And a profile of `change password` on a host:
ansible-playbook config_localpw.yml -e "survey_hosts=cisco-ios" config_localpw : Update line passwords --------------------- 4.66s ansible_facts : collect output from ios device ------------- 5.06s ansible_facts : include cisco-ios tasks -------------------- 0.51s config_localpw : Update line passwords --------------------- 0.34s config_localpw : Update enable and username config lines --- 0.33s config_localpw : debug ------------------------------------- 0.23s config_localpw : Update enable and username config lines --- 0.22s config_localpw : Update terminal server username doorbell -- 0.22s config_localpw : Update line passwords --------------------- 0.20s config_localpw : Update terminal server username doorbell -- 0.20s config_localpw : Update terminal server username doorbell -- 0.19s config_localpw : debug ------------------------------------- 0.19s config_localpw : Identify if it has a modem ---------------- 0.14s config_localpw : set_fact - Modem slot 2 ------------------- 0.11s config_localpw : set_fact - Modem slot 1 ------------------- 0.10s config_localpw : Update terminal server username doorbell -- 0.10s config_localpw : set_fact - Modem slot 3 ------------------- 0.10s config_localpw : Update line passwords --------------------- 0.10s config_localpw : Update enable and username config lines --- 0.09s config_localpw : Update enable and username config lines --- 0.09s
Automation will be unique to every organization, and it’s important to regularly track performance benchmarks as your roles evolves. Beyond the obvious benefit of being able to accurately estimate your automation run times, you can determine where improvements can be made while proactively monitoring for faulty code/logic that will inevitably slip through peer reviews.
Profiling tasks/roles:
https://docs.ansible.com/ansible/latest/plugins/callback/profile_tasks.html
https://docs.ansible.com/ansible/latest/plugins/callback/profile_roles.html
All plugins:
https://docs.ansible.com/ansible/latest/plugins/callback.html
Ansible CLI Process Monitoring
I use dstat when I’m trying to benchmark Ansible CLI performance, and it’s the first place I start when I need to troubleshoot performance.
dstat
is an is an extremely versatile replacement for vmstat
, iostat
and ifstat
— it’s an all-in-one stat collection tool. It will gather CPU, process, memory, I/O, network, etc… I use it to log/monitor any and all system stats related to Ansible/Tower.
dstat -tcmslrnpyi --fs --socket --unix --top-oom -t timestamp -c total CPU usage -m memory usage -s swap space -l load average -r disk I/O -n network I/O -p processes -y linux system stats -i interrupts --fs filesystem open files and inodes --socket network sockets --unix unix sockets--top-oom watch for OOM process
Note: In RHEL8+, dstat
is supplied by the pcp-system-tools
package.