Scaling Salt Clients
Salt Minion Onboarding Rate
The rate at which SUSE Manager can on-board minions (accept Salt keys) is limited and depends on hardware resources. On-boarding minions at a faster rate than SUSE Manager is configured for will build up a backlog of unprocessed keys slowing the process and potentially exhausting resources. It is recommended to limit the acceptance key rate pro-grammatically. A safe starting point would be to on-board a minion every 15 seconds, which can be implemented via the following command:
for k in $(salt-key -l un|grep -v Unaccepted); do salt-key -y -a $k; sleep 15; done
Minions Running with Unaccepted Salt Keys
Minions which have not been on-boarded, (minions running with unaccepted Salt keys) consume resources, in particular inbound network bandwidth for ~2.5 Kb/s per minion. 1000 idle minions will consume around ~2.5 Mb/s, and this number will drop to almost 0 once on-boarding has been completed. Limit non-onboarded systems for optimal performance.
Salt’s official documentation suggests the maximum number of opened files should be set to at least 2 × the minion count.
Current default is 16384, which is sufficient for ~8000 minions.
For larger installations, this number may be increased by editing the following line in /usr/lib/systemd/system/salt-master.service
:
LimitNOFILE=16384
Salt Timeouts
Background Information
Salt features two timeout parameters called timeout
and gather_job_timeout
that are relevant during the execution of Salt commands and jobs—it does not matter whether they are triggered using the command line interface or API.
These two parameters are explained in the following article.
This is a normal workflow when all minions are well reachable:
-
A salt command or job is executed:
salt '*' test.ping
-
Salt master publishes the job with the targeted minions into the Salt PUB channel.
-
Minions take that job and start working on it.
-
Salt master is looking at the Salt RET channel to gather responses from the minions.
-
If Salt master gets all responses from targeted minions, then everything is completed and Salt master will return a response containing all the minion responses.
If some of the minions are down during this process, the workflow continues as follows:
-
If
timeout
is reached before getting all expected responses from the minions, then Salt master would trigger an aditional job (a Saltfind_job
job) targeting only pending minions to check whether the job is already running on the minion. -
Now
gather_job_timeout
is evaluated. A new counter is now triggered. -
If this new
find_job
job responses that the original job is actually running on the minion, then Salt master will wait for that minion’s response. -
In case of reaching
gather_job_timeout
without having any response from the minion (neither for the initialtest.ping
nor for thefind_job
job), Salt master will return with only the gathered responses from the responding minions.
By default, SUSE Manager globally sets timeout
and gather_job_timeout
to 120 seconds.
So, in the worst case, a Salt call targeting unreachable minions will end up with 240 seconds of waiting until getting a response.
A Presence Ping Mechanism for Unreachable Salt Minions
In order to prevent waiting until timeouts are reached when some minions are down, SUSE introduced a so-called "presence mechanism" for Salt minions.
This presence mechanism checks for unreachable Salt minions when SUSE Manager is performing synchronous calls to these minions, and it excludes unreachable minions from that call. Synchronous calls are going to be displaced in favor of asynchronous calls but currently still being used during some workflows.
The presence mechanism triggers a Salt test.ping
with a custom and fixed short Salt timeout values.
Default Salt values for the presence ping are: timeout
= 4
and gather_job_timeout = 1
.
This way, we can quickly detect which targeted minions are unreachable, and then exclude them from the synchronous call.
Overriding Salt Presence Timeout Values
SUSE Manager administrators can increase or decrease default presence ping timeout values by removing the comment markers (\#
) and setting the desired values for salt_presence_ping_timeout
and salt_presence_ping_gather_job_timeout
options in /etc/rhn/rhn.conf
:
# SUSE Manager presence timeouts for Salt minions # salt_presence_ping_timeout = 4 # salt_presence_ping_gather_job_timeout = 1
Salt SSH Minions (SSH Push)
Salt SSH minions are slightly different that regular minions (zeromq). Salt SSH minions do not use Salt PUB/RET channels but a wrapper Salt command inside of an SSH call.
Salt timeout
and gather_job_timeout
are not playing a role here.
SUSE Manager defines a timeout for SSH connections in /etc/rhn/rhn.conf
:
# salt_ssh_connect_timeout = 180
The presence ping mechanism is also working with SSH minions.
In this case, SUSE Manager will use salt_presence_ping_timeout
to override the default timeout value for SSH connections.