Tuning Large Scale Deployments
SUSE Manager is designed by default to work on small and medium scale installations. For installations with more than 1000 clients per SUSE Manager Server, adequate hardware sizing and parameter tuning must be performed.
The instructions in this section can have severe and catastrophic performance impacts when improperly used. In some cases, they can cause SUSE Manager to completely cease functioning. Always test changes before implementing them in a production environment. During implementation, take care when changing parameters. Monitor performance before and after each change, and revert any steps that do not produce the expected result. |
We strongly recommend that you contact SUSE Consulting for assistance with tuning. SUSE will not provide support for catastrophic failure when these advanced parameters are modified without consultation. |
Tuning is not required on installations of fewer than 1000 clients. Do not perform these instructions on small or medium scale installations. |
1. The Tuning Process
Any SUSE Manager installation is subject to a number of design and infrastructure constraints that, for the purposes of tuning, we call environmental variables. Environmental variables can include the total number of clients, the number of different operating systems under management, and the number of software channels.
Environmental variables influence, either directly or indirectly, the value of most configuration parameters. During the tuning process, the configuration parameters are manipulated to improve system performance.
Before you begin tuning, you will need to estimate the best setting for each environment variable, and adjust the configuration parameters to suit.
To help you with the estimation process, we have provided you with a dependency graph. Locate the environmental variables on the dependency graph to determine how they will influence other variables and parameters.
Environmental variables are represented by graph nodes in a rectangle at the top of the dependency graph. Each node is connected to the relevant parameters that might need tuning. Consult the relevant sections in this document for more information about recommended values.
Tuning one parameter might require tuning other parameters, or changing hardware, or the infrastructure. When you change a parameter, follow the arrows from that node on the graph to determine what other parameters might need adjustment. Continue through each parameter until you have visited all nodes on the graph.
-
3D boxes are hardware design variables or constraints
-
Oval-shaped boxes are software or system design variables or constraints
-
Rectangle-shaped boxes are configurable parameters, color-coded by configuration file:
-
Red: Apache
httpd
configuration files -
Blue: Salt configuration files
-
Brown: Tomcat configuration files
-
Grey: PostgreSQL configuration files
-
Purple:
/etc/rhn/rhn.conf
-
-
Dashed connecting lines indicate a variable or constraint that might require a change to another parameter
-
Solid connecting lines indicate that changing a configuration parameter requires checking another one to prevent issues
After the initial tuning has been completed, you will need to consider tuning again in these cases:
-
If your tuning inputs change significantly
-
If special conditions arise that require a certain parameter to be changed. For example, if specific warnings appear in a log file.
-
If performance is not satisfactory
To re-tune your installation, you will need to use the dependency graph again. Start from the node where significant change has happened.
2. Environmental Variables
This section contains information about environmental variables (inputs to the tuning process).
- Network Bandwidth
-
A measure of the typically available egress bandwith from the SUSE Manager Server host to the clients or SUSE Manager Proxy hosts. This should take into account network hardware and topology as well as possible capacity limits on switches, routers, and other network equipment between the server and clients.
- Channel count
-
The number of expected channels to manage. Includes any vendor-provided, third-party, and cloned or staged channels.
- Client count
-
The total number of actual or expected clients. It is important to tune any parameters in advance of a client count increase, whenever possible.
- OS mix
-
The number of distinct operating system versions that managed clients have installed. This is ordered by family (SUSE Linux Enterprise, openSUSE, Red Hat Enterprise Linux, or Ubuntu based). Storage and computing requirements are different in each case.
- User count
-
The expected maximum amount of concurrent users interacting with the Web UI plus the number of programs simultaneously using the XMLRPC API. Includes
spacecmd
,spacewalk-clone-by-date
, and similar.
3. Parameters
This section contains information about the available parameters.
3.1. MaxClients
Description |
The maximum number of HTTP requests served simultaneously by Apache httpd. Proxies, Web UI, and XMLRPC API clients each consume one. Requests exceeding the parameter will be queued and might result in timeouts. |
Tune when |
User count and proxy count increase significantly and this line appears in |
Value default |
150 |
Value recommendation |
150-500 |
Location |
|
Example |
|
After changing |
Immediately change |
Notes |
This parameter was renamed to |
More information |
https://httpd.apache.org/docs/2.4/en/mod/mpm_common.html#maxrequestworkers |
3.2. ServerLimit
Description |
The number of Apache httpd processes serving HTTP requests simultaneously.
The number must equal |
Tune when |
|
Value default |
150 |
Value recommendation |
The same value as |
Location |
|
Example |
|
More information |
https://httpd.apache.org/docs/2.4/en/mod/mpm_common.html#serverlimit |
3.3. maxThreads
Description |
The number of Tomcat threads dedicated to serving HTTP requests |
Tune when |
|
Value default |
150 |
Value recommendation |
The same value as |
Location |
|
Example |
|
More information |
3.4. connectionTimeout
Description |
The number of milliseconds before a non-responding AJP connection is forcibly closed. |
Tune when |
Client count increases significantly and |
Value default |
900000 |
Value recommendation |
20000-3600000 |
Location |
|
Example |
|
More information |
3.5. keepAliveTimeout
Description |
The number of milliseconds without data exchange from the JVM before a non-responding AJP connection is forcibly closed. |
Tune when |
Client count increases significantly and |
Value default |
300000 |
Value recommendation |
20000-600000 |
Location |
|
Example |
|
More information |
3.6. Tomcat’s -Xmx
Description |
The maximum amount of memory Tomcat can use |
Tune when |
|
Value default |
1 GiB |
Value recommendation |
4-8 GiB |
Location |
|
Example |
|
After changing |
Check memory usage |
More information |
https://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html |
3.7. java.message_queue_thread_pool_size
Description |
The maximum number of threads in Tomcat dedicated to asynchronous operations |
Tune when |
Client count increases significantly |
Value default |
5 |
Value recommendation |
50 - 150 |
Location |
|
Example |
|
After changing |
Check |
Notes |
Incoming Salt events are handled in separate thread pool, see |
More information |
|
3.8. java.salt_batch_size
Description |
The maximum number of minions concurrently executing a scheduled action. |
Tune when |
Client count reaches several thousands and actions are not executed quickly enough. |
Value default |
200 |
Value recommendation |
200-500 |
Location |
|
Example |
|
After changing |
Check memory usage. Monitor memory usage closely before and after the change. |
More information |
3.9. java.salt_event_thread_pool_size
Description |
The maximum number of threads in Tomcat dedicated to handling of incoming Salt events. |
Tune when |
The number of queued Salt events grows. Typically, this can happen during onboarding of large number of minions with higher value of
|
Value default |
8 |
Value recommendation |
20-100 |
Location |
|
Example |
|
After changing |
Check the length of Salt event queue.
Check |
More information |
|
3.10. java.salt_presence_ping_timeout
Description |
Before any action is executed on a client, a presence ping is executed to make sure the client is reachable.
This parameter sets the amount of time before a second command ( |
Tune when |
Client count increases significantly, or some clients are responding correctly but too slowly, and SUSE Manager excludes them from calls.
This line appears in |
Value default |
4 seconds |
Value recommendation |
4-400 seconds |
Location |
|
Example |
|
After changing |
Large |
More information |
=== [cols="1,1"] |
| Description | Before any action is executed on a client, a presence ping is executed to make sure the client is reachable.
After java.salt_presence_ping_timeout
seconds have elapsed without a response, a second command (find_job
) is sent to the client for a final check.
This parameter sets the number of seconds after the second command after which the client is definitely considered offline.
Having many clients typically means some will respond faster than others, so this timeout could be raised to accommodate for the slower ones.
| Tune when | Client count increases significantly, or some clients are responding correctly but too slowly, and SUSE Manager excludes them from calls.
This line appears in /var/log/rhn/rhn_web_ui.log
: "Got no result for <COMMAND> on minion <MINION_ID> (minion did not respond in time)"
| Value default | 1 second
| Value recommendation | 1-100 seconds
| Location | /etc/rhn/rhn.conf
| Example | java.salt_presence_ping_gather_job_timeout = 10
| More information | Salt Timeouts
=== [cols="1,1"] |
| Description | Whenever content is changed in a software channel, its metadata needs to be recomputed before clients can use it.
Channel-altering operations include the addition of a patch, the removal of a package or a repository synchronization run.
This parameter specifies the maximum number of Taskomatic threads that SUSE Manager will use to recompute the channel metadata.
Channel metadata computation is both CPU-bound and memory-heavy, so raising this parameter and operating on many channels simultaneously could cause Taskomatic to consume significant resources, but channels will be available to clients sooner.
| Tune when | Channel count increases significantly (more than 50), or more concurrent operations on channels are expected.
| Value default | 2
| Value recommendation | 2-10
| Location | /etc/rhn/rhn.conf
| Example | java.taskomatic_channel_repodata_workers = 4
| After changing | Check taskomatic.java.maxmemory
for adjustment, as every new thread will consume memory
| More information | man rhn.conf
[cols="1,1"] |
| Description | The maximum amount of memory Taskomatic can use.
Generation of metadata, especially for some OSs, can be memory-intensive, so this parameter might need raising depending on the managed OS mix.
| Tune when | java.taskomatic_channel_repodata_workers
increases, OSs are added to SUSE Manager (particularly Red Hat Enterprise Linux or Ubuntu), or OutOfMemoryException
errors appear in /var/log/rhn/rhn_taskomatic_daemon.log
.
| Value default | 4096 MiB
| Value recommendation | 4096-16384 MiB
| Location | /etc/rhn/rhn.conf
| Example | taskomatic.java.maxmemory = 8192
| After changing | Check memory usage.
| More information | man rhn.conf
=== [cols="1,1"] |
| Description | The number of Taskomatic worker threads.
Increasing this value allows Taskomatic to serve more clients in parallel.
| Tune when | Client count increases significantly
| Value default | 20
| Value recommendation | 20-200
| Location | /etc/rhn/rhn.conf
| Example | org.quartz.threadPool.threadCount = 100
| After changing | Check hibernate.c3p0.max_size
and thread_pool
for adjustment
| More information | http://www.quartz-scheduler.org/documentation/2.4.0-SNAPSHOT/configuration.html
=== [cols="1,1"] |
| Description | Cycle time for Taskomatic.
Decreasing this value lowers the latency of Taskomatic.
| Tune when | Client count is in the thousands.
| Value default | 5000 ms
| Value recommendation | 1000-5000 ms
| Location | /etc/rhn/rhn.conf
| Example | org.quartz.scheduler.idleWaitTime = 1000
| More information | http://www.quartz-scheduler.org/documentation/2.4.0-SNAPSHOT/configuration.html
=== [cols="1,1"] |
| Description | Number of Taskomatic threads dedicated to sending commands to Salt clients as a result of actions being executed.
| Tune when | Client count is in the thousands.
| Value default | 1
| Value recommendation | 1-10
| Location | /etc/rhn/rhn.conf
| Example | taskomatic.com.redhat.rhn.taskomatic.task.MinionActionExecutor.parallel_threads = 10
=== [cols="1,1"] |
| Description | Number of Taskomatic threads dedicated to sending commands to Salt SSH clients as a result of actions being executed.
| Tune when | Client count is in the hundreds.
| Value default | 20
| Value recommendation | 20-100
| Location | /etc/rhn/rhn.conf
| Example | taskomatic.com.redhat.rhn.taskomatic.task.SSHMinionActionExecutor.parallel_threads = 40
[cols="1,1"] |
| Description | Maximum number of PostgreSQL connections simultaneously available to both Tomcat and Taskomatic.
If any of those components requires more concurrent connections, their requests will be queued.
| Tune when | java.message_queue_thread_pool_size
or maxThreads
increase significantly, or when org.quartz.threadPool.threadCount
has changed significantly.
Each thread consumes one connection in Taskomatic and Tomcat, having more threads than connections might result in starving.
| Value default | 20
| Value recommendation | 100 to 200, higher than the maximum of java.message_queue_thread_pool_size
+ maxThreads
and org.quartz.threadPool.threadCount
| Location | /etc/rhn/rhn.conf
| Example | hibernate.c3p0.max_size = 100
| After changing | Check max_connections
for adjustment.
| More information | https://www.mchange.com/projects/c3p0/#maxPoolSize
[cols="1,1"] |
| Description | The maximum amount of memory that the rhn-search
service can use.
| Tune when | Client count increases significantly, and OutOfMemoryException
errors appear in journalctl -u rhn-search
.
| Value default | 512 MiB
| Value recommendation | 512-4096 MiB
| Location | /etc/rhn/rhn.conf
| Example | rhn-search.java.maxmemory = 4096
| After changing | Check memory usage.
[cols="1,1"] |
| Description | The amount of memory reserved for PostgreSQL shared buffers, which contain caches of database tables and index data.
| Tune when | RAM changes
| Value default | 25% of total RAM
| Value recommendation | 25-40% of total RAM
| Location | /var/lib/pgsql/data/postgresql.conf
| Example | shared_buffers = 8192MB
| After changing | Check memory usage.
| More information | https://www.postgresql.org/docs/10/runtime-config-resource.html#GUC-SHARED-BUFFERS
[cols="1,1"] |
| Description | Maximum number of PostgreSQL connections available to applications.
More connections allow for more concurrent threads/workers in various components (in particular Tomcat and Taskomatic), which generally improves performance.
However, each connection consumes resources, in particular work_mem
megabytes per sort operation per connection.
| Tune when | hibernate.c3p0.max_size
changes significantly, as that parameter determines the maximum number of connections available to Tomcat and Taskomatic
| Value default | 400
| Value recommendation | 2 * hibernate.c3p0.max_size + 50, if less than 1000
| Location | /var/lib/pgsql/data/postgresql.conf
| Example | max_connections = 250
| After changing | Check memory usage.
Monitor memory usage closely before and after the change.
| More information | https://www.postgresql.org/docs/10/runtime-config-connection.html#GUC-MAX-CONNECTIONS
[cols="1,1"] |
| Description | The amount of memory allocated by PostgreSQL every time a connection needs to do a sort or hash operation.
Every connection (as specified by max_connections
) might make use of an amount of memory equal to a multiple of work_mem
.
| Tune when | Database operations are slow because of excessive temporary file disk I/O.
To test if that is happening, add log_temp_files = 5120
to /var/lib/pgsql/data/postgresql.conf
, restart PostgreSQL, and monitor the PostgreSQL log files.
If you see lines containing LOG: temporary file:
try raising this parameter’s value to help reduce disk I/O and speed up database operations.
| Value recommendation | 2-20 MB
| Location | /var/lib/pgsql/data/postgresql.conf
| Example | work_mem = 10MB
| After changing | check if the SUSE Manager Server might need additional RAM.
| More information | https://www.postgresql.org/docs/10/runtime-config-resource.html#GUC-WORK-MEM
[cols="1,1"] |
| Description | Estimation of the total memory available to PostgreSQL for caching.
It is the explicitly reserved memory (shared_buffers
) plus any memory used by the kernel as cache/buffer.
| Tune when | Hardware RAM or memory usage increase significantly
| Value recommendation | Start with 75% of total RAM.
For finer settings, use shared_buffers
+ free memory + buffer/cache memory.
Free and buffer/cache can be determined via the free -m
command (free
and buff/cache
in the output respectively)
| Location | /var/lib/pgsql/data/postgresql.conf
| Example | effective_cache_size = 24GB
| After changing | Check memory usage
| Notes | This is an estimation for the query planner, not an allocation.
| More information | https://www.postgresql.org/docs/10/runtime-config-query.html#GUC-EFFECTIVE-CACHE-SIZE
[cols="1,1"] |
| Description | The number of worker threads serving Salt API HTTP requests.
A higher number can improve parallelism of SUSE Manager Server-initiated Salt operations, but will consume more memory.
| Tune when | java.message_queue_thread_pool_size
or org.quartz.threadPool.threadCount
are changed.
Starvation can occur when there are more Tomcat or Taskomatic threads making simultaneous Salt API calls than there are Salt API worker threads.
| Value default | 100
| Value recommendation | 100-500, but should be higher than the sum of java.message_queue_thread_pool_size
and org.quartz.threadPool.threadCount
| Location | /etc/salt/master.d/susemanager.conf
, in the rest_cherrypy
section.
| Example | thread_pool: 100
| After changing | Check worker_threads
for adjustment.
| More information | https://docs.saltstack.com/en/latest/ref/netapi/all/salt.netapi.rest_cherrypy.html#performance-tuning
[cols="1,1"] |
| Description | The number of salt-master
worker threads that process commands and replies from minions and the Salt API.
Increasing this value, assuming sufficient resources are available, allows Salt to process more data in parallel from minions without timing out, but will consume significantly more RAM (typically about 70 MiB per thread).
| Tune when | Client count increases significantly, thread_pool
increases significantly, or SaltReqTimeoutError
or Message timed out
errors appear in /var/log/salt/master
.
| Value default | 8
| Value recommendation | 8-200
| Location | /etc/salt/master.d/tuning.conf
| Example | worker_threads: 50
| After changing | Check memory usage.
Monitor memory usage closely before and after the change.
| More information | https://docs.saltstack.com/en/latest/ref/configuration/master.html#worker-threads
[cols="1,1"] |
| Description | The maximum number of outstanding messages sent by salt-master
. If more than this number of messages need to be sent concurrently, communication with clients slows down, potentially resulting in timeout errors during load peaks.
| Tune when | Client count increases significantly and Salt request timed out. The master is not responding.
errors appear when pinging minions during a load peak.
| Value default | 1000
| Value recommendation | 10000-100000
| Location | /etc/salt/master.d/tuning.conf
| Example | pub_hwm: 10000
| More information | https://docs.saltstack.com/en/latest/ref/configuration/master.html#pub-hwm, https://zeromq.org/socket-api/#high-water-mark
[cols="1,1"] |
| Description | The maximum number of allowed client connections that have started but not concluded the opening process. If more than this number of clients connects in a very short time frame, connections are dropped and clients experience a delay re-connecting.
| Tune when | Client count increases significantly and very many clients reconnect in a short time frame, TCP connections to the salt-master
process get dropped by the kernel.
| Value default | 1000
| Value recommendation | 1000-5000
| Location | /etc/salt/master.d/tuning.conf
| Example | zmq_backlog: 2000
| More information | https://docs.saltstack.com/en/latest/ref/configuration/master.html#zmq-backlog, http://api.zeromq.org/3-0:zmq-getsockopt (ZMQ_BACKLOG
)
[cols="1,1"] |
| Description | How aggressively the kernel moves unused data from memory to the swap partition.
Setting a lower parameter typically reduces swap usage and results in better performance, especially when RAM memory is abundant.
| Tune when | RAM increases, or swap is used when RAM memory is sufficient.
| Value default | 60
| Value recommendation | 1-60. For 128 GB of RAM, 10 is expected to give good results.
| Location | /etc/sysctl.conf
| Example | vm.swappiness = 20
| More information | https://documentation.suse.com/sles/15-SP3/html/SLES-all/cha-tuning-memory.html#cha-tuning-memory-vm
Adjusting some of the parameters listed in this section can result in a higher amount of RAM being used by various components. It is important that the amount of hardware RAM is adequate after any significant change. To determine how RAM is being used, you will need to check each process that consumes it. Operating system::
Stop all SUSE Manager services and inspect the output of It is important that the SUSE Manager Server has sufficient RAM to accommodate all of these processes, especially OS buffers and caches, to have reasonable PostgreSQL performance. We recommend you keep several gigabytes available at all times, and add more as the database size on disk increases. Whenever the expected amount of memory available for OS buffers and caches changes, update the To get a live breakdown of the memory used by services on the SUSE Manager Server, use this command: ---- pidstat -p ALL -r --human 1 60 |
tee pidstat-memory.log ---- This command will save a copy of displayed data in the |