cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

Redundant server failures

Redundant server failures

Summary

Three server redundancy (triad) failures.

Question

We're having problems with the stability of redundant servers. The problem has been intermittent for years. There is no specific error we're seeing or other deterministic data available. We're currently collecting:

Number of tcp open requests/sec to the lmgrd daemon are measured
Service time in microseconds (from accept on port to the shutdown) of every request to the lmgrd daemon
The socket receive backlog (set to 500 by lmgrd) and socket drops and socket overflows on port is measured every 10 seconds
File descriptor counts are sampled every 10 seconds for each of the three servers (lmgrd and two vendor daemons)
The only thing that correlates is the socket receive backlog increases/maxes out and usually they see the socket overflows increments when they see the failures to maintain quorum on the redundant servers. Increase in socket backlog leads to partial failures to verify a host.

Any ideas as to why this may be happening?

Answer

This is a case of one of the many symptoms that can occur when license servers are overloaded beyond their capacity. Symptoms can range from clients timing out to - in this case - quorum being lost in a triad.

This is not something with which engineering would practically be able to engage unless a deterministic error behavior is reproduced or enhancement request designed to mitigate the high-load behavior is raised (again, requiring determinism).

The kinds of things that you could experiment with today to alleviate lost-quorum are:
1. Reduce licenses served per triad.
2. Increase hardware resources on primary and secondary nodes (high CPU count more important than number cores or threads).
3. Consider upgrading the vendor daemon to the latest version - there may be some benefit from our recent select->poll changes (such as FNP-17708: "Remove the OS select() call from the FNP code-base for all non-Windows platforms" implemented with FNP 11.15.1).
4. Set a value for LM_SERVER_HIGHEST_FD (see docs) to something in the low 100's, down from the default of 1024. This will have the effect lowering the accepted client connections/second to the lmgrd, which may help in mitigating lost-quorum situations.

In general, an approach to mitigate lost-quorum is to control client load. Enhancement FNP-18732: "Have a graceful and predictable error just before VD hits its producer-tested limits" is under consideration to help mitigate this type of situation.
Labels (1)
Was this article helpful? Yes No
No ratings
Version history
Last update:
‎Nov 13, 2018 11:38 PM
Updated by: