This website uses cookies. By clicking Accept, you consent to the use of cookies. Click Here to learn more about how we use cookies.
Turn on suggestions
Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.
- Revenera Community
- :
- FlexNet Publisher
- :
- FlexNet Publisher Knowledge Base
- :
- Redundant server failures
Subscribe
- Mark as New
- Mark as Read
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Redundant server failures
Redundant server failures
Summary
Three server redundancy (triad) failures.Question
We're having problems with the stability of redundant servers. The problem has been intermittent for years. There is no specific error we're seeing or other deterministic data available. We're currently collecting:Number of tcp open requests/sec to the lmgrd daemon are measured
Service time in microseconds (from accept on port to the shutdown) of every request to the lmgrd daemon
The socket receive backlog (set to 500 by lmgrd) and socket drops and socket overflows on port is measured every 10 seconds
File descriptor counts are sampled every 10 seconds for each of the three servers (lmgrd and two vendor daemons)
The only thing that correlates is the socket receive backlog increases/maxes out and usually they see the socket overflows increments when they see the failures to maintain quorum on the redundant servers. Increase in socket backlog leads to partial failures to verify a host.
Any ideas as to why this may be happening?
Answer
This is a case of one of the many symptoms that can occur when license servers are overloaded beyond their capacity. Symptoms can range from clients timing out to - in this case - quorum being lost in a triad.This is not something with which engineering would practically be able to engage unless a deterministic error behavior is reproduced or enhancement request designed to mitigate the high-load behavior is raised (again, requiring determinism).
The kinds of things that you could experiment with today to alleviate lost-quorum are:
1. Reduce licenses served per triad.
2. Increase hardware resources on primary and secondary nodes (high CPU count more important than number cores or threads).
3. Consider upgrading the vendor daemon to the latest version - there may be some benefit from our recent select->poll changes (such as FNP-17708: "Remove the OS select() call from the FNP code-base for all non-Windows platforms" implemented with FNP 11.15.1).
4. Set a value for LM_SERVER_HIGHEST_FD (see docs) to something in the low 100's, down from the default of 1024. This will have the effect lowering the accepted client connections/second to the lmgrd, which may help in mitigating lost-quorum situations.
In general, an approach to mitigate lost-quorum is to control client load. Enhancement FNP-18732: "Have a graceful and predictable error just before VD hits its producer-tested limits" is under consideration to help mitigate this type of situation.
No ratings