Additional license consumed on client reconnection post heartbeat interval

Yes

Description

When a client app loses a network connection, but then regains it, a socket is still open on the license server, and the license is unavailable to other users after the app reconnects (and consumes additional license) and then later exits cleanly.

Replication Scenario

- Set a Client / Server system (on separate machines) and check out a license

- Using lmstat on the server confirm that a license is checked out

- Unplug the network cable on the client

- Wait at least the period of one heartbeat (normally two minutes)

- Plug the cable back in and note (from the vendor log) that a second license is checked out.

- Confirm using lmstat that two licenses are checked out

- Exit the client app and confirm that only one license is checked in.

- Confirm using lmstat that one license is left checked out, even 1 hour later.

Root Cause:

During the time when the network is disconnected, clients heartbeat to the server fails with the network error. So, client disconnects the connection with the daemon and when the network is back, it creates a new connection and sends again the checkout request for that feature. Since this is a different connection, server does the additional checkout for the feature. This additional license of the feature is never re-claimed by the client as it does not know about it and when the client exits, license lingers forever.

In case, If client had checked out n licenses before the network disconnect, all the n licenses will be held in the server.

Workaround

>> The first workaround reduces the LM_A_TCP_TIMEOUT value (set by the client, the time the server waits before deciding the client is disconnected and checks licenses back in). We suggested this formula to calculate the timeout based on heartbeat settings:

LM_A_TCP_TIMEOUT = (LM_A_CHECK_INTERVAL x 2) + LM_A_RETRY_COUNT x LM_A_RETRY_INTERVAL + one-minute-buffer.

As an example,:

1.) Setting LM_A_CHECK_INTERVAL to 30 seconds,

2.) LM_A_RETRY_COUNT to 2 and

3.) LM_A_RETRY_INTERVAL to 30 seconds

would result in LM_A_TCP_TIMEOUT of 3 minutes.

Since default LM_A_TCP_TIMEOUT is 2 hours, this significantly reduces the probability of the license server holding back licenses – for that to occur, the client would have to reconnect within 3 minutes.

The disadvantages of this workaround are:

1. Does not completely solve this issue (but does drastically reduce occurence)
2. Client updates needed

A consequence of the workaround is that clients reconnecting after 3 minutes, (using above example) would have to check out licenses again, even if they managed to reconnect on the same socket.

>> A second workaround is to edit the server OS TCP properties:

Edit/create the KeepAliveTime, KeepAliveInterval & TcpMaxDataRetransmission registry values, as set in HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters (refer http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html for equivalents on Linux).

So if we set KeepAliveTime to 600 seconds, KeepAliveInterval to 60 seconds and TcpMaxDataRetransmission to 3, the server will wait 600 seconds then for every 60 seconds
sends heartbeat probes to the client for three times. After that, the server considers the connection to be broken. The disadvantage of the second workaround is that this configures TCP properties for all processes running on the server.
This should be OK if the license server is the only production process running on the server, for example if the server is isolated by running it in a VM .

Version Fix

The above issue has been fixed in FNP 11.18.3.0 (2021 R4) so you could run a quick test in our latest version .(FNP-18904)

As part of this fix we introduced a Vendor variable 'ls_server_override_client_tcp_timeout' to override LM_A_TCP_TIMEOUT value at the Server side. Broken client connections are cleared at server end after ''ls_server_override_client_tcp_timeout' timeout period and licenses are checked in back. We also enable TCP keepalive, so that TCP stack also clear broken connections if the application fails to close it after a certain period.

So for the fix to work you need to set the value of this variable say ls_server_override_client_tcp_timeout=300 in lsvendor.c and rebuild the server .

alnicol · ‎Aug 24, 2022

FYI it is possible to replicate this issue even when using 11.18.3.1 just by using the replication steps in this article, which suggests that this issue is not in fact fixed after all. Flexera have yet to confirm but it is easy to test yourself if needed

Workaround 2 is also truncated and cannot be read

jyadav · ‎Aug 25, 2022

Hi @alnicol ,

As part of this fix we introduced a Vendor variable 'ls_server_override_client_tcp_timeout' to override LM_A_TCP_TIMEOUT value at the Server side. Broken client connections are cleared at server end after ''ls_server_override_client_tcp_timeout' timeout period and licenses are checked in back. We also enable TCP keepalive, so that TCP stack also clear broken connections if the application fails to close it after a certain period.

So for the fix to work you need to set the value of this variable say ls_server_override_client_tcp_timeout=300 in lsvendor.c and rebuild the server .

alnicol · ‎Aug 25, 2022

Hi @jyadav ,

Thank you, hopefully that is the info we need to resolve the issue.

Perhaps this article could be amended to make that clearer for others in the future

alnicol · ‎Sep 20, 2022

@jyadav

I think the fix described here may well prevent the duplicate checked out licenses being held indefinitely like the article talks about, but it seemingly does not actually pervent the duplicate checkout in the first place which can cause significant issues in itself.

Based on the information provided and testing on our end, the fix applied in 11.18.3.0 may work if the duration of the disconnection is more than 300 seconds (300 seems to be the minimum value allowable for 'ls_server_override_client_tcp_timeout')

But I think if the client reconnects within the 300 seconds then an additional license is still going to be consumed - which is problematic if there are no more licenses available, and FlexNet causes the application to close with a "Lost license, cannot re-connect: Licensed number of users already reached" error

Logically, if the fix is only applied if the timeout is reached, then there will be no difference if the duration is less than the minimum timeout and the issue could still occur. Our testing also confirms that the license error can seemingly still occur for these short timeouts

Is there any way to eliminate this duplicate checkout problem completely? Ideally, the server needs to be able to not consume another license even temporarily, as otherwise it's possible to hit this error. I guess the server would need to do something like detect that the checkout requests occurred due to a re-connection, and if so kill the old connection (and freeing up the old license) before checking out a new license

I appreciate that the above suggestion would need further software changes, but seems to me the current fix doesn't handle this scenario

Additionally - is there any exit handler that is called when the application exits due to a "Lost license, cannot re-connect: Licensed number of users already reached" error?

If such a handler existed, it may at least be possible to save the application data before exiting - something that doesn't currently occur and thus data can be lost, however I can't see reference to any such handler.
This could be a useful addition that would not address the root cause of this issue, but at least allow it to be less damaging

Thanks

PS I raised a support case for this issue and have been met with silence, hence why we're asking for support via this post.