BUG: (P0) External Filter does not attempt to reconnect with server

jcheng · ‎Jun 21, 2005

Description:

The external filter does not seem to reconnect to the external filter server, once the connection is broken. We tested this in cases where our server is 1) shutdown and restart properly 2) unreachable due to network failure. In either case the external filter was not able to re-establish the connection with the server, short of restarting the lmgrd. (The vendor daemon log simply logs:

"14:24:03 (plattst2) Vendor daemon could not connect to External Filter server jcheng on port 9582 (-17,80005)"

without really trying to re-connect to the server. This is a serious reliability issue (P0) for us, because it means if for any reason our daemon is down/restarted, the external filter functionality will simply stop working.

Thanks,
-Johnson

davidz · ‎Jun 21, 2005

We do have code to reconnect. We are currently looking at what system error codes under which this reconnect is tried.

jcheng · ‎Jun 23, 2005

Hi,

Just want to add some more information to this bug:

If the vendor daemons are started *prior* to our external filter server running and listening on the port; then when the first checkout comes, the vendor daemons will connect to the server but will have some trouble handshaking with our server (it seems to send a header and then nothing, causing our read to timeout).

I suspect this is related to the re-connect scenario as well.

Thanks,
-Johnson

jcheng · ‎Jun 24, 2005

Hi guys,

I did a quick test for this bug, and I found that

1) The problem I appended yesterday is now fixed. Now even if I start my external filter server after the vendor daemons were started, a checkout/checkin will still establish the connection properly.

2) The re-connect functionality is still not there. Here's how I tested:

- Start my external filter server
- Start new lmgrd + vendor daemons (I can see they connect with my server)
- Do some checkout/checkins (I can see them in my server)
- Shutdown my external filter server, which closes all the connections with the vendor daemons
- Now do a checkout -> the vendor daemons don't detect the disconnect yet
- Do a checkin -> now the vendor daemons do detect the disconnect (I can see in the log)
- Now I start my external filter server again
- Do another checkout -> from now on the vendor daemons will NOT reconnect to my server, and will keep on logging the message as before:

"12:11:13 (plattst1) Vendor daemon could not connect to External Filter server jcheng on port 9582 (-17,80005)"

Thanks,
-Johnson

lnielsen · ‎Jun 24, 2005

Hi,

I did Johnsons test sequence. The problem exist on Linux which I will get right on. Windows however works fine.

Lars

jcheng · ‎Jun 27, 2005

Hi,

I have tested the new build with the reconnect bug fix, and now the vendor daemon does detect the broken connection and reconnects with my server.

One thing I observed was that, if I shutdown and restart my external filter server, the next first checkout/in does not attempt to reconnect and is missed. Then the subsequent checkout/in does detect the broken connection, and does attempt the reconnection and all goes well. Is there a way to fix this behaviour so that even the first checkout/in after the broken connection be detected?

(I am thinking of the scenario where the user reconfigure our daemon and need to shutdown and immediately restart between checkout/ins, it would be good if they don't lose an extra transaction because of that...)

Thanks,
-Johnson

jcheng · ‎Jun 27, 2005

Hi,

I have re-tested and confirmed the scenario, and also checked the library version I am building against. Here is a summary of what happens:

- start lmgrd + VDs
- start my server
- checkout (VD connects to my server correctly, and sent to my server)
- checkin (sent to my server)
- shutdown my server
- restart my server
- checkout (license checked out, but NO TRAFFIC observed over the wire on my server side)
- checkin (license checked in, now vendor daemon reconnect with my server)
- from now on everything works

My testing platform is Linux.

Thanks,
-Johnson

davidz · ‎Jun 28, 2005

Lars, other engineers and I see this behavior, but we have not yet determined why our call to write() succeeds even though the License Scheduler side of the connection has been cut. We originally thought it had to do with a timeout value; that is, maybe that it took a while for our side of the connection to be notified of the cut, but after waiting some 15 minutes, we saw the same behavior. We've researched all options of the write() call and the socket setup calls, researched this behavior on Google, but cannot yet find the answer.

Therefore, let's step back and examine the consequences if this behavior is not changed:

The filter server (License Scheduler) can go down for any reason (reconfiguration, failure, or normal shutdown) and the network can fail for many reasons.

We agreed that the license server cannot wait too long for a response from the filter server, so we put in a read timeout parameter.

Therefore, during the time when the license server cannot connect to its filter server (for any of the reasons above), N number of filter events will timeout and are never resent to the filter server.

The current behavior will result in N+1 filter events never being sent to the filter server.

In our tests, not filtering one event seems significant, but in real deployed environments I suspect that not filtering the N+1st event will probably not be significant.

Platform's thoughts?

BUG: (P0) External Filter does not attempt to reconnect with server

FlexNet Publisher