Mailing-List: contact cygwin-developers-help AT sourceware DOT cygnus DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-developers-owner AT sources DOT redhat DOT com Delivered-To: mailing list cygwin-developers AT sources DOT redhat DOT com Message-Id: <200111160258.fAG2wVm27159@barbelith.montana.com> Content-Type: text/plain; charset="iso-8859-1" From: robert bowman To: cygwin-developers AT cygwin DOT com Subject: Re: TCP connections can occasionally fail because of a winsock bug Date: Thu, 15 Nov 2001 20:00:18 -0700 X-Mailer: KMail [version 1.3.1] References: <20011115212156 DOT 5563 DOT qmail AT lizard DOT curl DOT com> In-Reply-To: <20011115212156.5563.qmail@lizard.curl.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit On Thursday 15 November 2001 14:21, you wrote: > I've dug deeply enough into this to determine that I believe the > problem is caused by a bug in winsock.  I can get the problem to > manifest itself completely independently from Cygwin.  See the full > description in the attached program, which one of my coworkers with an > MSDN subscription is going to forward to Microsoft to see what they > have to say about it. For what it's worth, we recently encountered this problem in the ONC RPC library. The original Sun code, and any revision I've been able to find, binds a local port even on the TCP protocol. The same thing happens, with the bind not failing, and the failure occurring on the connect. We depend on RPC heavily, and would see delays on startup when the inital clnt_create would fail repeatedly. The RPC attempts to use a pool of local ports, and will increment and retry if the bind fails -- but it doesn't. This is not a cygwin issue; we are using the MKS/DataFocus NutCracker toolkit. DataFocus provided the ported ONC RPC code but does not support it. We have been tinkering with it in-house. The bind can be eliminated for some improvement, in this case. There are other issues we are dealing with. I've forwarded a couple of the emails to another programmer at work who is also working on NT/2000 socket issues. Interestingly enough, on Linux, the bind also fails unless the process has root priveleges. However, the code only iterates on EADDRINUSE and the return is not checked, so the connect succeeds. I, also, wrote a native testcase with the WSA calls and got the same results. I did note that the OS expires the port eventually, but it takes 5 to 20 minutes. I believe the root of the problem is that both the remote host address and local port are used to determine if the connection is unique. bind would fail if anything other than ANY_ADDR is used, so at the time of the bind it isn't known if the combination is unique. Only when the host address is known in connect, will the combination fail. Our problem was exacerbated by the fact several apps are typically started at the same time on one station, and they are all trying to make RPC connections to the server machine. The ONC RPC algo uses the pid to calculate which port to try first; with several clients starting and making several connection, there would be groups of used ports; if a connection timed out, and the next attempt moved into a cluster of ports being used by another app, the clnt_create would fail many times, before it finally iterated into fresh territory.