Mailing-List: contact cygwin-developers-help AT sourceware DOT cygnus DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-developers-owner AT sources DOT redhat DOT com Delivered-To: mailing list cygwin-developers AT sources DOT redhat DOT com Date: Fri, 16 Nov 2001 10:36:57 +0100 From: Corinna Vinschen To: cygwin-developers AT cygwin DOT com Subject: Re: TCP connections can occasionally fail because of a winsock bug Message-ID: <20011116103657.H27452@cygbert.vinschen.de> Reply-To: cygwin-developers AT cygwin DOT com Mail-Followup-To: cygwin-developers AT cygwin DOT com References: <20011115212156 DOT 5563 DOT qmail AT lizard DOT curl DOT com> <200111160258 DOT fAG2wVm27159 AT barbelith DOT montana DOT com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit User-Agent: Mutt/1.2.5i In-Reply-To: <200111160258.fAG2wVm27159@barbelith.montana.com>; from bowman@montana.com on Thu, Nov 15, 2001 at 08:00:18PM -0700 On Thu, Nov 15, 2001 at 08:00:18PM -0700, robert bowman wrote: > On Thursday 15 November 2001 14:21, you wrote: > > I've dug deeply enough into this to determine that I believe the > > problem is caused by a bug in winsock.  I can get the problem to > > manifest itself completely independently from Cygwin.  See the full > > description in the attached program, which one of my coworkers with an > > MSDN subscription is going to forward to Microsoft to see what they > > have to say about it. > > For what it's worth, we recently encountered this problem in the ONC RPC > library. The original Sun code, and any revision I've been able to find, > binds a local port even on the TCP protocol. The same thing happens, with the > bind not failing, and the failure occurring on the connect. > > We depend on RPC heavily, and would see delays on startup when the inital > clnt_create would fail repeatedly. The RPC attempts to use a pool of local > ports, and will increment and retry if the bind fails -- but it doesn't. > > This is not a cygwin issue; we are using the MKS/DataFocus NutCracker > toolkit. DataFocus provided the ported ONC RPC code but does not support it. > We have been tinkering with it in-house. The bind can be eliminated for some > improvement, in this case. > > There are other issues we are dealing with. I've forwarded a couple of the > emails to another programmer at work who is also working on NT/2000 socket > issues. > > Interestingly enough, on Linux, the bind also fails unless the process has > root priveleges. However, the code only iterates on EADDRINUSE and the return > is not checked, so the connect succeeds. > > I, also, wrote a native testcase with the WSA calls and got the same results. > I did note that the OS expires the port eventually, but it takes 5 to 20 > minutes. > > I believe the root of the problem is that both the remote host address and > local port are used to determine if the connection is unique. bind would fail > if anything other than ANY_ADDR is used, so at the time of the bind it isn't > known if the combination is unique. Only when the host address is known in > connect, will the combination fail. > > Our problem was exacerbated by the fact several apps are typically started at > the same time on one station, and they are all trying to make RPC connections > to the server machine. The ONC RPC algo uses the pid to calculate which port > to try first; with several clients starting and making several connection, > there would be groups of used ports; if a connection timed out, and the next > attempt moved into a cluster of ports being used by another app, the > clnt_create would fail many times, before it finally iterated into fresh > territory. Thanks for that interesting description. There's that SO_REUSEADDR call to setsockopt(). I wonder if that could be a help. It's treated somewhat dangerous, though. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Developer mailto:cygwin AT cygwin DOT com Red Hat, Inc.