Mailing-List: contact cygwin-developers-help AT sourceware DOT cygnus DOT com; run by ezmlm Sender: cygwin-developers-owner AT sourceware DOT cygnus DOT com Delivered-To: mailing list cygwin-developers AT sourceware DOT cygnus DOT com Message-Id: From: newsham AT lava DOT net (Tim Newsham) Subject: signal safety and cygwin To: cygwin-developers AT sourceware DOT cygnus DOT com Date: Tue, 2 Mar 1999 14:43:21 -1000 (HST) X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Hi, I'm experiencing a problem in which processes get into an unkillable state, which I believe is related to signals and the signal safety (or lack thereof) of parts of cygwin. NOTE: I am using an older version of cygwin (based off of cygwin beta 19.1, with additional patches). I am also using a beta of win2000, although I have seen similar problems in the past in nt4.0 which I was able to work around. Here is the particular problem I am seeing: I have a process which talks over a socket to another process. This process has an alarm handler which siglongjmp's back to the start of execution and writes back an error on the socket in the event of a timeout. Sometimes when this process is running and performing a select on some sockets and is interupted, the code returns to the start of execution and attempts to perform its write, but the write hangs completely. Any further attempts to kill the process from cygwin fails, although I can kill the process fine from the task manager. Performing stack traces on the hung process (using windbg, ugh, I would kill for the ability to attach to a process w/ gdb) I see that the process is indeed hung in the sendto that cygwin does when I perform the write() on the socket. In addition, I am able to see that previous to the sigalarm happening, I was indeed in cygwin's select function, waiting on the socket select therad to die (select_done). My guess is that the signal is coming in, select is noting that a signal came in and it's WaitForMultipleObjects is completing, it is shutting down the select thread, and then returns off in to the handler on the stack which then proceeds to write to a socket. At this point we have the select thread doing a winsock select on one socket handle (depending on the race between the two threads) while the other thread is doing a sendto on another socket handle. For whatever reason, winsock isn't happy and the sendto blocks indefinitely. Looking at the bigger picture, there are other problems: select went and did a lot of busy work, and wasn't allowed to clean up after itself. Shouldn't the entire select function be protected against signal handler dispatch (although definitely it still should allow signal delivery to break the timed wait)? It scares me that there are many other places in cygwin where this sort of problem can show its face, since the code generally allows for interuption unless explicitely disallowed. Maybe a better policy would be to disallow interuption unless explicitely allowed, like in many unix kernels? If this specific problem has been fixed in a newer version, I apologize for wasting time. I am not up to date w/ cygwin at the moment. If anyone sees flaws in my analysis and can offer a better one, please do. Tim N.