Mailing-List: contact cygwin-developers-help AT sourceware DOT cygnus DOT com; run by ezmlm
Sender: cygwin-developers-owner AT sourceware DOT cygnus DOT com
Delivered-To: mailing list cygwin-developers AT sourceware DOT cygnus DOT com
Message-Id: <m10Hzkw-0010wTC@malasada.lava.net>
From: newsham AT lava DOT net (Tim Newsham)
Subject: signal safety and cygwin
To: cygwin-developers AT sourceware DOT cygnus DOT com
Date: Tue, 2 Mar 1999 14:43:21 -1000 (HST)
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit


Hi,

    I'm experiencing a problem in which processes get into an unkillable
state, which I believe is related to signals and the signal safety (or
lack thereof) of parts of cygwin.  NOTE: I am using an older version of
cygwin (based off of cygwin beta 19.1, with additional patches).  I am
also using a beta of win2000, although I have seen similar problems
in the past in nt4.0 which I was able to work around.

Here is the particular problem I am seeing:  I have a process which talks
over a socket to another process.  This process has an alarm handler
which siglongjmp's back to the start of execution and writes back an
error on the socket in the event of a timeout.  Sometimes when this
process is running and performing a select on some sockets and is
interupted, the code returns to the start of execution and attempts
to perform its write, but the write hangs completely.  Any further
attempts to kill the process from cygwin fails, although I can kill
the process fine from the task manager.

Performing stack traces on the hung process (using windbg, ugh, I
would kill for the ability to attach to a process w/ gdb) I see that
the process is indeed hung in the sendto that cygwin does when I
perform the write() on the socket.  In addition, I am able to see
that previous to the sigalarm happening, I was indeed in cygwin's
select function, waiting on the socket select therad to die (select_done).

My guess is that the signal is coming in, select is noting that a
signal came in and it's WaitForMultipleObjects is completing, it
is shutting down the select thread, and then returns off in to the
handler on the stack which then proceeds to write to a socket.  At
this point we have the select thread doing a winsock select on one
socket handle (depending on the race between the two threads) while
the other thread is doing a sendto on another socket handle.  For
whatever reason, winsock isn't happy and the sendto blocks indefinitely.

Looking at the bigger picture, there are other problems:  select
went and did a lot of busy work, and wasn't allowed to clean up
after itself.  Shouldn't the entire select function be protected
against signal handler dispatch (although definitely it still should
allow signal delivery to break the timed wait)?  It scares me that
there are many other places in cygwin where this sort of problem
can show its face, since the code generally allows for interuption
unless explicitely disallowed.  Maybe a better policy would be to
disallow interuption unless explicitely allowed, like in many unix
kernels?

If this specific problem has been fixed in a newer version, I apologize
for wasting time.  I am not up to date w/ cygwin at the moment.  If
anyone sees flaws in my analysis and can offer a better one, please
do.

                                              Tim N.