Mailing-List: contact cygwin-developers-help AT sourceware DOT cygnus DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-developers-owner AT sources DOT redhat DOT com Delivered-To: mailing list cygwin-developers AT sources DOT redhat DOT com Message-ID: <010601c15a9c$8329b070$0200a8c0@lifelesswks> From: "Robert Collins" To: "Jason Tishler" Cc: References: <20011021213545 DOT A1884 AT dothill DOT com> Subject: Re: fix cond_race... was RE: src/winsup/cygwin ChangeLog thread.cc thread.h ... Date: Mon, 22 Oct 2001 11:54:39 +1000 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4133.2400 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 X-OriginalArrivalTime: 22 Oct 2001 01:59:25.0344 (UTC) FILETIME=[2CF0FA00:01C15A9D] rethreaded to cygdev... ----- Original Message ----- From: "Jason Tishler" > Rob, > > On Sun, Oct 07, 2001 at 10:24:30PM +1000, Robert Collins wrote: > > From: "Jason Tishler" > > > Unfortunately, Python's test_threadedtempfile regression test still > > > hangs (IIRC) in the same place. See attached for details. > > > > I'm going to have to think about this one - unless your systems is > > massively overloaded during the test - such that the spinloop around > > line 482 is able to get 10 timeslices without the waiting thread getting > > 1?!? - there should be no way to tickle this. > > > > I'd like you to add a system_printf, at line 483, something like > > "system_printf ("repulsing event at count 5\n"); (oh, and put it at the > > PulseEvent in {}. If that fires then we know that the detection code is > > ok. If so, can you try bumping the spin count up, and make the pulsevent > > fire if spins mod 5 == 0 ? > > With the attached patch applied to thread.cc version 1.52, Python's > test_threadedtempfile regression test still hangs in the same place. > Did I alter the code as you intended above or did I misunderstand? > > When I run test_threadedtempfile, I get the following output: > > 0 [main] python 2024 pthread_cond::Signal: repulsing event at count 995 > 382380484 [unknown (0x520)] python 2024 pthread_cond::Signal: repulsing event at > count 0 > .. > > So the repulse event is occurring, but I don't think that it is having any > affect. Which means that there are a) listed waiting threads AND b) none have called WaitForSingleObject yet. > Is there anything else that you would like me to try? Yes. I'll write more in ~ 1 hr. The basic game plan is that we need to figure out why the other thread is not getting to call WFSO. The fix I put in place is meant to follow the following logic: Any thread altering the state of a condition object grabs an access mutex to ensure atomic alterations. The access mutex must never be kept across blocking system calls. threads that want to wait on the cond variable atomically increment a waiting thread counter for both performance and race fixing reasons. They then release the mutex and call WFSO on the cond event object. These three operations are not bound into an atomic unit. So to prevent lost signals, upon wake up, threads atomically decrement the waiting thread counter _before_ grabbing the cond variable access mutex. So the woken thread will block until the signaller releases the access mutex. This allows detection of lost signals by the signaller - if the waiting thread count does not decrement, then either you have a crashed thread (unlikely as this is completely within our code) or the waiting thread has not had enough timeslices to call WFSO yet. So the signaller gives up the cpu and tries again... and again... The count of 5 between tries is an attempt to prevent releasing multiple threads because we're not waiting long enough!. There is a second potential race in this, which is multiple waiters entering and altering the waiting thread count. That is solved by the cond access mutex which is kept locked by the signaller. So the problem you have Jason, is simply that none of the waiting threads have called WFSO AND are not being given enough CPU time to do so. There are several reasons that this could happen. 1) the thread waiting count is wrong and there are actually no threads waiting when Signal occurs. Then the Wait will always fail. 2) there is a syncronisation issue with entry to the cond access mutex between the waiter and the signaller. 3) (and this is nasty one) the signaller is at a higher priority level than the waiter. This will result (IIR my terminology C) in an inverted priority situation, which NT does not handle. (This is why hard RT folk still shun NT kernels). For 1) gdb is your friend. For 2) system printfs and I are your friends. for 3) try temporarily dropping the priority of the signaller before it signals and restoring before exiting the access mutex. (pthead_setpriority should do). Of course, if python doesn't set thread priority then 3 is unlikely. Rob