Mailing-List: contact cygwin-developers-help AT sourceware DOT cygnus DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-developers-owner AT sources DOT redhat DOT com Delivered-To: mailing list cygwin-developers AT sources DOT redhat DOT com Date: 14 Sep 2001 11:56:45 -0400 Message-ID: <20010914155645.3005.qmail@lizard.curl.com> From: Jonathan Kamens To: cygwin-developers AT cygwin DOT com Subject: Successfully attached to hung "make -j2" process! Cgf was right -- things have progressed sufficiently since 1.1.8 that when the "make -j2" hang we've been seeing forever manifested itself again with a cygwin1.dll built with --enable-debugging out of the current CVS repository, I was able to successfully attach to it. Here's some background information about the problem, followed by what I'm seeing in the current instance of it.... We run our automated builds with "make -j2" on dual-processor machines. On a very regular basis, they hang and can't be killed. When they hang, we see two or more Make processes in a parent-child chain, separated by sh processes (we call Make recursively, so this is understandable), and one or more of them is in state "I" (according to ps). Until now, I've been unable to attach to the hung processes to find out what's going on. I just duplicated this problem running "make -j2" on a Cygwin build tree. I now have two hung Make processes, one of which is in state "I". I can attach the debugger to both of them. Both of them are hung inside read(). I'm going to take a little aside to explain how the Make "job server" works, in case you don't know, because I'm pretty certain, from what I see in these stack traces, that the problem is related to the job server. When Make is run with "-j" and the job server is enabled, it creates a pipe which is shared by all Make processes -- the master Make process run from the command line and all the children it starts up. It writes into the pipe characters, representing tokens for the jobs that are allowed to run in parallel at any given time. When any Make process running as part of the build decides that it has a command that it wants to execute, it reads a token from the read end of the pipe. When the job is finished, it writes the token back to the write end of the pipe so that another Make process can read it and start a job. Both hung Make processes appear to be hung trying to read a token from the job server pipe. This is the child Make process, the one which ps shows in state "I": (gdb) thread apply all where Thread 4 (thread 297.0xe1): #0 0x77f762e9 in ?? () #1 0x77f04ede in ?? () Thread 3 (thread 297.0xec): #0 0x77f682cb in ?? () #1 0x77f1cd76 in ?? () #2 0x610566f9 in wait_subproc () at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1240 #3 0x61004e9a in thread_stub (arg=0x6109c9f8) at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97 #4 0x77f04ede in ?? () Thread 2 (thread 297.0xf0): #0 0x77f682cb in ?? () #1 0x77f1cd76 in ?? () #2 0x610563a5 in wait_sig () at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1126 #3 0x61004e9a in thread_stub (arg=0x6109c9f8) at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97 #4 0x77f04ede in ?? () Thread 1 (thread 297.0xbf): #0 0x77f67f07 in ?? () #1 0x6101158f in fhandler_base::raw_read (this=0x614f3ffc, ptr=0x24ad432, ulen=1) at /u/jik/cygwin-cvs/src/winsup/cygwin/fhandler.cc:219 #2 0x61011e34 in fhandler_base::read (this=0x614f3ffc, in_ptr=0x24ad432, in_len=1) at /u/jik/cygwin-cvs/src/winsup/cygwin/fhandler.cc:488 #3 0x6105c004 in _read (fd=5, ptr=0x24ad432, len=1) at /u/jik/cygwin-cvs/src/winsup/cygwin/syscalls.cc:316 #4 0x61085a6e in read (fd=5, buf=0x24ad432, cnt=1) at /u/jik/cygwin-cvs/src/newlib/libc/syscalls/sysread.c:15 #5 0x6100ff74 in sigdelayed () at /u/jik/cygwin-cvs/src/winsup/cygwin/exceptions.cc:1229 #6 0x402a14 in execute_file_commands (file=0xa01ec78) at /u/jik/systems-cvs-root/cygwin/make/commands.c:362 #7 0x421abc in remake_file (file=0xa01ec78) at /u/jik/systems-cvs-root/cygwin/make/remake.c:1008 #8 0x42096b in update_file_1 (file=0xa01ec78, depth=4) at /u/jik/systems-cvs-root/cygwin/make/remake.c:659 #9 0x41ea78 in update_file (file=0xa01ec78, depth=4) at /u/jik/systems-cvs-root/cygwin/make/remake.c:309 #10 0x420f9d in check_dep (file=0xa01ec78, depth=4, this_mtime=1, must_make_ptr=0x24ad67c) at /u/jik/systems-cvs-root/cygwin/make/remake.c:813 #11 0x41fb13 in update_file_1 (file=0xa01b0f0, depth=3) at /u/jik/systems-cvs-root/cygwin/make/remake.c:450 #12 0x41ea78 in update_file (file=0xa01b0f0, depth=2) at /u/jik/systems-cvs-root/cygwin/make/remake.c:309 #13 0x420f9d in check_dep (file=0xa01b0f0, depth=2, this_mtime=1, must_make_ptr=0x24ad7bc) at /u/jik/systems-cvs-root/cygwin/make/remake.c:813 #14 0x41fb13 in update_file_1 (file=0xa013ba8, depth=1) at /u/jik/systems-cvs-root/cygwin/make/remake.c:450 #15 0x41ea78 in update_file (file=0xa013ba8, depth=0) at /u/jik/systems-cvs-root/cygwin/make/remake.c:309 #16 0x41e55d in update_goal_chain (goals=0xa013c08, makefiles=0) at /u/jik/systems-cvs-root/cygwin/make/remake.c:156 #17 0x41560c in main (argc=73, argv=0x614f6df4, envp=0xa010008) at /u/jik/systems-cvs-root/cygwin/make/main.c:1998 #18 0x61004294 in dll_crt0_1 () at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:770 #19 0x61004565 in _dll_crt0 () at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:837 #20 0x610045a4 in dll_crt0 (uptr=0x0) at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:849 #21 0x42e7d3 in cygwin_crt0 () #0 0x77f762e9 in ?? () Line 362 in Make's commands.c is this: new_job (file); I assume that new_job is hung inside the read call in this snippet of code: /* Read a token. As long as there's no token available we'll block. If we get a SIGCHLD we'll return with EINTR. If one happened before we got here we'll return immediately with EBADF because the signal handler closes the dup'd file descriptor. */ if (read (job_rfd, &token, 1) == 1) { DB (DB_JOBS, (_("Obtained token for child 0x%08lx (%s).\n"), (unsigned long int) c, c->file->name)); break; } I'm unsure of how to confirm this. Since new_job doesn't show up in the stack trace above (why not?), I can't enter the new_job frame and ask for the address of "token" to confirm that it matches the address in the call to read(). There is something odd here -- if it's trying to read from a pipe, why is it calling into fhandle_base instead of fhandler_pipe? Here's the parent Make process, which does *not* show "I" in its ps output, and which *is* calling into fhandler_pipe: (gdb) thread apply all where Thread 4 (thread 251.0xaa): #0 0x77f762e9 in ?? () #1 0x77f04ede in ?? () Thread 3 (thread 251.0xff): #0 0x77f682cb in ?? () #1 0x77f1cd76 in ?? () #2 0x610566f9 in wait_subproc () at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1240 #3 0x61004e9a in thread_stub (arg=0x6109c9f8) at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97 #4 0x77f04ede in ?? () Thread 2 (thread 251.0x94): #0 0x77f682cb in ?? () #1 0x77f1cd76 in ?? () #2 0x610563a5 in wait_sig () at /u/jik/cygwin-cvs/src/winsup/cygwin/sigproc.cc:1126 #3 0x61004e9a in thread_stub (arg=0x6109c9f8) at /u/jik/cygwin-cvs/src/winsup/cygwin/debug.cc:97 #4 0x77f04ede in ?? () Thread 1 (thread 251.0x12d): #0 0x77f67a5b in ?? () #1 0x6104db43 in peek_pipe (s=0x24aef64, ignra=0) at /u/jik/cygwin-cvs/src/winsup/cygwin/select.cc:458 #2 0x6104de44 in fhandler_pipe::ready_for_read (this=0x614f1be4, fd=5, howlong=4294967295, ignra=0) at /u/jik/cygwin-cvs/src/winsup/cygwin/select.cc:500 #3 0x6105bf90 in _read (fd=5, ptr=0x24af052, len=1) at /u/jik/cygwin-cvs/src/winsup/cygwin/syscalls.cc:300 #4 0x61085a6e in read (fd=5, buf=0x24af052, cnt=1) at /u/jik/cygwin-cvs/src/newlib/libc/syscalls/sysread.c:15 #5 0x6100ff74 in sigdelayed () at /u/jik/cygwin-cvs/src/winsup/cygwin/exceptions.cc:1229 #6 0x402a14 in execute_file_commands (file=0xa02c448) at /u/jik/systems-cvs-root/cygwin/make/commands.c:362 #7 0x421abc in remake_file (file=0xa02c448) at /u/jik/systems-cvs-root/cygwin/make/remake.c:1008 #8 0x42096b in update_file_1 (file=0xa02c448, depth=4) at /u/jik/systems-cvs-root/cygwin/make/remake.c:659 #9 0x41ea78 in update_file (file=0xa02c448, depth=4) at /u/jik/systems-cvs-root/cygwin/make/remake.c:309 #10 0x420f9d in check_dep (file=0xa02c448, depth=4, this_mtime=1, must_make_ptr=0x24af29c) at /u/jik/systems-cvs-root/cygwin/make/remake.c:813 #11 0x41fb13 in update_file_1 (file=0xa01e898, depth=3) at /u/jik/systems-cvs-root/cygwin/make/remake.c:450 #12 0x41ea78 in update_file (file=0xa01e898, depth=2) at /u/jik/systems-cvs-root/cygwin/make/remake.c:309 #13 0x420f9d in check_dep (file=0xa01e898, depth=2, this_mtime=1, must_make_ptr=0x24af3dc) at /u/jik/systems-cvs-root/cygwin/make/remake.c:813 #14 0x41fb13 in update_file_1 (file=0xa019080, depth=1) at /u/jik/systems-cvs-root/cygwin/make/remake.c:450 #15 0x41ea78 in update_file (file=0xa019080, depth=0) at /u/jik/systems-cvs-root/cygwin/make/remake.c:309 #16 0x41e55d in update_goal_chain (goals=0xa027ca8, makefiles=0) at /u/jik/systems-cvs-root/cygwin/make/remake.c:156 #17 0x41560c in main (argc=2, argv=0x614f621c, envp=0xa010008) at /u/jik/systems-cvs-root/cygwin/make/main.c:1998 #18 0x61004294 in dll_crt0_1 () at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:770 #19 0x61004565 in _dll_crt0 () at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:837 #20 0x610045a4 in dll_crt0 (uptr=0x0) at /u/jik/cygwin-cvs/src/winsup/cygwin/dcrt0.cc:849 #21 0x42e7d3 in cygwin_crt0 () From what I've observed, I have some theories about what might be going wrong, but I'm unsure of how to proceed at investigating them: 1) Somehow Cygwin is getting confused about a file descriptor and forgets that it's a pipe, so it can't read a token from it. 2) A token is getting "lost" somewhere inside the Cygwin or Windows pipe implementation. 3) If a process writes data into a pipe and then exits before the character is read out of the pipe, is the data lost? If so, then this could explain how tokens are getting lost. I'm probably not going to have any more time to play with this today. and time will be scarce next week too. However, absent any suggestions from anyone on the list, when I have time, I'll test whether (3) is a problem by writing some programs which write data to and from pipes and seeing what happens when one end exits before the other end reads. (2) will be slightly harder to test -- I'll have to write programs which do all kinds of forks and execs and writing to and from pipe file descriptors and see if any of the data written to the pipe gets lost. I'm unsure of how to investigate (1) further. Thoughts? jik