www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1998/08/02/08:54:04

Date: Sun, 2 Aug 1998 15:53:18 +0300 (IDT)
From: Eli Zaretskii <eliz AT is DOT elta DOT co DOT il>
To: "Gurunandan R. Bhat" <grbhat AT unigoa DOT ernet DOT in>
cc: djgpp AT delorie DOT com
Subject: Re: fsdb crashes post emacs
In-Reply-To: <Pine.LNX.3.91.980729114705.739A-100000@aditya.unigoa.ernet.in>
Message-ID: <Pine.SUN.3.91.980802155243.11599B-100000@is>
MIME-Version: 1.0

On Wed, 29 Jul 1998, Gurunandan R. Bhat wrote:

> 	I have a problem in running fsdb immediately after exiting from
> emacs (19.34 latest binaries from simtel) and *only* after exiting from
> emacs. I have a plain DOS 6.22 machine (comspec=4dos.com). fsdb myprog.exe
> crashes even before the fsdb screen comes up with the following traceback: 
> 
> -----------------------------------------------------------
> Cannot continue from exception, exiting due to signal 0123
> General Protection Fault at eip=00012360
> eax=0000007c ebx=00000004 ecx=fffffffc edx=06a2fcef esi=00000048 edi=00069270
> ebp=0006920c esp=00069204 program=D:\DJGPP\BIN\FSDB.EXE
> cs: sel=00a7  base=10000000  limit=0009ffff
> ds: sel=00af  base=10000000  limit=0009ffff
> es: sel=00af  base=10000000  limit=0009ffff
> fs: sel=00cf  base=00016300  limit=000040ff
> gs: sel=00bf  base=00000000  limit=ffffffff
> ss: sel=00af  base=10000000  limit=0009ffff
>  
> Call frame traceback EIPs:
>   0x00012360   0x12360
>   0x00011e4d   0x11e4d
>   0x00007569   0x7569
>   0x00008126   0x8126
>   0x0000a348   0xa348
>   0x0000186f   0x186f
>   0x00011862   0x11862

This is a longish message, since I decided to use this crash as a case
study to explain how to approach such problems.

Here's the short summary: there seems to be either a bug in library
function `malloc' or inside FSDB that causes these crashes.  More
debugging is required, which could only be done on the machine where
these crashes happen, to find the actual cause; specific
recommendations on how to continue debugging are at the end of this
message.

The rest of this message is a detailed explanation of how to debug
such problems.  Many people think this is some kind of black art which
is unavailable to the uninitiated.  I will try to demystify this by
demonstrating that there's no black art here, it only takes a couple
of tools, and a determined person who isn't afraid of mundane work.
Hopefully, this will be useful to somebody in the future.

First, honor is due to Gurunandan R. Bhat for posting a complete and
accurate report about the crash.  Using the posted info, it was much
easier to investigate the problem.

We begin with the crash dump.  It's a bit unusual, so I will first
explain the non-standard parts.

> exiting due to signal 0123

0123 in hex is 291 decimal.  Looking inside the header <signal.h>, we
will find this line:

     #define SIGSEGV	291

So this program apparently died because of SIGSEGV.

> Cannot continue from exception, exiting due to signal 0123

This message comes from the DJGPP signal-handling code.  It means that
the program installed a handler for SIGSEGV, the handler was called
when the disaster struck, and it (the handler) attempted to return.
This isn't allowed (since returning to the locus of the exception will
just trigger the same exception again), so DJGPP is aborting the
program.  This is also the reason for 0123 being printed instead of
the usual SIGSEGV (the message is printed by a code that is different
from the usual one: most programs don't install exception handlers
which return).

Why did the program install a handler for SIGSEGV?  Because FSDB, the
program that crashed, is a debugger.  DJGPP debuggers typically want
to catch exceptions in the debugged program, so they install handlers
for these exceptions.  This crash happens *after* FSDB already
installed its handlers, but *before* it got opportunity to run the
debuggee.  So it ends up catching its own exception.

At this point, I would usually turn to run the offending program and
try to locate the bug.  But in this case, I couldn't reproduce the
problem on my machine.  So we are left with using the posted
information, and with its help, find the exact place and cause of the
crash.  

The plan is to look at the machine code, find the instruction that
triggered the crash, and then correlate it with the sources of the
program and try to guess what might have been the reason(s) for this
calamity.

> General Protection Fault at eip=00012360

So we know that it crashed at EIP = 0x12360.  But how do we peek at
that address to display machine instructions nearby?  A debugger such
as GDB will help.  Here:

  C:\> gdb c:/djgpp/bin/fsdb.exe
  [GDB's usual blurb omitted]
  (gdb) x/i 0x12360
  0x12360:	movl (%edx),%eax

The GDB command `x' displays the contents of a memory address, and `/i'
tells it to display it as an instruction.

This instruction is a pointer dereference: a 32-bit value whose
address is in EDX is loaded into EAX.  An instruction as simple as
that can only crash if EDX were zero, i.e. if it were a NULL pointer
dereference, or if EDX is garbled.  Now, what was that registers' dump
saying?

> edx=06a2fcef

Aha!  This is indeed garbled: it is way beyond the limit of the DS
segment:

> ds: sel=00af  base=10000000  limit=0009ffff

So somebody dereferenced a garbled pointer.  But where in the source
is that?  That's where the other, symyfied traceback will help us (did
I already say kudos to Gurunandan R. Bhat for providing all that?):

>   0x00013748    _malloc+192

It seems that the crash is inside `malloc'.  But since the EIP here is
quite different from the original 0x00012360, let's make sure that the
original crash was also inside `malloc' (otherwise we will be chasing
a wild goose).  The easiest way to do that is to disassemble a program
(any program) compiled with -g.  Observe:

  C:\> gcc -g -o hello hello.c
  C:\> gdb hello
  (gdb) breakpoint malloc
  Breakpoint 1 at 0x288d
  (gdb) disassemble 0x288d
  Dump of assembler code for function malloc:
  0x2884 <malloc>:	pushl %ebp
  0x2885 <malloc+1>:	movl  %esp,%ebp
  ....
  End of assembler dump
  (gdb)

As an aside: why did I use the `breakpoint' command?  Because
`disassemble' needs the address inside the function, and `breakpoint'
prints the address of the function's first executable source line.
And how did I knew that `malloc' will be present in even the trivial
"Hello, world" program?  Well, that's an insider's knowledge of the
startup code: it calls `malloc' a few times, and so every DJGPP
program has `malloc' linked in.  But even if we needed to disassemble
another function, we could just arrange for the test program to call
it.

Back to our whodunit.  So `malloc' starts at 0x2884 in this program.
Our crash point is at offset 192 (decimal) from the beginning.  Let's
look there:

  .....
  0x3484 <malloc+188>:	xorl  %eax,%eax
  0x3486 <malloc+190>:	jmp   0x349a <malloc+210>
  0x3488 <malloc+192>:	movl  (%edx),%eax
  .....

Bullseye!  Here's our friend "movl (%edx),%eax" again.  For still more
certainty, we'll look in the vicinity of this instruction inside FSDB.
This is somewhat difficult without disassembled listing like the one
above: the `disassemble' command doesn't work on stripped programs
which lack debugging symbols, so we need to use the `x' command and
compute the address for it, which requires to calculate the length of
each instruction.  But armed with the listing above, it's easy to see
the size of every instruction: it is the difference of its address and
the address of the next instruction.  So we can now do this:

  C:\> gdb c:/djgpp/bin/fsdb.exe
  (gdb) x/i 0x12360
  0x12360:	movl (%edx),%eax

Okay, already been there.  Now let's display the previous instruction,
whose address is 2 bytes before (see the disassembled `malloc' above):

  (gdb) x/i 0x1235e
  0x1235e:	jmp  0x12372

and the one before that:

  (gdb) x/i 0x1235c
  0x1235c:	xorl %eax,%eax

These are exactly the instructions from the disassembled listing, give
or take the address where it `jmp's, which is different because
absolute addresses are always different in different programs.  We
can now be sure that the crash indeed happened inside `malloc'.
(Actually, I looked at a few more instructions: better be safe than
sorry.)

On to the last part of the plan: we now want to find out where inside
`malloc' did it crash, and why.  This requires to correlate the source
of `malloc' and its disassembled listing.  One way to do that is to
compile the source of `malloc' with the `-S' option and look at the
produced assembly.  However, a better way is to walk through the
assembly listing looking at the source, and correlate them by
``decompiling'' the assembly in your head, since "gcc -S" might
produce a different listing due to different optimization switches (to
know what switches were used to compile the library, you will need to
wade through the library build procedure, which isn't easy).

Anyway, I will save you some trouble.  Here's the offending code from
the `malloc' source, with the crash spot marked:

  /*
   * If nothing in hash bucket right now,
   * request more memory from the system.
   */
  if ((op = nextf[bucket]) == NULL) {
    morecore(bucket);
    if ((op = nextf[bucket]) == NULL)
      return (NULL);
  }
  /* remove from linked list */
  nextf[bucket] = op->ov_next;  <<<<<< it crashes here
  op->ov_magic = MAGIC;
  op->ov_index = bucket;

The crashed instruction, "movl (%edx),%eax", is the translation of
"op->ov_next".  GCC translates it into a single pointer dereference
because the `ov_next' member is the first member of the structure:

  union overhead {
    union overhead *ov_next;	/* when free */
    struct {
      unsigned char	ovu_magic; /* magic number */
      unsigned char	ovu_index; /* bucket # */
    } ovu;
  };

Since the offset of the first member is always zero, GCC avoids adding
that offset after dereferencing the structure's address.

The garbled pointer in EDX is therefore the variable `op', and its
value is computed a few lines above:

  if ((op = nextf[bucket]) == NULL) {

This line translates into the following 3 instructions:

  0x3468 <malloc+192>:	movl  0xa140(,%ebx,4),%edx
  0x346f <malloc+167>:	testl %edx,%edx
  0x3471 <malloc+169>:	jne   0x3488 <malloc+192>

The first of these instructions uses the index in EBX (which
apparently holds the value of `bucket') to access the array nextf[]
whose address happens to be 0xa140.  Maybe the index is itself
garbled?  Back to the registers' dump:

> ebx=00000004

Nope, `buckets' has a reasonable value of 4.  To be sure, let's look
at FSDB sources and see whether 4 is okay for this call to `malloc'.
Here's the relevant part of the stack dump:

>  0x00013748   _malloc+192
>  0x00013229   _xmalloc+21
>  0x000080a5   _initdisplay+133, line 2147 of fullscr.c

Line 2147 of fullscr.c says this:

  code_pane_pos = xmalloc ((toplines + 2) * sizeof (word32));

where `toplines' is computed like this (inside `init_screen', lines
308 and 321 of screen.c):

  rows = ScreenRows ();
  toplines = (rows / 2) + 4;

Assuming the standard 25-line display, `toplines' is 16 and `xmalloc'
will be called with the argument (16 + 2) * 4 = 72.  I will save you
this trouble, but I actually went through the `malloc' code to verify
that, when called with an argument of 72, it will end up with
`buckets' equal to 4.  So there's nothing wrong here.

There are only one alternative left: somebody garbles nextf[4].  It
can be either some bug in `malloc' whereby a garbled value is put into
nextf[4], and `malloc' tries later to treat that like a valid pointer;
or FSDB itself garbles `malloc's data structure, either in one of the
previous calls to `free' or `realloc', or in some other code which
inadvertently runs awry.

To find out which is the actual cause, we need access to a machine
where this problem actually happens.  We need to run FSDB either under
a debugger, or with debugging print-out, and find out who and why
clobbers the nextf[] array.  Since debugging a program which is itself
a debugger is tricky, I would suggest to add calls to `cprintf' inside
`malloc' and friends, in the places where they access nextf[], and
find out what code clobbers it.  The reason for this bug must be
pretty easy to find when we know that.

And this concludes today's lesson.  We haven't find the villain yet,
but we've got pretty close, which is good enough for a debugging
session which cannot even reproduce the problem.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019