[an error occurred while processing this directive]
Node:<a name="Crash%20dump">Crash dump</a>,
Next:<a rel=next href="faq12_3.html">Debug graphics</a>,
Previous:<a rel=previous href="faq12_1.html">How to debug</a>,
Up:<a rel=up href="faq12.html">Debugging</a>
<br>

<h2>12.2 How to begin debugging using the crash dump info</h2>

<p><em><strong>Q</strong>: My program crashed with SIGSEGV, but I'm unsure how to begin
debugging it<small>...</small>.</em>

<br><p>
<p><em><strong>Q</strong>: Can you help me figure out all those funny numbers printed when
my program crashes?</em>

<br><p>
<p><strong>A</strong>: Debugging should always begin with examining the message printed
when the program crashes.  That message includes crucial information
which usually proves invaluable during debugging.  So the first thing
you should do is carefully save the entire message.  On plain DOS, use
the &lt;PrintScreen&gt; key to get a hard copy of the message.  On
Windows, use the clipboard to copy the message to a text editor or the
Notepad, and save it to a file.  If you can easily reproduce the crash,
try running the program after redirecting the standard error stream,
where the crash dump is printed, to a file, e.g. like this:

<pre>  redir -e crash.txt myprog [arguments to the program go here]
</pre>

<p>(here I used the <code>redir</code> program supplied with DJGPP; the <code>-e</code>
switch tells it to redirect the standard error stream to the named
file). 
Redirecting the standard error stream to a file has an
additional advantage of printing the entire call frame traceback, even
if it is very long, whereas when writing to the screen, the DJGPP exit
code limits the number of printed stack frames so that the crash message
won't scroll off the screen.

<p>After you've saved the crash message, look at the name of the crashed
program, usually printed on the 4th line.  Knowing which program crashed
is important when one program calls another, like if you run a program
from <small>RHIDE</small>.  Without this step, you might erroneously try to debug
the wrong program.  (If the program name is garbled, or if
<code>&lt;??UNKNOWN??&gt;</code> is printed in its stead, it means the program
crashed inside the startup code.)

<p>The next step in the debugging is to find out where in the code did the
program crash.  The <code>SYMIFY</code> program will help you translate the
call frame traceback, which is the last portion of the crash message,
into a list of function names, source files and line numbers which
describe the sequence of function calls that led to the crash.  The
top-most line in the call frame traceback is the place where the program
crashed, the one below it is the place that called the function which
crashed, etc.  The last line will usually be in the startup code, in a
function called <code>__crt1_startup</code>, but if the screen is too small to
print the entire traceback without scrolling, the traceback will be
truncated before it gets to the startup.  See <a href="faq9_3.html">how to use <code>SYMIFY</code></a>, for more details about the
call frame traceback and <code>SYMIFY</code> usage.

<p>If you compiled your program without the <code>-g</code> switch, or if you
stripped the debugging symbols (e.g., using the <code>-s</code> linker
switch), running <code>SYMIFY</code> will just repeat the addresses instead of
translating them to function names and source file info.  You will have
to rebuild the program with <code>-g</code> and without <code>-s</code>, before you
continue.

<p>Next, you need to get an idea about the cause of the crash.  To this
end, look at the first two lines of the crash message.  There you will
find a description of the type of the crash, like this:

<pre> Exiting due to signal SIGSEGV
 Page Fault at eip=00008e89, error=0004
</pre>

<p>(the actual text in your case will be different).  The following table
lists common causes for each type of crash:

<dl>
<dt><code>Page Fault</code>
<dd>This usually means the program tried to access some data via a NULL or
an uninitialized pointer.  A NULL pointer is a pointer which holds an
address that is zero; it can come from a failed call to <code>malloc</code>
(did your code check for that?).  An uninitialized pointer holds some
random garbage value; it can come from a missing call to <code>malloc</code>.

<p>The error code (<code>error=0004</code> in the example above) will usually be
either 4 or 6.  The former means that the program tried to read (take
a value from) the invalid address, the latter means the program tried to
write (change the stored value) there.

<p>Sometimes, you might see a somehwat different format of a <code>Page
Fault</code> message:

<pre> Page Fault cr2=10000000 at eip e75; flags=6
 eax=00000030 ebx=00000000 ecx=0000000c edx=00000000
 esi=0001a44a edi=00000000 ebp=00000000 esp=00002672
 cs=18 ds=38 es=af fs=0 gs=0 ss=20 error=0002
</pre>

<p>This message comes from CWSDPMI, which could happen when some crucial
data structure in the low-level library code becomes trashed.  The value
in <code>cr2</code> is the address which caused the <code>Page Fault</code>
exception.  In the example above, this address is 0x10000000, and since
this is exactly the base address of the DJGPP program under CWSDPMI, it
means the program dereferenced a NULL pointer.

<p>If the message says <code>Page Fault in RMCB</code>, then it usually means
that the program installed an interrupt handler or a real-mode callback
(a.k.a. RMCB), but failed to lock all the memory accessed by the
handler or functions it calls.  See <a href="faq18_11.html">installing hardware interrupt handlers</a>, for more about this.  It also might mean that a program
failed to unhook some interrupt before it exited.

<br><dt><code>General Protection Fault</code>
<dd>This can be caused by a variety of reasons:

<ul>
<li>use of an uninitialized or a garbled pointer (beyond the limit of the DS
segment printed at the time of crash);

<li>attempt to access memory using an invalid selector, e.g., a selector
whose privilege level is zero from a DJGPP program which runs at
privilege level 3.  (The lower 2 bits of a slector are its privilege
level.)

<p>In this case, the GPF will be accompanied by an error code, like this:

<pre> General Protection Fault at eip=000020bc, error=0104
</pre>

<p>The error code is actually the selector that the program tried to use,
in this case 104h; its low 2 bits are 00, so this is a ring-0 selector.

</p><li>overwriting the stack, e.g. by writing (assigning values) to array
elements beyond the array limits, or due to incorrect argument list
passed to a function, like passing a pointer to an <code>int</code> to a
function that expects a pointer to an <code>double</code>, or passing buffers
to a library function without sufficient space to hold the results;

<li>stack overflow. 
</ul>

<p>Overwriting the stack frame can usually be detected by looking at the
values of the <small>EBP</small> and <small>ESP</small> registers, printed right below the
first two lines.  Normally, <small>ESP</small> is <em>slightly smaller</em> than
<small>EBP</small>, smaller than the limit of the <small>SS</small> segment, and usually
<em>larger</em> than <small>EIP</small><a rel=footnote href="faq25.html"><sup>21</sup></a>; anything else is a clear sign of a stack being overrun or
overwritten.  In particular, if <small>ESP</small> is valid, but <small>EBP</small> is not,
it usually means that the stack was overwritten.  In some cases,
<small>EBP</small>'s value might look like a chunk of text, like 0x33313331 (the
string <code>1313</code>, after swapping the bytes due to the fact that x86 is
a little-endian machine).

<p>How do you know whether the values of <small>ESP</small> and <small>EBP</small> are valid? 
To help you, DJGPP v2.02 and later prints the valid limits of the
application stack, like this:

<pre>App stack: [000afb50..0002fb50]  Exceptn stack: [0002fa2c..0002daec]
</pre>

<p>(The second range of values, for the "Exceptn stack", shows the
8KB-long stack used by the library for processing hardware exceptions,
because the normal application stack might be invalid when an exception
happens.)

<p>Another tell-tale sign of an overrun stack frame is that the symified
traceback points to a line where the function returns, or to its closing
brace.  That's because, when a program overruns the stack, the return
address saved there gets overwritten by a random value, and the program
crashes when the offending function tries to return to an invalid
address.

<p>Suspect a stack overflow if the <small>EBP</small> and <small>ESP</small> values are close
to one another, but both very low (the stack grows <em>downwards</em>) and
outside the valid stack limits printed below the registers' dump, or
if the call frame traceback includes many levels, which is a sign of a
deep recursion.

<p>Another sign of a stack overflow is when the traceback points to some
internal library structure, like <code>__djgpp_exception_table</code>, or if
the <small>SS</small> selector is marked as <code>invalid</code> in the crash message.

<p>Stubediting the program to enlarge its stack size might solve problems
with stack overflow (but <strong>not</strong> when the stack is being
overwritten as described above).  See <a href="faq15_9.html">changing stack size</a>, for a description of how
to enlarge the stack.  If you use large automatic arrays, an alternative
to stubediting is to make the array dimensions smaller, or make the
array global, or allocate it at run time using <code>malloc</code>.

<p>Note that, unlike in the cases, described above, where the stack was
overwritten, stack overflow usually manifests itself by <strong>both</strong>
<small>ESP</small> and <small>EBP</small> being invalid (outside the valid limits printed by
the crashed program).

<br><dt><code>Stack Fault</code>
<dd>Usually means a stack overflow, but can also happen if your code
overruns the stack frame (see above).

<br><dt><code>Floating Point exception</code>
<dt><code>Coprocessor overrun</code>
<dt><code>Overflow</code>
<dt><code>Division by Zero</code>
<dd>These (and some additional) messages, printed when the program crashes
due to signal <code>SIGFPE</code>, mean some error in floating-point
computations, like division by zero or overflow.  Sometimes such errors
happen when an <code>int</code> is passed to a function that expects a
<code>float</code> or a <code>double</code>.

<br><dt><code>Cannot continue from exception, exiting due to signal 0123</code>
<br><dt><code>Cannot continue from exception, exiting due to signal SIGSEGV</code>
<dd>This message is printed if your program installed a handler for a fatal
signal such as <code>SIGSEGV</code> (0123 in hex is the numeric code of
<code>SIGSEGV</code>; see the header <code>signal.h</code> for the other codes), and
that handler attempted to return.  This is not allowed, since returning
to the locus of the exception will just trigger the same exception again
and again, so the DJGPP signal-handling machinery aborts the program
after printing this message.

<p>If you indeed wanted <code>SIGSEGV</code> to be generated in that case, the
way to solve such problems is to modify your signal handler so that it
calls either <code>exit</code> or <code>longjmp</code>.  If <code>SIGSEGV</code> should
not have been triggered, debug this as described below.

<br><dt><code>Invalid TSS in RMCB</code>
<dd>This usually means that a program failed to uninstall its interrupt
handler or RMCB when it exited.  If you are using DJGPP v2.0, one case
where this happens is when a nested program exits by calling
<code>abort</code>: v2.0 had a bug in its library whereby calling <code>abort</code>
would bypass the cleanup code that restored the keyboard interrupt
hooked by the DJGPP startup code; v2.01 solves this bug.

<p>Using the <code>itimer</code> facility in v2.01 programs can also cause such
crashes if the program exits abnormally, or doesn't disable the timer
before it exits. The exit code in DJGPP v2.02 and later makes sure the
original timer interrupt is always restored.

<br><dt><code>Double Fault</code>
<dd>If this message appears when you run your program under CWSDPR0 and
press the Interrupt key (<kbd>Ctrl-&lt;C&gt;</kbd> or <kbd>Ctrl-&lt;BREAK&gt;</kbd>)
or the QUIT key (<kbd>Ctrl-&lt;\&gt;</kbd>), then this is expected behavior
(the <code>SIGINT</code> generation works by invalidating the <small>DS/SS</small>
selector, but since CWSDPR0 doesn't switch stacks on exceptions, there's
no place to put the exception frame for the exception this triggers, so
the program double faults and bails out).  Otherwise, treat this as
<code>Page Fault</code>.

<br><dt><code>Control-Break Pressed</code>
<dt><code>Control-C Pressed</code>
<dt><code>INTR key Pressed</code>
<dt><code>QUIT key Pressed</code>
<dd>These are not real crashes, but are listed here for completeness.  They
are printed when <kbd>Ctrl-&lt;BREAK&gt;</kbd> or the Interrupt key (by
default, <kbd>Ctrl-&lt;C&gt;</kbd>) is pressed, which by default abort the
program due to signal <code>SIGINT</code>.  The QUIT key (by default,
<kbd>Ctrl-&lt;\&gt;</kbd>) generates the <code>SIGQUIT</code> signal which by
default is ignored, but some programs set it to abort the program as
well. 
</dl>

<p>If you are lucky, and the crash happened inside your function (as
opposed to some library function), then the above info and the symified
call frame traceback should almost immediately suggest where's the bug. 
You need to analyze the source line corresponding to the top-most
<small>EIP</small> in the call frame traceback, and look for the variable(s) that
could provide one of the reasons listed above.  If you cannot figure it
out by looking at the source code, run the program under a debugger
until it gets to the point of the crash, then examine the variables
involved in the crashed computation, to find those which trigger the
problem.  Finally, use the debugger to find out how did those variables
come to get those buggy values.

<p>People which are less lucky have their programs crash inside library
functions for which <code>SYMIFY</code> will only print their names, since the
libraries are usually compiled without the <code>-g</code> switch.  You have
several possible ways to debug these cases:

<ul>
<li>Begin with the last call frame that <code>SYMIFY</code> succeeded to convert
to a pointer to a line number in a source file.  This line should be a
call to some function in some library you used to link your program. 
Re-read the docs for that function and examine all the arguments you are
passing to it under a debugger, looking for variables that could cause
the particular type of crash you have on your hands, as described above.

<li>Link your program against a version of the library that includes the
debug info.  For example, you can get the sources of the library and
recompile it with the <code>-g</code> compiler switch.  After re-linking the
program, cause it to crash and run <code>SYMIFY</code> to get a full
description of the place where it dies.

<li>A variation of the previous technique is to paste into your program the
source of the library function(s) whose names you see in the symified
traceback, and recompile your program with <code>-g</code> switch.  Then run
your program again, and when it crashes, <code>SYMIFY</code> should be able to
find the line number info for the entire traceback.

<li>If you cannot get hold of the sources for the library, you could still
use assembly-level commands of the debugger to find out the reason for
the crash.  Here's how:

<ul>
<li>Load the program into a debugger.

<li>Display the instruction at <small>EIP</small> whose value is printed on the second
line of the crash message.  For example, with GDB, use the <kbd>x/i
eip-value</kbd> command.

<li>Find out which registers are relevant to the crash.  For example, if the
instruction dereferences a pointer, the register that holds that pointer
is one possible candidate for scrutiny.

<li>Look at the values for the relevant registers as printed in the crash
message, and find the register(s) which hold abnormal values.  Common
cases include:

<ol type=a start=1>
<li>A pointer whose value is below 4096 (1000 hex), or above the limit of
the <small>DS</small> segment.  (The limit is printed as part of the crash
message.)

<li>An index of an array element or an offset into a struct whose value is
negative, or beyond the last array element, or more than the offset of
the last struct member.

<li>A linear address of a buffer in conventional memory whose value is more
than 10FFFFh, or an offset into the transfer buffer which is larger than
the transfer buffer size (16K by default).
</ol>

<li>Once you've found the register whose value is abnormal, find out what
variables in your program caused that abnormal value, e.g. by stepping
through the machine code from the point where your code called the
library function. 
</ul>
</ul>

<p><hr>
[an error occurred while processing this directive]