Bug 653311 - Switch jprof from hand-rolled stackwalk code to glibc's backtrace() to work with modern x86 Linuxes, improve jprof output, update README - r=jim_nance (per bz)
authorRandell Jesup <rjesup@wgate.com>
Sun, 15 May 2011 05:47:48 -0400
changeset 69590 9968ed6b629a1b2c937e0cb1053a56c91f96c2d2
parent 69589 0a1e7ec7e2684d21d19a017663c373b0518076c4
child 69591 f717485edc5188c4991a1ba010a7ad4c3e3ee83e
push id20035
push userrjesup@wgate.com
push dateTue, 17 May 2011 04:10:07 +0000
treeherderautoland@9968ed6b629a [default view] [failures only]
perfherder[talos] [build metrics] [platform microbench] (compared to previous push)
reviewersjim_nance
bugs653311
milestone6.0a1
first release with
nightly linux32
nightly linux64
nightly mac
nightly win32
nightly win64
last release without
nightly linux32
nightly linux64
nightly mac
nightly win32
nightly win64
Bug 653311 - Switch jprof from hand-rolled stackwalk code to glibc's backtrace() to work with modern x86 Linuxes, improve jprof output, update README - r=jim_nance (per bz)
tools/jprof/README.html
tools/jprof/leaky.cpp
tools/jprof/leaky.h
tools/jprof/stub/Makefile.in
tools/jprof/stub/libmalloc.cpp
tools/jprof/stub/libmalloc.h
--- a/tools/jprof/README.html
+++ b/tools/jprof/README.html
@@ -1,17 +1,18 @@
 <html>
 <head><title>The Jprof Profiler</title></head>
 
 <body bgcolor="#FFFFFF" text="#000000"
       link="#0000EE" vlink="#551A8B" alink="#FF0000">
 <center>
 <h1>The Jprof Profiler</h1>
 <font size="-1">
-<a href="mailto:jim_nance%yahoo.com">jim_nance@yahoo.com</a>
+<a href="mailto:jim_nance%yahoo.com">jim_nance@yahoo.com</a><p>
+Recent (4/2011) updates Randell Jesup (see bugzilla for contact info)
 </font>
 <hr>
 
 <a href="#introduction">Introduction</a> | <a href="#operation">Operation</a> |
 <a href="#setup">Setup</a> | <a href="#usage">Usage</a> |
 <a href="#interpretation">Interpretation</a>
 
 </center>
@@ -41,38 +42,53 @@ default pull.  To do this do:
   cvs co mozilla/tools/jprof
 </pre>
 
 <p>Next, configure your mozilla with jprof support by adding
 <code>--enable-jprof</code> to your configure options (eg adding
 <code>ac_add_options --enable-jprof</code> to your <code>.mozconfig</code>) and
 making sure that you do <strong>not</strong> have the
 <code>--enable-strip</code> configure option set -- jprof needs symbols to
-operate.</p>
+operate.  On many architectures with GCC, you'll need to add
+<code>--enable-optimize="-O3 -fno-omit-frame-pointer"</code> or the
+equivalent to ensure frame pointer generation in the compiler you're using.</p>
 
 <p>Finally, build mozilla with your new configuration.  Now you can run jprof.</p>
 
 <h3><a name="usage">Usage</a></h3>
-
+Options:
+<ul>
+  <li><b>-s depth</b> : Limit depth looked at from captured stack
+      frames</li>
+  <li><b>-v</b> : Output some information about the symbols, memory map, etc.</li>
+  <li><b>-t</b> : Group output according to thread.  Requires external
+      LD_PRELOAD library to help force sampling of spawned threads; jprof
+       normally captures the main thread only.  See <a
+       href="http://sam.zoy.org/writings/programming/gprof.html">gprof-helper</a>;
+       it may need adaption for jprof.</li>
+  <li><b>-e exclusion</b> : Allows excluding specific stack frames</li>
+  <li><b>-i inclusion</b> : Allows including specific stack frames</li>
+</ul>
 The behavior of jprof is determined by the value of the JPROF_FLAGS environment
 variable.  This environment variable can be composed of several substrings
 which have the following meanings:
 <ul>
     <li> <b>JP_START</b> : Install the signal handler, and start sending the
     timer signals.
     
     <li> <b>JP_DEFER</b> : Install the signal handler, but don't start sending
     the timer signals.  The user must start the signals by sending the first
     one (with <code>kill -PROF</code>, or with <code>kill -ALRM</code> if
     JP_REALTIME is used, or with <code>kill -POLL</code> (also known as <code>kill -IO</code>) if JP_RTC_HZ is used).
 
     <li> <b>JP_FIRST=x</b> : Wait x seconds before starting the timer
 
     <li> <b>JP_PERIOD=y</b> : Set timer to interrupt every y seconds.  Only
-    values of y strictly greater than 0.001 are supported.
+    values of y greater than or equal to 0.001 are supported.  Default is
+    0.050 (50ms).
     
     <li> <b>JP_REALTIME</b> : Do the profiling in intervals of real time rather
     than intervals of time used by the mozilla process (and the kernel
     when doing work for mozilla).  This could probably lead to weird
     results (you'll see whatever runs when mozilla is waiting for events),
     but is needed to see time spent in the X server.
 
     <li> <b>JP_RTC_HZ=freq</b> : This option, only available on Linux if the
@@ -98,20 +114,20 @@ being profiled
 <h4>Examples of JPROF_FLAGS usage</h4>
 <ul>
 
   <li>To make the timer start firing 3 seconds after the program is started and
   fire every 25 milliseconds of program time use:
     <pre>
         setenv JPROF_FLAGS "JP_START JP_FIRST=3 JP_PERIOD=0.025" </pre>
 
-  <li>To make the timer start on your signal and fire every 1.5 milliseconds of
+  <li>To make the timer start on your signal and fire every 1 millisecond of
   program time use:  
     <pre>
-        setenv JPROF_FLAGS "JP_DEFER JP_PERIOD=0.0015" </pre>
+        setenv JPROF_FLAGS "JP_DEFER JP_PERIOD=0.001" </pre>
 
   <li>To make the timer start on your signal and fire every 10 milliseconds of
   wall-clock time use:  
     <pre>
         setenv JPROF_FLAGS "JP_DEFER JP_PERIOD=0.010 JP_REALTIME" </pre>
 
   <li>To make the timer start on your signal and fire at 8192 Hz in wall-clock
   time use:
@@ -184,87 +200,92 @@ hierarchical profile, which is described
 
 <h4><a name="hier">Hierarchical output</a></h4>
 
 The hierarchical output is divided up into sections, with each section
 corresponding to one function.  A typical section looks something like
 this:
 
 <blockquote><pre>
-             <A href="#29355">141300 PL_ProcessPendingEvents</A>
-             <A href="#29372">   927 PL_ProcessEventsBeforeID</A>
- 29358   0 <a name=29358>  142227</a> <b>PL_HandleEvent</b>
-             <A href="#28546"> 92394 nsInputStreamReadyEvent::EventHandler(PLEvent*)</A>
-             <A href="#41572"> 49181 HandlePLEvent(ReflowEvent*)</A>
-             <A href="#29537">   481 handleTimerEvent(TimerEventType*)</A>
-             <A href="#34494">   158 nsTransportStatusEvent::HandleEvent(PLEvent*)</A>
-             <A href="#29359">     9 PL_DestroyEvent</A>
-
-             <A href="#20319">     4 __restore_rt</A>
+ index  Count         Hits      Function Name
+                      <A href="#72871">     545 (46.4%) nsBlockFrame::ReflowInlineFrames(nsBlockReflowState&, nsLineList_iterator, int*)</A>
+                      <A href="#72873">     100 (8.5%)  nsBlockFrame::ReflowDirtyLines(nsBlockReflowState&)</A>
+ 72870      4 (0.3%)  <a name=72870>     645 (54.9%)</a> <b>nsBlockFrame::DoReflowInlineFrames(nsBlockReflowState&, nsLineLayout&, nsLineList_iterator, nsFlowAreaRect&, int&, nsFloatManager::SavedState*, int*, LineReflowStatus*, int)</b>
+                      <A href="#72821">     545 (46.4%) nsBlockFrame::ReflowInlineFrame(nsBlockReflowState&, nsLineLayout&, nsLineList_iterator, nsIFrame*, LineReflowStatus*)</A>
+                      <A href="#72853">      83 (7.1%)  nsBlockFrame::PlaceLine(nsBlockReflowState&, nsLineLayout&, nsLineList_iterator, nsFloatManager::SavedState*, nsRect&, int&, int*)</A>
+                      <A href="#74150">       9 (0.8%)  nsLineLayout::BeginLineReflow(int, int, int, int, int, int)</A>
+                      <A href="#74897">       1 (0.1%)  nsTextFrame::GetType() const</A>
+                      <A href="#74131">       1 (0.1%)  nsLineLayout::RelativePositionFrames(nsOverflowAreas&)</A>
+                      <A href="#58320">       1 (0.1%)  __i686.get_pc_thunk.bx</A>
+                      <A href="#53077">       1 (0.1%)  PL_ArenaAllocate</A>
 </pre></blockquote>
 
 The information this block tells us is:
 
 <ul>
-<li>There were 0 profiler hits <em>in</em> <code>PL_HandleEvent</code>
-<li>There were 142227 profiler hits <em>under</em> <code>PL_HandleEvent</code>.  Of these:
+<li>There were 4 profiler hits <em>in</em> <code>nsBlockFrame::DoReflowInlineFrames</code>
+<li>There were 645 profiler hits <em>in or under</em> <code>nsBlockFrame::DoReflowInlineFrames</code>.  Of these:
 <ul>
-  <li>92394 were in or under <code>nsInputStreamReadyEvent::EventHandler</code>
-  <li>49181 were in or under <code>HandlePLEvent(ReflowEvent*)</code>
-  <li>481 were in or under <code>handleTimerEvent</code>
-  <li>158 were in or under <code>nsTransportStatusEvent::HandleEvent</code>
-  <li>9 were in or under <code>PL_DestroyEvent</code>
-  <li>4 were in or under <code>__restore_rt</code>
+  <li>545 were in or under <code>nsBlockFrame::ReflowInlineFrame</code>
+  <li>83 were in or under <code>nsBlockFrame::PlaceLine</code>
+  <li>9 were in or under <code>nsLineLayout::BeginLineReflow</code>
+  <li>1 was in or under <code>nsTextFrame::GetType</code>
+  <li>1 was in or under <code>nsLineLayout::RelativePositionFrames</code>
+  <li>1 was in or under <code>__i686.get_pc_thunk.bx</code>
+  <li>1 was in or under <code>PL_ArenaAllocate</code>
 </ul>
-<li>Of these 142227 calls into <code>PL_HandleEvent</code>:
+<li>Of these 645 calls into <code>nsBlockFrame::DoReflowInlineFrames</code>:
 <ul>
-  <li>141300 came from <code>PL_ProcessPendingEvents</code>
-  <li>927 came from <code>PL_ProcessEventsBeforeID</code>
+  <li>545 came from <code>nsBlockFrame::ReflowInlineFrames</code>
+  <li>100 came from <code>nsBlockFrame::ReflowDirtyLines</code>
 </ul>
 </ul>
 
 
 The rest of this section explains how to read this information off from the jprof output.
 
-<p>This block corresponds to the function <code>PL_HandleEvent</code>, which is
+<p>This block corresponds to the function <code>nsBlockFrame::DoReflowInlineFrames</code>, which is
 therefore bolded and not a link.  The name of this function is preceded by
-three numbers which have the following meaning.  The number on the left (29358)
-is the index number, and is not important.  The center number (0) is the number
-of times this function was interrupted by the timer.  The last number (142227)
-is the number of times this function was in the call stack when the timer went
+five numbers which have the following meaning.  The number on the left (72870)
+is the index number, and is not important.  The next number (4) and the
+percentage following (0.3%) are the number
+of times this function was interrupted by the timer and the percentage of
+the total hits that is.  The last number pair ("645 (54.9%)")
+are the number of times this function was in the call stack when the timer went
 off.  That is, the timer went off while we were in code that was ultimately
-called from <code>PL_HandleEvent</code>.
+called from <code>nsBlockFrame::DoReflowInlineFrames</code>.
 <p>For our example we can see that our function was in the call stack for
-142227 interrupt ticks, but we were never the function that was running when
-the interrupt arrived.
+645 interrupt ticks, but we were only the function that was running when
+the interrupt arrived 4 times.
 <P>
-The functions listed above the line for <code>PL_HandleEvent</code> are its
+The functions listed above the line for <code>nsBlockFrame::DoReflowInlineFrames</code> are its
 callers.  The numbers to the left of these function names are the numbers of
 times these functions were in the call stack as callers of
-<code>PL_HandleEvent</code>.  In our example, we were called 927 times by
-<code>PL_ProcessEventsBeforeID</code> and 141300 times by
-<code>PL_ProcessPendingEvents</code>.
+<code>nsBlockFrame::DoReflowInlineFrames</code>.  In our example, we were called 545 times by
+<code>nsBlockFrame::ReflowInlineFrames</code> and 100 times by
+<code>nsBlockFrame::ReflowDirtyLines</code>.
 <P>
-The functions listed below the line for <code>PL_HandleEvent</code> are its
+The functions listed below the line for <code>nsBlockFrame::DoReflowInlineFrames</code> are its
 callees.  The numbers to the left of the function names are the numbers of
-times these functions were in the callstack as callees of <code>PL_HandleEvent</code>. In our example, of the 142227 profiler hits under <code>PL_HandleEvent</code> 92394 were under <code>nsInputStreamReadyEvent::EventHandler</code>, 49181 were under <code>HandlePLEvent(ReflowEvent*)</code>, and so forth.
+times these functions were in the callstack as callees of
+<code>nsBlockFrame::DoReflowInlineFrames</code> and the corresponding percentages. In our example, of the 645 profiler hits under <code>nsBlockFrame::DoReflowInlineFrames</code> 545 were under <code>nsBlockFrame::ReflowInlineFrame</code>, 83 were under <code>nsBlockFrame::PlaceLine</code>, and so forth.<p>
+
+<b>NOTE:</b> If there are loops of execution or recursion, the numbers will
+not add up and percentages can exceed 100%.  If a function directly calls
+itself "(self)" will be appended to the line, but indirect recursion will
+not be marked.
 
 <h3>Bugs</h3>
-Jprof has only been tested under Red Hat Linux 6.0, 6.1, and 6.2.  It does
-not work under 6.0, though it is possible hack up the source code and make
-it work there.  The way I determine the stack trace from inside the
-signal handler is tightly bound to the version of glibc that is running.
-If you know of a more portable way to get this information please let
-me know.
-
-<h3>Update</h3>
+The current build of Jprof has only been tested under Ubuntu 8.04 LTS, but
+should work under any fairly modern linux distribution using GCC/GLIBC.
+Please update this document with any known compatibilities/incompatibilities.
+<p>
+If you get an error:<p><code>Inconsistency detected by ld.so: dl-open.c: 260: dl_open_worker: Assertion `_dl_debug_initialize (0, args->nsid)->r_state == RT_CONSISTENT' failed!
+</code><p>that means you've hit a timing hole in the version of glibc you're
+running.  See <a
+href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=4578">Redhat bug 4578</a>.
+<!-- <h3>Update</h3>
 <ul>
-  <li>Ben Bucksch reports that installing the Red Hat 6.1 glibc rpms on a Red Hat
-6.0 system allows jprof to work, and does not seem to break anything except
-gdm (the Gnome login program), and that can be fixed by installing the RH 6.1
-gdb rpm.</li>
-  <li>David Baron reports that jprof works under RedHat 6.0 if one uncomments
-the <code>#define JPROF_PTHREAD_HACK</code> near the beginning of
-<code>libmalloc.cpp</code>.</li>
 </ul>
+-->
 
 </body>
 </html>
--- a/tools/jprof/leaky.cpp
+++ b/tools/jprof/leaky.cpp
@@ -1,8 +1,9 @@
+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
 /* ***** BEGIN LICENSE BLOCK *****
  * Version: MPL 1.1/GPL 2.0/LGPL 2.1
  *
  * The contents of this file are subject to the Mozilla Public License Version
  * 1.1 (the "License"); you may not use this file except in compliance with
  * the License. You may obtain a copy of the License at
  * http://www.mozilla.org/MPL/
  *
@@ -123,16 +124,17 @@ htmlify(const char *in)
 leaky::leaky()
 {
   applicationName = NULL;
   logFile = NULL;
   progFile = NULL;
 
   quiet = TRUE;
   showAddress = FALSE;
+  showThreads = FALSE;
   stackDepth = 100000;
 
   mappedLogFile = -1;
   firstLogEntry = lastLogEntry = 0;
 
   sfd = -1;
   externalSymbols = 0;
   usefulSymbols = 0;
@@ -144,52 +146,54 @@ leaky::leaky()
 }
 
 leaky::~leaky()
 {
 }
 
 void leaky::usageError()
 {
-  fprintf(stderr, "Usage: %s prog log\n", (char*) applicationName);
+  fprintf(stderr, "Usage: %s [-v][-t] [-e exclude] [-i include] [-s stackdepth] prog log\n", (char*) applicationName);
+  fprintf(stderr, "\t-v: verbose\n\t-t: split threads\n");
   exit(-1);
 }
 
 void leaky::initialize(int argc, char** argv)
 {
   applicationName = argv[0];
   applicationName = strrchr(applicationName, '/');
   if (!applicationName) {
     applicationName = argv[0];
   } else {
     applicationName++;
   }
 
   int arg;
   int errflg = 0;
-  while ((arg = getopt(argc, argv, "adEe:gh:i:r:Rs:tqx")) != -1) {
+  while ((arg = getopt(argc, argv, "adEe:gh:i:r:Rs:tqvx")) != -1) {
     switch (arg) {
       case '?':
+      default:
 	errflg++;
 	break;
       case 'a':
 	break;
-      case 'A':
+      case 'A': // not implemented
 	showAddress = TRUE;
 	break;
       case 'd':
 	break;
       case 'R':
 	break;
       case 'e':
 	exclusions.add(optarg);
 	break;
       case 'g':
 	break;
-      case 'r':
+      case 'r': // not implemented
 	roots.add(optarg);
 	if (!includes.IsEmpty()) {
 	  errflg++;
 	}
 	break;
       case 'i':
 	includes.add(optarg);
 	if (!roots.IsEmpty()) {
@@ -202,17 +206,22 @@ void leaky::initialize(int argc, char** 
 	stackDepth = atoi(optarg);
 	if (stackDepth < 2) {
 	  stackDepth = 2;
 	}
 	break;
       case 'x':
 	break;
       case 'q':
-	quiet = TRUE;
+        break;
+      case 'v':
+        quiet = !quiet;
+        break;
+      case 't':
+        showThreads = TRUE;
 	break;
     }
   }
   if (errflg || ((argc - optind) < 2)) {
     usageError();
   }
   progFile = argv[optind++];
   logFile = argv[optind];
@@ -260,31 +269,89 @@ void leaky::LoadMap()
     lme->next = loadMap;
     loadMap = lme;
   }
   close(fd);
 }
 
 void leaky::open()
 {
+  int threadArray[100]; // should auto-expand
+  int last_thread = -1;
+  int numThreads=0;
+
   LoadMap();
 
   setupSymbols(progFile);
 
   // open up the log file
   mappedLogFile = ::open(logFile, O_RDONLY);
   if (mappedLogFile < 0) {
     perror("open");
     exit(-1);
   }
   off_t size;
   firstLogEntry = (malloc_log_entry*) mapFile(mappedLogFile, PROT_READ, &size);
   lastLogEntry = (malloc_log_entry*)((char*)firstLogEntry + size);
 
-  analyze();
+  fprintf(stdout,"<html><head><title>Jprof Profile Report</title></head><body>\n");
+  fprintf(stdout,"<h1><center>Jprof Profile Report</center></h1>\n");
+
+  if (showThreads)
+  {
+    // Find all the threads captured
+
+    // pthread/linux docs say the signal can be delivered to any thread in
+    // the process.  In practice, it appears in Linux that it's always
+    // delivered to the thread that called setitimer(), and each thread can
+    // have a separate itimer.  There's a support library for gprof that
+    // overlays pthread_create() to set timers in any threads you spawn.
+
+    // This loop walks through all the call stacks we recorded
+    for (malloc_log_entry* lep=firstLogEntry;
+         lep < lastLogEntry;
+         lep = reinterpret_cast<malloc_log_entry*>(&lep->pcs[lep->numpcs])) {
+      if (lep->thread != last_thread)
+      {
+        int i;
+        for (i=0; i<numThreads; i++)
+        {
+          if (lep->thread == threadArray[i])
+            break;
+        }
+        if (i == numThreads &&
+            i < (int) (sizeof(threadArray)/sizeof(threadArray[0])))
+        {
+          threadArray[i] = lep->thread;
+          numThreads++;
+          fprintf(stderr,"new thread %d\n",lep->thread);
+        }
+      }
+    }
+    fprintf(stderr,"Num threads %d\n",numThreads);
+
+    fprintf(stdout,"<hr>Threads:<p><pre>\n");
+    for (int i=0; i<numThreads; i++)
+    {
+      fprintf(stdout,"   <a href=\"thread_%d\">%d</a><p>\n",
+              threadArray[i],threadArray[i]);
+    }
+    fprintf(stdout,"</pre><hr>");
+
+    for (int i=0; i<numThreads; i++)
+    {
+      analyze(threadArray[i]);
+    }
+  }
+  else
+  {
+    analyze(0);
+  }
+
+  fprintf(stdout,"</pre></body></html>\n");
 
   exit(0);
 }
 
 //----------------------------------------------------------------------
 
 
 static int symbolOrder(void const* a, void const* b)
@@ -423,24 +490,29 @@ void leaky::displayStackTrace(FILE* out,
 
 void leaky::dumpEntryToLog(malloc_log_entry* lep)
 {
   printf("%ld\t", lep->delTime);
   printf(" --> ");
   displayStackTrace(stdout, lep);
 }
 
-void leaky::generateReportHTML(FILE *fp, int *countArray, int count)
+void leaky::generateReportHTML(FILE *fp, int *countArray, int count, int thread)
 {
-  fprintf(fp,"<html><head><title>Jprof Profile Report</title></head><body>\n");
-  fprintf(fp,"<h1><center>Jprof Profile Report</center></h1>\n");
   fprintf(fp,"<center>");
-  fprintf(fp,"<A href=#flat>flat</A><b> | </b><A href=#hier>hierarchical</A>");
+  if (showThreads)
+  {
+    fprintf(fp,"<hr><A NAME=thread_%d><b>Thread: %d</b></A><p>",
+            thread,thread);
+  }
+  fprintf(fp,"<A href=#flat_%d>flat</A><b> | </b><A href=#hier_%d>hierarchical</A>",
+          thread,thread);
   fprintf(fp,"</center><P><P><P>\n");
 
+  int totalTimerHits = count;
   int *rankingTable = new int[usefulSymbols];
 
   for(int cnt=usefulSymbols; --cnt>=0; rankingTable[cnt]=cnt);
 
   // Drat.  I would use ::qsort() but I would need a global variable and my
   // intro-pascal professor threatened to flunk anyone who used globals.
   // She damaged me for life :-) (That was 1986.  See how much influence
   // she had.  I don't remember her name but I always feel guilty about globals)
@@ -459,33 +531,39 @@ void leaky::generateReportHTML(FILE *fp,
     }
   }
 
   // Ok, We are sorted now.  Let's go through the table until we get to
   // functions that were never called.  Right now we don't do much inside
   // this loop.  Later we can get callers and callees into it like gprof
   // does
   fprintf(fp,
-  "<h2><A NAME=hier></A><center><a href=\"http://lxr.mozilla.org/mozilla/source/tools/jprof/README.html#hier\">Hierarchical Profile</a></center></h2><hr>\n");
+	  "<h2><A NAME=hier_%d></A><center><a href=\"http://lxr.mozilla.org/mozilla/source/tools/jprof/README.html#hier\">Hierarchical Profile</a></center></h2><hr>\n",
+          thread);
   fprintf(fp, "<pre>\n");
-  fprintf(fp, "%5s %5s    %4s %s\n",
-  "index", "Count", "Hits", "Function Name");
+  fprintf(fp, "%6s %6s         %4s      %s\n",
+          "index", "Count", "Hits", "Function Name");
 
   for(i=0; i<usefulSymbols && countArray[rankingTable[i]]>0; i++) {
     Symbol *sp=&externalSymbols[rankingTable[i]];
     
-    sp->cntP.printReport(fp, this);
+    sp->cntP.printReport(fp, this, rankingTable[i], totalTimerHits);
 
     char *symname = htmlify(sp->name);
-    fprintf(fp, "%6d %3d <a name=%d>%8d</a> <b>%s</b>\n", rankingTable[i],
-            sp->timerHit, rankingTable[i], countArray[rankingTable[i]],
+    fprintf(fp, "%6d %6d (%3.1f%%)%s <a name=%d>%8d (%3.1f%%)</a>%s <b>%s</b>\n", 
+            rankingTable[i],
+            sp->timerHit, (sp->timerHit*1000/totalTimerHits)/10.0,
+            (sp->timerHit*1000/totalTimerHits)/10.0 >= 10.0 ? "" : " ",
+            rankingTable[i], countArray[rankingTable[i]],
+            (countArray[rankingTable[i]]*1000/totalTimerHits)/10.0,
+            (countArray[rankingTable[i]]*1000/totalTimerHits)/10.0 >= 10.0 ? "" : " ",
             symname);
     delete [] symname;
 
-    sp->cntC.printReport(fp, this);
+    sp->cntC.printReport(fp, this, rankingTable[i], totalTimerHits);
 
     fprintf(fp, "<hr>\n");
   }
   fprintf(fp,"</pre>\n");
 
   // OK, Now we want to print the flat profile.  To do this we resort on
   // the hit count.
 
@@ -503,45 +581,50 @@ void leaky::generateReportHTML(FILE *fp,
       }
     }
   }
 
   // Pre-count up total counter hits, to get a percentage.
   // I wanted the total before walking the list, if this
   // double-pass over externalSymbols gets slow we can
   // do single-pass and print this out after the loop finishes.
-  int totalTimerHits = 0;
+  totalTimerHits = 0;
   for(i=0;
     i<usefulSymbols && externalSymbols[rankingTable[i]].timerHit>0; i++) {
     Symbol *sp=&externalSymbols[rankingTable[i]];
     totalTimerHits += sp->timerHit;
   }
+  if (totalTimerHits == 0)
+    totalTimerHits = 1;
 
-  fprintf(fp,"<h2><A NAME=flat></A><center><a href=\"http://lxr.mozilla.org/mozilla/source/tools/jprof/README.html#flat\">Flat Profile</a></center></h2><br>\n");
+  if (totalTimerHits != count)
+    fprintf(stderr,"Hit count mismatch: count=%d; totalTimerHits=%d",
+            count,totalTimerHits);
+
+  fprintf(fp,"<h2><A NAME=flat_%d></A><center><a href=\"http://lxr.mozilla.org/mozilla/source/tools/jprof/README.html#flat\">Flat Profile</a></center></h2><br>\n",
+          thread);
   fprintf(fp, "<pre>\n");
 
   fprintf(fp, "Total hit count: %d\n", totalTimerHits);
   fprintf(fp, "Count %%Total  Function Name\n");
   // Now loop for as long as we have timer hits
   for(i=0;
     i<usefulSymbols && externalSymbols[rankingTable[i]].timerHit>0; i++) {
 
     Symbol *sp=&externalSymbols[rankingTable[i]];
     
     char *symname = htmlify(sp->name);
     fprintf(fp, "<a href=\"#%d\">%3d   %-2.1f     %s</a>\n",
             rankingTable[i], sp->timerHit,
             ((float)sp->timerHit/(float)totalTimerHits)*100.0, symname);
     delete [] symname;
   }
-
-  fprintf(fp,"</pre></body></html>\n");
 }
 
-void leaky::analyze()
+void leaky::analyze(int thread)
 {
   int *countArray = new int[usefulSymbols];
   int *flagArray  = new int[usefulSymbols];
 
   //Zero our function call counter
   memset(countArray, 0, sizeof(countArray[0])*usefulSymbols);
 
   // The flag array is used to prevent counting symbols multiple times
@@ -553,27 +636,30 @@ void leaky::analyze()
   memset(flagArray, -1, sizeof(flagArray[0])*usefulSymbols);
 
   // This loop walks through all the call stacks we recorded
   stacks = 0;
   for(malloc_log_entry* lep=firstLogEntry; 
     lep < lastLogEntry;
     lep = reinterpret_cast<malloc_log_entry*>(&lep->pcs[lep->numpcs])) {
 
-    if (excluded(lep) || !included(lep))
+    if ((thread != 0 && lep->thread != thread) ||
+        excluded(lep) || !included(lep))
+    {
       continue;
+    }
 
     ++stacks; // How many stack frames did we collect
 
     // This loop walks through every symbol in the call stack.  By walking it
     // backwards we know who called the function when we get there.
     u_int n = (lep->numpcs < stackDepth) ? lep->numpcs : stackDepth;
     char** pcp = &lep->pcs[n-1];
     int idx=-1, parrentIdx=-1;  // Init idx incase n==0
-    for(int i=n-1; i>=0; --i, --pcp, parrentIdx=idx) {
+    for (int i=n-1; i>=0; --i, --pcp) {
       idx = findSymbolIndex(reinterpret_cast<u_long>(*pcp));
 
       if(idx>=0) {
 	// Skip over bogus __restore_rt frames that realtime profiling
 	// can introduce.
 	if (i > 0 && !strcmp(externalSymbols[idx].name, "__restore_rt")) {
 	  --pcp;
 	  --i;
@@ -588,41 +674,48 @@ void leaky::analyze()
 	  ++countArray[idx];
 	}
 
 	// We know who we are and we know who our parrent is.  Count this
 	if(parrentIdx>=0) {
 	  externalSymbols[parrentIdx].regChild(idx);
 	  externalSymbols[idx].regParrent(parrentIdx);
 	}
+        // inside if() so an unknown in the middle of a stack won't break
+        // the link!
+        parrentIdx=idx;
       }
     }
 
     // idx should be the function that we were in when we received the signal.
     if(idx>=0) {
       ++externalSymbols[idx].timerHit;
     }
   }
 
-  generateReportHTML(stdout, countArray, stacks);
+  generateReportHTML(stdout, countArray, stacks, thread);
 }
 
-void FunctionCount::printReport(FILE *fp, leaky *lk)
+void FunctionCount::printReport(FILE *fp, leaky *lk, int parent, int total)
 {
-    const char *fmt = "             <A href=\"#%d\">%6d %s</A>\n";
+    const char *fmt = "                      <A href=\"#%d\">%8d (%3.1f%%)%s %s</A>%s\n";
 
     int nmax, tmax=((~0U)>>1);
     
     do {
 	nmax=0;
 	for(int j=getSize(); --j>=0;) {
 	    int cnt = getCount(j);
 	    if(cnt==tmax) {
 		int idx = getIndex(j);
 		char *symname = htmlify(lk->indexToName(idx));
-		fprintf(fp, fmt, idx, getCount(j), symname);
+                fprintf(fp, fmt, idx, getCount(j),
+                        getCount(j)*100.0/total,
+                        getCount(j)*100.0/total >= 10.0 ? "" : " ",
+                        symname,
+                        parent == idx ? " (self)" : "");
 		delete [] symname;
 	    } else if(cnt<tmax && cnt>nmax) {
 	        nmax=cnt;
 	    }
 	}
     } while((tmax=nmax)>0);
 }
--- a/tools/jprof/leaky.h
+++ b/tools/jprof/leaky.h
@@ -47,17 +47,17 @@
 typedef unsigned int u_int;
 
 struct Symbol;
 struct leaky;
 
 class FunctionCount : public IntCount
 {
 public:
-    void printReport(FILE *fp, leaky *lk);
+  void printReport(FILE *fp, leaky *lk, int parent, int total);
 };
 
 struct Symbol {
   char* name;
   u_long address;
   int    timerHit;
   FunctionCount cntP, cntC;
 
@@ -85,16 +85,17 @@ struct leaky {
   void open();
 
   char*  applicationName;
   char*  logFile;
   char*  progFile;
 
   int   quiet;
   int   showAddress;
+  int   showThreads;
   u_int  stackDepth;
 
   int   mappedLogFile;
   malloc_log_entry* firstLogEntry;
   malloc_log_entry* lastLogEntry;
 
   int    stacks;
 
@@ -110,17 +111,17 @@ struct leaky {
 
   StrSet roots;
   StrSet includes;
 
   void usageError();
 
   void LoadMap();
 
-  void analyze();
+  void analyze(int thread);
 
   void dumpEntryToLog(malloc_log_entry* lep);
 
   void insertAddress(u_long address, malloc_log_entry* lep);
   void removeAddress(u_long address, malloc_log_entry* lep);
 
   void displayStackTrace(FILE* out, malloc_log_entry* lep);
 
@@ -128,13 +129,13 @@ struct leaky {
   void ReadSharedLibrarySymbols();
   void setupSymbols(const char* fileName);
   Symbol* findSymbol(u_long address);
   bool excluded(malloc_log_entry* lep);
   bool included(malloc_log_entry* lep);
   const char* indexToName(int idx) {return externalSymbols[idx].name;}
 
   private:
-  void generateReportHTML(FILE *fp, int *countArray, int count);
+  void generateReportHTML(FILE *fp, int *countArray, int count, int thread);
   int  findSymbolIndex(u_long address);
 };
 
 #endif /* __leaky_h_ */
--- a/tools/jprof/stub/Makefile.in
+++ b/tools/jprof/stub/Makefile.in
@@ -43,16 +43,19 @@ VPATH		= @srcdir@
 
 include $(DEPTH)/config/autoconf.mk
 
 MODULE		= jprof
 EXPORTS		=
 LIBRARY_NAME	= jprof
 EXPORT_LIBRARY	= 1
 
+# override optimization
+MOZ_OPTIMIZE_FLAGS = -fno-omit-frame-pointer
+
 CPPSRCS		= \
 		libmalloc.cpp \
 		$(NULL)
 
 EXPORTS         = \
                 jprof.h \
                 $(NULL)
 
--- a/tools/jprof/stub/libmalloc.cpp
+++ b/tools/jprof/stub/libmalloc.cpp
@@ -18,16 +18,17 @@
  * The Initial Developer of the Original Code is Netscape Communications Corp.
  * Portions created by the Initial Developer are Copyright (C) 1998
  * the Initial Developer. All Rights Reserved.
  *
  * Contributor(s):
  *   Jim Nance
  *   L. David Baron - JP_REALTIME, JPROF_PTHREAD_HACK, and SIGUSR1 handling
  *   Mike Shaver - JP_RTC_HZ support
+ *   Randell Jesup - glibc backtrace() support
  *
  * Alternatively, the contents of this file may be used under the terms of
  * either the GNU General Public License Version 2 or later (the "GPL"), or
  * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
  * in which case the provisions of the GPL or the LGPL are applicable instead
  * of those above. If you wish to allow use of your version of this file only
  * under the terms of either the GPL or the LGPL, and not to allow others to
  * use your version of this file under the terms of the MPL, indicate your
@@ -56,73 +57,107 @@
 #include <fcntl.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <signal.h>
 #include <sys/time.h>
 #include <sys/types.h>
 #include <sys/ioctl.h>
 #include <sys/stat.h>
+#include <sys/syscall.h>
 #include <ucontext.h>
+#include <execinfo.h>
 
 #include "libmalloc.h"
 #include "jprof.h"
 #include <string.h>
 #include <errno.h>
 #include <dlfcn.h>
 
 
 #ifdef NTO
 #include <sys/link.h>
 extern r_debug _r_debug;
 #else
 #include <link.h>
 #endif
 
+#define USE_GLIBC_BACKTRACE 1
+// To debug, use #define JPROF_STATIC
+#define JPROF_STATIC //static
+
 static int gLogFD = -1;
 static pthread_t main_thread;
 
 static void startSignalCounter(unsigned long millisec);
 static int enableRTCSignals(bool enable);
 
 
 //----------------------------------------------------------------------
 
 #if defined(i386) || defined(_i386) || defined(__x86_64__)
-static void CrawlStack(malloc_log_entry* me,
-                       void* stack_top, void* top_instr_ptr)
+JPROF_STATIC void CrawlStack(malloc_log_entry* me,
+                             void* stack_top, void* top_instr_ptr)
 {
+#if USE_GLIBC_BACKTRACE
+    // This probably works on more than x86!  But we need a way to get the
+    // top instruction pointer, which is kindof arch-specific
+    void *array[500];
+    int cnt, i;
+    u_long numpcs = 0;
+    bool tracing = false;
+
+    // This is from glibc.  A more generic version might use
+    // libunwind and/or CaptureStackBackTrace() on Windows
+    cnt = backtrace(&array[0],sizeof(array)/sizeof(array[0]));
+
+    // StackHook->JprofLog->CrawlStack
+    // Then we have sigaction, which replaced top_instr_ptr
+    array[3] = top_instr_ptr;
+    for (i = 3; i < cnt; i++)
+    {
+        me->pcs[numpcs++] = (char *) array[i];
+    }
+    me->numpcs = numpcs;
+
+#else
+  // original code - this breaks on many platforms
   void **bp;
 #if defined(__i386)
   __asm__( "movl %%ebp, %0" : "=g"(bp));
 #elif defined(__x86_64__)
   __asm__( "movq %%rbp, %0" : "=g"(bp));
 #else
   // It would be nice if this worked uniformly, but at least on i386 and
   // x86_64, it stopped working with gcc 4.1, because it points to the
   // end of the saved registers instead of the start.
   bp = __builtin_frame_address(0);
 #endif
   u_long numpcs = 0;
+  bool tracing = false;
 
   me->pcs[numpcs++] = (char*) top_instr_ptr;
 
   while (numpcs < MAX_STACK_CRAWL) {
     void** nextbp = (void**) *bp++;
     void* pc = *bp;
     if (nextbp < bp) {
       break;
     }
-    if (bp > stack_top) {
+    if (tracing) {
       // Skip the signal handling.
       me->pcs[numpcs++] = (char*) pc;
     }
+    else if (pc == top_instr_ptr) {
+      tracing = true;
+    }
     bp = nextbp;
   }
   me->numpcs = numpcs;
+#endif
 }
 #endif
 
 //----------------------------------------------------------------------
 
 static int rtcHz;
 static int rtcFD = -1;
 
@@ -164,23 +199,24 @@ static void DumpAddressMap()
 static void EndProfilingHook(int signum)
 {
     DumpAddressMap();
     puts("Jprof: profiling paused.");
 }
 
 //----------------------------------------------------------------------
 
-static void
-Log(u_long aTime, void* stack_top, void* top_instr_ptr)
+JPROF_STATIC void
+JprofLog(u_long aTime, void* stack_top, void* top_instr_ptr)
 {
   // Static is simply to make debugging tollerable
   static malloc_log_entry me;
 
   me.delTime = aTime;
+  me.thread = syscall(SYS_gettid); //gettid();
 
   CrawlStack(&me, stack_top, top_instr_ptr);
 
 #ifndef NTO
   write(gLogFD, &me, offsetof(malloc_log_entry, pcs) + me.numpcs*sizeof(char*));
 #else
   printf("Neutrino is missing the pcs member of malloc_log_entry!! \n");
 #endif
@@ -276,17 +312,17 @@ static int enableRTCSignals(bool enable)
         }            
         return 0;
     }
 
     return 1;
 }
 #endif
 
-static void StackHook(
+JPROF_STATIC void StackHook(
 int signum,
 siginfo_t *info,
 void *ucontext)
 {
     static struct timeval tFirst;
     static int first=1;
     size_t millisec = 0;
 
@@ -320,19 +356,19 @@ void *ucontext)
             double usec = 1e6*(tNow.tv_sec - tFirst.tv_sec);
             usec += (tNow.tv_usec - tFirst.tv_usec);
             millisec = static_cast<size_t>(usec*1e-3);
         }
     }
 
     gregset_t &gregs = ((ucontext_t*)ucontext)->uc_mcontext.gregs;
 #ifdef __x86_64__
-    Log(millisec, (void*)gregs[REG_RSP], (void*)gregs[REG_RIP]);
+    JprofLog(millisec, (void*)gregs[REG_RSP], (void*)gregs[REG_RIP]);
 #else
-    Log(millisec, (void*)gregs[REG_ESP], (void*)gregs[REG_EIP]);
+    JprofLog(millisec, (void*)gregs[REG_ESP], (void*)gregs[REG_EIP]);
 #endif
 
     if (!rtcHz)
         startSignalCounter(timerMiliSec);
 }
 
 NS_EXPORT_(void) setupProfilingStuff(void)
 {
@@ -366,31 +402,36 @@ NS_EXPORT_(void) setupProfilingStuff(voi
 		startTimer = 0;
 	    }
 	    if(strstr(tst, "JP_START")) doNotStart = 0;
 	    if(strstr(tst, "JP_REALTIME")) realTime = 1;
 	    if(strstr(tst, "JP_APPEND")) append = O_APPEND;
 
 	    char *delay = strstr(tst,"JP_PERIOD=");
 	    if(delay) {
-	        double tmp = strtod(delay+10, NULL);
-		if(tmp>1e-3) {
+                double tmp = strtod(delay+strlen("JP_PERIOD="), NULL);
+                if (tmp>=1e-3) {
 		    timerMiliSec = static_cast<unsigned long>(1000 * tmp);
-		}
+                } else {
+                    fprintf(stderr,
+                            "JP_PERIOD of %g less than 0.001 (1ms), using 1ms\n",
+                            tmp);
+                    timerMiliSec = 1;
+                }
 	    }
 
 	    char *first = strstr(tst, "JP_FIRST=");
 	    if(first) {
-	        firstDelay = atol(first+9);
+                firstDelay = atol(first+strlen("JP_FIRST="));
 	    }
 
             char *rtc = strstr(tst, "JP_RTC_HZ=");
             if (rtc) {
 #if defined(linux)
-                rtcHz = atol(rtc+10);
+                rtcHz = atol(rtc+strlen("JP_RTC_HZ="));
                 timerMiliSec = 0; /* This makes JP_FIRST work right. */
                 realTime = 1; /* It's the _R_TC and all.  ;) */
 
 #define IS_POWER_OF_TWO(x) (((x) & ((x) - 1)) == 0)
 
                 if (!IS_POWER_OF_TWO(rtcHz) || rtcHz < 2) {
                     fprintf(stderr, "JP_RTC_HZ must be power of two and >= 2, "
                             "but %d was provided; using default of 2048\n",
@@ -415,16 +456,18 @@ NS_EXPORT_(void) setupProfilingStuff(voi
 		} else {
 		    struct sigaction action;
 		    sigset_t mset;
 
 		    // Dump out the address map when we terminate
 		    atexit(DumpAddressMap);
 
 		    main_thread = pthread_self();
+                    //fprintf(stderr,"jprof: main_thread = %u\n",
+                    //        (unsigned int)main_thread);
 
 		    sigemptyset(&mset);
 		    action.sa_handler = NULL;
 		    action.sa_sigaction = StackHook;
 		    action.sa_mask  = mset;
 		    action.sa_flags = SA_RESTART | SA_SIGINFO;
 #if defined(linux)
                     if (rtcHz) {
--- a/tools/jprof/stub/libmalloc.h
+++ b/tools/jprof/stub/libmalloc.h
@@ -47,16 +47,17 @@ extern "C" {
 
 typedef unsigned long u_long;
 
 // Format of a malloc log entry. This is what's written out to the
 // "malloc-log" file.
 struct malloc_log_entry {
   u_long delTime;
   u_long numpcs;
+  int thread;
   char* pcs[MAX_STACK_CRAWL];
 };
 
 // type's
 #define malloc_log_stack   7
 
 // Format of a malloc map entry; after this struct is nameLen+1 bytes of
 // name data.