Understanding the Process Crash and Backtracing in Linux

We know that all software developers write a buggy software, and so do I. Writing a buggy software is inevitable since the software becomes more and more complex. The chance that we have to debug our (or other developers) software is getting higher until an artificial-intelligent-based debugger is used in many companies. But, just let us learn the basic how an application crashes and how to debug it using backtracing.

Application Crash

In the infotainment system projects which I involved, the most critical issues are the application crashes. In Linux, if an application crash occurs, the Linux kernel will send a signal to the application process and write a core file (coredump). In most situations, we have to activate this coredump creation, for example by calling ulimit -c unlimited in the Linux console before starting the application.

So lets we start with a simple C++ application:

#include 

void function_2() {
  char *a = 0;
  char b = *a;
  printf("Get char: %c \n", b);
  
}

void function_1() {
  printf("function 1 calling function 2\n");
  function_2();
}

int main(int argc, char **argv) {
  function_1();
  return 0;
}

and the CMake file CMakeLists.txt contains:

cmake_minimum_required (VERSION 3.5)
add_definitions(-std=c++11)

# including the source codes
set(sources
   src/blog_example.cpp)

# create and named the binary as example
add_executable(example ${sources})

If we run the application, the application will crash with a segmentation fault:

$ ./example 
function 1 calling function 2
Segmentation fault (core dumped)

If you don't run the ulimit and the result of the ulimit -c is 0, we will not see a core file. After executing ulimit -c unlimited and run the application again we will see a core file in our current directory.

In desktop Ubuntu 16.04, the OS forwarding the coredump to the apport application. We can check it by running this command:

$ cat /proc/sys/kernel/core_pattern

Which return |/usr/share/apport/apport %p %s %c %P.

You can change the naming scheme of the kernel in order to get the application filename of the core file and set the output directory by executing this command:

echo "%e.core" > /proc/sys/kernel/core_pattern

In another Linux embedded environment without apport, if systemd is used as our init application, we can change the systemd configuration /etc/systemd/system.conf by uncomment the line with DefaultLimitCore and set to infinity or a fixed value in bytes:

DefaultLimitCORE=infinity

It is essential to know the Linux signals in this context. This Linux manual about signal is recommended to be read. The common signal for application crash is SIGSEGV. These other signals do also trigger the coredump: SIGQUIT, SIGILL, SIGABRT, SIGFPE, SIGBUS, SIGSYS, SIGTRAP, SIGXCPU, SIGXFSZ.

Backtracing in Linux

Backtracing means an approach to list function calls backward from the crash point. If a running process or thread doesn't crash, then backtracing returns the list of function calls that are currently active in this process or thread.

Using GNU GDB Debugger (gdb)

A coredump or a core file can be analyzed by the gdb application tool. This command can be executed to get the backtrace of a core file:

$ gdb <app_binary_path> <core_path>

and it returns such outputs:

Reading symbols from example...(no debugging symbols found)...done.
[New LWP 12658]
Core was generated by './example'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000040057a in function_2() ()

As we know, the gdb recognizes that the core file doesn't contain the debugging symbols. The output shows that it can point out the function which the crash happens, however, not in which filename and line number. In order to get the details backtrace, we have to build the application with the -g as the additional g++ or gcc parameter. We can modify the CMake to:

add_definitions(-std=c++11 -g)

By compiling the application with the activation of the debugging symbols, the result of the gdb command is following:

Reading symbols from example...done.

warning: exec file is newer than core file.
[New LWP 12658]
Core was generated by './example'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000040057a in function_2 () at /usr/xxx/src/blog_example.cpp:5
5     char b = *a;

Programmatically Backtracing with Signal Handler

In order to do the backtracing after a crash signal, such as SIGSEGV, we need to handle the signal by redefining the kernel signal handler. A very good comprehensive article about backtracing can be found in the Eli's article. Basically, as a developer, we use either the glibc backtrace (backtrace and backtrace_symbols) or libunwind. He describes how to get the backtrace in a normal case, which the process does not crash. Moreover, he wrote that he strongly prefer libunwind, because it's the most modern, widespread, flexible, and portable solution.

From my perspective, the huge difference between those libraries is if we want to acquire the backtrace after a crash signal. We can re-define the Linux signal handler and print out the backtrace to the stdout (console). However, we need to be aware of the restriction of a signal handlers implementation. We need to think about these two rules, based on the Linux Programming Interface [1]:

The signal handlers set a global flag and exits. If the handler does not exit the application, it will run in the endless loop
The signal handlers conduct a cleanup if any heap memory was allocated before. Afterwards, it either terminates the process or uses a nonlocal goto to unwind the stack and return to a predetermined location in the main application. See the goto example in [2].

Below is an example code to be added in our example application which use glibc's backtrace() and backtrace_symbols():

#include    // backtrace
#include       // free, exit
#include      // _exit
#include       // sigaction, signal type sigemptyset

void glibc_backtrace (void)
{
  void *array[10];
  size_t size;
  char **strings;
  size_t i;
  //call the glibc backtrace()
  size = backtrace (array, 10);
  //obtain the backtrace symbols
  strings = backtrace_symbols (array, size);

  printf ("Obtained %zd stack frames.\n", size);
  for (i = 0; i < size; i++)
     printf ("%s\n", strings[i]);
  //free the memory allocation
  free (strings);
}

static void sig_handler(int signum)
{
    printf("%s: signal %d\n", __FUNCTION__, signum);
    switch(signum ) {
    case SIGSEGV:
        glibc_backtrace();
        break;
    default:
        break;
    }
    _exit(1);
}

void redefine_signal()
{
    // Redefining a new signal handler for SIGSEGV
    struct sigaction sa;
    sa.sa_handler = sig_handler;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;
    if (sigaction(SIGSEGV, &sa, NULL) == -1)
    {
      exit(1);
    }
}

int main(int argc, char **argv) {
  redefine_signal();
  function_1();
  return 0;
}

Note, that we shall use _exit() to terminate the application after a crash in the signal handler. The exit() function is unsafe because it slushes stdio buffers before calling _exit(). If we have a dynamic memory allocation (e.g. malloc or new), we have to clean up the heap first before calling _exit(). In case of calling custom signal handler from a different library, we need to chain the signal handler and avoid using _exit().

The above code cannot be implemented for a production software since backtrace_symbols() try to allocate a heap memory which may be corrupt already. Furthermore, the code in signal handlers has to be reentrant [3] and async-signal-safe [[4]]. As the comparison, this below code shows how we can use the libunwind approach and replaces the glibc_backtrace() in the sig_handler() with unwind():

#include 
#include 

void unwind() {
  unw_cursor_t cursor;
  unw_context_t context;

  // Initialize cursor to current frame for local unwinding.
  unw_getcontext(&context);
  unw_init_local(&cursor, &context);

  // Unwind frames one by one, going up the frame stack.
  while (unw_step(&cursor) > 0) {
    unw_word_t offset, pc;
    unw_get_reg(&cursor, UNW_REG_IP, &pc);
    if (pc == 0) {
      break;
    }
    printf("ip 0x%lx:", pc);

    char sym[256];
    if (unw_get_proc_name(&cursor, sym, sizeof(sym), &offset) == 0) {
      printf(" (%s+0x%lx) ", sym, offset);
      
      char* demangle_ptr = sym;
      int ret_state;
      char* demangled = abi::__cxa_demangle(sym, nullptr, nullptr, &ret_state);
      if (ret_state == 0) {
        demangle_ptr = demangled;
      }
      printf(" -> %s+0x%lx\n", demangle_ptr, offset);
      free(demangled);

    } else {
      printf(" Error: unable to obtain symbol name for this frame\n");
    }
  }
}

static void sig_handler(int signum)
{
    printf("%s: signal %d\n", __FUNCTION__, signum);
    switch(signum ) {
    case SIGSEGV:
        unwind();
        break;
    default:
        break;
    }
    _exit(1);
}

For further information, don't hesitate to read the Linux Programming Interface book and these discussions in [5] and [6].

The function char * __cxa_demangle (const char *mangled_name, char *output_buffer, size_t *length, int *status) will allocate a memory space if a nul pointer is passed to the function like the example above [[7]]. In order to prevent the memory allocation, we have to allocate an enough memory before calling the unwind(), and free the memory allocation afterwards. We cannot use a pointer of a stack because if the passed output_buffer is not long enough, it is expanded using realloc [[8]].

Conclusion

In this article, we learn about application crash, signal handlers and how to use backtrace to analyze the application crash. There are two common approaches to perform the backtracing and to obtain the demangled symbols:

glibc's backtrace() and backtrace_symbols()
libunwind functions:
1. unw_getcontext(),
2. unw_init_local(),
3. unw_step(),
4. unw_get_reg,
5. unw_get_proc_name(),
6. and abi::__cxa_demangle()

The main advantage of using the unw_get_proc_name() and abi::__cxa_demangle() is that we can prevent the demangling to allocate a bunch of memory space during a signal fault handler.

References

[4]: http://man7.org/linux/man-pages/man7/signal-safety.7.html

[7]: https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.3/a01696.html [8]: https://stackoverflow.com/questions/45022504/abi-cxa-demangle-why-buffer-needs-to-be-malloc-ed [9]: http://stackoverflow.com/a/5945911 [10]: http://free-electrons.com/pub/video/2008/ols/ols2008-gilad-ben-yossef-fault-handlers.ogg [11]: https://github.com/gby/libcrash