In <email@example.com>, Borden Rhodes wrote:
>1) Is there a way to apply debugging symbols retroactively to a dump? A few
>times I've had Linux crash on me and spit out a debugging dump. I do my
>best to install debugging symbols for all 1400 packages I have on my system
>(when I can find them) but this requires a huge amount of hard disk space
>and, invariably, the odd dump is missing symbols. Recreating the crash
>isn't always possible. Is there (or could someone invent) a way to save a
>dump without the symbols, download the symbol tables and then regenerate
>the dump with the symbols so it's useful to developers?
Yes, it is, sometimes. Ubuntu has a process to do it automatically, that
mostly gets it right.
Modern versions of "strip" et. al. allow you to save the debugging information
to a separate .so that just contains debugging information. gdb (et. al.) can
then use the debugging-info only .so to decorate an existing backtrace.
This is actually how a lot of distributions produce separate -dbg or -DEBUG
However, this debugging-info only files only match with the *same exact build*
of the real .so. Taking a random backtrace, determining which build it came
from and finding the appropriate -dbg packages is a bit difficult.
Also, things like prelink, that modify existing .so files result in the
debugging-info only .so not matching. This might also happen this some types
of hardening that reduces the impact of heap/stack overflow/underflow attacks.
Compounding this problem is the large number of programs that are being
written with parts in "scripting" languages, or otherwise non-C/C++ languages
where the path from a symbol in a ELF file to the problematic code is not as
In short, it can be done in some cases and there are programmers working on
making backtraces from Joe Sixpack or Jane Boxwine more useful. It does seem
to be like there may need to be more people working on this, but it is not
very "sexy" work. Most programmers would rather spend their time improving
the user experience when things are working; IME, that is where the user
spends most of their time.
>2) I find that the logs contain lots of facts but not a whole lot of useful
>information (if any) when something goes wrong. I've had KDE go
>black-screen on me, for example, and force a hard reboot but there's no
>mention whatsoever (that I can find) in xorg.log, kdm.log, messages, syslog
>or dmesg. Windows seems to be fairly good at making its last breath a stop
>error before it dies which means when I get back into the system (or when
>I'm looking at a client's computer days after) I can find that stop error,
>look it up and figure out what went wrong. Are Linux's logs designed for
>troubleshooting or only for monitoring? Are proper troubleshooting logs
>kept somewhere else or in a special file? Is there a guide on how to read
>Linux's logs so I can make sense out of them like I can Windows' logs?
In the case of a kernel crash, the last breath of the system is unfortunately
not writing to dmesg/syslog and sync()ing disks. Depending on the nature of
the crash, there are some good reasons not to do this, though. (E.g. is the
case of a PANIC(), the kernel developer is basically indicating that the
kernel image has been compromised -- doing FS operations with a compromised
kernel might cause [more] data loss.)
I think that logs in general are... dropping in quality. They seem to be less
focused around failed "sanity" checks, mis-configuration warnings, and I-was-
here before I called exit() message. They seem to more filled with I-didn't-
comment-this-out-before-our-release build debugging messages for random
developers. This is not true of kernel logs for the most part; I find them
informative, but it is rarely my kernel that causes me problems.
I speak as someone that has been working as a developer in some capacity for 8
years. Take that for what you will.
>3) Linux needs better troubleshooting and recovery systems. The answer I
>usually get when I get an unexplained error is to run the program inside a
>dbg or with valgrind. I'm not convinced that this is a practical way to
>troubleshoot serious problems (like kernel panics) and it requires a
>certain amount of foresight that a problem will occur. According to this
>logic, the only way that someone can produce useful reports and feedback
>(or even get a clue as to what happened) on the day-to-day crashes and bugs
>is to start Linux and all of its sub process inside valgrind and/or gdb.
>This is obviously not an intended use of these programs.
If we don't know how to reproduce the problem, we can't fix it. If we do know
how to reproduce the problem, the foresight needed to use gdb/valgrind is not
too much more. They shouldn't be your first tools, but they are necessary.
I've also had gdb/valgrind mask errors, which is truly unfortunate. Still, if
you know a way to make it crash every time EXCEPT when in gdb/valgrind, that
tells me something as a developer
NB: I've never had gdb/valgrind help with kernel errors, since they generally
live in user space.
Being able to reproduce the error is the *most important* step. IME, there
are very few problems that can't be fixed/worked-around in 8 man-hours once
you can reproduce the problem in under 15 minutes.
Also, if you have an unreproducable problem, I'm gonna blame the hardware or
cosmic radiation, not the code.
>1) Logs need to have useful information.
>When I look at a client's Windows
>box days after they report something going wrong, the logs tell me at what
>time the problem happened, which process failed and what error it threw just
>before it blew. I can look those error codes up and (usually) fix the
>problem within an hour.
As a less homogeneous environment, there's no ultimate table of error codes to
>When something dies on Linux, the log entry
>(assuming it even makes one) only tells me how many seconds into that
>particular boot the problem occurred. I've never been able to go back a few
>days later and find the log entries related to a particular crash - maybe
>because they've been purged.
I've still got logs from 2009 on my currently running desktop. They *have*
been archived, but they are still available. You should check your logrotate
settings to make sure your logs are being handled the way you'd like.
>I know that the Linux tradition is to identify
>processes only by ID but surely there must be a way that it can print a
>file or package name or anything more useful than memory addresses and
>registers so at least I know where to start pointing fingers.
The kernel doesn't know about packages. It does know about files, but once
the process is running, it doesn't identify the file using a pathname. As it
is dying is it difficult to extract accurate information, particularly if it
has already "eaten" it's own memory image.
>people have told me that it's pointless trying to debug a dump in the logs.
> What's the point of dumping it in the first place if nobody can read it?
It is a place to start, but it's not a very good one. A kdump or corefile is
usually much better. A backtrace tells you a set of functions to look at for
obvious errors, a kdump or corefile allows you to inspect local variables and
determine exactly which of your assumptions was violated.
>2) I wish error logs had simple codes or messages (which have documentation)
>like Windows Stop errors so I can look them up and figure out why something
>died. Often times I try to Google the whole error message and either get
>directed to source code or totally irrelevant postings (since it seems that
>many messages are reused for all kinds of problems). For example,
>'segfault' gets thrown so much that it only tells you that the program
>crashed - something I already know.
segfault is a very specific type of crash: A process attempted to access a
memory address that was either not mapped or was mapped without the required
permissions. (Trying to move the IP to a place that is mapped NOEXEC, trying
to write to a read-only mmap(), or even a simple dereference of a NULL
Unfortunately, it is the most common type of hard crash. It can be caused a
multitude of programming errors. If your program is not segfaulting, it can
likely recover in some meaningful way, or at least write a log message and
cleanly exit. If it is segfaulting, there is relatively little you can do; a
signal handler is C isn't allowed to call all of the library functions, and
returning from the SIGSEGV handler causes the program to terminate or
immediately get the signal again, so you can't set a flag.
Error codes and fixed error messages are established after the main body of
code is written, so they can be standardized throughout the body of the code
and documented. However, with release early, release often being the mantra
of many projects, that level of "freeze" never happens. New error messages
and conditions are added all the time, and (more often than not) old error
messages and conditions go way when recovery code is added.
>xorg.conf files (which are depreciated
It's not depreciated. xorg.conf is *the* correct place to configure your
Xorg. However, one of the goal of Xorg is to have enough auto-detection and
dynamic re-configuration that an empty (or missing) xorg.conf is enough for
>there one log that only deals with hardware status and changes, another one
>that only deals with network status and firewall logging, another one which
>only deals with dumps and crashes and so on?
There a a fixed number of "syslog" facilities, but they were designed in the
days of AT&T UNIX, so not all of them are entirely relevant. It seems like
Linux could probably add some more, but portable programs would use them.
Plus, a lot of programs don't log via syslog() anymore anyway.
Anyway, it could be a lot better, I agree. I seem to remember that Debian and
most upstream projects do accept volunteers.
Boyd Stephen Smith Jr. ,= ,-_-. =.
firstname.lastname@example.org ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy `-'(. .)`-'