Handle the NT_PRSTATUS lost for the "bt" command
----- Original Message -----
> The purpose of this patch is to work out "bt" command for the diskdump > which NT_PRSTATUS note could not be saved by IPI lost. > I think IPI is possibly lost by panic under the serious crashed condition. > > I noticed that "bt" failed in my ppc environment > when the NT_PRSTATUS notes are lost on some CPUs while IPI delivery. > Then, I made CPU map for prstatus in diskdump more correctable > by checking a validation of crash_notes field. > > I've tested this problem by patching kernel like, kernel/kexec.c > void crash_save_cpu(struct pt_regs *regs, int cpu) > { > + if (current->pid == 0) > + /* this cpu was idle; nothing to capture */ > + return; > > It looks terrible and impractical test case but actually I met this code > in my using distro's kernel. I couldn't reproduce actual IPI lost case, > then fortunately, use this as a example of the causes if IPI could not be > delivered to other CPUs. > > => Taking diskdump by sysrq+c and makedumpfile. > > crash> help -D | grep notes > num_prstatus_notes: 1 > notes_buf: 10ba91a8 > notes[0]: 10ba91a8 > crash> help -k | grep cpus > cpus: 8 > cpus_override: (null) > crash> bt > PID: 1001 TASK: ea62b000 CPU: 2 COMMAND: "bash" > Segmentation fault > > Since seven idle cpus did not save NT_PRSTATUS note, > crash could not handle CPU#2's note where is located as CPU#0's. > > With this patch, crash get to work out with correct CPU map to > prstatus. > > WARNING: catch lost crash_notes at cpu#0 > WARNING: catch lost crash_notes at cpu#1 > WARNING: catch lost crash_notes at cpu#3 > WARNING: catch lost crash_notes at cpu#4 > WARNING: catch lost crash_notes at cpu#5 > WARNING: catch lost crash_notes at cpu#6 > WARNING: catch lost crash_notes at cpu#7 > crash.fix> help -D | grep notes > num_prstatus_notes: 1 > notes_buf: 107a3378 > notes[2]: 107a3378 > crash.fix> help -k | grep cpus > cpus: 8 > cpus_override: (null) > crash.fix> bt > PID: 1001 TASK: ea62b000 CPU: 2 COMMAND: "bash" > > R0: 00000001 R1: eb793e60 R2: ea62b000 R3: 00000063 > R4: 00000000 R5: ffffffff R6: c043ba2c R7: 00000000 > R8: 00008000 R9: 00000000 R10: 00000000 R11: eb793e70 > R12: 28242444 R13: 100b8448 R14: 100b07b8 R15: 100b0894 > R16: 00000000 R17: 00000000 R18: 00000000 R19: 1006d270 > R20: 00000000 R21: 100f0430 R22: 00000000 R23: 00000001 > R24: c08f1ac8 R25: 00029002 R26: c08f1bac R27: c08d0000 > R28: 00000000 R29: c09ada48 R30: 00000063 R31: eb793e60 > NIP: c0423378 MSR: 00021002 OR3: c09ada48 CTR: c0423344 > LR: c0423d8c XER: 00000000 CCR: 28242444 MQ: 00008000 > DAR: 00000000 DSISR: 00800000 Syscall Result: eb793e60 > NIP [00000000c0423378] sysrq_handle_crash > LR [00000000c0423d8c] __handle_sysrq > > #0 [eb793e60] sysrq_handle_crash at c0423378 > : snip > > Thanks, > Toshi Toshi, I don't want to add any new initialization-time code -- especially if it's related to the NT_PRSTATUS notes -- that can abort a crash session unnecessarily. In your new crash_was_lost_crash_note() function, there are these two FAULT_ON_ERROR readmem() calls: readmem(symbol_value("crash_notes"), KVADDR, &crash_notes_ptr, sizeof(ulong), "crash_notes", FAULT_ON_ERROR); and readmem(crash_notes_ptr, KVADDR, buf, SIZE(note_buf), "cpu crash_notes", FAULT_ON_ERROR); Although they are highly unlikely to fail, can you please make both of them RETURN_ON_ERROR, and if the readmem() fails, have it bail out and return FALSE? And then, if necessary, make any adjustments to map_cpus_to_prstatus_kdump_cmprs() to handle that remote possibility. You should be able to test it with your patched kernel. Also, I don't understand the wording of this error message at the end of crash_was_lost_crash_note(): error(WARNING, "catch lost crash_notes at cpu#%d ", cpu); Can you re-word that? The notes were not "lost", but rather were "not saved" by the crashing system. Lastly, in __diskdump_memory_dump(), you just skip the "lost" notes sections: for (i = 0, j = 0; j < dd->num_prstatus_notes; i++) { if (dd->nt_prstatus_percpu[i] == NULL) continue; fprintf(fp, " notes[%d]: %lx ", i, (ulong)dd->nt_prstatus_percpu[i]); j++; } Can you make it more obvious, say, by displaying something like: notes[6]: (not saved) Thanks, Dave -- Crash-utility mailing list Crash-utility@redhat.com https://www.redhat.com/mailman/listinfo/crash-utility |
Handle the NT_PRSTATUS lost for the "bt" command
Hi Dave,
Thanks for your comments. (2012/06/19 1:01), Dave Anderson wrote: > I don't want to add any new initialization-time code -- especially if > it's related to the NT_PRSTATUS notes -- that can abort a crash session > unnecessarily. In your new crash_was_lost_crash_note() function, there > are these two FAULT_ON_ERROR readmem() calls: > > readmem(symbol_value("crash_notes"), KVADDR,&crash_notes_ptr, > sizeof(ulong), "crash_notes", FAULT_ON_ERROR); > and > > readmem(crash_notes_ptr, KVADDR, buf, SIZE(note_buf), > "cpu crash_notes", FAULT_ON_ERROR); > > Although they are highly unlikely to fail, can you please make > both of them RETURN_ON_ERROR, and if the readmem() fails, have > it bail out and return FALSE? I see, I'll make consideration about initialization-time rule from now, to keep crash session, should use readmem() with QUIET|RETURN_ON_ERROR. > And then, if necessary, make any > adjustments to map_cpus_to_prstatus_kdump_cmprs() to handle that > remote possibility. You should be able to test it with your > patched kernel. I'm not able to test for unexpected, deliberate readmem() failure because my core file contains "crash_notes" memory field... Although I've probably misunderstood your advice, I had better add more debugging messages into map_cpus_to_prstatus_kdump_cmprs() to handle conditions where imply "not saved NT_PRSTATUS" or "nt_prstatus_percpu[] is remapped". > Also, I don't understand the wording of this error message > at the end of crash_was_lost_crash_note(): > > error(WARNING, "catch lost crash_notes at cpu#%d ", cpu); > > Can you re-word that? The notes were not "lost", but rather were > "not saved" by the crashing system. I re-word all "lost" keywords, so too function name with "saved". And also make this warning valid only when CRASHDEBUG() is enable because we can check it later by using "help -D". > Lastly, in __diskdump_memory_dump(), you just skip the "lost" > notes sections: > > for (i = 0, j = 0; j< dd->num_prstatus_notes; i++) { > if (dd->nt_prstatus_percpu[i] == NULL) > continue; > fprintf(fp, " notes[%d]: %lx ", > i, (ulong)dd->nt_prstatus_percpu[i]); > j++; > } > > Can you make it more obvious, say, by displaying something like: > > notes[6]: (not saved) Looks better, thanks for giving good hint. [updates by attached patch] - display messages only if CRASHDEBUG() >= 1 crash -d 1 => display about NT_PRSTATUS mapping messages as below. ---------------- WARNING: cpu#0: not saved crash_notes WARNING: cpu#1: not saved crash_notes crash: NT_PRSTATUS: cpu#2 = 107a34a8 (remapped from cpu#0) WARNING: cpu#3: not saved crash_notes WARNING: cpu#4: not saved crash_notes WARNING: cpu#5: not saved crash_notes WARNING: cpu#6: not saved crash_notes WARNING: cpu#7: not saved crash_notes ---------------- - help message is enhanced crash> help -D | grep notes num_prstatus_notes: 1 notes_buf: 107a34a8 notes[0]: (not saved) notes[1]: (not saved) notes[2]: 107a34a8 notes[3]: (not saved) notes[4]: (not saved) notes[5]: (not saved) notes[6]: (not saved) notes[7]: (not saved) Thanks, Toshi > Thanks, > Dave > > > -- > Crash-utility mailing list > Crash-utility@redhat.com > https://www.redhat.com/mailman/listinfo/crash-utility > -- Crash-utility mailing list Crash-utility@redhat.com https://www.redhat.com/mailman/listinfo/crash-utility |
Handle the NT_PRSTATUS lost for the "bt" command
----- Original Message -----
> Hi Dave, > > Thanks for your comments. > > (2012/06/19 1:01), Dave Anderson wrote: > > > I don't want to add any new initialization-time code -- especially if > > it's related to the NT_PRSTATUS notes -- that can abort a crash session > > unnecessarily. In your new crash_was_lost_crash_note() function, there > > are these two FAULT_ON_ERROR readmem() calls: > > > > readmem(symbol_value("crash_notes"), KVADDR,&crash_notes_ptr, > > sizeof(ulong), "crash_notes", FAULT_ON_ERROR); > > and > > > > readmem(crash_notes_ptr, KVADDR, buf, SIZE(note_buf), > > "cpu crash_notes", FAULT_ON_ERROR); > > > > Although they are highly unlikely to fail, can you please make > > both of them RETURN_ON_ERROR, and if the readmem() fails, have > > it bail out and return FALSE? > > I see, I'll make consideration about initialization-time rule from now, > to keep crash session, should use readmem() with QUIET|RETURN_ON_ERROR. > > > And then, if necessary, make any adjustments to map_cpus_to_prstatus_kdump_cmprs() > > to handle that remote possibility. You should be able to test it with your > > patched kernel. > > I'm not able to test for unexpected, deliberate readmem() failure > because my core file contains "crash_notes" memory field... > > Although I've probably misunderstood your advice, > I had better add more debugging messages into > map_cpus_to_prstatus_kdump_cmprs() to handle conditions > where imply "not saved NT_PRSTATUS" or "nt_prstatus_percpu[] is > remapped". > > > Also, I don't understand the wording of this error message > > at the end of crash_was_lost_crash_note(): > > > > error(WARNING, "catch lost crash_notes at cpu#%d ", cpu); > > > > Can you re-word that? The notes were not "lost", but rather were > > "not saved" by the crashing system. > > I re-word all "lost" keywords, so too function name with "saved". > And also make this warning valid only when CRASHDEBUG() is enable > because we can check it later by using "help -D". > > > Lastly, in __diskdump_memory_dump(), you just skip the "lost" > > notes sections: > > > > for (i = 0, j = 0; j< dd->num_prstatus_notes; i++) { > > if (dd->nt_prstatus_percpu[i] == NULL) > > continue; > > fprintf(fp, " notes[%d]: %lx ", > > i, (ulong)dd->nt_prstatus_percpu[i]); > > j++; > > } > > > > Can you make it more obvious, say, by displaying something like: > > > > notes[6]: (not saved) > > Looks better, thanks for giving good hint. > > [updates by attached patch] > > - display messages only if CRASHDEBUG() >= 1 > crash -d 1 > => display about NT_PRSTATUS mapping messages as below. > ---------------- > WARNING: cpu#0: not saved crash_notes > WARNING: cpu#1: not saved crash_notes > crash: NT_PRSTATUS: cpu#2 = 107a34a8 (remapped from cpu#0) > WARNING: cpu#3: not saved crash_notes > WARNING: cpu#4: not saved crash_notes > WARNING: cpu#5: not saved crash_notes > WARNING: cpu#6: not saved crash_notes > WARNING: cpu#7: not saved crash_notes > ---------------- > > - help message is enhanced > crash> help -D | grep notes > num_prstatus_notes: 1 > notes_buf: 107a34a8 > notes[0]: (not saved) > notes[1]: (not saved) > notes[2]: 107a34a8 > notes[3]: (not saved) > notes[4]: (not saved) > notes[5]: (not saved) > notes[6]: (not saved) > notes[7]: (not saved) > > Thanks, > Toshi OK, now I'm getting confused... I note that currently __diskdump_memory_dump() does this: for (i = 0; i < dd->num_prstatus_notes; i++) { fprintf(fp, " notes[%d]: %lx ", i, (ulong)dd->nt_prstatus_percpu[i]); } But your patch does this: nrcpus = kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS; for (i = 0; i < nrcpus; i++) { if (dd->nt_prstatus_percpu[i] == NULL) { fprintf(fp, " notes[%d]: (not saved) ", i); continue; } fprintf(fp, " notes[%d]: %lx ", i, (ulong)dd->nt_prstatus_percpu[i]); } Now surely we don't want to dump "NR_CPUS" notes pointers, unless there actually are that many cpus in the system. For example, in RHEL6 sets CONFIG_NR_CPUS to 4096 for x86_64, but rarely do we see a dumpfile with that many cpus. So I would prefer that the for loop continue to be bounded by "dd->num_prstatus_notes". However, given that process_elf32_notes() and process_elf64_notes() do a cursory test of each note by checking for the NT_PRSTATUS type, I'm presuming that in your dumpfile, dd->num_prstatus_notes must be equal to 1, correct? void process_elf64_notes(void *note_buf, unsigned long size_note) { Elf64_Nhdr *nt; size_t index, len = 0; int num = 0; for (index = 0; index < size_note; index += len) { nt = note_buf + index; if(nt->n_type == NT_PRSTATUS) { dd->nt_prstatus_percpu[num] = nt; num++; } len = sizeof(Elf64_Nhdr); len = roundup(len + nt->n_namesz, 4); len = roundup(len + nt->n_descsz, 4); } if (num > 0) { pc->flags2 |= ELF_NOTES; dd->num_prstatus_notes = num; } return; } But then map_cpus_to_prstatus_kdump_cmprs() remaps the notes without modifying the dd->num_prstatus_notes counter. So then there's this: void * diskdump_get_prstatus_percpu(int cpu) { if ((cpu < 0) || (cpu >= dd->num_prstatus_notes)) return NULL; return dd->nt_prstatus_percpu[cpu]; } So in a case such as your example, where cpu 2 was the only cpu that saved its notes, the function above would incorrectly return NULL when called with diskdump_get_prstatus_percpu(2). What am I missing? Dave -- Crash-utility mailing list Crash-utility@redhat.com https://www.redhat.com/mailman/listinfo/crash-utility |
Handle the NT_PRSTATUS lost for the "bt" command
----- Original Message -----
> > OK, now I'm getting confused... > The more I look at this patch, the more confused I get... During initialization, the ELF notes contained in the dumpfile file header are scanned, and if an NT_PRSTATUS note is seen, a pointer to its location in the dumpfile is saved in dd->nt_prstatus_percpu[num] and the "num" of valid notes is kept in dd->num_prstatus_notes. If the dd->num_prstatus_notes is equal to the online cpu count, then it is presumed that there is a one-to-one relationship, where the cpu number can be used as the index into the dd->nt_prstatus_percpu[num] array. If the number of notes is not equal to the number of online cpus, then the "mapping" function is called, where if a cpu is found to be offline, then its (incorrectly) associated entry in the dd->nt_prstatus_percpu[num] array is "pushed up" to the next higher entry. But the dd->num_prstatus_notes does not seem to get incremented to reflect that move, so then it's seems like diskdump_get_prstatus_percpu() can possibly return NULL when there actually is a relevant NT_PRSTATUS note. That seems to be a bug (?), but it's not particularly important, because for x86 and x86_64, the data in the NT_PRSTATUS notes is only used if the starting point for backtraces if the PC/SP pair cannot be determined otherwise, which is the case virtually all of the time. So the registers found in the NT_PRSTATUS notes are pretty much useless... Now, to complicate matters, your patch does not look at the NT_PRSTATUS notes in the dumpfile header, but instead looks at the base kernel's original notes, and verifies their contents, and correlates what's found there against what was found in the dumpfile? So I don't understand what you are attempting to do -- what is the difference between the notes that are copied into the dumpfile vs. what you are looking at in the base kernel? I'm also wondering what would happen in your case if there were a combination of "lost" notes *and* offline cpus? How would that work? So at this point I really don't want to add this patch at all because it touches common code, and I don't want to risk breaking the other arches. Nobody has ever reported any "lost" cpus so far, probably because the kdump facility uses non-maskable NMI's to shutdown the non-panicking cpus. This is such a highly-unlikely corner case, that it does not even seem worth addressing for fear of breaking something else. I didn't look at the reasoning behind why you ran into a segmentation violation, but since the PPC code path would be: ... back_trace() get_diskdump_regs() get_diskdump_regs_ppc() perhaps you can rework your patch so that it is segregated to PPC only? Dave -- Crash-utility mailing list Crash-utility@redhat.com https://www.redhat.com/mailman/listinfo/crash-utility |
Handle the NT_PRSTATUS lost for the "bt" command
(2012/06/20 1:20), Dave Anderson wrote:
> OK, now I'm getting confused... > > I note that currently __diskdump_memory_dump() does this: > > for (i = 0; i< dd->num_prstatus_notes; i++) { > fprintf(fp, " notes[%d]: %lx ", > i, (ulong)dd->nt_prstatus_percpu[i]); > } > > But your patch does this: > > nrcpus = kt->kernel_NR_CPUS ? kt->kernel_NR_CPUS : NR_CPUS; > > for (i = 0; i< nrcpus; i++) { > if (dd->nt_prstatus_percpu[i] == NULL) { > fprintf(fp, > " notes[%d]: (not saved) ", > i); > continue; > } > fprintf(fp, " notes[%d]: %lx ", > i, (ulong)dd->nt_prstatus_percpu[i]); > } > > Now surely we don't want to dump "NR_CPUS" notes pointers, > unless there actually are that many cpus in the system. > For example, in RHEL6 sets CONFIG_NR_CPUS to 4096 for x86_64, > but rarely do we see a dumpfile with that many cpus. > So I would prefer that the for loop continue to be bounded > by "dd->num_prstatus_notes". I also bothered about condition whether loop should break at dd->nt_prstatus_percpu or not, even though remaining online cpu's notes can't display as "not saved". However, I choised nrcpus way since my environment could break with online cpu nums, not NR_CPUS. Now, I check kernel_NR_CPUS validation. if (STREQ(ln, "CONFIG_NR_CPUS")) { kt->kernel_NR_CPUS = atoi(val); I'm aware if ikconfig is not valid, break condition of loop is always NR_CPUS without doubt and display gets noisy like as your counsel. > However, given that process_elf32_notes() and process_elf64_notes() > do a cursory test of each note by checking for the NT_PRSTATUS > type, I'm presuming that in your dumpfile, dd->num_prstatus_notes > must be equal to 1, correct? Yes, dd->num_prstatus_notes in my dumpfile is 1. > void > process_elf64_notes(void *note_buf, unsigned long size_note) > { > Elf64_Nhdr *nt; > size_t index, len = 0; > int num = 0; > > for (index = 0; index< size_note; index += len) { > nt = note_buf + index; > > if(nt->n_type == NT_PRSTATUS) { > dd->nt_prstatus_percpu[num] = nt; > num++; > } > len = sizeof(Elf64_Nhdr); > len = roundup(len + nt->n_namesz, 4); > len = roundup(len + nt->n_descsz, 4); > } > > if (num> 0) { > pc->flags2 |= ELF_NOTES; > dd->num_prstatus_notes = num; > } > return; > } > > But then map_cpus_to_prstatus_kdump_cmprs() remaps the > notes without modifying the dd->num_prstatus_notes counter. > > So then there's this: > > void * > diskdump_get_prstatus_percpu(int cpu) > { > if ((cpu< 0) || (cpu>= dd->num_prstatus_notes)) > return NULL; > > return dd->nt_prstatus_percpu[cpu]; > } > > So in a case such as your example, where cpu 2 was the only cpu that saved > its notes, the function above would incorrectly return NULL when called > with diskdump_get_prstatus_percpu(2). > > What am I missing? No, you are entirely right and I could not condier about x86 parts. I'm getting the feeling that there might be more affects for any other arches. I reconsider different approach to fix NT_PRSTATUS up. Thanks, Toshi > Dave > > -- > Crash-utility mailing list > Crash-utility@redhat.com > https://www.redhat.com/mailman/listinfo/crash-utility > -- Crash-utility mailing list Crash-utility@redhat.com https://www.redhat.com/mailman/listinfo/crash-utility |
Handle the NT_PRSTATUS lost for the "bt" command
(2012/06/20 3:25), Dave Anderson wrote:
> > > ----- Original Message ----- > >> >> OK, now I'm getting confused... >> > > The more I look at this patch, the more confused I get... > > During initialization, the ELF notes contained in the > dumpfile file header are scanned, and if an NT_PRSTATUS > note is seen, a pointer to its location in the dumpfile > is saved in dd->nt_prstatus_percpu[num] and the "num" > of valid notes is kept in dd->num_prstatus_notes. > > If the dd->num_prstatus_notes is equal to the online cpu > count, then it is presumed that there is a one-to-one > relationship, where the cpu number can be used as the > index into the dd->nt_prstatus_percpu[num] array. > > If the number of notes is not equal to the number of online > cpus, then the "mapping" function is called, where if a > cpu is found to be offline, then its (incorrectly) associated > entry in the dd->nt_prstatus_percpu[num] array is "pushed up" > to the next higher entry. But the dd->num_prstatus_notes > does not seem to get incremented to reflect that move, so > then it's seems like diskdump_get_prstatus_percpu() can > possibly return NULL when there actually is a relevant > NT_PRSTATUS note. A dd->num_prstatus_notes is equal to number of NT_PRSTATUS notes in dump file and unless nt_prstatus_percpu[] is continuous from 0, diskdump_get_prstatus_percpu() can possibly return NULL. > That seems to be a bug (?), but it's not particularly important, > because for x86 and x86_64, the data in the NT_PRSTATUS notes is > only used if the starting point for backtraces if the PC/SP pair > cannot be determined otherwise, which is the case virtually all of > the time. So the registers found in the NT_PRSTATUS notes are > pretty much useless... I seem to be minor bug, if some of cpus are offline while panic, can't backtrace the same number of cpus from tail because of returning NULL. I think current task's PC/SP can not obtain from thread_struct corectly, then NT_PRSTATUS notes need to be stored for them instead. But anyway, panic under offline is very unlikely condition as same as the NMI failed. > Now, to complicate matters, your patch does not look at the > NT_PRSTATUS notes in the dumpfile header, but instead looks > at the base kernel's original notes, and verifies their > contents, and correlates what's found there against what was > found in the dumpfile? So I don't understand what you are > attempting to do -- what is the difference between the notes > that are copied into the dumpfile vs. what you are looking at > in the base kernel? At first, my patch is trying to find out tainted note area in base kernel. Such notes can not be found out from dumpfile, however, found out other ones are set continuously into nt_prstatus_percpu[from 0 to found out notes]. I just want to repaire nt_prstatus_percpu[#index] corresponding to cpu#index. [example] cpu#0 cpu#1 cpu#2 cpu#3 saved tainted saved tainted [by looking at base kernel] => nt_prstatus_percpu[0] = cpu#0's note, nt_prstatus_percpu[1] = cpu#3's note I want to repaire array like, => nt_prstatus_percpu[0] = cpu#0's note, nt_prstatus_percpu[3] = cpu#3's note Since ELF note format dose not contain information which cpu saved me, I think there is no way to identify mapping between cpu and nt_prstatus_percpu[] from ELF note when both are not equal to. That's why I try to resolve from base kernel area. > I'm also wondering what would happen in your case if there > were a combination of "lost" notes *and* offline cpus? How > would that work? I think offline cpus never ack IPI, thus lost notes everytime(not concerned). I only assume the case that cpu was online but could not ack IPI caused by any reasons in interrupts off. > So at this point I really don't want to add this patch > at all because it touches common code, and I don't want to > risk breaking the other arches. Nobody has ever reported > any "lost" cpus so far, probably because the kdump facility > uses non-maskable NMI's to shutdown the non-panicking cpus. > This is such a highly-unlikely corner case, that it does > not even seem worth addressing for fear of breaking something > else. I can agree, x86 architecture own NMI which is most reliable way to deliver interrupts to other cpus while powerpc may not own NMI. I am going to rework patch so that it can be effective only in ppc arch. > I didn't look at the reasoning behind why you ran into a > segmentation violation, but since the PPC code path would be: > > ... > back_trace() > get_diskdump_regs() > get_diskdump_regs_ppc() get_diskdump_regs_ppc() is touching dd->nt_prstatus_percpu[bt->tc->processor] when "bt" for panic_task. --------- if (KDUMP_CMPRS_VALID() && (bt->task == tt->panic_task || (is_task_active(bt->task) && (dd->num_prstatus_notes > 1)))) { note = (Elf32_Nhdr*) dd->nt_prstatus_percpu[bt->tc->processor]; if (!note) error(FATAL, "cannot determine NT_PRSTATUS ELF note " "for %s task: %lx ", (bt->task == tt->panic_task) ? "panic" : "active", bt->task); len = sizeof(Elf32_Nhdr); len = roundup(len + note->n_namesz, 4); bt->machdep = (void *)((char *)note + len + MEMBER_OFFSET("elf_prstatus", "pr_reg")); } machdep->get_stack_frame(bt, eip, esp); --------- In my dumpfile case, bt->tc->processor is 2 and its note is contained at dd->nt_prstatus_percpu[0]. note = (Elf32_Nhdr*) dd->nt_prstatus_percpu[2]; * Here is no note instance which "bt" expects. A dd->nt_prstatus_percpu[] is allocated by malloc() and not initialied with 0. When note->n_namesz is executed, it is cause of segmentation violation. This malloc() is moved to calloc() in my first patch. Other problem which is also appeared in get_netdump_regs_ppc(). A dd->num_prstatus_notes is 1 in my dumpfile although there were 8 online cpus. If next condition "is_task_active(bt->task)" is required by "bt", get_diskdump_regs_ppc() unfortunately do machdep->get_stack_frame() for active tasks and surely failed. "cannot determine NT_PRSTATUS ELF note ..." should be proper. > perhaps you can rework your patch so that it is segregated > to PPC only? Yes I'm going to rework, and thanks for your reviews. I'm modifying difficult parts this time, very helpful. I'll do more tests before sending reworked PPC only patch. I also have to study about kexectools or makedumpfile more about how NT_PRSTATUS or others are treated, haven't understood enough. Toshi > Dave > > -- > Crash-utility mailing list > Crash-utility@redhat.com > https://www.redhat.com/mailman/listinfo/crash-utility > -- Crash-utility mailing list Crash-utility@redhat.com https://www.redhat.com/mailman/listinfo/crash-utility |
Handle the NT_PRSTATUS lost for the "bt" command
----- Original Message -----
> > That seems to be a bug (?), but it's not particularly important, > > because for x86 and x86_64, the data in the NT_PRSTATUS notes is > > only used if the starting point for backtraces if the PC/SP pair > > cannot be determined otherwise, which is the case virtually all of > > the time. So the registers found in the NT_PRSTATUS notes are > > pretty much useless... > > I seem to be minor bug, if some of cpus are offline while panic, > can't backtrace the same number of cpus from tail because of returning NULL. > I think current task's PC/SP can not obtain from thread_struct corectly, > then NT_PRSTATUS notes need to be stored for them instead. For the active tasks, the PC/SP starting points can typically be found by looking at the stack contents. The NT_PRSTATUS notes are a relatively recent addition to compressed kdumps, i.e., have only existed since header version 4. Only in very rare circumstances do the x86 and x86_64 arches need to utilize the registers in the compressed kdump NT_PRSTATUS notes. That being said, I will fix the invalid value that is stored in dd->num_prstatus_notes so that diskdump_get_prstatus_percpu() will work as expected if there are offline cpus. That function is currently used only by x86 and x86_64, so there should be no issues with the other arches. > Yes I'm going to rework, and thanks for your reviews. > I'm modifying difficult parts this time, very helpful. > > I'll do more tests before sending reworked PPC only patch. > I also have to study about kexectools or makedumpfile more about > how NT_PRSTATUS or others are treated, haven't understood enough. OK thanks, Dave -- Crash-utility mailing list Crash-utility@redhat.com https://www.redhat.com/mailman/listinfo/crash-utility |
| All times are GMT. The time now is 04:22 PM. |
VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.