Deactivating the module and so falling back to the softdog would help. Since we have no debug symbols for the kernel (I did not find any package about this....), I could not use kdump to catch the panic up. Product Security Center Security Updates Security Advisories Red Hat CVE Database Security Labs Keep your systems secure with Red Hat's specialized responses for high-priority security vulnerabilities. They replaced the PSP with another acronym for G7s and above) supports Emulex cards, That's Emulex rather than the HP branded variety. weblink
The only issue is now 1 of my HP servers throws a NMI and panics as soon as it boots into the OS. intel_idle+0xe7/0x160 [ 5493.663438] [
Thank you!!! Rafael David Tinoco (inaddy) wrote on 2015-04-07: #11 Doing verification right now... After replacing the shell the issue still persisted. This happens at random, but mostly when we use the live migration.
If you blacklist watchdog module server not panic but reset immediatelly. Tags: health, HP BladeSystem, hp proliant, iLO One Comment on "Interpreting (decoding) NMI sources from IML log messages" liu March 23rd, 2015 at 10:01 pm Hi, my error code is 0x00000002, This will help the support colleagues and figure out what went wrong. An Unrecoverable System Error (nmi) Has Occurred (service Information: 0x7fbce8f6, 0x00000000) Since then I monitor the hardware from Onboard Administrator and there is no something strange.
We have updated drivers and FW of system board, replaced system board and riser board, yet we still get the same failures after some days. 0 Kudos Reply PMI_WINCHAM Occasional Visitor An Unrecoverable System Error Nmi Has Occurred Dl585 It seems like if corosync wants to use them, which is why it would open /dev/watchdog, then there's either a corosync bug or there's something in the configuration that isn't right. If not, some conditions that would normally result in a graceful shutdown (typically overheating) could progress to the point where a forced reboot would be considered necessary. https://community.hpe.com/t5/ProLiant-Servers-ML-DL-SL/DL380p-Gen8-with-uncorrectabl-PCI-express-error/td-p/5995669 In one lab we have HP proliant servers with massive kernel panic on Module hpwdt.ko.
This is not the ANSWER for the reported bug, just a clarification on what the kernel team has decided to do way before this case. Ilo Watchdog Nmi Doesn't sound quite like the same issue. Systems are crashing with following panic message: [ 5492.146364] Kernel panic - not syncing: An NMI occurred. Of course we shall recommend the HW watchdog interface for 2 node cluster setups, for example, when we can't rely on quorum policies and fencing mechanisms are not available (like external
The server keeps crashing with always the same error messagesCritical PCI Bus 03/13/2013 17:12 03/13/2013 17:12 1 Uncorrectable PCI Express Error (Embedded device, Bus 0, Device 2, Function 2, Error status https://bugs.launchpad.net/bugs/1432837 We have a ceph cluster with 3 hosts, 3 monitors up and running on this lab and erverything seems to be quite good. An Unrecoverable System Error Nmi Has Occurred Hp sched_clock+0x9/0x10 [ 5493.224869] [
It is recommended that servers that require the ability to respond to an iLO-triggered NMI be updated to SLES 11 SP2 or later.https://www.suse.com/company/press/2012/2/suse-linux-enterprise-11-service-pack-2-released.htmlorhttp://www.novell.com/support/kb/doc.php?id=7012368Additional resource on related information can be obtained from:Release http://crearesiteweb.net/an-unrecoverable/an-unrecoverable-system-error-has-occurred-hp-proliant.html Running the PIC (programmable interrupt controller) in XAPIC mode might not be compatible with firmware if the CPU supports X2APIC because of one of the only features that differs XAPIC from If the OS locks up hard, watchdog timers (if configured) would eventually trigger an NMI. This is why I suspected the USB drives.We already had a systemboard replacement and after that the freqency went up. An Unrecoverable System Error Has Occurred Error Code 0x0000002d 0x00000000
Thank you! It looks + like some users were using /dev/watchdog (from HPWDT module) by accident and/or without + even knowing. (like for example running corosync and using watchdog). + Check: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837 for My issue is resolved on the older kernels. #15 adamb, Nov 11, 2015 [email protected] Member Joined: Nov 12, 2015 Messages: 78 Likes Received: 0 Hello everybody! check over here Thank you Rafael Tinoco -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.
SubDevice: pci 0x3245 "Smart Array P410i" Revision: 0x01 Driver: "cciss" Driver Modules: "cciss" Driver Info #0: Driver Status: cciss is active Driver Activation Cmd: "modprobe cciss" Driver Info #1: Driver Status: Ilo Application Watchdog Timeout Nmi Service Information 0x0000002b 0x00000000 So we can conclude, this is related to the kernel bug with hpwdt, since Code: echo "A" > /dev/watchdog will produce the kernel panic with hpwdt.ko loaded. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'.
ILO: "76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An Unrecoverable System Error (NMI) has occurred (System error code 0x0000002B, 0x00000000)" Examples: PID: 0 TASK: ffffffff81c1a480 CPU: 0 COMMAND: "swapper/0" #0 [ffff88085fc05c88] machine_kexec at We have provided the following cmdline to be used: " intel_idle.max_cstate=0 ". Get your own in 60 seconds. Uncorrectable Pci Express Error Deactivating the module and so falling back to the softdog would help.
The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick support. 6.000+ satisfied customers have Soadyheid View Public Profile View LQ Blog View Review Entries View HCL Entries Find More Posts by Soadyheid 09-22-2014, 04:48 AM #7 kaito.7 LQ Newbie Registered: Jun 2014 Posts: I believe you should be able to boot directly into hardware diagnostics via the Intelligent Provisioning Utility accessed via the iLO on a Gen 8 server (or would that be within http://crearesiteweb.net/an-unrecoverable/an-unrecoverable-system-error-has-occurred-hp.html Are you new to LinuxQuestions.org?
HomeAbout Interpreting (decoding) NMI sources from IML log messages Apr.25, 2009 in BladeSystem, Operations, ProLiant If you are using the HP health drivers for ProLiant servers (or at least the hp-wdt You are currently viewing LQ as a guest.