ovirt-srv15 kernel panic due to KSM
Description
relates to
Activity

Former user April 21, 2017 at 1:58 PM
closing for now as we changed KSM settings to overcome the issue and have another similar x86_64 ticket with a better reproducer.

Former user March 28, 2017 at 1:14 PM
Just to check if it is the same issue as I've enabled KSM with merging across NUMA nodes to see if the system still crashes.

Former user January 6, 2017 at 2:50 PM
As this seems to be caused by KSM, I've disabled it on the respective cluster and booted the VMs back up. Also while working on this I saw that ovirt-srv15 has 2607 duplicate network interfaces in Foreman so deleted them all using an SQL command.

Former user January 6, 2017 at 2:29 PM
I added a "rescue" PXE entry to boot the box into a CentOS env,
from where I was able to successfully fix the FS by mounting it
to replay the log. Then checked it using xfs_repair to see that
there are no more errors. After this the bootloader detected the
OS successfully.
There is a core dump created, so this was caused by the kernel:
[ 2368.545615] Unable to handle kernel paging request for data at address 0x00000000
[ 2368.545622] Faulting instruction address: 0xc0000000002dcc10
[ 2368.545626] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2368.545628] SMP NR_CPUS=2048 NUMA PowerNV
[ 2368.545632] Modules linked in: vhost_net vhost macvtap macvlan ebt_arp ebtable_nat tun ebtable_filter ebtables ip6table_filter ip6_tables scsi_transport_iscsi xt_physdev br_netfilter ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport kvm_hv kvm xt_conntrack nf_conntrack iptable_filter softdog ext4 mbcache jbd2 ses enclosure scsi_transport_sas sg shpchp rtc_opal powernv_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c dm_service_time sd_mod sr_mod cdrom lpfc dm_multipath ipr libata crc_t10dif tg3 crct10dif_generic scsi_transport_fc ptp scsi_tgt pps_core crct10dif_common dm_mirror dm_region_hash dm_log dm_mod 8021q garp mrp bridge stp llc bonding
[ 2368.545673] CPU: 56 PID: 423 Comm: ksmd Not tainted 3.10.0-514.2.2.el7.ppc64le #1
[ 2368.545676] task: c000000fe580d3e0 ti: c000000fe58a8000 task.ti: c000000fe58a8000
[ 2368.545679] NIP: c0000000002dcc10 LR: c0000000002dcbf8 CTR: 0000000000000000
[ 2368.545682] REGS: c000000fe58ab920 TRAP: 0300 Not tainted (3.10.0-514.2.2.el7.ppc64le)
[ 2368.545684] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28002042 XER: 20000000
[ 2368.545692] CFAR: c000000000009368 DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
GPR00: c0000000002dcbf8 c000000fe58abba0 c0000000011a7c00 f000000002e5e908
GPR04: f000000002e5e908 0000000000000000 0000000000000000 0000000000000000
GPR08: c000000004553190 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000000002200 c000000007b5f800 c000000d88c51300 f00000000612ba98
GPR16: c000001cce95ef40 c000000fd9f0b008 6db6db6db6db6db7 c000000fe58a8000
GPR20: c000000d882ec280 1000000000000000 c000001bc3550000 c000000fe58a8000
GPR24: 0000000000000000 c000000d88efb488 fffffffffffff000 c00000000155b2f0
GPR28: c0000000010da360 f000000000000000 0000000000000001 0000000000000000
[ 2368.545726] NIP [c0000000002dcc10] ksm_do_scan+0xfb0/0x1c80
[ 2368.545729] LR [c0000000002dcbf8] ksm_do_scan+0xf98/0x1c80
[ 2368.545731] Call Trace:
[ 2368.545734] [c000000fe58abba0] [c0000000002dcbf8] ksm_do_scan+0xf98/0x1c80 (unreliable)
[ 2368.545738] [c000000fe58abce0] [c0000000002dda10] ksm_scan_thread+0x130/0x330
[ 2368.545741] [c000000fe58abd80] [c0000000001146ec] kthread+0xec/0x100
[ 2368.545745] [c000000fe58abe30] [c00000000000a47c] ret_from_kernel_thread+0x5c/0xe0
[ 2368.545748] Instruction dump:
[ 2368.545749] 60000000 4bfffa80 79240764 4bfffcec e8610020 4bf8c6f5 60000000 7faea040
[ 2368.545754] 419e012c e9340008 e9540010 2fa90000 <f92a0000> 419e0008 f9490008 e93b0000
[ 2368.545762] ---[ end trace 6f3f1790cfe9a4ea ]---
[ 2368.547152]
[ 2368.547157] Sending IPI to other CPUs
[ 2368.548161] IPI complete
will log this as a bug as we're likely to hit this situation again. Same applies to the XFS corruption issue - may have to report to IBM or test newer firmwares for XFS log format support.

Former user January 6, 2017 at 10:59 AM
pb-discover.log starts like this:
— pb-discover —
Detected platform type: powerpc
Running command:
exe: nvram
argv: 'nvram' '--print-config' '--partition' 'common'
configuration:
autoboot: enabled, 10 sec
boot priority order:
network: 2
disk: 1
language:
SKIP: sda: no ID_FS_TYPE property
SKIP: sda1: no ID_FS_TYPE property
mounting device /dev/sda2 read-only
couldn't mount device /dev/sda2: mount failed: Input/output error
mounting device /dev/sda3 read-only
couldn't mount device /dev/sda3: mount failed: No such device
SKIP: sdb: no ID_FS_TYPE property
mounting device /dev/sdb1 read-only
couldn't mount device /dev/sdb1: mount failed: No such device
SKIP: sdc: no ID_FS_TYPE property
...
We should be booting from sda:
fdisk -l /dev/sda
Disk /dev/sda: 283.7 GB, 283794997248 bytes
128 heads, 32 sectors/track, 135324 cylinders
Units = cylinders of 4096 * 512 = 2097152 bytes
Device Boot Start End Blocks Id System
/dev/sda1 * 1 3 4096 41 PPC PReP Boot
Partition 1 does not end on cylinder boundary
/dev/sda2 3 253 512000 83 Linux
Partition 2 does not end on cylinder boundary
/dev/sda3 253 135324 276626432 8e Linux LVM
Partition 3 does not end on cylinder boundary
And here's dmesg:
[ 38.414724] XFS (sda2): Mounting Filesystem
[ 38.492500] XFS (sda2): Starting recovery (logdev: internal)
[ 38.492851] XFS (sda2): dirty log written in incompatible format - can't recover
[ 38.492856] XFS (sda2): log mount/recovery failed: error 5
[ 38.492894] XFS (sda2): log mount failed
To me this looks like XFS on the boot partition is dirty, yet petitboot's kernel version can't recover it and thus can't compile a bootloader entry.
Details
Assignee
Former userFormer user(Deactivated)Reporter
Former userFormer user(Deactivated)Components
Priority
Medium
Details
Details
Assignee

Reporter

A short time after ovirt-srv15 (one of the Power8 hypervisors) was updated to CentOS 7.3 it went offline. Opening this case to troubleshoot. It was hosting 16 ppc64le Jenkins slaves. We still have slaves running on ovirt-srv16 so this is not critical at the moment.