ovirt-srv07 crashed due to KSM
Description
relates to
Activity
Former user March 28, 2017 at 12:55 PM
Fixed by setting to merge across NUMA nodes. Upstream bug still open to resolve the crash in the other mode. Closing for now.
Former user March 17, 2017 at 8:49 AM
There was a faulty DIMM replaced after which I updated the system and started VMs again (with KSM disabled) yet the box crashed again:
[21213.043816] ------------[ cut here ]------------
[21213.048972] kernel BUG at mm/ksm.c:611!
[21213.053250] invalid opcode: 0000 [#1] SMP
[21213.057836] Modules linked in: vhost_net vhost macvtap macvlan ebt_arp ebtable_nat tun ebtable_filter ebtables ip6table_filter ip6_tables scsi_transport_iscsi xt_physdev br_netfilter ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter intel_powerclamp coretemp kvm_intel kvm irqbypass dm_service_time crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_devintf acpi_power_meter sg acpi_pad iTCO_wdt iTCO_vendor_support wmi dcdbas mei_me pcspkr sb_edac ipmi_si ipmi_msghandler lpc_ich mei shpchp edac_core dm_multipath dm_mod nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul
[21213.137165] sysimgblt crct10dif_common fb_sys_fops crc32c_intel ttm ahci drm libahci libata i2c_core tg3 megaraid_sas ptp pps_core fjes 8021q garp mrp bridge stp llc bonding
[21213.153283] CPU: 10 PID: 186 Comm: ksmd Not tainted 3.10.0-514.10.2.el7.x86_64 #1
[21213.161632] Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.1.3 11/20/2013
[21213.169980] task: ffff88203fa2edd0 ti: ffff880ffde30000 task.ti: ffff880ffde30000
[21213.178327] RIP: 0010:[<ffffffff811d525c>] [<ffffffff811d525c>] remove_node_from_stable_tree+0x11c/0x120
[21213.189009] RSP: 0018:ffff880ffde33d20 EFLAGS: 00010282
[21213.194933] RAX: 0000000081a3b0f8 RBX: ffffea0000000180 RCX: 0000000000000001
[21213.202893] RDX: 0000000081a3b0f8 RSI: 0000000000000000 RDI: ffff8809c5710d08
[21213.210852] RBP: ffff880ffde33d30 R08: 0000000000000065 R09: 00000000ffffffff
[21213.218811] R10: ffff880687fd0000 R11: 0000000000000000 R12: ffff8809c5710d08
[21213.226770] R13: 0000000000000006 R14: ffff8809c5710d0b R15: ffff8809c5710d08
[21213.234729] FS: 0000000000000000(0000) GS:ffff880fff940000(0000) knlGS:0000000000000000
[21213.243755] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21213.250166] CR2: 00007fc6f9d48270 CR3: 00000000019ba000 CR4: 00000000001427e0
[21213.258125] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[21213.266085] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[21213.274045] Stack:
[21213.276285] ffffea0000000180 ffffea0000000000 ffff880ffde33d70 ffffffff811d55a8
[21213.284573] 00ff8801697a0000 ffff8809c5710d08 ffff8809c5710d08 ffff880ffde33e28
[21213.292862] ffff881c606de0c0 ffff881fdabc2180 ffff880ffde33dc0 ffffffff811d5f2f
[21213.301150] Call Trace:
[21213.303880] [<ffffffff811d55a8>] get_ksm_page+0x98/0x120
[21213.309903] [<ffffffff811d5f2f>] __stable_node_chain+0x3f/0x250
[21213.316603] [<ffffffff811d6cce>] ksm_do_scan+0x46e/0x11e0
[21213.322721] [<ffffffff811d7acf>] ksm_scan_thread+0x8f/0x240
[21213.329037] [<ffffffff810b17d0>] ? wake_up_atomic_t+0x30/0x30
[21213.335543] [<ffffffff811d7a40>] ? ksm_do_scan+0x11e0/0x11e0
[21213.341956] [<ffffffff810b06ff>] kthread+0xcf/0xe0
[21213.347398] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[21213.354685] [<ffffffff81696a58>] ret_from_fork+0x58/0x90
[21213.360708] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[21213.367990] Code: 83 2d d8 b3 d7 00 01 49 89 44 24 08 66 b8 00 02 49 89 44 24 10 eb ac 0f 1f 84 00 00 00 00 00 49 8d 7c 24 18 e8 06 e0 15 00 eb 98 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 4c 8b 67
[21213.389642] RIP [<ffffffff811d525c>] remove_node_from_stable_tree+0x11c/0x120
[21213.397709] RSP <ffff880ffde33d20>
Former user March 3, 2017 at 2:54 PMEdited
This was reproduced on a CentOS kernel so the bug was escalated to CentOS bug tracker: https://bugs.centos.org/view.php?id=12908
May be a race condition between KSM and THP that both scan memory and merge pages. Now I'm doing stress tests with KSM enabled and THP disabled to see if I can still trigger a panic.
Yaniv Kaul March 2, 2017 at 7:21 PM
Please open a BZ on ksm.
Former user March 1, 2017 at 1:01 PM
disabled KSM on ovirt-srv06 and ovirt-srv07 while we troubleshoot this. Will open a bug against CentOS and hope it gets some attention.
Today during a build queue spike ovirt-srv07 crashed. I enabled KSM on this host yeterday and this is the likely cause. A vmcore was generated, will investigate.