monitoring alrets for gerrit on 19.01.2017

Description

Several nagios alerts were sent on 19.01.2017 around 21:27:33 UTC related to various services on the Gerrit server. They have since recovered. This is a ticket to identify the root cause.

Activity

Show:

Former user March 3, 2017 at 3:05 PM

Root cause located - lots of concurrent git requests. If this happens more often we may need to add even more memory to the instance.

Former user January 20, 2017 at 10:02 AM

Top process names when the first OOM happened:

180 git
94 git-upload-pack
90 git-daemon
21 httpd
10 postmaster
6 mingetty
3 udevd
3 sendmail
2 crond
1 xinetd
1 sshd
...

And for the last OOM:

174 git
95 git-upload-pack
91 git-daemon
21 httpd
11 postmaster
6 mingetty
4 sendmail
3 udevd
2 crond
1 xinetd
1 sshd
...

Memory consumption of the Java process (gerrit) during the first OOM:

[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[24949] 500 24949 3678120 426287 4 -17 -1000 java

And the last one:

[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[24949] 500 24949 3678138 807366 4 -17 -1000 java

RSS size grew twofold during this period. Combined with 180 git processes (50000 pages each), and 95 git-upload-pack processes (30000 each) this consumed most of the virtual memory available to the host. Luckily OOM did not kill Java or Postgres.

Former user January 20, 2017 at 9:30 AM

According to nagios logs, the issue lasted between Thu, 19 Jan 2017 21:16:22 GMT and Thu, 19 Jan 2017 21:38:32 GMT. Most services gave timeouts while ping continued to work and the server did not reboot.

Looking at the server itself, I see that multiple OOM conditions were reported yesterday (server time is EST which is GMT+5h)
Jan 19 16:15:40 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:15:53 gerrit kernel: postmaster invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:15:53 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:16:43 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:17:05 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:17:57 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:10 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:13 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:17 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:21 gerrit kernel: git invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:24 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:35 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:52 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:55 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:59 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:33:03 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:33:34 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:33:38 gerrit kernel: git invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0

Looks like the server was indeed under heavy load and this was not a false-positive as it ran out of memory and killed a dozen processes which likely affected jobs cloning repositories.

Done

Details

Assignee

Reporter

Priority

Created January 20, 2017 at 9:17 AM
Updated April 2, 2017 at 12:51 PM
Resolved March 3, 2017 at 3:05 PM