monitoring alrets for gerrit on 19.01.2017
Description
Activity

Former user March 3, 2017 at 3:05 PM
Root cause located - lots of concurrent git requests. If this happens more often we may need to add even more memory to the instance.

Former user January 20, 2017 at 10:02 AM
Top process names when the first OOM happened:
180 git
94 git-upload-pack
90 git-daemon
21 httpd
10 postmaster
6 mingetty
3 udevd
3 sendmail
2 crond
1 xinetd
1 sshd
...
And for the last OOM:
174 git
95 git-upload-pack
91 git-daemon
21 httpd
11 postmaster
6 mingetty
4 sendmail
3 udevd
2 crond
1 xinetd
1 sshd
...
Memory consumption of the Java process (gerrit) during the first OOM:
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[24949] 500 24949 3678120 426287 4 -17 -1000 java
And the last one:
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[24949] 500 24949 3678138 807366 4 -17 -1000 java
RSS size grew twofold during this period. Combined with 180 git processes (50000 pages each), and 95 git-upload-pack processes (30000 each) this consumed most of the virtual memory available to the host. Luckily OOM did not kill Java or Postgres.

Former user January 20, 2017 at 9:30 AM
According to nagios logs, the issue lasted between Thu, 19 Jan 2017 21:16:22 GMT and Thu, 19 Jan 2017 21:38:32 GMT. Most services gave timeouts while ping continued to work and the server did not reboot.
Looking at the server itself, I see that multiple OOM conditions were reported yesterday (server time is EST which is GMT+5h)
Jan 19 16:15:40 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:15:53 gerrit kernel: postmaster invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:15:53 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:16:43 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:17:05 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:17:57 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:10 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:13 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:17 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:21 gerrit kernel: git invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:24 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:32:35 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:52 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:55 gerrit kernel: java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=-17, oom_score_adj=-1000
Jan 19 16:32:59 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:33:03 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:33:34 gerrit kernel: git invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Jan 19 16:33:38 gerrit kernel: git invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Looks like the server was indeed under heavy load and this was not a false-positive as it ran out of memory and killed a dozen processes which likely affected jobs cloning repositories.
Details
Assignee
Former userFormer user(Deactivated)Reporter
Former userFormer user(Deactivated)Priority
Medium
Details
Details
Assignee

Reporter

Several nagios alerts were sent on 19.01.2017 around 21:27:33 UTC related to various services on the Gerrit server. They have since recovered. This is a ticket to identify the root cause.