Document procedure for infra upgrades

Description

On Sun, Jan 21, 2018 at 1:01 PM, Barak Korren <bkorren@redhat.com> wrote:

>
>
> On 21 January 2018 at 12:50, Eyal Edri <eedri@redhat.com> wrote:
>
>>
>>
>> On Sun, Jan 21, 2018 at 12:47 PM, Barak Korren <bkorren@redhat.com>
>> wrote:
>>
>>>
>>>
>>> On 21 January 2018 at 12:39, Eyal Edri <eedri@redhat.com> wrote:
>>>
>>>> There is another issue, which is currently failing all CQ, and its
>>>> related to the new IBRS CPU model.
>>>> It looks like all of the lago slaves were upgraded to new Libvirt and
>>>> kernel on Friday, while we still don't have a fix on lago-ost-plugin for
>>>> that.
>>>>
>>>> I think there was a misunderstanding about what to upgrade, and it
>>>> might have been understood that only the bios upgrade breaks it and not the
>>>> kernel one.
>>>>
>>>> In any case, we're currently fixing the issue, either by downgrading
>>>> the relevant pkgs on lago slaves or adding the mapping to new CPU types
>>>> from OST.
>>>>
>>>> For future, I suggest a few updates to maintenance work on Jenkins
>>>> slaves ( VMs or BM ):
>>>>
>>>> 1. Let's avoid doing an upgrade close to a weekend ( i.e not on Thu-Sun
>>>> ), so all the team can be around to help if needed or if something
>>>> unexpected happens.
>>>> 2. When we have a system-wide upgrade scheduled, like all BM slaves or
>>>> VMs for a specific OS, let's adopt a gradual upgrade with a few days window
>>>> in between,
>>>> e.g, if we need to upgrade all Lago slaves, let's upgrade 1-2 and
>>>> wait to see if nothing breaks and continue after we verify OST runs (
>>>> either seeing on CQ or running manually )
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>>
>>> We have a staging system - we should be using it for staging....
>>>
>>
>> Do we have OST tests or manual job avaialble there?
>>
>
> We can add them easily, or simply run Lago manually when needed.
>
>
>> In any case, this doesn't contradict what I suggested, even if you test
>> on staging, there could be differences from the production system, so we
>> should take care when we upgrade regardless.
>>
>
> Yes, but at least we'd know we green lighted the new configuration - I'm
> sure in this case we could have found at least some of the issues on
> staging (Like the fc27 issues for example) and could have avoided expansive
> production failures.
>
> Another point when scheduling an upgrade, is to talk to infra owner or the
>> CI team and understand if we currently have a large Q in CQ or known
>> failures, so it might be best to wait a bit until its cleared.
>>
>>
>
>

Adding infra-support so we can gather this info and prepare a maintanaince
/ upgrade checklist to add to the oVirt infra docs.
Let's continue the discussion, suggestion on that ticket.

> –
> Barak Korren
> RHV DevOps team , RHCE, RHCi
> Red Hat EMEA
> redhat.com | TRIED. TESTED. TRUSTED. | redhat.com/trusted
>

Eyal edri

MANAGER

RHV DevOps

EMEA VIRTUALIZATION R&D

Red Hat EMEA <https://www.redhat.com/>
<https://red.ht/sig> TRIED. TESTED. TRUSTED. <https://redhat.com/trusted>
phone: +972-9-7692018
irc: eedri (on #tlv #rhev-dev #rhev-integ)

Activity

Show:

Eyal Edri December 25, 2018 at 1:46 PM

We've added some steps to the ovirt-infra docs about upgrades, and it also continues to change as we upgrade the infra. so closing for now.

Former user January 21, 2018 at 11:38 AM

As for now, tester 5029 passed with 175 patches.
So for now, we know that there are no unknown regressions in master repo.

Fixed

Details

Assignee

Reporter

Priority

Created January 21, 2018 at 11:11 AM
Updated August 29, 2019 at 2:12 PM
Resolved December 25, 2018 at 1:46 PM

Flag notifications