Simplify oVirt storage configuration

Description

Today's outage was a clear reminder that our current storage configuration does not serve us well. We hardly know how to debug it, it seems to not be resistant to the very issues it was supposed to protect against and introduce potential failure scenarios of its own.

I suggest we implement a new storage layout that meets the following criteria:

  1. Ultimate simplicity at the lower level of the stack. More specifically:

    1. The storage severs should be simple NFS or iSCSI servers. No DRBD and no exotic file-systems.

    2. Only simple storage will be presented to oVirt for use as storage domains

  2. Separation of resources between critical services - The 'Jenkins" master for e.g. should not share resources with the "resources" server or anything else.The separation should hold true down to the physical spindle level.

  3. Duplication of services and use of local storage where possible - this is a longer term effort - but we have some low hanging fruits here like artifactory, where simple DNS/LB-based fail-over between two identical hosts would probably suffice.

  4. Complexity only where needed and up the stack. For example we can just have the storage for Jenkins be mirrored at the VM level with fail-over to a backup VM.

100% Done
Type
Key
Summary
Priority
Story Points
Assignee
Status

Activity

Show:

Former user August 30, 2017 at 11:25 PM

Tried moving Jenkins disks to find out Shareable disks can't be moved. Will have to shut down VMs that have these flags set on disks (Jenkins and Resources) to unset them and then perform storage migrations.

Former user August 30, 2017 at 8:55 AM

Status update:

  • 10gig card installed

  • iSCSI configured for data domains

  • NFS configured for ISO/HostedEngine/OpenShift domains

  • storage VLAN configured on production hosts

  • three new storage domains created:

    • jenkins-iscsi - Jenkins VM (1.8TB)

    • data-iscsi - PROD VMs (3.6TB)

    • data2-iscsi - non-PROD VMs (2.7TB)

  • data migration in progress, 3.5TB of disks moved, 3.2TB to move.

So far, migration is done using Live Storage Migration, i.e. the VM is not shut down for this. Tested this first on non-critical VMs and it performs quite well. The VM has no downtime and starts fine after shutdown.

A couple of our PROD OpenShift VMs failed to migrate properly. They have two disks and one disk failed to move and now there is a broken snapshot that is in the way. This happened when both disks were selected for moving at once - when disks are moved one by one no snapshot seems to be created for the VM.

Former user August 2, 2017 at 1:46 PM

Faulty hardware was successfully replaced on ovirt-storage02 and I provisioned the following RAID configuration:

Physical size

RAID

Logical size

name

use

2x900G

0.9T

RAID1

centos

OS plus NFS shares

4x900G

1.8T

RAID10

jenkins

Jenkins

6x900G

3.6T

RAID50

prod-1

prod systems tier 1 like resources

4x900G

2.7T

RAID5

prod-2

prod systems tier 2

So we get 9TB of useable space on this machine - less than 11TB that we had before, yet we can split load between spindles.
Now waiting for networking to be reconfigured to make use of separated storage networks and will provision an OS.
Will also log a ticket to install the 10GIG card into the server while it's still not in production.

Former user June 29, 2017 at 2:27 PM

The current setup failed two times this year after failures it was supposed to protect against:

I completely agree with Barak's points. Focusing on service redundancy is a more effective approach than complicating our storage setup.
My proposal is to use iSCSI for improved performance due to lower overhead and hypervisor-level IO caching. We also need to break up the raid50 that we have and use several raid10 volumes to separate workloads better.

Eyal Edri June 29, 2017 at 12:05 PM

Another option is to dig into how our DRDB and Pacemaker is set/configured.
Most likely the setup is 5 years old and not optimal, it might have new features and can function better if we'll upgrade it and document it well.

Barak Korren June 29, 2017 at 8:05 AM

lets start discussion here.

Done

Details

Assignee

Reporter

Blocked By

Components

Priority

Created June 29, 2017 at 8:05 AM
Updated September 2, 2018 at 3:50 PM
Resolved August 23, 2018 at 3:05 PM

Flag notifications