Simplify oVirt storage configuration

General

Additional Info

General

Additional Info

Description

Today's outage was a clear reminder that our current storage configuration does not serve us well. We hardly know how to debug it, it seems to not be resistant to the very issues it was supposed to protect against and introduce potential failure scenarios of its own.

I suggest we implement a new storage layout that meets the following criteria:

Ultimate simplicity at the lower level of the stack. More specifically:
1. The storage severs should be simple NFS or iSCSI servers. No DRBD and no exotic file-systems.
2. Only simple storage will be presented to oVirt for use as storage domains
Separation of resources between critical services - The 'Jenkins" master for e.g. should not share resources with the "resources" server or anything else.The separation should hold true down to the physical spindle level.
Duplication of services and use of local storage where possible - this is a longer term effort - but we have some low hanging fruits here like artifactory, where simple DNS/LB-based fail-over between two identical hosts would probably suffice.
Complexity only where needed and up the stack. For example we can just have the storage for Jenkins be mirrored at the VM level with fail-over to a backup VM.

Subtasks

100% Done


Type	Key	Summary	Priority	Story Points	Assignee	Status
	OVIRT-2075	Move HostedEngine NFS share to new storage	High		Former user(Deactivated)
	OVIRT-2420	rebuild ovirt-storage01	Medium		Former user(Deactivated)

Linked issues

Web links

bugzilla: misleading message for Shareable disk parameter
bugzilla: misleading message for Shareable disk parameter

Activity

Show:

Former user August 30, 2017 at 11:25 PM

Tried moving Jenkins disks to find out Shareable disks can't be moved. Will have to shut down VMs that have these flags set on disks (Jenkins and Resources) to unset them and then perform storage migrations.

Former user August 30, 2017 at 8:55 AM

Status update:

10gig card installed
iSCSI configured for data domains
NFS configured for ISO/HostedEngine/OpenShift domains
storage VLAN configured on production hosts
three new storage domains created:
- jenkins-iscsi - Jenkins VM (1.8TB)
- data-iscsi - PROD VMs (3.6TB)
- data2-iscsi - non-PROD VMs (2.7TB)
data migration in progress, 3.5TB of disks moved, 3.2TB to move.

So far, migration is done using Live Storage Migration, i.e. the VM is not shut down for this. Tested this first on non-critical VMs and it performs quite well. The VM has no downtime and starts fine after shutdown.

A couple of our PROD OpenShift VMs failed to migrate properly. They have two disks and one disk failed to move and now there is a broken snapshot that is in the way. This happened when both disks were selected for moving at once - when disks are moved one by one no snapshot seems to be created for the VM.

Former user August 2, 2017 at 1:46 PM

Faulty hardware was successfully replaced on ovirt-storage02 and I provisioned the following RAID configuration:

Physical size	RAID	Logical size	name	use
2x900G	0.9T	RAID1	centos	OS plus NFS shares
4x900G	1.8T	RAID10	jenkins	Jenkins
6x900G	3.6T	RAID50	prod-1	prod systems tier 1 like resources
4x900G	2.7T	RAID5	prod-2	prod systems tier 2

So we get 9TB of useable space on this machine - less than 11TB that we had before, yet we can split load between spindles.
Now waiting for networking to be reconfigured to make use of separated storage networks and will provision an OS.
Will also log a ticket to install the 10GIG card into the server while it's still not in production.

Former user June 29, 2017 at 2:27 PM

The current setup failed two times this year after failures it was supposed to protect against:

10.03.2017 - hardware issue. No failover OVIRT-1244: ovirt.org resources are unreachable. Done
29.06.2017 - software issue. Failover failed and the whole cluster was locked up OVIRT-1494: RCA for PHX Storage outage on 29.06.2017Done

I completely agree with Barak's points. Focusing on service redundancy is a more effective approach than complicating our storage setup.
My proposal is to use iSCSI for improved performance due to lower overhead and hypervisor-level IO caching. We also need to break up the raid50 that we have and use several raid10 volumes to separate workloads better.

Eyal Edri June 29, 2017 at 12:05 PM

Another option is to dig into how our DRDB and Pacemaker is set/configured.
Most likely the setup is 5 years old and not optimal, it might have new features and can function better if we'll upgrade it and document it well.

Barak Korren June 29, 2017 at 8:05 AM

@Eyal Edri @Former user lets start discussion here.

Done

Details
Assignee
Former user(Deactivated)
Reporter
Barak Korren(Deactivated)
Blocked By
pending upgrade of HE to 4.2.3 and verifying no open bugs on HE
Components
storage
Priority
Highest

Created June 29, 2017 at 8:05 AM

Updated September 2, 2018 at 3:50 PM

Resolved August 23, 2018 at 3:05 PM

Simplify oVirt storage configuration

Description

Subtasks

Linked issues

causes

is blocked by

relates to

Web links

Activity

Former user August 30, 2017 at 11:25 PM

Former user August 30, 2017 at 8:55 AM

Former user August 2, 2017 at 1:46 PM

Former user June 29, 2017 at 2:27 PM

Eyal Edri June 29, 2017 at 12:05 PM

Barak Korren June 29, 2017 at 8:05 AM

Details
Assignee
Former user(Deactivated)
Reporter
Barak Korren(Deactivated)
Blocked By
pending upgrade of HE to 4.2.3 and verifying no open bugs on HE
Components
storage
Priority
Highest

Details

Assignee

Reporter

Blocked By

Components

Priority

Flag notifications

Something's gone wrong

Simplify oVirt storage configuration

Description

Subtasks

Linked issues

causes

is blocked by

relates to

Web links

Activity

Former user August 30, 2017 at 11:25 PM

Former user August 30, 2017 at 8:55 AM

Former user August 2, 2017 at 1:46 PM

Former user June 29, 2017 at 2:27 PM

Eyal Edri June 29, 2017 at 12:05 PM

Barak Korren June 29, 2017 at 8:05 AM

DetailsAssigneeFormer userFormer user(Deactivated)ReporterBarak KorrenBarak Korren(Deactivated)Blocked Bypending upgrade of HE to 4.2.3 and verifying no open bugs on HEComponentsstoragePriorityHighest

Details

Assignee

Reporter

Blocked By

Components

Priority

Flag notifications

Something's gone wrong

Details
Assignee
Former user(Deactivated)
Reporter
Barak Korren(Deactivated)
Blocked By
pending upgrade of HE to 4.2.3 and verifying no open bugs on HE
Components
storage
Priority
Highest