Done
Details
Assignee
Former userFormer user(Deactivated)Reporter
Barak KorrenBarak Korren(Deactivated)Blocked By
pending upgrade of HE to 4.2.3 and verifying no open bugs on HEComponents
Priority
Highest
Details
Details
Assignee
Former user
Former user(Deactivated)Reporter
Barak Korren
Barak Korren(Deactivated)Blocked By
pending upgrade of HE to 4.2.3 and verifying no open bugs on HE
Components
Priority
Created June 29, 2017 at 8:05 AM
Updated September 2, 2018 at 3:50 PM
Resolved August 23, 2018 at 3:05 PM
Today's outage was a clear reminder that our current storage configuration does not serve us well. We hardly know how to debug it, it seems to not be resistant to the very issues it was supposed to protect against and introduce potential failure scenarios of its own.
I suggest we implement a new storage layout that meets the following criteria:
Ultimate simplicity at the lower level of the stack. More specifically:
The storage severs should be simple NFS or iSCSI servers. No DRBD and no exotic file-systems.
Only simple storage will be presented to oVirt for use as storage domains
Separation of resources between critical services - The 'Jenkins" master for e.g. should not share resources with the "resources" server or anything else.The separation should hold true down to the physical spindle level.
Duplication of services and use of local storage where possible - this is a longer term effort - but we have some low hanging fruits here like artifactory, where simple DNS/LB-based fail-over between two identical hosts would probably suffice.
Complexity only where needed and up the stack. For example we can just have the storage for Jenkins be mirrored at the VM level with fail-over to a backup VM.