Revisit build artifact storage and retenesion
Description
blocks
is caused by
relates to
Activity
Barak Korren June 12, 2018 at 1:29 PM
Converted this ticket into an epic to track all the artifact retention related activity.
Barak Korren April 25, 2018 at 8:44 AMEdited
I see, it means we'll need to actually delete data rather then retaining it...
Former user April 25, 2018 at 8:40 AMEdited
Here are the test results on master/tested with a 160GB disk serving as backend for a 480 GB VDO volume (triple size according to official recommendations)
vdostats --human-readable
Device Size Used Available Use% Space saving%
/dev/mapper/vdo1 160.0G 107.9G 52.1G 67% 21%
df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/mapper/vdo1 480G 132G 349G 28% /tmp/vdo
The "saving" value is at 21% with 108GB of the backend volume consumed when storing 132GB of data. Moreover, this is mostly saved on ISOs. While copying the "rpm" directory the "saving" field was around 3-4% only.
After deleting the iso directory and running fstrim VDO block usage dropped to 92GB while disk usage was around 103 GB:
vdostats --human-readable
Device Size Used Available Use% Space saving%
/dev/mapper/vdo1 160.0G 92.4G 67.6G 57% 14%
df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/mapper/vdo1 480G 103G 378G 22% /tmp/vdo
Memory usage on this minimal system running nothing but VDO was around 520MB for this volume.
In my opinion, having ~15% savings isn't worth the increased risk of corruption and increased memory usage.
Former user April 24, 2018 at 2:49 PM
I will do a test on RHEL 7.5 with master/tested which is a few hundred gigs to confirm how it performs.
Barak Korren April 24, 2018 at 4:20 AM
Most of our storage space is taken up by RPMs which are compressed archives that don't de-duplicate efficiently.
Do you have some data to back up this claim? If you have the same data compressed by the same algorithm in different files, it should theoretically de-duplicate very well, as long as you do block-level as opposed to file-level de-duplicaiton.
We need to revisit how we store and manage build artifacts in our environment.
We need to do this to reach several goals:
Stop having to frequently deal with running out of space on the Jenkins server
Stop having to frequently deal with running out of space on the Resources server
Make Jenkins load faster
Make publishing of artifacts faster (If can take up to 20m to publish to 'tested' ATM)
Make it so that finding artifacts is possible without knowing the exact details of the job that made them. We would like to be able to find artifacts by at least:
Knowing the build URL in Jenkins
Knowing the STDCI stage/project/branch/distro/arch/git hash combination.
Asking for STDCI stage/project/branch/distro/arch/latest artifact
We need to achieve the above without significantly harming the UX we provide. For example, users should still be able to find artifacts by navigating from links posted to Gerrit/GitHub to the Jenkins job result pages.