Background
Up until recently the ci/cd process has been effective although there is a significant amount of duplication in definitions which increases maintenance effort to ensure all instances remain in sync. There is often the requirement that adding a new gem will often require the ci/cd to be extended to support the new gem which is it at risk of being forgotten. Further the current setup is not efficient due to it performing a significant number of builds/test of gems which have not changed. This significantly increases the duration of a run for instance #2174 took nearly 17mins due to the grpc gem taking over 16 minutes as well as Sinatra taking over 15mins. For the above example, it should've taken a couple of minutes. Sticking with the above example, it is fair to say the gem is well tested on Ubuntu however lacks any testing on Windows or macos. If we were to add these additional os to the matrix we would see the the number of ci checks increase from 50 to 122 with that number growing by 3 each time a new gem is added. It is important to note we also 4 copies of the test workflow just with different gems to reduce the checks however adding more workflows is not scalable.
Another important thing to note is currently we are not even checking if the examples actually build/run which should be addressed.
So the big thing identified which can we do to improve the situation is to move from static defining which gems are to be tested to dynamic determination based on the files changed and then use the merge queue to perform broad regression testing. This results in a significantly reduced number of checks within pr's while knowing that before it is merged to main everything is tested in the merge queue. Should an error occur, the merge is aborted.
Building blocks
Now that we know what we can do to improve the situation it is a matter of identifying the building blocks that are required to get us there.
Gem determination ✅
The process of enabling dynamic gem determination was initiated with #2054 which introduced labels on pr's which corresponds to what has changed. As part of this work it was observed a gap being the previous solution relied on 2 files (docker-compose.yml & a Workflow) which could not be tied to the corresponding gems. The first was addressed with #2203 & second proposed in #2312. With both of those pr's merged we can easily see which gems have been impacted the change.
Generalize workflows ✅
Looking at the workflows there was a significant amount of small customisations blocking consolidation ie services, env variables etc. This has been progressively addressed via #2163, #2203, #2280 & #2282. We are now at a point where all the workflows are identical.
Reduce duplication of environment variables 🏗️
As part of the review of #2282 it was mentioned that:
Would it be useful to set these options in unit tests and docker as well so they match?
As such during the course of #2312 it emerged that the example docker services also often also needing the same env variables hence are at risk of being out of sync. This has been tackled via #2314 by leveraging dotenv. The idea here is if the env variable is only needed within the ruby process ie connection address of mysql this is defined in the .env.test file which is loaded by rake or the ruby file. If the variable is only needed for executing the Workflow definition ie is jruby supported then it goes in github-ci.yml. It was originally expected that a single file would be used but the 2 file approach is less maintenance and a variable won't need to be specified in both hence doesn't lead to duplication.
Consolidated dynamic build workflows 🏗️
Currently the build stage is controlled by variables on a step in the testing Workflow, the intention is to split the build out which enables us to easily control which os & runtime is used for the build with introducing conditional logic needing to be maintained. It also makes the result of each build step more visible. This is useful for if we decide to enforce yard coverage as it is a dedicated step. It also means we can easily hook up the build of the example docker containers if present as dedicated steps. This split is in #2197 and only possible due to the prior work of gem determination. Note this build Workflow has a verification job which can be used as a required check and will fail if the build of any gem produced a result other than pass. This ensures that a pr can not be merged if the build of a gem is failing. If no gems were found everything is built using the root Rakefile which until now wasn't tested.
Consolidated testing workflows 📜
This replicates the build process by dynamically running tests only for the gems changed however this matrix has an additional dimension for the OS enabling seamless testing of the gem across the 3 os. This would also have the same behaviour of build ie default of testing everything and a verification step usable as a required check. Note the instrumentation-base gem does complicate things but is solvable by also building everything if base has changed. See open question below.
End State
| Workflow |
Job |
Step |
| Build |
{{GemName}} |
Gem, Yard & Example Docker |
| Test |
{{GemName}} x {{os}} |
Ruby 3.3, Ruby 3.4, Ruby 4, Jruby, Truffleruby |
Based on the above if our changes are contained to a single excluding base we end up with 1 build check & 1 test check per os. For 2 gems then it is multipled by 2 etc.
In scenarios where base is changed we end up with the equivalent of 2 gems being changed.
This is a significant reduction in checks while in fact increasing coverage of the gems being changed and maintaining regression testing via merge queue.
Open Question's
- Should the base instrumentation actually be part of the sdk? For me moving it to sdk makes sense and could be used to drive adoption.
Background
Up until recently the ci/cd process has been effective although there is a significant amount of duplication in definitions which increases maintenance effort to ensure all instances remain in sync. There is often the requirement that adding a new gem will often require the ci/cd to be extended to support the new gem which is it at risk of being forgotten. Further the current setup is not efficient due to it performing a significant number of builds/test of gems which have not changed. This significantly increases the duration of a run for instance #2174 took nearly 17mins due to the grpc gem taking over 16 minutes as well as Sinatra taking over 15mins. For the above example, it should've taken a couple of minutes. Sticking with the above example, it is fair to say the gem is well tested on Ubuntu however lacks any testing on Windows or macos. If we were to add these additional os to the matrix we would see the the number of ci checks increase from 50 to 122 with that number growing by 3 each time a new gem is added. It is important to note we also 4 copies of the test workflow just with different gems to reduce the checks however adding more workflows is not scalable.
Another important thing to note is currently we are not even checking if the examples actually build/run which should be addressed.
So the big thing identified which can we do to improve the situation is to move from static defining which gems are to be tested to dynamic determination based on the files changed and then use the merge queue to perform broad regression testing. This results in a significantly reduced number of checks within pr's while knowing that before it is merged to main everything is tested in the merge queue. Should an error occur, the merge is aborted.
Building blocks
Now that we know what we can do to improve the situation it is a matter of identifying the building blocks that are required to get us there.
Gem determination ✅
The process of enabling dynamic gem determination was initiated with #2054 which introduced labels on pr's which corresponds to what has changed. As part of this work it was observed a gap being the previous solution relied on 2 files (docker-compose.yml & a Workflow) which could not be tied to the corresponding gems. The first was addressed with #2203 & second proposed in #2312. With both of those pr's merged we can easily see which gems have been impacted the change.
Generalize workflows ✅
Looking at the workflows there was a significant amount of small customisations blocking consolidation ie services, env variables etc. This has been progressively addressed via #2163, #2203, #2280 & #2282. We are now at a point where all the workflows are identical.
Reduce duplication of environment variables 🏗️
As part of the review of #2282 it was mentioned that:
As such during the course of #2312 it emerged that the example docker services also often also needing the same env variables hence are at risk of being out of sync. This has been tackled via #2314 by leveraging dotenv. The idea here is if the env variable is only needed within the ruby process ie connection address of mysql this is defined in the .env.test file which is loaded by rake or the ruby file. If the variable is only needed for executing the Workflow definition ie is jruby supported then it goes in github-ci.yml. It was originally expected that a single file would be used but the 2 file approach is less maintenance and a variable won't need to be specified in both hence doesn't lead to duplication.
Consolidated dynamic build workflows 🏗️
Currently the build stage is controlled by variables on a step in the testing Workflow, the intention is to split the build out which enables us to easily control which os & runtime is used for the build with introducing conditional logic needing to be maintained. It also makes the result of each build step more visible. This is useful for if we decide to enforce yard coverage as it is a dedicated step. It also means we can easily hook up the build of the example docker containers if present as dedicated steps. This split is in #2197 and only possible due to the prior work of gem determination. Note this build Workflow has a verification job which can be used as a required check and will fail if the build of any gem produced a result other than pass. This ensures that a pr can not be merged if the build of a gem is failing. If no gems were found everything is built using the root Rakefile which until now wasn't tested.
Consolidated testing workflows 📜
This replicates the build process by dynamically running tests only for the gems changed however this matrix has an additional dimension for the OS enabling seamless testing of the gem across the 3 os. This would also have the same behaviour of build ie default of testing everything and a verification step usable as a required check. Note the instrumentation-base gem does complicate things but is solvable by also building everything if base has changed. See open question below.
End State
Based on the above if our changes are contained to a single excluding base we end up with 1 build check & 1 test check per os. For 2 gems then it is multipled by 2 etc.
In scenarios where base is changed we end up with the equivalent of 2 gems being changed.
This is a significant reduction in checks while in fact increasing coverage of the gems being changed and maintaining regression testing via merge queue.
Open Question's