At Digital Turbine, we sometimes find ourselves needing to offload tasks to specialized hardware. This allows the task to run more efficiently and reduce the load from our regular executors, such as those running our CI/CD operations, SDK Metrics ingestion and our aggregations.
Many different approaches and tools are available to address such needs; some are very basic to operate, while others have a steep learning curve.
We explored the various possibilities, but most cases proved to be extra large in simple terms of usage. For example, we could have created or used a Kubernetes cluster, a simple CronJob and an Allocator for AWS. Still, such a need would require hands-on Kubernetes knowledge and a deep understanding of maintaining such an operation - knowledge which most mobile engineers do not possess.
In addition, we explored the idea of re-using an existing Air Flow operation, which our peer teams make use of - but that proved impossible due to the technicalities of mixing permissions across the groups; such an issue also existed with using AWS Batch.
It's worth noting that both approaches explained above and other 3rd party solutions would have required us to have a budget, but like all things, this was a proof of concept, so a more simple solution was required.
Another important consideration we took into account was the accessibility of the solution, in a way that wouldn't introduce too much complexity and new technologies. This steered the answer in a slightly different direction, to allow us to run the workloads as if we run them as regular Jenkins jobs, something most software engineers are already familiar with.
After reviewing our options, we opted for a simpler and easier solution, which allows us to use our existing technology stack and knowledge to operate, while seamlessly running our operation from anywhere we wanted.
To suit the above requirements, a piece of software was written as a script to run our SDK Metrics ingestion workloads on AWS compute instances, without requiring a single machine to be active 24/7, since we wanted to maximize our cost/computer efficiency and use spot instances, as there are definitely "dead spots" in our ingestion process.
Our programming language for this utility is go-lang, as it has mature SDKs for both AWS and Docker and both are extremely fast to bootstrap.
In this blog post, we'll walk through the steps required to re-create the script and follow its inner workings.
For the script to run successfully, you will require a pre-existing AMI that has a docker daemon installed, running and exposing the Docker API port (2375) to the relevant calling machine(s). The following public non-Digital Turbine GitHub gist by styblope has some information about how to make the docker daemon listen to TCP in addition to UNIX sockets, but to sum up the requirements:
Also, this post assumes the following dependencies have been provided to your go-lang project:
Before diving into the code, set out below is a short overview of the process we are about to see in action:
So as a start, we need to create and run an EC2 instance, preferably a spot instance. Fortunately, that's a pretty straightforward task, and we only need a few details to invoke the API and receive the details of the newly created EC2 instance.
Here is how the request object looks in go-lang:
Using this object actually to perform the instance request:
If successful, the result from this call is constructive, as it includes details about the newly created and running instance; this will help us connect to the instance:
Now that we have an instance running, we can attempt to connect to it; we can hijack Docker SDK's "Ping" API to check if the instance has the docker daemon running and accepting connections:
After creating a new docker client object, the code described above performs the simple task of attempting to connect to the Docker daemon on the instance for a max time of 5 minutes, in intervals of 5 seconds.
After a successful connection, the ping result will have the API version the docker daemon in the EC2 instance is using.
Now that we have the machine running and we have a newly created docker client ready to go, we can start the work of actually using the docker services on the EC2 instance we just created.
This starts with pulling a docker image to the instance itself.
But before we do that, a head's up - if you would like to use ECR as your image repository, you must provide credentials to the docker service running on that instance.
To perform that, it's best to call the "GetAuthorizationToken" API.
To use that token, in the docker API, you must manipulate the string a bit:
So here, we can finally pull our image:
If we want to pipe the output of the image pull operation:
Now we can continue to deal with the container, beginning with creating it:
If everything went smoothly (otherwise, check the returned errors from aws/docker) until here, we could now start the container:
Now let's pipe the logs output from docker to the process running us:
If we want to stop the container, we can use the stop container API:
Finally, to receive the exit code of the container, we can use Inspect API:
Now, to be good citizens, we have to terminate the EC2 instance we used; this might also save a few bucks in case we grabbed an expensive machine:
In the age of Kubernetes, this might look a bit like a step in the other direction. However, considering that the teams managing the project are all mobile developers, a more straightforward, more head and hands-on approach has nailed it.
This post has described how we can quickly empower both the AWS and Docker APIs and SDKs to create small remote execution units and run them as if they are running locally. We fine-tuned them running on Jenkins, allowing for a simple job output as if Jenkins ran the code itself.