Creating dev environments in 20 seconds with some Fabio and Mongo Atlas - featured

Share

Creating dev environments in 20 seconds with some Fabio and Mongo Atlas -blog

August 01, 2021

Nir Cohen, VP Engineering, Strigo

In a previous post, we wrote about our way of providing full-featured, isolated development environments in just several minutes. The gist was, that we provided a CloudFormation template to provision an environment via Travis no commit, and then ran an Ansible playbook to provision the instance. ALB Host-based routing was our way of providing an endpoint for those environments.

That was.. nice, but flawed. There’s no reason to go into detail, since the solution was specific to our system. Suffice to say, we wanted to achieve even faster, more dynamic provisioning of development environments, scheduled by a single provisioning mechanism, to reduce complexity.

Since the workflow itself is what makes our new solution flexible, fast and very, very simple to maintain, we will focus on it, though some examples are provided.

What enabled us to improve were several changes we made to our infrastructure.

  1. We moved from our own, self-managed Mongo cluster to Mongo Atlas.
  2. We implemented full service-discovery (Consul), container scheduling (Nomad) and dynamic load-balancing (Fabio).
  3. We wrote a well-structured CLI-based workflow to aid us.
  4. We started building our container images on CircleCI and pushing them to ECR.

The Workflow

The Build Process

We build our container images on CircleCI, and push them directly to ECR. We have a semi-generic .circleci/config.yml for all services:

version: 2
jobs:
  build:
    docker:
    - image: circleci/python:latest
    environment:
    - AWS_DEFAULT_REGION: us-something-1
    - SERVICE_NAME: my-service
    - SERVICE_REPO: XXXX.dkr.ecr.eu-west-1.amazonaws.com/my-servicesteps:
    - checkout
    - setup_remote_docker
    - run: docker build -t ${SERVICE_NAME} .
    - deploy:
      command: |
        sudo pip install awscli
        `aws ecr get-login --no-include-email`
        # Extract the proper docker tag (branch or git tag)
        DOCKER_TAG="tag-missing"
        if [ -z "$CIRCLE_TAG" ]; then
          echo "Taking tag from CIRCLE_BRANCH: ${CIRCLE_BRANCH}"
          DOCKER_TAG="${CIRCLE_BRANCH}"
        else
          echo "Taking tag from CIRCLE_TAG: ${CIRCLE_TAG}"
          DOCKER_TAG="${CIRCLE_TAG}"
        fi
        
        DOCKER_FULL_TAG=${SERVICE_REPO}:ref-${DOCKER_TAG}
        echo Tagging image ${DOCKER_FULL_TAG}
        docker tag ${SERVICE_NAME} ${DOCKER_FULL_TAG}
        docker push ${DOCKER_FULL_TAG}...

Every push creates a docker image tag (or updates it). If a branch is used, a tag with the name of the branch is created, so branches can also be deployed if they’re provided as SERVICE_VERSION parameters (The ref- prefix isn’t that interesting, except that it’s required to apply retention policies on ECR images.)

The deployment-description workflow we now use is as follows:

 

Deploying a new environment

  1. We run strigo dev deploy ENV_NAME SERVICE1_NAME=SERVICE1_VERSION ... (defaulting to master for all services if versions are not provided — which is how we create our master dev env).
  2. The CLI then ssh-copies a consolidated, templated job to a random nomad server, and runs the job. The ENV_NAME is then used both for the job’s name and to create a database in a Mongo Atlas dev-specific cluster under the same name. (We have a simple ssh-tunneling context manager implementation, and thought about using it to deploy the job directly via Nomad’s API, which is something we might do in the future. This will allow us to make blocking CLI calls to the Nomad API to also verify that the job succeeded.)
  3. The ENV_NAME is also used as a variable in the job template so that the different applications can use it. Versions are also propagated as variables into the jobs, and correspond with container tags kept in ECR.
  4. The job runs, and the different containers are then spread across a pre-created application-server cluster. Since utilization is very low, we can run many jobs in a relatively small cluster (~10–20 concurrent envs on ~4 medium instances.)
  5. Nomad registers the services in Consul, and adds a tag that Fabio can later read. Nomad will add a consul tag for a service which answers to the required endpoint, e.g. urlprefix-ENV_NAME-SERVICE_NAME.dev-env.strigo.io
  6. Since all services in Consul now contain that tag, Fabio will read them, and start load-balancing traffic between them.
  7. The ALB, instead of load-balancing between application servers, can now load-balance between Fabio servers, based on the host-header. The only thing that changes in the endpoint URLs for ENV_NAME .
  8. The env is then available at ~https://ENV_NAME.dev-env.strigo.io .

Our dev-env.nomad job template looks something like this:

job "{{ ENV_NAME }}" {
  datacenters = ["dc1"]
  type = "service"{% include 'dev-SERVICE1_NAME.nomad' %}
{% include 'dev-SERVICE2_NAME.nomad' %}
...}

Each service group is included in the template, and so it’s very easy to add a new service. Ideally, we’d include our production service jobs, but.. you know, reality. Also, if for some reason you want to run a partial, all you have to do is remove the relevant include , and that service will not be deployed. This allows to create dynamic deployment logic and allow the CLI to abstract the way services are chosen for specific environments. So, for instance, strigo dev deploy ENV_NAME --core can deploy only core services as part of the env, etc..

Listing Environments

Since environments are nomad jobs, we easily list them by returning the output of a nomad status on our Nomad dev cluster.

$ strigo dev list
Choosing nomad server...
Using 172.16.39.129
Listing environments...
Retrieving job status...
run: nomad status
out: ID        Type     Status   Submit Date
out: feat1     service  running  06/17/18 14:34:43 UTC
out: master    service  running  06/18/18 11:34:05 UTC
out: feat2     service  running  06/17/18 20:34:55 UTC
out: hotfix    service  running  06/03/18 17:53:33 UTC
...

Updating Environments

All we need to do to update an environment is to again run strigo dev deploy ENV_NAME ... . Database creation is idempotent, so the env will use the database created when we first created the environment.

Tip: We add an environment variable to each nomad service job called DUMMY_VAR , with a current timestamp value generated by the CLI and passed into it. This is done so that Nomad will identify there’s a change to the job, and redownload the image and deploy it.

Destroying Environments

Since a development environment is, in its entirety, a Nomad job and a database, all we do is run strigo dev stop ENV_NAME . This runs a nomad stop ENV_NAME and drops the database from Mongo Atlas, obliterating the environment completely! Fun.

Working with A DBaaS

Working with Mongo Atlas provides us with the ability to duplicate our main dev database using a simple API call. Since Mongo Atlas also provides a great cluster monitoring interface (built into its web interface), we can monitor how our cluster behaves in dev quite easily (obviously, dev != prod, but at least we can monitor how relative changes to our code change the database’s behavior on some level.)

Win

This setup is a ~replica of our production environment. Grafana and Prometheus, Telegraf, Redash, etc.. all run in the dev environment and are as easily upgradable as any other service in the system. While it takes time to set this entire thing up, the benefits we reap are immeasurable. Running any development environment takes more or less 20s (also thanks to the fact that downloading even large Docker images from ECR to an EC2 instance of the same region is done in very little time. you can also optimize for alpine images to reduce image size by a large margin, but please, only do that if you know what you’re doing).

Providing variables via a CLI allows us to be as dynamic as we need in templating our job, and so even specific service versions can be passed when we wish to create an environment containing any set of services. We can also pass a container instance count via the CLI for specific services, so running a very-large-cluster test for a specific service (or a set of services), requires little to zero work.

An additional helper in scaling and managing these envs (not directly related to this flow), is the fact that all that’s required to add more instances to our Nomad’s app-server cluster is changing an instance_count variable in Terraform and tf apply-ing. More on that in a future post :)

If you have any feedback on this workflow, we’d appreciate comments. Additionally, it would be great to hear about the workflows you use to manage your development environments, for the benefit of all.

Subscribe to our blog and share on social.
Share insights and articles with your social followers by subscribing below.
I have read the Privacy Policy and agree to its terms.
paper planes

Product education has evolved.

Reach out to schedule a personalized demo.

Book A Demo