Infrastructure-Upgrade : Surprises & Lesson-Learnt

Flexera
Flexera
3 0 111

TL;DR: At one point in time, we have had a stable & working infrastructure to support our services. Having said that, it has been there for almost a year now, and we feel that it is the time to give them an opportunity to be refreshed and revisited. We realize a lot of benefits from doing the upgrade on various components we use like Terraform, Helm, and EKS, to name a few. However, once we got the ball rolling, we faced some surprises that turned into valuable learning, including how we manage to keep reusing our persistence (Kinesis Stream) on our newly revamped infrastructure, which allows us to get the best of both worlds, i.e. retaining any existing data that was previously persisted, and also enjoying t.he benefit that we could reap from upgrading to a newer version.

(8 mins read)


As part of the Modernization program, our team is tasked to establish a data-ingestion service. Everything has been working smoothly and things are flowing through nicely in both our test & production environment. Having said that, the infrastructure that we built is almost a year old, and we would like to take this opportunity to refresh and upgrade to the stable most-recent version. 

 

WHY

The first question here, if it ain't broke, why do we need to upgrade? This can be a very challenging question to ask. But, first and foremost, we understand a lot of benefits that we could gain from upgrading our existing infrastructure, in term of performance, efficiency, maintainability, and other areas,  For instance, the new Terraform v0.12 will benefit us from its improved error message, first-class expression, rich value types, and any others. Upgrading to Helm3 will reward us with a cleaner-simpler Tiller-free environment which increases the security of the cluster, distributed repositories and helm hub, improved Helm test, JSON schema validation, and better command-line syntax. Last but not least, it is good security practice to stay up to date. 

 

WHAT WAS UPGRADED

Initially, we're looking at just doing the upgrade to Terraform 0.12 and Helm 3. However, during the process, as we have already performed the surgery on our infrastructure, we think it might be best to take the chance to also upgrade other components as per below:

Component
Old Version
New Version
PROMETHEUS OPERATOR 5.0.3 (chart) 8.12.10 (chart)
PROMETHEUS ADAPTER

1.2.0 (chart)

v0.5.0 (image-tag)

2.2.0 (chart)

v0.6.0 (image-tag)

PROMETHEUS PUSHGATEWAY

0.4.0 (chart)

v1.1.2 (image-tag)

1.4.0 (chart)

v1.2.0 (image-tag)

AWS ALB INGRESS

0.1.10 (chart)

v1.1.2 (image-tag)

0.1.14 (chart)

v1.1.6 (image-tag)

EXTERNAL DNS 2.0.2 (chart) 2.20.4 (chart)
METRICS SERVER 2.8.2 (chart) 2.10.2 (chart)
VAULT SECRET WEBHOOK 0.4.3 (chart) 1.0.1 (chart)

 

SURPRISES & LESSON LEARNT

  • We can only upgrade EKS version by 1 minor version, hence e.g. when trying to upgrade from 1.1 to 1.3, we need to upgrade to 1.2 first, before continuing to 1.3.
  • When cleaning up existing Terraform, we need to clean both the S3 state-file and also its DynamoDB state-lock
  • With the new Terraform 0.12 :
    • It has 0.12upgrade command is really helpful with converting our old Terraform syntax to the new one. Nevertheless, there is still a bunch of missing pieces that need to be addressed manually.
    • String interpolation syntax has changed for better and clearer.

      v0.11
      v0.12
      "${(local.is_create_acm ? 1 : 0)}"
      local.is_create_acm ? 1 : 0
      
    • The required_version declaration on main.tf is now moved to its own file called versions.tf
    • map syntax changes and becomes cleaner

      v0.11
      v0.12
      common_tags = "${ common_tags = {
      map( "Infra_ID" = var.infra_id
      "Infra_ID", "${var.infra_id}", "Owner" = var.owner
      "Owner", "${var.owner}", "Repo" = "flexera/trs-ingest"
      "Repo", "flexera/trs-ingest", "Workspace" = terraform.workspace
      "Workspace", "${terraform.workspace}", "Environment" = terraform.workspace
      "Environment","${terraform.workspace}", "ManagedBy" = "Terraform"
      "ManagedBy", "Terraform" }
      ) }
      }"
      common_tags = "${ common_tags = {
      map( "Infra_ID" = var.infra_id
      "Infra_ID", "${var.infra_id}", "Owner" = var.owner
      "Owner", "${var.owner}", "Repo" = "flexera/trs-ingest"
      "Repo", "flexera/trs-ingest", "Workspace" = terraform.workspace
      "Workspace", "${terraform.workspace}", "Environment" = terraform.workspace
      "Environment","${terraform.workspace}", "ManagedBy" = "Terraform"
      "ManagedBy", "Terraform" }
      ) }
      }"
    • The outputted json changes slightly, hence when using JQ to process this output, it needs a bit of modification:

      v0.11
      v0.12
      WORKER_POOLS=`terraform output -json worker_pools | jq '.value' -c | sed 's/"/\\\"/g' | sed 's/{/\\\{/g' | sed 's/}/\\\}/g' | sed 's/,/\\\,/g'`
      WORKER_POOLS=`terraform output -json worker_pools | jq '' -c | sed 's/"/\\\"/g' | sed 's/{/\\\{/g' | sed 's/}/\\\}/g' | sed 's/,/\\\,/g'`
    • There seems to be an issue with the ClusterIP assignment when Helm check its state. Hence we need to check first, if it exist we need to reuse it.
      if (kubectl get services $(DIFF_SERVICE_RELEASE_NAME)-rethinkdb-proxy -o jsonpath="{.spec.clusterIP}") > /dev/null; then \
        DIFF_RETHINKDB_PROXY_CLUSTERIP=$$(kubectl get services $(DIFF_SERVICE_RELEASE_NAME)-rethinkdb-proxy -o jsonpath="{.spec.clusterIP}"); \
      else \
        DIFF_RETHINKDB_PROXY_CLUSTERIP=""; \
      fi && \
      helm upgrade $(DIFF_SERVICE_RELEASE_NAME) $(DIFF_SERVICE_CHART_PATH) \
      --set crs.rethinkdb.proxy.service.clusterIP=$${DIFF_RETHINKDB_PROXY_CLUSTERIP} \
      . . .
      . . .
  • With the new Helm3,
    • It has 2to3 plugin, which is really handy to transfer existing chart from Helm2 to Helm3. Here is a sample Makefile target to do so:
      migrate-helm2-to-helm3:
        helm3 2to3 move config && \
        helmCharts=$$(helm list -aq --output json | jq '') && \
        for helmChart in $${helmCharts}; do \
          helmChart=$${helmChart%\"};helmChart=$${helmChart#\"};helmChart=$${helmChart%\",}; \
          If [ "$$helmChart" != "[" ] && [ "$$helmChart" != "]" ]; then \
            helm3 2to3 convert $$helmChart; \
          fi; \
        done && \
        helm3 2to3 cleanup
    • It is no longer auto-creating the namespace, hence we need to create it manually 
      if kubectl get namespace monitoring > /dev/null ; \
      then echo "Namespace 'monitoring' is exist"; \
      else \
      kubectl create namespace monitoring; \
      echo "Namespace 'monitoring' has been created"; \
      fi
    • It is no longer having local repo, so if we have installed Helm2 before when upgrading, we need to remove it manually
      helm repo remove local
    • It does not add stable-repo by default, hence need to add it manually:
      helm repo add stable https://kubernetes-charts.storage.googleapis.com
    • The timeout need to be sufficed with time-measurements

      Helm2
      Helm3
      --timeout 20
      --timeout 20s
    • The delete command is, by default, doing purging hence no need to put --purge anymore
    • Old charts will work fine, but we upgraded the charts to V2 where possible.
  • The new Vault requires us to label the namespace so that it is not crashing.
    if kubectl get namespace vswh > /dev/null ; \
    then echo "Namespace 'vswh' is exist"; \
    else \
    kubectl create namespace vswh; \
    kubectl label ns vswh name="vswh"; \
    echo "Namespace 'vswh' has been created and label is set"; \
    fi
  • Flex auth component requires to store cluster private key in the vault. Luckily, we had our clusters private keys with us so that we can store them again in the new cluster.
    Otherwise, there would have been extra effort required from our team and other teams depending on our service to update the new key pairs.
  • The new Prometheus Operator helm-chart is no longer using custom resource definition (CRD), hence we need to explicitly set createCustomResource to true when initializing the chart
    prometheusOperator:
      enabled: true 
      createCustomResource: false​
  • Terraform Import feature has become a savior to help us preserve the existing Kinesis Stream which acts as our persistence.
    • Terraform doesn't support certain resources to be exempted when doing destroy.
    • Hence, we do the following workaround:
      • Prior to running the terraform-destroy, we remove the Kinesis Stream from the Terraform State.
      • And, then we run terraform destroy
      • After that, prior to running terraform-apply for the new infrastructure, we run terraform import against the existing Kinesis Stream
        • terraform import my-kinesis-definition-on-terraform my-kinesis-stream-name
      • This will result in Terraform not recreating the Kinesis Stream again.
      • Finally, we do terraform apply as per usual

 

FURTHER THOUGHT

  • Aligning with our fully-automated CI/CD goal, we would like to explore more avenue to take the following remaining manual-step below:
    • Project creation on Errbit and propagating the new Errbit-Key to all relevant pods.
    • Vault secret-key assignment upon Vault & FlexAuth deployment.
    • Configuring Transit Gateway attachment.
  • Utilizing the Terraform's Import capability, we have a lot more room to further apply the trick on any other resources we managed on AWS e.g. Elastic-Cache, DynamoDB, S3, RDS, etc. Having said that, one remaining puzzle that we haven't explored is on re-hooking existing EFS with EFS-provisioner & Persistence-Volume, so that the newly-created pod can reuse existing backing-persistence.