cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
douglaswth_0-1623971828823.jpeg

 

You may have heard of DevOps and may even practice it, but you may not have heard about FinOps. At least I had not until a little over a year ago when my manager asked if I would like to be the initial DevOps engineer on the FinOps team he was starting. What I quickly learned is FinOps is a framework for dealing with the variable nature of cloud computing costs in a company in a similar way to how DevOps is a framework for managing the deployment and operation of your software.

Our FinOps team currently consists of my manager who focuses on more of the Analyst side of things such as forecasting, budgeting, and working with other teams on those things or when issues arise, while I focus on building solutions to help us and other teams measure and save on cloud costs. One of the first things we set out to do when we got started as a team was to take better advantage of Flexera One Automation (previously Policies) and the available Cost Policies it provides. We wanted the solution to give us visibility into what waste costs and potential savings the policies identify as well as what savings are realized when the policies take automatic actions to eliminate waste resources, but we also wanted to have the teams that actually owned those resources properly notified and fully in control of how the policies interacted with them.

The FinOps Policies System

Having come into Flexera through the RightScale acquisition, I had previously worked on the team that built the Policies engine both before and after the acquisition. I did a lot of the initial work on the Policy template language parser and have an in depth understanding of the Policy execution engine and its capabilities. With this knowledge, I set out to build our FinOps solution for Policies using Policies themselves. I named it the FinOps Policies system, but that is now a little funny since Flexera One Policies are now called Flexera One Automation instead. I had the initial working FinOps Policies system working late last year with a few early adopter teams and I have implemented more functionality and onboarded more teams since then.

The system is made up of four main components which I will describe in more detail further on:

  1. an internal repository on GitHub with Policy/Automation settings which teams use to opt in to the system and customize how things are applied to their cloud accounts based on their operational needs and comfort level with the system
  2. a Terraform module which provides an automated way to create the Flexera One Credentials and corresponding cloud credentials necessary for the Flexera One Automation system to discover and act on resources in each team’s cloud accounts
  3. a Flexera One Automation (Policy) template which ties the system together by evaluating whether changes need to be made to the currently running applied policies according to the latest settings, checking if there are any configuration or run time issues and reporting on them, and writing metrics to a metrics store
  4. a Grafana setup with several dashboards backed by a metrics store which provides insight into team adoption, engagement, and system health

Settings

The first component of the system is the Git repository with the Policy/Automation settings. For the settings, I decided to use YAML since it is both machine readable and more forgiving than JSON while still easily translated into it (which is necessary since the Automation engine does not currently understand YAML but it does understand JSON).

The default settings for each of the Automation templates that should be applied in each team’s cloud accounts are specified in a top level settings.yaml file:

policies:
  - name: AWS Unused Volumes
    url: https://github.com/flexera/policy_templates/blob/master/cost/aws/unused_volumes/aws_delete_unused_volumes.pt
    cloud_vendor: aws
    team_parameters:
      param_email: notification.emails
    default_parameters:
      param_allowed_regions:
        - ap-southeast-1
        - ap-southeast-2
        - eu-central-1
        - eu-west-1
        - us-east-1
        - us-east-2
        - us-west-1
        - us-west-2
      param_exclude_tags: [keep=true]
      param_automatic_action: []
    default_credentials:
      auth_aws: FinOps_Policy_AWS_Unused_Volumes_STS
    optional_parameter_sets:
      automatic_actions:
        param_automatic_action: [Delete Volumes]
      only_eu_regions:
        param_allowed_regions:
          - eu-central-1
          - eu-west-1
  ...

With the information provided for each template in settings.yaml my template has almost everything it needs to apply them to a team’s account. In the example of the AWS Unused Volumes template above, the team_parameters map is specifying the param_email parameter for a team should be set to the value of the notification.emails setting for the team when it is applied in their cloud accounts. There are also some optional_parameter_sets specified which allow teams to easily add common customizations to their own settings. The two common customizations show here are automatic_actions which enables the template to automatically delete waste resources when they are discovered and only_eu_regions which is helpful for the AWS accounts used for services in our Flexera One EU zone where we disable API access to other regions via an AWS SCP in order to ensure we comply with European Union requirements and regulations.

Individual team settings files are added in a teams/ subdirectory:

 

 

 

 

name: FinOps
notification:
  emails:
    - Policy Notifications - Team FinOps <abcdef12.FLEXERA.onmicrosoft.com@amer.teams.ms>
  microsoft_teams_webhook_urls:
    - https://flexera.webhook.office.com/webhookb2/abcdef12-3456-7890-abcd-ef1234567890@abcdef12-3456-7890-abcd-ef1234567890/IncomingWebhook/abcdef1234567890abcdef1234567890/abcdef12-3456-7890-abcd-ef1234567890
project_policy_settings:
  1234567890: # finops-team-sandbox
    - name: AWS Unused Volumes
      enable_optional_parameter_sets: [automatic_actions]

 

 

 

 

The name and notification settings are all that is required for a team to opt in to the system. As you saw above the notification.emails array is used when applying templates in the team’s cloud accounts. The notification.microsoft_teams_webhooks_urls are used when the template needs to notify about any configuration or run time issues. You may have guessed our email address actually goes to a Microsoft Teams channel as well; it is our primary mode of communication within Flexera (I have obfuscated the email address and URL so you cannot actually start spamming our notification channel). This example only shows a single customization for a single AWS account where we want to enable automatic deletion of resources in our sandbox account, but the settings support the full customization of Automation apply parameters at both the all team cloud account and individual cloud account levels.

In order to validate the YAML settings files will actually work correctly and the parameters specified match any constraints in the actual Automation templates, I added automated tests in the repository. Since Flexera has standardized on Go for writing our microservices, I wrote most of these tests in Go as well so engineers from any team should be familiar enough to debug on their own. When someone makes a pull request to the repository, GitHub Actions run these validations and we also use a CODEOWNERS file to ensure teams are involved when their settings change. When a pull request is merged to the main branch, more GitHub Actions run which create a GitHub Release with the YAML settings files merged together and translated to JSON as an artifact (I used the gojq command line tool to achieve this). The latest GitHub Release represents the current desired state of the settings for the system.

Terraform Module

In order for the Flexera One Automation engine to make calls to external APIs, such as those of cloud providers in order to find and possibly delete waste resources, it needs to use Credentials which represent the method of authentication to the API. For example, on AWS you would create an IAM role which has an assume role policy which trusts the Flexera One platform and then you would enter the ARN of the role to create a Flexera One Credential.

Setting up these roles and the associated permissions that they require in order for the Automation templates to work is going to get very repetitive fairly quickly given teams will have multiple accounts for each of their projects (staging, production in both US and EU zones, etc.), so I needed to come up with an automated solution. Fortunately, our teams were already using Terraform to manage infrastructure as code as part of our DevOps practice and it already had support for the roles and permissions setup for the cloud providers we use. So I put together a Terraform module that allows teams to set up all of the necessary roles and permissions by importing it into the Terraform they are already using to manage their cloud accounts.

However, there was still one problem: teams would still need to take outputs from the Terraform module like the role ARNs on AWS and use them to create Flexera One Credentials manually. I didn’t want to only have a partially automated solution with an error prone data entry step as part of the first impression for teams onboarding to the system, so I set out to solve the problem.

Flexera One is an API first platform so all of the CRUD operations on Credentials you can perform in the user interface are actually performed through the Flexera One API. We write the microservices in Go using the Goa framework which has the nice side effect of generating an API client library as Go packages for each API we design. Coincidently, Terraform is also written in Go and has a Terraform Plugin SDK that allows you to extend its functionality by creating a Custom Provider. Combining the Goa client library for the Flexera One Credentials API and the Terraform Plugin SDK, I was able to put together the flexeracredentials Terraform provider plugin in about a week.

Now that I had the Terraform provider plugin, the Terraform module could both create the necessary roles and permissions on the cloud provider side and create the Flexera One Credentials to use them. Here is an example of how the module sets up an IAM role with the permissions needed for the AWS Unused Volumes Automation template and hooks it up to a Flexera One Credential:

 

 

 

 

data "aws_iam_policy_document" "aws_unused_volumes_assume_role" {
  statement {
    actions = ["sts:AssumeRole"]

    principals {
      type        = "AWS"
      identifiers = [local.flexera_trust_aws_account_arn]
    }

    condition {
      test     = "StringEquals"
      variable = "sts:ExternalId"
      values   = [data.flexeracredentials_project.current.org_id]
    }
  }
}

data "aws_iam_policy_document" "aws_unused_volumes_read_write" {
  statement {
    actions = [
      "ec2:DescribeRegions",
      "ec2:DescribeVolumes",
      "ec2:CreateTags",
      "ec2:CreateSnapshot",
      "ec2:DescribeSnapshots",
      "ec2:DeleteVolume",
    ]
    resources = ["*"]
  }

  statement {
    actions   = ["cloudwatch:GetMetricStatistics"]
    resources = ["*"]

    condition {
      test     = "Bool"
      variable = "aws:SecureTransport"
      values   = ["true"]
    }
  }
}

resource "aws_iam_role" "aws_unused_volumes" {
  name               = "FinOpsPoliciesAWSUnusedVolumes"
  description        = "The role for the AWS Unused Volumes Policy Template which may be applied according to the FinOps Policy Settings"
  assume_role_policy = data.aws_iam_policy_document.aws_unused_volumes_assume_role.json
}

resource "aws_iam_role_policy" "aws_unused_volumes_read_write" {
  name   = "FinOpsPoliciesAWSUnusedVolumesReadWrite"
  policy = data.aws_iam_policy_document.aws_unused_volumes_read_write.json
  role   = aws_iam_role.aws_unused_volumes.id
}

resource "flexeracredentials_aws_sts" "aws_unused_volumes" {
  identifier  = lookup(var.credentials_identifier_overrides, "aws_unused_volumes", "FinOps_Policy_AWS_Unused_Volumes_STS")
  name        = lookup(var.credentials_name_overrides, "aws_unused_volumes", "FinOps Policy AWS Unused Volumes STS")
  description = "The AWS IAM role for the AWS Unused Volumes Policy Template which may be applied according to the FinOps Policy Settings"
  role_arn    = aws_iam_role.aws_unused_volumes.arn

  tags = {
    provider = "aws"
  }
}

 

 

 

 

Automation Template

Of course, there needs to be something that ties the settings, Flexera Credentials, and possibly some other components together in order to actually have a system and that is the “FinOps Policies” Automation Template (previously the “FinOps Policies” Policy Template). The Automation/Policy Template Language is a declarative language with some imperative languages mixed in for good measure with an execution model where data sources are collected, a check is performed to determine if an incident needs to be raised, and actions are performed in response to changes in incident state.

The data sources in an Automation template are usually either definitions for retrieving API data or executing JavaScript to manipulate and transform data. These are some of the things the “FinOps Policies” template does with data sources:

  • It uses the GitHub REST API to fetch the latest settings release artifact from the internal repository as well as data about all the cloud accounts and which teams own them. That other data comes from another Git repository that our Cloud Enablement team was already maintaining.
  • It uses the Flexera One API to get the current status of Credentials, applied Automation templates, and their incidents.
  • It uses the InfluxDB API to determine when it last wrote metrics so it can narrow down the list of Automation incidents above to only those that are still relevant.
  • It uses multiple JavaScript script data sources to weave all the other data sources together and prepare the data to be meaningful for the checks and incident actions.

One place where I had to look around for a solution that fit was how and where to store the metrics. Ultimately, I decided to use InfluxDB Cloud since it has a push model that makes it compatible with how Automation templates work, it is a managed and elastic so my team doesn’t need to deal extra maintenance, and it provides a rich tool for exploring and manipulating its time series data in its Flux query language.

Something you need to keep in mind while writing Automation templates is the Cloud Workflow Language execution system is somewhat of a legacy component and does not have the same performance characteristics as other parts of the Automation system, so it is often good to do some of the processing you might have done in the action phase up front in a JavaScript data source instead. An example where I did this is the construction of the InfluxDB line protocol body that will need to be posted later to write metrics. The system has a number of metrics to keep track of so I wrote a JavaScript function which uses the Underscore library to build up the body:

var add_influxdb_line = function (measurement, tags, fields, types) {
  metrics.influxdb_lines += measurement;

  if (!_.isEmpty(tags)) {
    metrics.influxdb_lines += ',' + _.map(_.sortBy(_.map(tags, function (value, key) {
      switch (typeof (value)) {
      case 'string':
        return { key: key, value: value.replace(/[,= ]/g, '\\$&') };
      default:
        return { key: key, value: value };
      }
    }), 'key'), function (tag) {
      return tag.key + '=' + tag.value;
    });
  }

  metrics.influxdb_lines += ' ' + _.map(_.sortBy(_.map(fields, function (value, key) {
    switch (typeof (value)) {
    case 'string':
      return { key: key, value: '"' + value.replace(/["\\]/g, '\\$&') + '"' };
    case 'number':
      if (!_.isEmpty(types)) {
        var type = types[key];

        switch (type) {
        case 'i':
          return { key: key, value: (value >= 0 ? Math.floor(value) : Math.ceil(value)) + 'i' };
        case 'u':
          return { key: key, value: Math.floor(Math.abs(value)) + 'u' };
        }
      }
      // fallthrough
    default:
      return { key: key, value: value };
    }
  }), 'key'), function (field) {
    return field.key + '=' + field.value;
  }).join(',') + ' ' + timestamp + '\n';
};

After the data sources are evaluated the Automation performs checks to determine if it needs to raise or resolve incidents. These are some of the checks the “FinOps Policies” template does:

  • It checks if there are any errors which would prevent applying Automation templates for any of the cloud accounts belonging to teams that have opted in via settings. These errors are usually Credentials that have not been set up yet, but they can also be missing metadata in the list of cloud accounts from the metadata GitHub repository. Either way the team that owns the cloud account or accounts in question should probably be notified about the errors so they can get them fixed.
  • It checks if there are any applied Automation templates that need to be created, updated, or deleted in the cloud accounts. These kinds of changes can come from multiple types of changes including: a new team opting in via the settings, an existing team updating their settings, a new version of one of the Automation becoming available, or a new cloud account belonging to one of the teams being created. If an incident is raised for this check, it triggers an action but does not notify since that might drive us crazy.
  • It checks if there are any applied Automation templates or associated Incidents with errors.
  • It checks if there are any metrics to write. This always raises an incident since there are always metrics to write and again does not notify.

Each check in an Automation template can have any number of actions which can either be Cloud Workflow actions or email actions. For the applied Automation template CRUD and metrics to write incidents, there are Cloud Workflow actions that make the required Flexera One or InfluxDB API calls to perform their actions. The others send their notification emails to a Microsoft Teams channel the FinOps team monitors, but each team’s settings also include a notification.microsoft_teams_webhooks_urls list as well since the Automation template does not have a way to specifically send a subset of the incident details to the affected team’s email address.

I solved this by using Cloud Workflow to determine which teams need to be notified about particular items and grouping them together into a Microsoft Teams notification Card it can post to the webhook URL since Cloud Workflow does not currently have any email functionality. The other Cloud Workflow actions can also run into errors, so I also used the webhook functionality to notify the FinOps team about those in all of them. While building this functionality I noticed only 10 Card “sections” actually show up in Teams when there when there are more posted and also the notification would not even show up if it were lots more. To work around this limitation, I added functionality to truncate the Card “sections” down to 9 if there were more than 10 and then add a tenth card which indicates more details can be found in the Automation template incident in Flexera One. Here are the Cloud Workflow Definitions that implement the Microsoft Teams notification and error reporting functionality:

 

 

 

 

define report_errors($errors, $description, $rs_project_name, $rs_project_id, $rs_org_id, $notification_webhook_urls) do
  if size($errors) > 0
    call construct_card(
      $rs_project_name + ' (Project ID: ' + $rs_project_id + '): ' + size($errors) + ' Error(s) ' + $description,
      {sections: $errors}
    ) retrieve $card
    call truncate_card_sections($card, $rs_project_id, $rs_org_id) retrieve $card

    foreach $notification_webhook_url in $notification_webhook_urls do
      call notify_webhook($notification_webhook_url, $card)
    end
  end
end

define construct_card($title, $card) return $card do
  $card['@type'] = 'MessageCard'
  $card['@context'] = 'https://schema.org/extensions'
  $card['summary'] = $title
  $card['themeColor'] = '85bb65' # dollar bill
  $card['title'] = $title
end

define truncate_card_sections($card, $rs_project_id, $rs_org_id) return $card do
  $sections = $card['sections']

  # Microsoft Teams Notification Webhook posts only display up to 10 sections
  if size($sections) > 10
    $card['sections'] = $sections[..8] + [{
      title: 'Notification Truncated',
      text: 'There may be more issues than visible in this notification. More information should be available under [Policy Incidents](https://app.flexera.com/orgs/' + $rs_org_id + '/policy/projects/' + $rs_project_id + '/incidents).'
    }]
  end
end

define notify_webhook($url, $body) do
  $response = http_post(url: $url, body: $body)
  call check_response($response, 'notify webhook')
end

define notify_webhook_and_append_if_errors($url, $body, $errors) return $errors do
  $response = http_post(url: $url, body: $body)
  call check_response_and_append_if_error($response, 'notify webhook', $errors) retrieve $errors
end

define check_response($response, $request_description) do
  if $response['code'] > 299 || $response['code'] < 200
    raise 'Unexpected status code from ' + $request_description + ' request: ' + $response['code'] + ' body: ' + to_s($response['body'])
  end
end

define check_response_and_append_if_error($response, $request_description, $errors) return $errors do
  if $response['code'] > 299 || $response['code'] < 200
    $errors << {
      title: 'Unexpected status code from ' + $request_description + ' request',
      facts: [
        {name: 'Code', value: '`' + $response['code'] + '`'},
        {name: 'Body', value: '`' + to_s($response['body']) + '`'}
      ]
    }
  end
end

 

 

 

 

Grafana Metrics

One similarity between the practices of DevOps and FinOps is the importance of visibility into metrics. Within Flexera, all of our DevOps teams are already using Grafana to visualize the metrics coming out of their microservices and we already have a process in place for authenticating to Grafana through our identity provider. Grafana also supports a wide range of backends including InfluxDB, so it was an easy choice for visualizing all of the metrics the “FinOps Policies” system will be writing.

When building Grafana dashboards for the system, I focused on a few types of users:

  1. FinOps team members: my manager and I care about how the system is working, how well it is working for teams, which teams we should reach out and start onboarding next, and which teams might need some help.
  2. Managers and executives: they care about hitting their budgets and knowing which of their teams might need to prioritize cleaning up cloud waste costs.
  3. Individual team members: engineers care about how their own team’s infrastructure is doing with regard to costs.

Here are what a few of the panels from our “FinOps Policies” Grafana dashboards look like:

douglaswth_1-1623971828589.png
douglaswth_2-1623971828770.png
douglaswth_3-1623971828764.png
douglaswth_4-1623971828601.png
douglaswth_5-1623971828767.png
douglaswth_6-1623971828608.png
douglaswth_7-1623971828765.png
douglaswth_8-1623971828613.png

You may have noticed we have metrics for estimated potential savings from triggered incidents and estimated savings from resolved incidents. These are the amounts of money that would be spent each for the cloud resources the Automation templates have discovered if they are not eliminated, or the not spent if they are eliminated.

We are able to get this information because all of the Cost Policies (Automation templates) the “FinOps Policies” Automation template applies includes resource level estimated cost information in a standardized format within the incident data available through the Flexera One API. Some of the Automation templates, such as AWS Unused Volumes and AWS Old Snapshots, use the Flexera One Cloud Cost Optimization (formerly Optima) API to calculate these estimates based on recent data from our cloud bills while others use cloud pricing APIs since that data is not easily available otherwise (this is something I worked with the team that owns the Cost Automation templates to implement). For example, the AWS Unused IP Addresses Automation template uses the AWS Price List API to find the hourly cost for unallocated Elastic IP addresses to calculate its estimates.

For the estimated savings from resolved incidents, the Automation template takes the cost estimate from the incident at the time it is resolved and records it as a savings event. While there are some caveats to this method such as it missing cleanup that may occur outside of resolving an incident, it at least provides a conservative estimate of the realized savings from using the Automation templates to find and eliminate cloud waste.

Conclusion

The “FinOps Policies” system is now running with a few Cost Automation templates and has a number of engineering teams within Flexera using it. However, there are still more engineering teams that have not started using it and more Cost Automation templates that would make sense for us to apply (or we may even write some new ones), so this is still an ongoing project and I imagine I will need to improve it. This parallels the FinOps Lifecycle where a company continuously moves through phases a journey in order to constantly improve.

Also, I’m finding some of the things I’ve learned in building the system are applicable to other projects. For example, while working on building metrics for Unit Economics where we were pulling metrics from a different system, it made sense to use InfluxDB to aggregate those metrics so we could use Flux query language to look at them a monthly or quarterly scale.

Finally, by using the Flexera One platform to solve the problems we need to on the FinOps team, we are not only proving it out and finding bugs, but we are also building what could be considered prototypes of potential features that might make sense to add to Flexera One in the future so we can provide the same benefits to our customers as well.