A new Flexera Community experience is coming on November 25th. Click here for more information.
You may have heard of DevOps and may even practice it, but you may not have heard about FinOps. At least I had not until a little over a year ago when my manager asked if I would like to be the initial DevOps engineer on the FinOps team he was starting. What I quickly learned is FinOps is a framework for dealing with the variable nature of cloud computing costs in a company in a similar way to how DevOps is a framework for managing the deployment and operation of your software.
Our FinOps team currently consists of my manager who focuses on more of the Analyst side of things such as forecasting, budgeting, and working with other teams on those things or when issues arise, while I focus on building solutions to help us and other teams measure and save on cloud costs. One of the first things we set out to do when we got started as a team was to take better advantage of Flexera One Automation (previously Policies) and the available Cost Policies it provides. We wanted the solution to give us visibility into what waste costs and potential savings the policies identify as well as what savings are realized when the policies take automatic actions to eliminate waste resources, but we also wanted to have the teams that actually owned those resources properly notified and fully in control of how the policies interacted with them.
Having come into Flexera through the RightScale acquisition, I had previously worked on the team that built the Policies engine both before and after the acquisition. I did a lot of the initial work on the Policy template language parser and have an in depth understanding of the Policy execution engine and its capabilities. With this knowledge, I set out to build our FinOps solution for Policies using Policies themselves. I named it the FinOps Policies system, but that is now a little funny since Flexera One Policies are now called Flexera One Automation instead. I had the initial working FinOps Policies system working late last year with a few early adopter teams and I have implemented more functionality and onboarded more teams since then.
The system is made up of four main components which I will describe in more detail further on:
The first component of the system is the Git repository with the Policy/Automation settings. For the settings, I decided to use YAML since it is both machine readable and more forgiving than JSON while still easily translated into it (which is necessary since the Automation engine does not currently understand YAML but it does understand JSON).
The default settings for each of the Automation templates that should be applied in each team’s cloud accounts are specified in a top level settings.yaml
file:
With the information provided for each template in settings.yaml
my template has almost everything it needs to apply them to a team’s account. In the example of the AWS Unused Volumes template above, the team_parameters
map is specifying the param_email
parameter for a team should be set to the value of the notification.emails
setting for the team when it is applied in their cloud accounts. There are also some optional_parameter_sets
specified which allow teams to easily add common customizations to their own settings. The two common customizations show here are automatic_actions
which enables the template to automatically delete waste resources when they are discovered and only_eu_regions
which is helpful for the AWS accounts used for services in our Flexera One EU zone where we disable API access to other regions via an AWS SCP in order to ensure we comply with European Union requirements and regulations.
Individual team settings files are added in a teams/
subdirectory:
name: FinOps
notification:
emails:
- Policy Notifications - Team FinOps <abcdef12.FLEXERA.onmicrosoft.com@amer.teams.ms>
microsoft_teams_webhook_urls:
- https://flexera.webhook.office.com/webhookb2/abcdef12-3456-7890-abcd-ef1234567890@abcdef12-3456-7890-abcd-ef1234567890/IncomingWebhook/abcdef1234567890abcdef1234567890/abcdef12-3456-7890-abcd-ef1234567890
project_policy_settings:
1234567890: # finops-team-sandbox
- name: AWS Unused Volumes
enable_optional_parameter_sets: [automatic_actions]
The name
and notification
settings are all that is required for a team to opt in to the system. As you saw above the notification.emails
array is used when applying templates in the team’s cloud accounts. The notification.microsoft_teams_webhooks_urls
are used when the template needs to notify about any configuration or run time issues. You may have guessed our email address actually goes to a Microsoft Teams channel as well; it is our primary mode of communication within Flexera (I have obfuscated the email address and URL so you cannot actually start spamming our notification channel). This example only shows a single customization for a single AWS account where we want to enable automatic deletion of resources in our sandbox account, but the settings support the full customization of Automation apply parameters at both the all team cloud account and individual cloud account levels.
In order to validate the YAML settings files will actually work correctly and the parameters specified match any constraints in the actual Automation templates, I added automated tests in the repository. Since Flexera has standardized on Go for writing our microservices, I wrote most of these tests in Go as well so engineers from any team should be familiar enough to debug on their own. When someone makes a pull request to the repository, GitHub Actions run these validations and we also use a CODEOWNERS
file to ensure teams are involved when their settings change. When a pull request is merged to the main branch, more GitHub Actions run which create a GitHub Release with the YAML settings files merged together and translated to JSON as an artifact (I used the gojq command line tool to achieve this). The latest GitHub Release represents the current desired state of the settings for the system.
In order for the Flexera One Automation engine to make calls to external APIs, such as those of cloud providers in order to find and possibly delete waste resources, it needs to use Credentials which represent the method of authentication to the API. For example, on AWS you would create an IAM role which has an assume role policy which trusts the Flexera One platform and then you would enter the ARN of the role to create a Flexera One Credential.
Setting up these roles and the associated permissions that they require in order for the Automation templates to work is going to get very repetitive fairly quickly given teams will have multiple accounts for each of their projects (staging, production in both US and EU zones, etc.), so I needed to come up with an automated solution. Fortunately, our teams were already using Terraform to manage infrastructure as code as part of our DevOps practice and it already had support for the roles and permissions setup for the cloud providers we use. So I put together a Terraform module that allows teams to set up all of the necessary roles and permissions by importing it into the Terraform they are already using to manage their cloud accounts.
However, there was still one problem: teams would still need to take outputs from the Terraform module like the role ARNs on AWS and use them to create Flexera One Credentials manually. I didn’t want to only have a partially automated solution with an error prone data entry step as part of the first impression for teams onboarding to the system, so I set out to solve the problem.
Flexera One is an API first platform so all of the CRUD operations on Credentials you can perform in the user interface are actually performed through the Flexera One API. We write the microservices in Go using the Goa framework which has the nice side effect of generating an API client library as Go packages for each API we design. Coincidently, Terraform is also written in Go and has a Terraform Plugin SDK that allows you to extend its functionality by creating a Custom Provider. Combining the Goa client library for the Flexera One Credentials API and the Terraform Plugin SDK, I was able to put together the flexeracredentials
Terraform provider plugin in about a week.
Now that I had the Terraform provider plugin, the Terraform module could both create the necessary roles and permissions on the cloud provider side and create the Flexera One Credentials to use them. Here is an example of how the module sets up an IAM role with the permissions needed for the AWS Unused Volumes Automation template and hooks it up to a Flexera One Credential:
data "aws_iam_policy_document" "aws_unused_volumes_assume_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "AWS"
identifiers = [local.flexera_trust_aws_account_arn]
}
condition {
test = "StringEquals"
variable = "sts:ExternalId"
values = [data.flexeracredentials_project.current.org_id]
}
}
}
data "aws_iam_policy_document" "aws_unused_volumes_read_write" {
statement {
actions = [
"ec2:DescribeRegions",
"ec2:DescribeVolumes",
"ec2:CreateTags",
"ec2:CreateSnapshot",
"ec2:DescribeSnapshots",
"ec2:DeleteVolume",
]
resources = ["*"]
}
statement {
actions = ["cloudwatch:GetMetricStatistics"]
resources = ["*"]
condition {
test = "Bool"
variable = "aws:SecureTransport"
values = ["true"]
}
}
}
resource "aws_iam_role" "aws_unused_volumes" {
name = "FinOpsPoliciesAWSUnusedVolumes"
description = "The role for the AWS Unused Volumes Policy Template which may be applied according to the FinOps Policy Settings"
assume_role_policy = data.aws_iam_policy_document.aws_unused_volumes_assume_role.json
}
resource "aws_iam_role_policy" "aws_unused_volumes_read_write" {
name = "FinOpsPoliciesAWSUnusedVolumesReadWrite"
policy = data.aws_iam_policy_document.aws_unused_volumes_read_write.json
role = aws_iam_role.aws_unused_volumes.id
}
resource "flexeracredentials_aws_sts" "aws_unused_volumes" {
identifier = lookup(var.credentials_identifier_overrides, "aws_unused_volumes", "FinOps_Policy_AWS_Unused_Volumes_STS")
name = lookup(var.credentials_name_overrides, "aws_unused_volumes", "FinOps Policy AWS Unused Volumes STS")
description = "The AWS IAM role for the AWS Unused Volumes Policy Template which may be applied according to the FinOps Policy Settings"
role_arn = aws_iam_role.aws_unused_volumes.arn
tags = {
provider = "aws"
}
}
Of course, there needs to be something that ties the settings, Flexera Credentials, and possibly some other components together in order to actually have a system and that is the “FinOps Policies” Automation Template (previously the “FinOps Policies” Policy Template). The Automation/Policy Template Language is a declarative language with some imperative languages mixed in for good measure with an execution model where data sources are collected, a check is performed to determine if an incident needs to be raised, and actions are performed in response to changes in incident state.
The data sources in an Automation template are usually either definitions for retrieving API data or executing JavaScript to manipulate and transform data. These are some of the things the “FinOps Policies” template does with data sources:
One place where I had to look around for a solution that fit was how and where to store the metrics. Ultimately, I decided to use InfluxDB Cloud since it has a push model that makes it compatible with how Automation templates work, it is a managed and elastic so my team doesn’t need to deal extra maintenance, and it provides a rich tool for exploring and manipulating its time series data in its Flux query language.
Something you need to keep in mind while writing Automation templates is the Cloud Workflow Language execution system is somewhat of a legacy component and does not have the same performance characteristics as other parts of the Automation system, so it is often good to do some of the processing you might have done in the action phase up front in a JavaScript data source instead. An example where I did this is the construction of the InfluxDB line protocol body that will need to be posted later to write metrics. The system has a number of metrics to keep track of so I wrote a JavaScript function which uses the Underscore library to build up the body:
After the data sources are evaluated the Automation performs checks to determine if it needs to raise or resolve incidents. These are some of the checks the “FinOps Policies” template does:
Each check in an Automation template can have any number of actions which can either be Cloud Workflow actions or email actions. For the applied Automation template CRUD and metrics to write incidents, there are Cloud Workflow actions that make the required Flexera One or InfluxDB API calls to perform their actions. The others send their notification emails to a Microsoft Teams channel the FinOps team monitors, but each team’s settings also include a notification.microsoft_teams_webhooks_urls
list as well since the Automation template does not have a way to specifically send a subset of the incident details to the affected team’s email address.
I solved this by using Cloud Workflow to determine which teams need to be notified about particular items and grouping them together into a Microsoft Teams notification Card it can post to the webhook URL since Cloud Workflow does not currently have any email functionality. The other Cloud Workflow actions can also run into errors, so I also used the webhook functionality to notify the FinOps team about those in all of them. While building this functionality I noticed only 10 Card “sections” actually show up in Teams when there when there are more posted and also the notification would not even show up if it were lots more. To work around this limitation, I added functionality to truncate the Card “sections” down to 9 if there were more than 10 and then add a tenth card which indicates more details can be found in the Automation template incident in Flexera One. Here are the Cloud Workflow Definitions that implement the Microsoft Teams notification and error reporting functionality:
define report_errors($errors, $description, $rs_project_name, $rs_project_id, $rs_org_id, $notification_webhook_urls) do
if size($errors) > 0
call construct_card(
$rs_project_name + ' (Project ID: ' + $rs_project_id + '): ' + size($errors) + ' Error(s) ' + $description,
{sections: $errors}
) retrieve $card
call truncate_card_sections($card, $rs_project_id, $rs_org_id) retrieve $card
foreach $notification_webhook_url in $notification_webhook_urls do
call notify_webhook($notification_webhook_url, $card)
end
end
end
define construct_card($title, $card) return $card do
$card['@type'] = 'MessageCard'
$card['@context'] = 'https://schema.org/extensions'
$card['summary'] = $title
$card['themeColor'] = '85bb65' # dollar bill
$card['title'] = $title
end
define truncate_card_sections($card, $rs_project_id, $rs_org_id) return $card do
$sections = $card['sections']
# Microsoft Teams Notification Webhook posts only display up to 10 sections
if size($sections) > 10
$card['sections'] = $sections[..8] + [{
title: 'Notification Truncated',
text: 'There may be more issues than visible in this notification. More information should be available under [Policy Incidents](https://app.flexera.com/orgs/' + $rs_org_id + '/policy/projects/' + $rs_project_id + '/incidents).'
}]
end
end
define notify_webhook($url, $body) do
$response = http_post(url: $url, body: $body)
call check_response($response, 'notify webhook')
end
define notify_webhook_and_append_if_errors($url, $body, $errors) return $errors do
$response = http_post(url: $url, body: $body)
call check_response_and_append_if_error($response, 'notify webhook', $errors) retrieve $errors
end
define check_response($response, $request_description) do
if $response['code'] > 299 || $response['code'] < 200
raise 'Unexpected status code from ' + $request_description + ' request: ' + $response['code'] + ' body: ' + to_s($response['body'])
end
end
define check_response_and_append_if_error($response, $request_description, $errors) return $errors do
if $response['code'] > 299 || $response['code'] < 200
$errors << {
title: 'Unexpected status code from ' + $request_description + ' request',
facts: [
{name: 'Code', value: '`' + $response['code'] + '`'},
{name: 'Body', value: '`' + to_s($response['body']) + '`'}
]
}
end
end
One similarity between the practices of DevOps and FinOps is the importance of visibility into metrics. Within Flexera, all of our DevOps teams are already using Grafana to visualize the metrics coming out of their microservices and we already have a process in place for authenticating to Grafana through our identity provider. Grafana also supports a wide range of backends including InfluxDB, so it was an easy choice for visualizing all of the metrics the “FinOps Policies” system will be writing.
When building Grafana dashboards for the system, I focused on a few types of users:
Here are what a few of the panels from our “FinOps Policies” Grafana dashboards look like:
You may have noticed we have metrics for estimated potential savings from triggered incidents and estimated savings from resolved incidents. These are the amounts of money that would be spent each for the cloud resources the Automation templates have discovered if they are not eliminated, or the not spent if they are eliminated.
We are able to get this information because all of the Cost Policies (Automation templates) the “FinOps Policies” Automation template applies includes resource level estimated cost information in a standardized format within the incident data available through the Flexera One API. Some of the Automation templates, such as AWS Unused Volumes and AWS Old Snapshots, use the Flexera One Cloud Cost Optimization (formerly Optima) API to calculate these estimates based on recent data from our cloud bills while others use cloud pricing APIs since that data is not easily available otherwise (this is something I worked with the team that owns the Cost Automation templates to implement). For example, the AWS Unused IP Addresses Automation template uses the AWS Price List API to find the hourly cost for unallocated Elastic IP addresses to calculate its estimates.
For the estimated savings from resolved incidents, the Automation template takes the cost estimate from the incident at the time it is resolved and records it as a savings event. While there are some caveats to this method such as it missing cleanup that may occur outside of resolving an incident, it at least provides a conservative estimate of the realized savings from using the Automation templates to find and eliminate cloud waste.
The “FinOps Policies” system is now running with a few Cost Automation templates and has a number of engineering teams within Flexera using it. However, there are still more engineering teams that have not started using it and more Cost Automation templates that would make sense for us to apply (or we may even write some new ones), so this is still an ongoing project and I imagine I will need to improve it. This parallels the FinOps Lifecycle where a company continuously moves through phases a journey in order to constantly improve.
Also, I’m finding some of the things I’ve learned in building the system are applicable to other projects. For example, while working on building metrics for Unit Economics where we were pulling metrics from a different system, it made sense to use InfluxDB to aggregate those metrics so we could use Flux query language to look at them a monthly or quarterly scale.
Finally, by using the Flexera One platform to solve the problems we need to on the FinOps team, we are not only proving it out and finding bugs, but we are also building what could be considered prototypes of potential features that might make sense to add to Flexera One in the future so we can provide the same benefits to our customers as well.