Managing AWS Lambda Versions with AWS Step Functions: A comprehensive guide

Managing AWS Lambda Versions with AWS Step Functions: A comprehensive guide

Each AWS account has a regional default storage limit for Lambda and Lambda Layers (Zip archives) of 75 GB. This sounds a lot, and it can even be increased by demand, but many functions combined with frequent deployments might lead to storage pollution quite quickly. Therefore, it makes sense to clean up old deployment packages regularly. This blogpost introduces an AWS Step Functions state machine that is designed to automate the identification and deletion of older Lambda function versions.

Starting point

AWS creates a new version for each lambda function deployment and keeps them all available – just in case an older one is needed in future. Every deployment package is stored and counts to the overall size limit of 75 GB. Even though this amount of storage is quite large it is advisable (and recommended by AWS) to clean up older versions which are no longer needed on a regular basis.

As of now, AWS does not provide a direct solution to manage these versions, prompting developers to create custom solutions. One example of a custom solution is the use of a Lambda function like the Lambda Janitor by Yan Cui. Our idea for an implementation was a state machine – an automated and visual solution for the Lambda version management.

This state machine handles all Lambda functions (optionally targeting specific ones based on predefined criteria) and deletes / removes old versions based on threshold which determines how many versions should be kept.
A visual representation of the final workflow as well as a brief explanation of the most important steps is given below:

Retrieving and filtering the function names

The workflow begins with calling and listing the names of all existing Lambda functions within the AWS region. The “lambda:listFunctions” API call returns up to 50 items per call and a marker token in case more results are available. To take this into account the paginator pattern after the “Iterate Functions” map state is being used.

For the first step we transform the result with the ResultSelector to extract only the function names. By doing that we get an output like this:

Ein Bild, das Text, Screenshot, Schrift, Reihe enthält. Automatisch generierte Beschreibung

To iterate through the array of lambda function names in the list, we use a Map state, which allows us to set a concurrency limit of 5. This limit helps to control the number of simultaneous requests and prevents throttle calls.

In our case we just want to clean up versions of Lambda functions which belong to our project. Others which are managed by the platform team need to stay untouched.

The Lambda functions from our project have a specific prefix in the name, that makes it easy for us to filter through all functions in the region. To do that, we added a “Choice” step to determine whether the state machine should continue the version cleanup or ignore the current Lambda function. This prefix can also be retrieved from the state machine invocation event.

Ein Bild, das Text, Screenshot, Schrift, Zahl enthält. Automatisch generierte Beschreibung

Getting all the available versions

In the following step, all available versions are retrieved using the “lambda:listVersionsByFunction” API call. The outcome of this step is shown below – a list of all versions and the name of the function currently processed.

Ein Bild, das Text, Screenshot, Zahl enthält. Automatisch generierte Beschreibung

Keeping the function name in this step and adding it to the ResultPath is mandatory because we will need it again in a few steps when deleting the versions.

The “lambda:listVersionsByFunction” API call returns up to 50 items and a marker token to retrieve more results. To take care of the pagination – the paginator pattern would have been required again. However, we have decided not to implement it to keep the workflow simpler.

There are not that many deployments in our environment that we would reach more than 50 versions per Lambda function before the workflow runs the next time. In addition, the workflow which is triggered by an AWS EventBridge scheduler could run more often to avoid this situation – for instance, every day instead of every week – just as an example.

Removing the $LATEST version and get version count

The Lambda API returns the versions in a way that the latest (called $LATEST ????) one comes first followed by all other versions in ascending order – from oldest to newest. We decided not to rely on this implicit order but to exclude the $LATEST version explicitly from all further processing steps as well as to sort the remaining version numbers – just in case AWS is going to change the response format.

Unfortunately, AWS Step Functions does only provide a limited set of array and JSON processing functions – so called intrinsic functions. A Lambda function is required to perform the necessary work. Hopefully, AWS adds some more power to the intrinsic function’s palette so that these kind of simple helper functions will no longer be necessary in future.

The outcome of this step consists of the lambda function name, the array of versions to be processed and their count.

Check number of available versions and remove outdated ones

In the next step, the state machine compares the number of available versions against a threshold. This is done with a “Choice” State. In our case we wanted to always keep the three most recent versions + the $LATEST one, so the “version_count” is compared with 3:

If the number of available versions exceeds the threshold we set, the oldest version (first position in the array) is deleted using the “lambda: deleteFunction” API call in the step “Delete oldest version”.

After that, a modification of the payload is required as the deleted version number and the version count itself needs to be adapted. A “Pass” state is used to remove the oldest version (the first position in the array with the versions) and to decrease the version counter.

Ein Bild, das Text, Screenshot, Software, Zahl enthält. Automatisch generierte Beschreibung

The execution keeps checking and deleting old versions until their number has reached the threshold value (three in this example).

Ein Bild, das Text, Screenshot, Diagramm, Reihe enthält. Automatisch generierte Beschreibung

 

One remaining thing to consider related to Lambda Aliases

A Lambda alias reference one or two function versions which cannot be deleted if the alias exists. It is possible to retrieve all aliases of a function via the “lambda:listAliases” API call. One way to make sure to take aliases into account would be to get all involved versions and to remove them from the versions array before the version deletion action. This requires some custom code in a Lambda function.

Another option – which we have implemented – consists in defining an error handler for the “Delete old version” step. This one captures the “Lambda.ResourceConflictException” which is thrown when trying to delete a version which cannot be deleted for instance because of an alias reference. This error handling makes sure that the state machine does not fail in case of alias references or other unforeseen problems when trying to perform the deletion.

A screenshot of a computer screen Description automatically generated

The state machine finishes after all Lambda functions which have been supposed to be processed do not possess more than the most recent versions – all others have been deleted. This mechanism helps to prevent deployment package pollution when used regularly.

Conclusion

In this guide on managing AWS Lambda versions with AWS Step Functions, I’ve shown how developers can leverage automation to streamline the management of Lambda function versions. By automating these processes, teams can focus more on development rather than maintenance, maintaining operational efficiency and resource optimization. Overall, this solution provides a practical, visual tool for managing Lambda functions effectively, helping users navigate the complexities of cloud operations with greater ease.

AWS documentation and service quotas are your friends – do not miss them!

AWS documentation and service quotas are your friends – do not miss them!

Who loves reading docs? Probably not that many developers and engineers. Coding and developing are so much more exciting than spending hours reading tons of documentation. However just recently I was taught again that this is one of the big misconceptions and fallacies – probably not only for me.

AWS provides an extensive documentation for each service which contains not just a general overview but, in most cases, a deep knowledge of details and specifies related to an AWS service. Most service documentations consist of hundreds of pages including a lot of examples and code snippets which are quite often helpful – especially related to IAM policies. It is not always easy to find the relevant pieces for a specific edge case or it might be missing from time to time, but overall AWS have done a great job documenting their landscape.

Service quotas which exist for every AWS service are another part which should not be missed when either starting to work with a new AWS service or to use one more extensively. Many headaches and lost hours spent to debug an issue could be avoided by taken these quotas into account right from the start. Unfortunately, this lesson is too easy to forget like will be shown in the following example.

In a recent project, AWS DataSync was used to move about 40 million files from an AWS EFS share to a S3 bucket. The whole sync process should be repeated from time to time after the initial sync to take new and updated files into account. AWS DataSync supports this scenario by applying an incremental approach after the first run.

One DataSync location was created for EFS and another one for S3 and both where sticked to gether by an DataSync task which configures among other things the sync properties. The initial run of this task went fine. All files were synced after about 9 hours.

Some days later a first incremental sync was started to reflect the changes which had happened on EFS since the first run. The task went into the preparation phase but broke after about 30 minutes with a strange error message:

A screenshot of a computer Description automatically generated

“Cannot allocate memory” – what are you trying to tell me? No memory setting was configured in the DataSync task definition as no agent was involved. The first hit on Google shed some light on this problem by redirecting me to the documentation of AWS DataSync.

A screenshot of a computer error Description automatically generated

which contains a link to the DataSync task quotas:

A screenshot of a computer Description automatically generated

 

Apparently, 40 million files are way too much for one task as only 25 million are supported when transferring files between AWS Storage services. A request to the AWS support confirmed as well that problem was related to the large number of files. I have no idea why the initial run was able to run through but at least the follow up one failed. Splitting up the task into several smaller ones solved this issue so that the incremental run could finally be succeeded as well. Nevertheless, some hours were lost even though we learned something new.

Lessons learned – again:

  • Embrace the docs – even though they are really extensive!
  • Take the service quotas into account before starting to work and while working with an AWS service. They will get relevant one day – possibly earlier than later!
  • AWS technical support really like to help and is quite competent. Do not hesitate to contact them (if you have a support plan available).
A cross account cost overview dashboard powered by Lambda, Step Functions, S3 and Quicksight

A cross account cost overview dashboard powered by Lambda, Step Functions, S3 and Quicksight

Keeping an eye on cloud spendings in AWS or any other cloud service provider is one of the most important parts of every project team’s work. This can be tedious when following the AWS recommendation and best practice to split workloads over more than one AWS account. Consolidated billing – a feature of AWS Organization would help in this case – but access to the main/billing account is very often not granted to project teams and departments as they would get access to cost details of the whole Organization.

In this article, a solution is presented which allows to automatically collect billing information from various accounts to present them in a concise format in one or more AWS Quicksight dashboards. This application can run in its own AWS account so that access can be limited on account level. Only dedicated persons can be authorized to learn about the costs.