The project relies exclusively on services feature available in the LocalStack Community Edition.
Given that Amazon Elastic Container Registry (ECR) and Lambda Layers are exclusively accessible with the Pro version,
I opted for simplicity and ease of use. Consequently, a decision was made to design a solution that steers clear of using external libraries facilitating a smoother deployment process for the Lambda Function.
Before you begin, make sure you have the following tools installed on your machine:
To install the project dependencies, execute the following command. Once the script completes, you'll be ready to start working on your Python project with all dependencies in place:
source scripts/poetry_init.sh
To run LocalStack and set up the infrastructure that supports the application, run the following command. The script execution will take care of the following operations:
- Start LocalStack.
- Provision the infrastructure using Terraform.
- Upload data to S3.
- Trigger the Lambda function to extract data from the Docker endpoint.
- Show the stored data in the DynamoDB table.
source scripts/localstack_init.sh
A CloudWatch event rule has been configured to invoke the "dockerhub-to-s3" Lambda function at regular five-minute intervals.
The 'dockerhub-to-s3' Lambda function triggered by CloudWatch, is responsible for retrieving data from the Docker endpoint and storing it in an S3 bucket.
The data extracted from Docker includes a JSON attribute, "last_updated" which serves as the key for partitioning and organizing the stored data within the S3 bucket. For example:
- raw/year=YYYY/month=MM/day=DD/YYYY_MM_DD_HHMMSS.json
- raw/year=2023/month=09/day=15/2023_09_15_1505500.json
- raw/year=2023/month=09/day=15/2023_09_15_154316.json
- raw/year=2023/month=09/day=15/2023_09_15_200000.json
- raw/year=2023/month=10/day=30/2023_10_30_154316.json
The decision to store data on S3 is centered around simplifying organizational-level data access. Once data is on the data lake, it becomes effortlessly shareable with other services like Redshift, Athena, DynamoDB, and more. Leveraging S3 event notifications further enables the creation of event-based workflow.
The adoption of a partitioning structure not only significantly enhances data retrieval efficiency through optimized search algorithms but also results in cost reductions for services like Athena. This is achieved by strategically applying query filters, thereby avoiding unnecessary full scans of files stored in the data lake.
Objects with the "raw/" prefix are deleted after 365 days, a practice enforced through lifecycle rules. Alternatively, the files could be transitioned to a more cost-effective storage class.
Upon the Lambda function uploading a file to S3, an S3 event notification is triggered. The event notification is specifically configured to invoke the "s3-to-dynamodb" Lambda function, establishing an automated workflow in response to file uploads on S3.
The "s3-to-dynamodb" Lambda function reads the file that triggered the event and saves its content to DynamoDB.
Within this Lambda function, aggregations are computed for the star_count and pull_count metrics on a yearly/monthly basis. This process is facilitated by employing a partitioning structure (e.g., year/month/day/*) and adhering to a naming convention for the files (e.g., 'YYYY_MM_DD_HHMMSS.json'). These design choices enable the implementation of various strategies for sorting and retrieving files.
For the sake of simplicity, in this particular scenario, the file name serves as the search criterion. For each combination of year/month, only the most recent file is considered when generating the metrics aggregations.
The DynamoDB table is designed with a single-table approach, exemplified by the following PutItem operation:
table.put_item(Item={
"PK": partition_key,
"SK": sort_key,
"star_count": star_count,
"pull_count": pull_count,
"last_updated": last_updated,
"aggregated_metrics_by_year_month": aggregated_metrics_by_year_month,
"TTL": ttl_epoch
})
The partition key is structured as:
partition_key = f"#user#{user}#name#{name}#namespace#{namespace}"
partition_key = "#user#localstack#name#localstack#namespace#localstack"
The sort key is structured as:
sort_key = f"#timestamp#{last_updated}"
sort_key = #timestamp#2023-09-15T15:00:00.087021Z
This design facilitates straightforward queries on the sort key, enabling retrieval of items based on specific timeframes, such as year, year/month, or year/month/day down to minutes and seconds, using constructs like "contains" and "beginswith."
Aggregated metrics for each year/month are stored in the "aggregated_metrics_by_year_month" attribute.
Here is an example of how aggregated metrics information might be stored in the aggregated_metrics_by_year_month attribute in DynamoDB:
"aggregated_metrics_by_year_month": {
"M": {
"2023_10": {
"L": [
{
"M": {
"star_count": {
"N": "230"
},
"pull_count": {
"N": "185000000"
}
}
}
]
},
"2023_11": {
"L": [
{
"M": {
"star_count": {
"N": "277"
},
"pull_count": {
"N": "188599872"
}
}
}
]
},
"2023_09": {
"L": [
{
"M": {
"star_count": {
"N": "201"
},
"pull_count": {
"N": "180000050"
}
}
}
]
}
}
}
With every insertion into the DynamoDB table, an item is explicitly designed to keep track of the most recent entry. By having this dedicated marker item, queries seeking the most recent data can efficiently identify the relevant entry without the need for extensive scans through historical records.
The item is structured as follows:
{
"SK": {
"S": "#latest_version"
},
"last_updated": {
"S": "2023-11-18T22:01:11.395251Z"
},
"PK": {
"S": "#user#localstack#name#localstack#namespace#localstack"
},
"TTL": {
"N": "1705513319"
}
}
The partition key is structured as:
partition_key = f"#user#{user}#name#{name}#namespace#{namespace}"
partition_key = "#user#localstack#name#localstack#namespace#localstack"
The sort key is structured as:
sort_key = "#latest_version"
The item in the DynamoDB table includes a TTL attribute, which stands for Time to Live. This attribute is configured to specify the expiration time of the item. In this particular case, every time an item is inserted into the table, the TTL is set with a predefined expiration of 60 days.
- Optimize CloudWatch Event Rule Interval: Explore the optimal time interval for the CloudWatch Event Rule. Invoking the Lambda function every 5 minutes may be excessive.
- Refine Data Partitioning Strategy: Investigate by considering and anticipating how the data is expected to be used and queried (S3, DynamoDB). Identifying the optimal partition key and strategy can significantly enhance data retrieval efficiency.
- Improve S3 Event Consumption: Consider introducing an SNS topic for S3 event notifications rather than directly invoking the Lambda function. This enables the SNS topic to serve multiple subscribers in the future. In conjunction with the SNS topic, integrate an SQS queue before invoking the Lambda function. If needed, the SQS can reduce the number of invocations if events are processed in batches.
- Refine Lifecycle Rules and TTL: Appropriately configure lifecycle rules on the S3 bucket and TTL in the DynamoDB table.
Tests are automatically executed through the pre-commit hook when pushing to the remote branch:
- repo: local
hooks:
- id: tests
name: tests
entry: poetry run pytest -s
language: python
"types": [ python ]
pass_filenames: false
stages: [ push ]
Alternatively, tests can also be run manually using the command
poetry run pytest
- Python | Programming language
- Poetry | Dependency management and packaging
- Pre-Commit | Pre-commit task automation
- Bash | Scripting
- Docker | Containerization and Deployment
- AWS | Cloud Provider
- LocalSTack | Cloud Service Emulator
- Terraform | IaC
Made with ❤️ by Vittorio Polverino