Skip to content

Latest commit

 

History

History
307 lines (243 loc) · 11.1 KB

README.md

File metadata and controls

307 lines (243 loc) · 11.1 KB

📜 Table of Contents


🤓 About

The project relies exclusively on services feature available in the LocalStack Community Edition.

Given that Amazon Elastic Container Registry (ECR) and Lambda Layers are exclusively accessible with the Pro version, I opted for simplicity and ease of use. Consequently, a decision was made to design a solution that steers clear of using external libraries facilitating a smoother deployment process for the Lambda Function.


🏁 Getting Started

Before you begin, make sure you have the following tools installed on your machine:


To install the project dependencies, execute the following command. Once the script completes, you'll be ready to start working on your Python project with all dependencies in place:

source scripts/poetry_init.sh 

To run LocalStack and set up the infrastructure that supports the application, run the following command. The script execution will take care of the following operations:

  • Start LocalStack.
  • Provision the infrastructure using Terraform.
  • Upload data to S3.
  • Trigger the Lambda function to extract data from the Docker endpoint.
  • Show the stored data in the DynamoDB table.
source scripts/localstack_init.sh 


🏦 Architecture

architecture.png

1. CloudWatch Event

A CloudWatch event rule has been configured to invoke the "dockerhub-to-s3" Lambda function at regular five-minute intervals.


2. Lambda Function (dockerhub-to-s3)

The 'dockerhub-to-s3' Lambda function triggered by CloudWatch, is responsible for retrieving data from the Docker endpoint and storing it in an S3 bucket.

The data extracted from Docker includes a JSON attribute, "last_updated" which serves as the key for partitioning and organizing the stored data within the S3 bucket. For example:

  • raw/year=YYYY/month=MM/day=DD/YYYY_MM_DD_HHMMSS.json
  • raw/year=2023/month=09/day=15/2023_09_15_1505500.json
  • raw/year=2023/month=09/day=15/2023_09_15_154316.json
  • raw/year=2023/month=09/day=15/2023_09_15_200000.json
  • raw/year=2023/month=10/day=30/2023_10_30_154316.json

3. S3

The decision to store data on S3 is centered around simplifying organizational-level data access. Once data is on the data lake, it becomes effortlessly shareable with other services like Redshift, Athena, DynamoDB, and more. Leveraging S3 event notifications further enables the creation of event-based workflow.

3.1 Partitioning

The adoption of a partitioning structure not only significantly enhances data retrieval efficiency through optimized search algorithms but also results in cost reductions for services like Athena. This is achieved by strategically applying query filters, thereby avoiding unnecessary full scans of files stored in the data lake.

3.2 Lifecycle Rules

Objects with the "raw/" prefix are deleted after 365 days, a practice enforced through lifecycle rules. Alternatively, the files could be transitioned to a more cost-effective storage class.


4. S3 Event Notifications

Upon the Lambda function uploading a file to S3, an S3 event notification is triggered. The event notification is specifically configured to invoke the "s3-to-dynamodb" Lambda function, establishing an automated workflow in response to file uploads on S3.


5. Lambda Function (s3-to-dynamodb)

The "s3-to-dynamodb" Lambda function reads the file that triggered the event and saves its content to DynamoDB.

Within this Lambda function, aggregations are computed for the star_count and pull_count metrics on a yearly/monthly basis. This process is facilitated by employing a partitioning structure (e.g., year/month/day/*) and adhering to a naming convention for the files (e.g., 'YYYY_MM_DD_HHMMSS.json'). These design choices enable the implementation of various strategies for sorting and retrieving files.

For the sake of simplicity, in this particular scenario, the file name serves as the search criterion. For each combination of year/month, only the most recent file is considered when generating the metrics aggregations.


6. DynamoDB

6.1 Single Table Design

The DynamoDB table is designed with a single-table approach, exemplified by the following PutItem operation:

table.put_item(Item={
    "PK": partition_key,
    "SK": sort_key,
    "star_count": star_count,
    "pull_count": pull_count,
    "last_updated": last_updated,
    "aggregated_metrics_by_year_month": aggregated_metrics_by_year_month,
    "TTL": ttl_epoch
})

The partition key is structured as:

partition_key = f"#user#{user}#name#{name}#namespace#{namespace}"
partition_key = "#user#localstack#name#localstack#namespace#localstack"

The sort key is structured as:

sort_key = f"#timestamp#{last_updated}"
sort_key = #timestamp#2023-09-15T15:00:00.087021Z

This design facilitates straightforward queries on the sort key, enabling retrieval of items based on specific timeframes, such as year, year/month, or year/month/day down to minutes and seconds, using constructs like "contains" and "beginswith."


6.2 Aggregations

Aggregated metrics for each year/month are stored in the "aggregated_metrics_by_year_month" attribute.

Here is an example of how aggregated metrics information might be stored in the aggregated_metrics_by_year_month attribute in DynamoDB:

"aggregated_metrics_by_year_month": {
    "M": {
        "2023_10": {
            "L": [
                {
                    "M": {
                        "star_count": {
                            "N": "230"
                        },
                        "pull_count": {
                            "N": "185000000"
                        }
                    }
                }
            ]
        },
        "2023_11": {
            "L": [
                {
                    "M": {
                        "star_count": {
                            "N": "277"
                        },
                        "pull_count": {
                            "N": "188599872"
                        }
                    }
                }
            ]
        },
        "2023_09": {
            "L": [
                {
                    "M": {
                        "star_count": {
                            "N": "201"
                        },
                        "pull_count": {
                            "N": "180000050"
                        }
                    }
                }
            ]
        }
    }
}

6.3 Marker Item

With every insertion into the DynamoDB table, an item is explicitly designed to keep track of the most recent entry. By having this dedicated marker item, queries seeking the most recent data can efficiently identify the relevant entry without the need for extensive scans through historical records.

The item is structured as follows:

{
    "SK": {
        "S": "#latest_version"
    },
    "last_updated": {
        "S": "2023-11-18T22:01:11.395251Z"
    },
    "PK": {
        "S": "#user#localstack#name#localstack#namespace#localstack"
    },
    "TTL": {
        "N": "1705513319"
    }
}

The partition key is structured as:

partition_key = f"#user#{user}#name#{name}#namespace#{namespace}"
partition_key = "#user#localstack#name#localstack#namespace#localstack"

The sort key is structured as:

sort_key = "#latest_version"

6.4 Time to Live

The item in the DynamoDB table includes a TTL attribute, which stands for Time to Live. This attribute is configured to specify the expiration time of the item. In this particular case, every time an item is inserted into the table, the TTL is set with a predefined expiration of 60 days.


7. Improvements

  • Optimize CloudWatch Event Rule Interval: Explore the optimal time interval for the CloudWatch Event Rule. Invoking the Lambda function every 5 minutes may be excessive.
  • Refine Data Partitioning Strategy: Investigate by considering and anticipating how the data is expected to be used and queried (S3, DynamoDB). Identifying the optimal partition key and strategy can significantly enhance data retrieval efficiency.
  • Improve S3 Event Consumption: Consider introducing an SNS topic for S3 event notifications rather than directly invoking the Lambda function. This enables the SNS topic to serve multiple subscribers in the future. In conjunction with the SNS topic, integrate an SQS queue before invoking the Lambda function. If needed, the SQS can reduce the number of invocations if events are processed in batches.
  • Refine Lifecycle Rules and TTL: Appropriately configure lifecycle rules on the S3 bucket and TTL in the DynamoDB table.

🐛 Test

Tests are automatically executed through the pre-commit hook when pushing to the remote branch:

  - repo: local
    hooks:
        - id: tests
          name: tests
          entry: poetry run pytest -s
          language: python
          "types": [ python ]
          pass_filenames: false
          stages: [ push ]

Alternatively, tests can also be run manually using the command

poetry run pytest

⛏️ Built Using



✏️ Authors

Made with ❤️ by Vittorio Polverino