Skip to content

BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.

License

Notifications You must be signed in to change notification settings

Pirate-Emperor/BigData-Pipeline

Repository files navigation

BigData Pipeline

BigData Pipeline

Project Overview

BigData Pipeline is a local testing environment designed for experimenting with various storage solutions, query engines, schedulers, and ETL/ELT tools. The project includes:

  • Storage Solutions: RDB, HDFS, Columnar Storage
  • Query Engines: Trino
  • Schedulers: Airflow
  • ETL/ELT Tools: DBT

Pipeline Components

Pipeline Component Version Description Port
MySQL 8.36+ Relational Database 3306
Hadoop 3.3.6+ Distributed Storage namenode: 9870, datanode: 9864
Trino 438+ Distributed Query Engine 8080
Hive 3.1.3 DFS Query Solution hiveserver2(thrift): 10002
Kudu 2.3+ Columnar Distributed Database master: 7051, tserver: 7050
Airflow 2.7+ Scheduler 8888
DBT 1.7.1 Analytics Framework -

Connection Info

Pipeline Component User Password Database
MySQL root root default
MySQL airflow airflow airflow_db
Trino Allowing all - 8080
Hive hive hive default
Airflow airflow airflow -

You can create databases, schemas, and tables with these accounts.

Execution

Apache open-source software is manually installed on an Ubuntu image, downloading from Apache mirror servers (CDN) to improve overall installation speed. The installation speed may vary based on the user's network environment, so a stable network is recommended.

  • MySQL: For MySQL, the docker-compose file is set for Mac Silicon (platform: linux/amd64). If running on Windows, comment out this line.
  • Trino: For Trino's Web UI/JDBC connections (e.g., DBeaver), any string can be used as the User. There is no password. Ensure that the user in dbt-trino's profiles.yml matches this.
  • DBT: DBT operates within Airflow using airflow-dbt. For local use, create a virtual environment. (Future plans include building improvements with poetry.)
  • Kudu & Hadoop: For local environments with limited resources, the replica count for kudu-tserver and hadoop-datanode has been set to 1. Kudu is a storage-only DB, requiring a separate engine (e.g., Impala, Trino) for executing queries.
  • Hue: If Hue is needed, uncomment the section in docker-compose.yml to use it.
  • Airflow: Airflow is configured with the Celery Executor. airflow-trigger is restricted due to resource constraints.

Getting Started

To get started with BigData Pipeline, follow these steps:

1. Start the Containers

  • 1-1. If you want to specify the required profile and bring up containers using the CLI:

    COMPOSE_PROFILES=trino,kudu,hive,dbt,airflow docker-compose -f docker-compose.yml up --build -d --remove-orphans
  • 1-2. If you want to bring up all containers at once:

    make up

2. Manage Containers

  • 2-1. If you want to stop running containers:

    make down
  • 2-2. If you want to remove running containers while deleting Docker images, volumes, and network resources:

    make delete.all

Checking if It's Running Properly

  • Hive Metastore Initialization: Check for an initialized file in the ./mnt/schematool-check folder.

  • Container Start Success: Look for the following image when running Docker Compose:

  • Web UI Access: If you can’t access the web UI for a specific platform after container startup, you may need to rebuild the containers.

    hadoop namenode
    hive-server-2
    kudu-master
    trino
    airflow
  • Trino JDBC Connection: If you see three catalogs (hive, kudu, mysql) after JDBC connection (jdbc:trino://localhost:8080) in DBeaver, it is working correctly.

    image

Trino Test Code

Test codes are located in the init-sql/trino directory.

  • test_code_1.sql: Tests schema and table creation, data insertion, and selection in the Hive catalog.

    image
  • test_code_2.sql: Tests Union queries between heterogeneous DB tables (Hive, Kudu).

    image

Next Challenge

  • Enhance static analysis tools and build systems for clean code (black, ruff, isort, mypy, poetry).
  • Improve CI automation for static analysis (pre-commit).
  • Simulate ETL/ELT with DBT-Airflow integration.

Contributing

Feel free to fork the repository, make changes, and submit pull requests. Contributions are welcome!

License

This project is licensed under the MIT License. See the LICENSE file for details.

Author

Pirate-Emperor

Twitter Discord LinkedIn

Reddit Medium

Thank you for visiting the BigData Pipeline project!


For more details, please refer to the GitHub repository.

About

BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published