BigData Pipeline is a local testing environment designed for experimenting with various storage solutions, query engines, schedulers, and ETL/ELT tools. The project includes:
- Storage Solutions: RDB, HDFS, Columnar Storage
- Query Engines: Trino
- Schedulers: Airflow
- ETL/ELT Tools: DBT
Pipeline Component | Version | Description | Port |
---|---|---|---|
MySQL | 8.36+ | Relational Database | 3306 |
Hadoop | 3.3.6+ | Distributed Storage | namenode: 9870, datanode: 9864 |
Trino | 438+ | Distributed Query Engine | 8080 |
Hive | 3.1.3 | DFS Query Solution | hiveserver2(thrift): 10002 |
Kudu | 2.3+ | Columnar Distributed Database | master: 7051, tserver: 7050 |
Airflow | 2.7+ | Scheduler | 8888 |
DBT | 1.7.1 | Analytics Framework | - |
Pipeline Component | User | Password | Database |
---|---|---|---|
MySQL | root | root | default |
MySQL | airflow | airflow | airflow_db |
Trino | Allowing all | - | 8080 |
Hive | hive | hive | default |
Airflow | airflow | airflow | - |
You can create databases, schemas, and tables with these accounts.
Apache open-source software is manually installed on an Ubuntu image, downloading from Apache mirror servers (CDN) to improve overall installation speed. The installation speed may vary based on the user's network environment, so a stable network is recommended.
- MySQL: For MySQL, the docker-compose file is set for Mac Silicon (platform: linux/amd64). If running on Windows, comment out this line.
- Trino: For Trino's Web UI/JDBC connections (e.g., DBeaver), any string can be used as the User. There is no password. Ensure that the user in
dbt-trino
'sprofiles.yml
matches this. - DBT: DBT operates within Airflow using
airflow-dbt
. For local use, create a virtual environment. (Future plans include building improvements with poetry.) - Kudu & Hadoop: For local environments with limited resources, the replica count for
kudu-tserver
andhadoop-datanode
has been set to 1. Kudu is a storage-only DB, requiring a separate engine (e.g., Impala, Trino) for executing queries. - Hue: If Hue is needed, uncomment the section in
docker-compose.yml
to use it. - Airflow: Airflow is configured with the Celery Executor.
airflow-trigger
is restricted due to resource constraints.
To get started with BigData Pipeline, follow these steps:
-
1-1. If you want to specify the required profile and bring up containers using the CLI:
COMPOSE_PROFILES=trino,kudu,hive,dbt,airflow docker-compose -f docker-compose.yml up --build -d --remove-orphans
-
1-2. If you want to bring up all containers at once:
make up
-
2-1. If you want to stop running containers:
make down
-
2-2. If you want to remove running containers while deleting Docker images, volumes, and network resources:
make delete.all
-
Hive Metastore Initialization: Check for an initialized file in the
./mnt/schematool-check
folder. -
Container Start Success: Look for the following image when running Docker Compose:
-
Web UI Access: If you can’t access the web UI for a specific platform after container startup, you may need to rebuild the containers.
-
Trino JDBC Connection: If you see three catalogs (hive, kudu, mysql) after JDBC connection (
jdbc:trino://localhost:8080
) in DBeaver, it is working correctly.
Test codes are located in the init-sql/trino
directory.
-
test_code_1.sql: Tests schema and table creation, data insertion, and selection in the Hive catalog.
-
test_code_2.sql: Tests Union queries between heterogeneous DB tables (Hive, Kudu).
- Enhance static analysis tools and build systems for clean code (black, ruff, isort, mypy, poetry).
- Improve CI automation for static analysis (pre-commit).
- Simulate ETL/ELT with DBT-Airflow integration.
Feel free to fork the repository, make changes, and submit pull requests. Contributions are welcome!
This project is licensed under the MIT License. See the LICENSE file for details.
Pirate-Emperor
- GitHub: Pirate-Emperor
- Reddit: PirateKingRahul
- Twitter: PirateKingRahul
- Discord: PirateKingRahul
- LinkedIn: PirateKingRahul
- Skype: Join Skype
- Medium: PirateKingRahul
Thank you for visiting the BigData Pipeline project!
For more details, please refer to the GitHub repository.