This project aims to predict passenger transportation outcomes on the Spaceship Titanic using machine learning techniques. It includes exploratory data analysis (EDA), statistical inference, machine learning model development, and deployment of a prediction service.
- Setup
- Project Structure
- Exploratory Data Analysis
- Statistical Inference
- Machine Learning Models
- Model Deployment
- UI Guide
- UI Screenshot
- Improvements and Future Work
- Contributors
- License
- Python 3.12+
- Docker
-
Clone the repository:
git clone https://github.com/vytautas-bunevicius/kaggle-spaceship-titanic.git cd kaggle-spaceship-titanic
-
Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt
-
Build the Docker image:
docker build -t spaceship-titanic-predictor .
-
Run the Docker container:
docker run -p 8080:8080 spaceship-titanic-predictor
-
Access the application at
http://localhost:8080
notebooks/
: Jupyter notebook containing EDA, statistical analysis, and model developmentsrc/
: Source code for the prediction servicedata/
: Dataset filesmodels/
: Saved machine learning modelstemplates/
: HTML templates for the web interfaceDockerfile
: Instructions for building the Docker imagerequirements.txt
: Python dependencies
Our EDA process included:
- Statistical summaries of passenger data
- Visualization of key features and their relationships
- Anomaly detection in numerical features
- Correlation analysis between variables
Key findings and visualizations can be found in the spaceship_titanic_analysis.ipynb
notebook.
We conducted statistical inference to understand the relationships between various features and the likelihood of transportation. This included:
- Defining the target population (all passengers on the Spaceship Titanic)
- Formulating hypotheses about factors influencing transportation
- Constructing confidence intervals
- Conducting t-tests and chi-square tests
Detailed analysis and results are available in the spaceship_titanic_analysis.ipynb
notebook.
We experimented with several machine learning models, including:
- Logistic Regression
- Random Forest
- XGBoost
- Stacked Ensemble (using H2O AutoML)
Hyperparameter tuning was performed using Optuna, and model ensembling was done using H2O's AutoML capabilities. The final model achieved a Kaggle score of > 0.79.
Model development and evaluation can be found in the spaceship_titanic_analysis.ipynb
notebook.
The best-performing model (Stacked Ensemble) was deployed as a Flask web application, containerized using Docker for easy deployment and scalability.
The web interface allows users to input passenger information and receive a prediction on whether the passenger will be transported.
-
Fill in the passenger details in the form:
- Home Planet
- CryoSleep status
- Destination
- Age
- VIP status
- Expenses (Room Service, Food Court, Shopping Mall, Spa, VR Deck)
- Cabin
-
Click the "Predict" button.
-
The prediction result will be displayed, including:
- Transportation outcome (Transported or Not Transported)
- Probability of transportation
- Interpretation of the probability (e.g., "There is a high chance that you will be transported.")
-
A visual probability bar indicates the likelihood of transportation.
-
The interface also displays a feature importance chart to help understand which factors most influence the prediction.
Caption: The Spaceship Titanic Predictor web interface, showing the input form and prediction results
- Collect more data to improve model accuracy
- Experiment with deep learning models
- Implement real-time model updating
- Optimize the model for faster prediction times
- Implement user feedback mechanism to continually improve the model
- Explore additional feature engineering techniques
- Conduct more in-depth analysis of feature interactions
- Implement A/B testing for different model versions