Skip to content
View Xinyao0118's full-sized avatar

Block or report Xinyao0118

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Xinyao0118/README.md

Welcome to Xinyao's Github 😊

Hi, I'm Xinyao, a proactive Gemini with the ISFP personality type. I'm intensely curious about fresh and challenging endeavors. I love keeping up with the latest technological advancements, and I pride myself on my strong ability to act upon and replicate them. My planning preference leans towards setting short-term goals for the next three days and long-term visions spanning five years. I cherish the dynamic nature of plans as they progress and am passionate about harnessing my creativity to guide new directions in planning.

Lately, I've been engrossed in Kaggle competitions. I've participated in three projects so far: ICR, LLM, and CAFA. Concurrently, I'm diving deep into advanced solutions from past projects and honing my predictive techniques. Two of them are Ubiquant prediction and AI Games.

In 2020, I graduated from Columbia University with a Master's in Biostatistics. Pinned on my homepage are the projects I undertook during my MS, many of which relate to healthcare. My primary focus at Columbia was on theory and decision-making, encompassing fundamental knowledge in computational statistics like data mining, optimization, and more. My thesis centered on causal inference, an intriguing direction in statistics that extends well into the realm of Machine Learning.

I'll keep updating my GitHub with new projects. Stay tuned!

πŸ”­ My Kaggle Competition Adventure

Built and evaluated a Large Language Model using a pre-trained BERT model for answering science questions, achieving a 0.6 MAP.

pipeline

Initially, I utilized LightGBM with Optuna for hyperparameter tuning, running 100 trials. Post-analysis of the feature importance plot led me to limit the feature list and ensemble five LightGBM models. The cross-validation displayed an impressive AUC of 0.99, yet the private leaderboard (LB) score wasn't as good as expected (0.22).

Recognizing this, I delved into diagnosing the issue and improving the model. I discovered:

  • The task was disease detection, thereby making recall a priority over AUC.
  • The training data size was small (<700 rows), increasing the risk of overfitting with LightGBM or Neural Networks.
  • The data was imbalanced with less than 200 positive instances, which was only 17% of the total.

In light of these insights, I initiated several improvements:

  • Applied KNN Imputer for missing value treatment.
  • Utilized Synthetic Minority Over-sampling Technique (SMOTE) to balance the positive and negative instances.
  • Shifted my ensemble strategy from averaging 5 LightGBM models to a voting system encompassing logistic regression, random forest, and SVM models.
  • Changed the cross-validation metric from AUC to recall.

Update improved the public Leaderboard score from 0.64 to 0.53.

Current: Top 35% -- 386 / 1120

πŸ› οΈ Technologies & Tools

Programming:

Python, SQL, R, Spark, Git, Bazel, Airflow, AWS, MLFlow, Databricks, CICD, Snowflake, Docker, Jupyter, Pytorch, Tensorflow, MySQL, MangoDB

Statistics & Data Mining:

A/B Testing, ANOVA, LLM, NLP, Deep Learning, Hyperparameter tuning (Optuna), Supervised Learning (LightGBM), Unsupervised Learning, Data Mining (Quantitive prediction) Industries I've Worked In: Tech, Advertisement (Audience prediction), E-commerce (funds flow forecasting, fraudulent activities detection), Healthcare (cancer detection, medical text classification, insurance beneficiaries risk adjustment)

🌱 My Journey So Far

VideoAmp, CA

As a Machine Learning Engineer, I led the design and development of personification systems, optimized data warehousing, implemented viewership prediction models, and facilitated extensive feature engineering.

Acumen, CA

As a Data Engineer, I optimized data pipelines for large datasets, automated manual tasks, integrated validation processes, and investigated data anomalies.

⚑ A Glimpse into My Projects

πŸ’‘A Glance into My Articles

Alternative link if you don't have access to Medium

This article is inspired by a post written by a Databricks engineer. It is aimed at company engineers who use the Databricks ecosystem but are unclear about why they chose it or its advantages. With this piece, we hope to demystify the underlying concepts and benefits of Databricks, specifically in comparison to Data Warehouses and Data Lakes.

Alternative link if you don't have access to Medium

A Large Language Model (LLM) refers to a type of artificial intelligence model designed to understand and generate human-like text. These models are trained on vast amounts of text data and utilize deep learning techniques, typically based on neural networks, to generate coherent and contextually relevant responses to textual prompts.

πŸ“« How to reach me

😎Fun Fact

Welcome our fluffy friend --> 🐱 Severus 🐱

Severus is my speakless friend, he is a 2 years old male ragdoll. He loves running around the house after pooping.

My another friend --> 🎻 Violin 🎻

I was a second violinist in Columbia University Irving Medical Center Symphony Orchestra.

Once a rehearsal we switched the conductor, the old one became my partner and sat next to me.

Then he finally knew I was the one who played out of tune.

Pinned Loading

  1. Google--American-Sign-Language-Fingerspelling-Recognition Google--American-Sign-Language-Fingerspelling-Recognition Public template

    Google - American Sign Language Fingerspelling Recognition Train fast and accurate American Sign Language fingerspelling recognition models

  2. LLM_Science_Exam LLM_Science_Exam Public template

    Kaggle - LLM Science Exam Use LLMs to answer difficult science questions

    Jupyter Notebook 1

  3. ICR---Identifying-Age-Related-Conditions ICR---Identifying-Age-Related-Conditions Public

    ICR - Identifying Age-Related Conditions Use Machine Learning to detect conditions with measurements of anonymous characteristics

  4. Breast_Cancer_Diagnosis Breast_Cancer_Diagnosis Public

    Compare the performance of full logistic-lasso, Newton Raphson method model and optimal Logistic-LASSO with Coordinate-Wise Update

    HTML

  5. Human_Disease_Prediction Human_Disease_Prediction Public

    [XGboost][Data Mining]Improve the bootstrap process and visualize the comparison of new and existing methods

    HTML

  6. IPO-Analysis IPO-Analysis Public

    [Python][SVM][Random Forest][Ada boost][Tableau visualization] 2018 IPO price in China& U.S.

    HTML 1