GitHub - rahulmanuwas/genpact_tutorial: Python machine learning introductory tutorial for Genpact folks

Tutorial: Machine Learning with Text in scikit-learn

Credits to the tutorial goes to Kevin Markham Most of the code is influenced by Kevin's Tutorial found here and check out his data school website here

Description

Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn.

Objectives

By the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation.

Abstract

It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on...

In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we'll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.

Detailed Outline

Model building in scikit-learn (refresher)
Representing text as numerical data
Reading a text-based dataset into pandas
Vectorizing our dataset
Building and evaluating a model
Comparing models
Examining a model for further insight
Practicing this workflow on another dataset
Tuning the vectorizer (discussion)

Recommended Resources

Text classification:

Read Paul Graham's classic post, A Plan for Spam, for an overview of a basic text classification system using a Bayesian approach. (He also wrote a follow-up post about how he improved his spam filter.)
Coursera's Natural Language Processing (NLP) course has video lectures on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the slides used in all of the videos.)
Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
How to Read the Mind of a Supreme Court Justice discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining how it works, and the Python code is available on GitHub.)
Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
In this PyData video (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data.

Naive Bayes and logistic regression:

Read this brief Quora post on airport security for an intuitive explanation of how Naive Bayes classification works.
For a longer introduction to Naive Bayes, read Sebastian Raschka's article on Naive Bayes and Text Classification. As well, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.

scikit-learn:

The scikit-learn user guide includes an excellent section on text feature extraction that includes many details not covered in today's tutorial.
The user guide also describes the performance trade-offs involved when choosing between sparse and dense input data representations.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
README.md		README.md
pandas_sklearn_introduction.ipynb		pandas_sklearn_introduction.ipynb
python_introduction.ipynb		python_introduction.ipynb
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorial: Machine Learning with Text in scikit-learn

Description

Objectives

Abstract

Detailed Outline

Recommended Resources

About

Releases

Packages

Languages

rahulmanuwas/genpact_tutorial

Folders and files

Latest commit

History

Repository files navigation

Tutorial: Machine Learning with Text in scikit-learn

Description

Objectives

Abstract

Detailed Outline

Recommended Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages