Comprehensive Data Science With Python

Duration: 5 days

Industry: Information Technology

About this course

This Python programming data science training course teaches engineers, data scientists, statisticians, and other quantitative professionals the Python skills they need to use the Python programming language to analyze and chart data.

What is Python?

Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

For more information, please check P2L's website.

Who can benefit?

Engineers
Data Scientists
Statisticians

This is what you'll learn

Understand the difference between Python basic data types
Know when to use different python collections
Ability to implement python functions
Understand control flow constructs in Python
Handle errors via exception handling constructs
Be able to quantitatively define an answerable, actionable question
Import both structured and unstructured data into Python
Parse unstructured data into structured formats
Understand the differences between NumPy arrays and pandas dataframes
Overview of where Python fits in the Python/Hadoop/Spark ecosystem
Simulate data through random number generation
Understand mechanisms for missing data and analytic implications

Explore and Clean Data
Create compelling graphics to reveal analytic results
Reshape and merge data to prepare for advanced analytics
Find test for group differences using inferential statistics
Implement linear regression from a frequentist perspective
Understand non-linear terms, confounding, and interaction in linear regression
Extend to logistic regression to model binary outcomes
Understand the difference between machine learning and frequentist approaches to statistics
Implement classification and regression models using machine learning
Score new datasets, evaluate model fit, and quantify variable importance

Course Outline

Software Requirements

Anaconda Python 3.5 or later
Spyder IDE (Comes with Anaconda)

Data Science with Python Programming Training Outline

Base Python Introduction

History and current use
Installing the Software
Python Distributions
String Literals and numeric objects
Collections (lists, tuples, dicts)
Datetime classes in Python
Memory Management in Python
Control Flow
Functions
Exception Handling

Defining Actionable, Analytic Questions

Defining the quantitative construct to make inference on the question
Identifying the data needed to support the constructs
Identifying limitations to the data and analytic approach
Constructing Sensitivity analyses

Bringing Data In

Structured Data
Structured Text Files
Excel workbooks
SQL databases
Working with Unstructured Text Data
Reading Unstructured Text
Introduction to Natural Language Processing with Python

NumPy: Matrix Language

Introduction to the ndarray
NumPy operations
Broadcasting
Missing data in NumPy (masked array)
NumPy Structured arrays
Random number generation

Data Preparation with Pandas

Filtering
Creating and deleting variables
Discretization of Continuous Data
Scaling and standardizing data
Identifying Duplicates
Dummy Coding
Combining Datasets
Transposing Data
Long to wide and back

Exploratory Data Analysis with Pandas

Univariate Statistical Summaries and Detecting Outliers
Multivariate Statistical Summaries and Outlier Detection
Group-wise calculations using Pandas
Pivot Tables

Exploring Data Graphically

Histogram
Box-and-whiskers plot
Scatter plots
Forest Plots
Group-by plotting

Advanced Graphing with Matplotlib, Pandas, and Seaborn

Python, Hadoop and Spark

Introduction to the difference in Python, Hadoop, and Spark
Importing data from Spark and Hadoop to Python
Parallel execution leveraging Spark or Hadoop

Missing Data

Exploring and understanding patterns in missing data
Missing at Random
Missing Not at Random
Missing Completely at Random
Data imputation methods

Traditional Inferential Statistics

Comparing Groups
P-Values, summary statistics, sufficient statistics, inferential targets
T-Tests (equal and unequal variances)
ANOVA
Chi-Square Tests
Correlation

Frequentist Approaches to Multivariate Statistics

Linear Regression
Multivariate linear regression
Capturing Non-linear Relationships
Comparing Model Fits
Scoring new data
Poisson Regression Extension
Logistic regression
Logistic Regression Example
Classification Metrics

Machine Learning Approaches to Multivariate Statistics

Machine Learning Theory
Data pre-processing
Missing Data
Dummy Coding
Standardization
Training/Test data
Supervised Versus Unsupervised Learning
Unsupervised Learning: Clustering
Clustering Algorithms
Evaluating Cluster Performance
Dimensionality Reduction
A-priori
Principal Components Analysis
Penalized Regression

Supervised Learning: Regression

Linear Regression
Penalized Linear Regression
Stochastic Gradient Descent
Scoring New Data Sets
Cross Validation
Variance Bias-Tradeoff
Feature Importance

Supervised Learning: Classification

Logistic Regression
LASSO
Random Forest
Ensemble Methods
Feature Importance
Scoring New Data Sets
Cross Validation

Conclusion

Prerequisite Skills

All attendees should have prior programming experience and an understanding of basic statistics.

Schedule (iMVP)

Please contact P2L to schedule the dates for this course.

Enroll