Quick starter guide for Data Science and Machine Learning (ML)- using R, Python and PySpark


Data Science and Machine Learning being one of the most popular buzz word of today, there are many who wish start their career in this field, but unsure where to start. This tutorial will help you to give basic overview of Machine Learning and a kick start to your Data Science Journey.
Image Source - Google

 

Data Science being a vast area there is so much to learn each day. There is no right or wrong way to start machine learning, but while working on any data science project or a usecase there are couple of common steps that every data scientist will go through. Thought of listing each stage step by step for beginners who are just beginning to start in there data science career or someone who wishes to learn machine learning but not sure where to start.

In this series of small-small articles we will try to cover all the stages that any data science / machine learning usecase will go through. We will try to cover different opensource technologies & libraries in example that can be used to work with Machine Learning - Python, R, PySpark 

While working on different modeling methods we will try to cover: overview, algorithm in detail, usage & application and try to implement use case in different programming languages which will make us comfortable in commonly used libraries & languages for machine learning.

We will try to cover most of the model evaluation methods & interpretation of result. 

Pre requisite:

  • Basic understanding of statistics.
  • Having a little knowledge of programming languages like R and python will help to understand better.

System Requirement:

We will be using Windows to run our examples and following tools needs to be installed before we can begin. 

  1. R Studio
  2. Python 3.7
  3. Jupyter
  4. PySpark

Link - Installation guide for installing R, Python, PySpark, Jupyter on windows 10

Machine Learning:

Below are the topics we will cover going further and will try to keep this page updated with new links and topics that we will conver in this series of Machine Learning.

  • Data Acquisition
  • Data Cleaning
    • Null Values
    • Categorical Data
    • Numerical Data
  • Features
    • Feature Engineering
    • Feature Selection
    • Dimensionality Reduction
  • Modeling Techniques (Commonly used algorithms)
    • Classification
      • Naive Bayes
      • Logistic Regression
      • Decision Tree Classifier
      • Random Forest Classifier
      • SVM - Support Vector Machine
      • GBT - Gradient Boosted Tree Classifier
    • Regression
      • Linear Regression
      • Decision Tree Regressor
      • Random Forest Regressor
      • GBT - Gradient Boosted Tree Regressor
    • Clustering
      • KMeans
      • GMM - Gaussian Mixture Model
    • Recommendation
      • Collaborative Filtering
      • FP Growth
    • Neural Network
    • Time Series
  • Model Evaluation
  • Model Deployment

 

Let me know in comments if any new topic or algorithm/method that i should consider to write about. Happy Learning :)

 

 

 


Machine Learning Python R PySpark Data Science Data Scientist

Related Stories

    blog comments powered by Disqus