As a first step of machine learning one needs some basic tools to get started. In our examples we will use opensource tools like R, Python, PySpark. This article will give brief instruction of how to setup your working environment and install R, Python, PySpark, Jupyter on windows 10.
Installing R and Python is quite straight forward on windows, we just need to follow the installation instruction on windows installation wizard. Any how below are detailed steps one can follow to install R, RStudio, Anaconda (for Python & Jupyter) and PySpark.
Installing R and RStudio on windows 10
- Download R from http://cran.cnr.berkeley.edu/.
- Click on Download R for Windows. Click on base. Click on Download R 3.5.3 for Windows (or a newer version that appears).
- Install R. Leave all default settings in the installation options.
- Download RStudio Desktop - "open source licence" for windows from here (it should be called something like RStudio 1.1.463 - Windows Vista/7/8/10 download any new version that is available).
- Click on the installer and choose default installation options.
- Click to launch RStudio.
- R & RStudio are installed now
Installing Python & Jupyter on windows 10
We will use Anaconda for python as it comes with other tools like Spider, Jupyter And lots of other preconfigured tools. Incase some has limited storage or ram will recommend to install Python and Jupyter.
- Download Anaconda Python 3.x from here
- Double click the installer to launch and click next
- Click "I Agree"
- Select an install for “Just Me” unless you’re installing for all users (which requires Windows Administrator privileges) and click Next.
- Select a destination folder to install Anaconda and click the Next button.
- Choose whether to add Anaconda to your PATH environment variable. We recommend not adding Anaconda to the PATH environment variable, since this can interfere with other software. Instead, use Anaconda software by opening Anaconda Navigator or the Anaconda Prompt from the Start Menu.
- Choose whether to register Anaconda as your default Python. Unless you plan on installing and running multiple versions of Anaconda, or multiple versions of Python, accept the default and leave this box checked.
- Click the Install button. If you want to watch the packages Anaconda is installing, click Show Details.
- Click the Next button.
- After a successful installation you will see the “Thanks for installing Anaconda” dialog box:
Detailed instruction to install Anaconda can be found here - Install Anaconda on Windows
Installing PySpark on Windows 10
- Download and install Java 7+
- We will need Gnu on windows (GOW) to be installed before we moveon to install pyspark
- Download and install GOW from here. Basically, GOW allows you to use linux commands on windows. In this install, we will need curl, gzip, tar which GOW provides.
- Visit Apache Spark Website http://spark.apache.org/downloads.html and download spark binaries for windows.
- Click - Download Spark: spark-2.4.5-bin-hadoop2.7.tgz
- Once the download is complete. Move the file to location where you wish to install spark
- mkdir D:\opt\spark
- cd D:\opt\spark
- mv <download location of spark-2.4.5-bin-hadoop2.7.tgz> D:\opt\spark
- Unzip the file.
- gzip -d spark-2.4.5-bin-hadoop2.7.tgz
- tar xvf spark-2.4.5-bin-hadoop2.7.tgz
- Download winutils.exe from here
- Move winutils.exe file spark-2.4.5-bin-hadoop2.7/bin folder
- Now we need to configure Environment Variables -
setx SPARK_HOME D:\opt\spark\spark-2.4.5-bin-hadoop2.7
setx HADOOP_HOME D:\opt\spark\spark-2.4.5-bin-hadoop2.7
setx PYSPARK_DRIVER_PYTHON ipython
setx PYSPARK_DRIVER_PYTHON_OPTS notebook
Add D:\opt\spark\spark-2.4.5-bin-hadoop2.7\bin to your path.
- To add these path manually - go to MyComputer, right click - select Properties, then select Advanced system settings from left pane. Select "Environment Variable" from the System Properties prompt.
- use step 14 variable name and values to create new system variables. and add D:\opt\spark\spark-2.4.5-bin-hadoop2.7\bin to you path variable.
- Save and close all windows, and reopen cmd prompt.
- run : pyspark --master local
We have successfully configured our systems for R, Python and to run Spark & PySpark.
Let me know in comments if you face any issue.