# **CIS 419/519**
## Primer - Introduction to Python (and useful packages) and Colab
---

### Before starting, you must click on the "Copy To Drive" option in the top bar. Go to File --> Save a Copy to Drive. <ins>This is the master notebook so you will not be able to save your changes without copying it !</ins> Once you click on that, make sure you are working on that version of the notebook so that your work is saved

In this notebook, we cover the following:

*   Basics of Python
*   Interactive notebooks (Colab and Jupyter)
*   Useful packages like `random`, `numpy` and `sklearn`, along with exercises





# **About Python**
Python is a programming language that is very popular among the machine learning research community.
It is a higher-level programming language than Java, and has an extensive collection of libraries that can be easily installed.
We require that you use python for this class, so it is important to be familiar with it.

### Python vs Java
- Java is compiled, but Python is not: you directly run the Python file
- Java is statically typed (the types of the variables cannot change), and you do not declare types in Python
```
# This works in python
x = 3
x = "python"
```
- Instantiating a class in Python does not use `new`
```
# Assume there is a class called `Person`
p = Person('Emily', 25)
```
- Python does not use parentheses, like Java. Instead, it uses whitespace to infer the scope of methods, for loops, etc.
```
for i in range(10):
      print(i * 2)      # Indent by 4 spaces
      print(i * 4)
print('Done')       # No indent, runs after for loop
```
- Functions cannot be overloaded, instead use optional arguments
```
def f(x, y=0.0):
      ....
f(8)
f(8, 2)
```
- Boolean values are now `True` and `False`

### Python vs Matlab
- Python uses hard brackets `[]` for indexing arrays and matrices instead of `()` in Matlab
- Python is 0-indexed, so the first item in a list is at index 0, not 1
- Python doesn't come by default with a full development tool, like Matlab
- The `numpy` library in Python implements many of the same matrix functionality that Matlab has

### Development Environments
- In this class we will primarily use Google Colaboratory, which is one of many ways you can write python code. Colab is useful as an easy way to write interactive Python without having to install anything on your machine.
- Other options include:
>- Jupyter Notebooks: http://jupyter.org/install, and https://www.anaconda.com/download/ for the popular Anaconda distribution
  - Text editors: Atom, Sublime
  - PyCharm: https://www.jetbrains.com/pycharm/

While most of what we will do in class can be done in Colab, there may be advanced use cases where it useful to have one of these other options installed on your machine. So while we don't want you getting bogged down in installations right off the bat, it's a good idea to keep these in mind (we will go over Jupyter installation shortly).



# **Introduction to Google Colaboratory**

<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>What is Colaboratory?</h1>

Colaboratory, or 'Colab' for short, allows you to write and execute Python in your browser, with 
- Zero configuration required
- Free access to GPUs
- Easy sharing

Whether you're a <strong>student</strong>, a <strong>data scientist</strong> or an <strong>AI researcher</strong>, Colab can make your work easier. Watch <a href="https://www.youtube.com/watch?v=inN8seMm7UI">Introduction to Colab</a> to find out more, or just get started below!

## <strong>Getting started</strong>

The document that you are reading is not a static web page, but an interactive environment called a <strong>Colab notebook</strong> that lets you write and execute code.

For example, here is a <strong>code cell</strong> with a short Python script that computes a value, stores it in a variable and prints the result.  Note that you can import many common packages, including some packages that would otherwise have to be pip-installed, directly into Colab.

In [None]:
import numpy as np

xs = range(10)
std_val = np.std(xs)
std_val

To execute the code in the above cell, select it with a click and then either press the play button to the left of the code, or use the keyboard shortcut 'Command/Ctrl+Enter'. To edit the code, just click the cell and start editing.

Variables that you define in one cell can later be used in other cells:

In [None]:
seconds_in_a_day = 86400

In [None]:
seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

Colab notebooks allow you to combine <strong>executable code</strong> and <strong>rich text</strong> in a single document, along with <strong>images</strong>, <strong>HTML</strong>, <strong>LaTeX</strong> and more. When you create your own Colab notebooks, they are stored in your Google Drive account. You can easily share your Colab notebooks with co-workers or friends, allowing them to comment on your notebooks or even edit them. To find out more, see <a href="/notebooks/basic_features_overview.ipynb">Overview of Colab</a>. To create a new Colab notebook you can use the File menu above, or use the following link: <a href="http://colab.research.google.com#create=true">Create a new Colab notebook</a>.

Colab notebooks are Jupyter notebooks that are hosted by Colab. To find out more about the Jupyter project, see <a href="https://www.jupyter.org">jupyter.org</a>.

## Data science

With Colab you can harness the full power of popular Python libraries to analyse and visualise data. The code cell below uses <strong>numpy</strong> to generate some random data, and uses <strong>matplotlib</strong> to visualise it. To edit the code, just click the cell and start editing.

In [None]:
import numpy as np
from matplotlib import pyplot as plt

ys = 200 + np.random.randn(100)
x = [x for x in range(len(ys))]

plt.plot(x, ys, '-')
plt.fill_between(x, ys, 195, where=(ys > 195), facecolor='g', alpha=0.6)

plt.title("Sample Visualization")
plt.show()

You can import your own data into Colab notebooks from your Google Drive account, including from spreadsheets, as well as from GitHub and many other sources. To find out more about importing data, and how Colab can be used for data science, see the links below under <a href="#working-with-data">Working with data</a>.

## Machine learning

With Colab you can import an image dataset, train an image classifier on it, and evaluate the model, all in just <a href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb">a few lines of code</a>. Colab notebooks execute code on Google's cloud servers, meaning you can leverage the power of Google hardware, including <a href="#using-accelerated-hardware">GPUs and TPUs</a>, regardless of the power of your machine. All you need is a browser.

Colab is used extensively in the machine learning community with applications including:
- Getting started with TensorFlow
- Developing and training neural networks
- Experimenting with TPUs
- Disseminating AI research
- Creating tutorials

To see sample Colab notebooks that demonstrate machine learning applications, see the <a href="#machine-learning-examples">machine learning examples</a> below.

# **Jupyter (Optional)**

### Python/Jupyter Installation Instructions
First, make sure you have python3 (version 3.9.5 is the latest stable version) installed on your local machine.
We recommend managing your python installation with Anaconda.
The Anaconda website has instructions for how to install python based on your operating system:
- https://www.anaconda.com/download/

Second, make sure that you have Jupyter notebooks installed.
It may have already come with your Anaconda installation.
If not, the Jupyter website has installation instructions:
- http://jupyter.org/install

If you have never used python or Jupyter notebooks before, we recommend bookmarking these resources:
- Python: https://developers.google.com/edu/python/
- Jupyter: https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook

Be aware that the Google tutorial uses python2, so some of their example code may not directly work in python3:
- Print statements now require parentheses: `print x` now becomes `print(x)`
- No longer need to specify file encoding: `open('file.txt', 'r')` works
- The division operator `/` no longer does integer divison: `5 / 2` is `2.5`
- The `xrange` operator is now only `range`

Other differences can be found on the web: https://www.geeksforgeeks.org/important-differences-between-python-2-x-and-python-3-x-with-examples/

## Jupyter Notebooks
When you start a Jupyter Notebook, you are starting a server on your local machine that hosts the notebooks.
```
> jupyter notebook
```
Once the server has started, open up a web browser and go to http://localhost:8888.
You should see a file tree, which you can use to navigate to where you want to save your Jupyter Notebook.

Jupyter Notebooks allow you to interactively develop python code.
Cells have types, either markdown (https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) or python code.
You can write as much code in a cell as you need.
When you want to execute it, enter shift+enter.
Anything displayed by your program will be written below the cell.
Variables that are defined within an executed cell are saved in memory and can be used later (similar behavior in Matlab).
It's similar to if you could pause the program execution.
Code within other cells can reference any cell _which has already been executed_.
If you have already run a cell and you edit its code, the cell needs to be rerun in order for your changes to take effect.

If you close the Jupyter Notebook, you will need to rerun all of the cells from scratch so the variables and methods will be redefined. **Make sure you save before you close the notebook!**

### Submitting assignments
Just like in Colab, your Jupyter Notebook can be downloaded as a notebook under File -> Download as -> Notebook (.ipynb), as well as a variety of other file types, such as a straight Python (.py) file.

# **Coding in Python**

### Basics

In [None]:
# Assigning variables
x = 4.2
y = 8
s = 'hello'

# The types of variables can change without a problem
y = 'abc'

# To display variables, use the print function
print(y)

# The "null" type is called `None`
z = None
print(z)

print(z == None)

# Casting
s = '140'
print(int(s))
s = '140.243'
print(float(s))
x = 123
print(str(x))

### String Manipulation

In [None]:
s = "ABCdef"
s = 'ABCdef'

# Length
print(len(s))  # 6

# Standard Indexing: Notice hard brackets and 0-indexing
print(s[0])      # 'A'
print(s[1])      # 'B'
print(s[5])      # 'f'
# print(s[100])  # Error

# Range Indexing
print(s[1:4])    # 'BCd'
print(s[2:])     # 'Cdef'
print(s[:2])     # 'AB'
print(s[1:100])  # 'BCdef'

# Negative Indexing
print(s[-1])    # 'f'
print(s[-2])    # 'e'
print(s[:-1])   # 'ABCde'
print(s[:-3])   # 'ABC'
print(s[::-1])  # 'fedCBA'

# Concatenation
print(s + 'xyz')     # 'ABCdefxyz'
# print(s + 123)     # Error
print(s + str(123))  # 'ABCdef123'

# Splitting
print(s.split('C'))  # ['AB', 'def']

### Lists

In [None]:
a = [0, 4, 2, 8, 9]  # Create a list with initial items
b = []  # Create an empty list

# Length
print(len(a))  # 5

# The same indexing operations on strings work for lists
print(a[1])    # 4
print(a[2:4])  # [2, 8]
print(a[-2])   # 8

# Sort the list
print(sorted(a))                # [0, 2, 4, 8, 9] Creates a copy, does not modify the original list
print(sorted(a, reverse=True))  # [9, 8, 4, 2, 0] Also a copy

# Add items to the end of the list
a.append(10)  # Returns nothing
print(a)      # [0, 4, 2, 8, 9, 10] 
a += [1, 12]  # Returns nothing
print(a)      # [0, 4, 2, 8, 9, 10, 1, 12]

# Lists don't have to be the same type in python
a.append('machine learning')
print(a)            # [0, 4, 2, 8, 9, 10, 1, 12, 'machine learning']
# print(sorted(a))  # Causes an error

# Searching
print(8 in a)    # True
print(100 in a)  # False

## Dictionaries
https://developers.google.com/edu/python/dict-files

In [None]:
d = {}       # Create an empty dictionary, which is a key-value map
d['a'] = 97  # Add an item to the dictionary
print(d)     # {'a': 97}
print(d['a'])

d['c'] = 99
d[0] = [0, 1, 2]  # Keys and values don't need to be the same type
print(d)          # {'a': 97, 'c': 99, 0: [0, 1, 2]}

# Retrieve the keys and values of the dictionary.
# These are not actually lists, but look like it. Call list(d.keys()) to actually convert it to a list
keys = d.keys()      # Retrieve a list of the keys of the dictionary, not necessarily in any order
values = d.values()  # Retrieve a list of the values of the dictionary, not necessarily in any order
print(keys)          # ['a', 'c', 0]
print(values)        # [97, 99, [0, 1, 2]]

# Check to see if items are in the dictionary (checks keys)
print('a' in d)  # True
print('x' in d)  # False

# Remove an item from the list
del d['a']    # Returns nothing, modifies inplace
# del d['y']  # Error

## If Statements
Covered in the strings section: https://developers.google.com/edu/python/strings

In [None]:
# Python doesn't require parenthese
a, b = 7, 10

# equality
if a == 7:
    print('yes')
else:
    print('no')
    
# not equal
if a != 8:
    print('neq')
    
# and
if a > 5 and b <= 10:
    print('1')
elif a > 10:
    print('2')
else:
    print('3')

# or
if a < 10 or b > 100:
    print('yes')
    
# Strings also use '=='
s = 'abc'
if s == 'abc':
    print('here')

## For Loops
Covered under lists: https://developers.google.com/edu/python/lists

In [None]:
a = [0, 8, 3, 5, 1]

range(5)     # Dynamically generates 0, 1, 2, 3, 4
range(1, 5)  # 1, 2, 3, 4

# "Standard" for loop from Java
for i in range(len(a)):
    print(a[i])

# Iterate over each element in the list
for x in a:
    print(x)
    
# Iterating through dictionaries
d = {0: 'a', 1: 'b', 2: 'c'}
for key, value in d.items():
    print(str(key) + ' -> ' + value)
    
    
# Loop over items in an array and get the index at the same time
for index, item in enumerate(a):
  print(index, item)

## Methods

In [None]:
# No return type specification necessary
def add(x, y):
    return x + y

def concat(list_1, list_2):
    list_1 += list_2

print(add(3, 4))        # 7
# print(add(4, 8, 10))  # Error

# Arguments can be passed by name
print(add(y=8, x=2))  # 10

a1 = [0, 1]
a2 = [3, 5]
concat(a1, a2)  # Returns nothing
print(a1)       # [0, 1, 3, 5]

# Methods can return more than one thing
def two(a):
  return a + 1, a + 2

a3, a4 = two(1)
print(a3)  # 2
print(a4)  # 3

t = two(1)
print(t[0])
print(t[1])

## List and Dictionary Comprehension

In [None]:
# Sometimes code can be simplified using list/dictionary comprehensions
def add_one(x):
  return x + 1

# The standard way to apply a function to every element and save in y
x = [0, 1, 2, 3]
y = []
for i in range(len(x)):
  y.append(add_one(x[i]))

# This is equivalent to the above
y = [add_one(x_i) for x_i in x]

# Add conditionals
y2 = [add_one(x_i) for x_i in x if x_i > 1]

# Dictionary comprehensions can be used to create dictionaries easily.
# This creates a mapping from x_i -> x_i * 4
d = {x_i : x_i * 4 for x_i in x}
print(d)
print(d[2])  # 8

## Files
https://developers.google.com/edu/python/dict-files

In order to use the files in this section, we will need to upload them since Google Colab does not persist files unless they are linked to Google Drive. The following cell will allow you to upload the files from Canvas. The files are located in Canvas, under Files/PDF Primer and Datasets/Datasets for Python Primer/ . Upload all the files in that folder to this Colab session to use them in the following cells.

In [None]:
from google.colab import files

uploaded = files.upload()

In [None]:
# Reading from a file
# In the same directory as the Jupyter notebook is a file called 'input.txt'.
# You can use this for loop template to iteratively read a file line by line. Generally, you will
# include some parsing logic to extract data from each line
lines = []
with open('instructions.txt', 'r') as f:   # This is the line which opens the file. `f` is the file handler
                                    # The `with open` will automatically close the file are you are done reading.
                                    # 'r' means you want to read from the file
    for line in f:
        line = line.strip()         # `strip()` removes any  whitespace from the beginning and end of the string
                                    # By default, each line will have '\n' at the end. 
        upper = line.upper()        # Here is where you would add your custom parsing logic
        lines.append(upper)

print(lines)  # ['A', 'B', 'C', 'D', 'E', 'F', 'G']

# Writing to a file
# This will write each of the apital letters to a line in an output file
out = open('out.txt', 'w')
out.write('hi')
out.close()

with open('output.txt', 'w') as out:  # 'w' means you want to write to the file. This will overwite the file if it exists
    for item in lines:
        out.write(str(item) + '\n')   # the `write` method only accepts strings and you have to
                                      # manually take care of the '\n' yourself

### Classes

In [None]:
class Person(object):
    # This is the constructor. All instance methods of a class
    # must accept `self` as the first parameter
    # The self parameter is a reference to the current instance of the class, and is used to access variables that belongs to the class
    # You use `self` to access data members and methods of the class,
    # similar to `this` in Java
    def __init__(self, name, age):
        # This will call the super class's constructor; Optional if you inherit from `object`
        super().__init__()
        
        # Assign data members
        self.name = name
        self.age = age
        
    def is_adult(self):
        return self.age >= 18
    
    def is_child(self):
        # Other instance methods are called with the `self` keyword.
        return not self.is_adult()
    
    def increment_age(self, amount):
        self.age += amount
        
    # Test to see if two `Person` objects are equal
    def __eq__(self, other):
        return self.name == other.name and self.age == other.age

In [None]:
p = Person('Bob', 19)

# Data members can be accessed directly
print(p.name)  # 'Bob'
print(p.age)   # 19

# Even though all of the methods require a `self` argument, you ignore that argument
# when calling the method
print(p.is_adult())  # True 

p.increment_age(1)  # Returns nothing
print(p.age)        # 20

p2 = Person('Mary', 30)
print(p == p2)  # False

p3 = Person('Mary', 30)
print(p2 == p3)  # True

# **Useful Packages**
There are many packages that come with your python installation as well as many which can be downloaded easily for free.
Importing a package requires the `import` command.
Most packages can be installed using the `pip` command (which is run on the command line, not within a python environment), that is included with the python installation.

```
> pip install numpy
> pip install scikit-learn
```


### **random**
The `random` package provides tools for generating random numbers.

In [None]:
import random

random.seed(4)

print(random.randint(1, 5))  # A random integer in (1, 5)

print(random.random())  # A random number in (0, 1.0]

# Randomly shuffle a list
a = [0, 1, 2, 3, 4, 5]
random.shuffle(a)  # Shuffles in place - does not return a copy
print(a)

### **numpy**
`numpy` is a very useful matrix library for Python.
It is used very frequently in machine learning, so it is good to be familiar with it.
If your program crashes with an import error when you import numpy, you need to install it with
```
> pip install numpy
```

In [None]:
# Imports can be aliased to simplify code
import numpy as np  # The numpy library is now referenced with np

# Create a vector of length 5 of all 0s
v = np.zeros(5)
print(v)

print(v.shape)     # Returns a tuple of the dimensions: (5,)
print(v.shape[0])  # Returns the size of the 0th dimension, 5

# You can assign specific entries by their index into the vector
v[0] = 1
v[1] = 4

# You can assign ranges of values
v[2:5] = [5, 3, 6]
print(v)

# Create a random vector of values between 0 and 1
np.random.seed(1)
v1 = np.random.rand(5)
print(v1)

# Compute the dot product
print(np.dot(v, v1))

# Element-wise multiplication
print(v * v1)
print(np.multiply(v, v1))  # equivalent

# Multiply the whole vector by a scalar, add a constant
print(v * 5)
print(v + 3)

In [None]:
# Matrices work much the same way (vectors are just matrices with 1 dimension)
# `np.ones` creates a matrix of the given input size
t = (3, 4)
X = np.ones(t)
print(X)
print(X.shape)     # (3, 4)
print(X.shape[0])  # 3
print(X.shape[1])  # 4

# Assigning specific entries requires 2 indices
# The first is the row index, the second is the column index
X[1, 3] = 8
print(X)

# The `:` selects the entire row or column
row1 = X[1, :]
print(row1)
print(X[:, 2])

# It can also be used to assign values
X[1, :] = [1, 2, 3, 4]
print(X)

# Matrix multiplication
Y = np.random.rand(4, 5)
print(np.matmul(X, Y))  # A matrix of size (3, 5)

# Transpose
print(X.transpose())
print(X.T)  # equivalent

# You can create a matrix from a list (of lists)
Z = np.asarray(
    [
        [0, 1, 2], 
        [3, 4, 5]
    ])
print(Z)

In [None]:
# Useful ways to initialize matrices
A = np.zeros((5, 3))  # A 5x5 matrix of all 0s
print(A)

A = np.ones((5, 3))  # A 5x3 matrix of all 1s
print(A)

A = np.eye(5)  # I matrix of size 5x5
print(A)

A = np.random.rand(5, 3)  # A random matrix of size 5x3 with numbers between 0 and 1
print(A)

A = np.random.randint(1, 4, (5, 3)) # A random matrix of integers in (0, 4] of size 5x3
print(A)


### More information on `numpy`
To know more about the methods and functions in `numpy`, visit the `numpy` User Guide: https://numpy.org/doc/stable/user/

### **sklearn**
`sklearn` is a useful machine learning library. It can be installed with
```
> pip install scikit-learn
```

In [None]:
# An example learning problem. You may use this as a template for homework
from sklearn.linear_model import SGDClassifier

# This is the feature matrix. There are 7 rows (7 training examples) with
# 4 columns (4 features).
X = np.asarray(
[
    [0, 0, 1, 0],
    [0, 1, 0, 0],
    [0, 0, 1, 1],
    [1, 0, 0, 1],
    [0, 1, 1, 0],
    [1, 1, 0, 0],
    [0, 1, 0, 1]
])

# These are the binary class labels for all 7 instances
y = np.asarray([0, 0, 1, 1, 0, 0, 0])

# Create the classifier
clf = SGDClassifier(loss="hinge", penalty="l2")

# Train the classifier on our labeled data
clf.fit(X, y)

# Use our trained classifier to predict labels for new input data
X_test = np.asarray(
[
    [1, 0, 0, 0],
    [0, 1, 1, 1],
    [0, 0, 0, 1]
])
y_pred = clf.predict(X_test)

# Let's assume we know the actual labels of `X_test` are this
y_test = np.asarray([0, 1, 1])

# We can compute the total number of correct classifications:
num_correct = sum(y_pred == y_test)
total = y_pred.shape[0]
print('Accuracy: ' + str(num_correct / total * 100))

A key function of sklearn is the ability to split datasets. For supervised machine learning, we need to split our data into sets on which we train our models, validate how well different iterations of our models perform (which we can use to tune and improve the model), and finally evaluate how the final model performs. Sklearn allows us to split the data easily using the train_test_split function. The function is quite well documented, so we encourage you to check out how it works for the specifics, but will run an example here.

In [None]:
# import train_test_split
from sklearn.model_selection import train_test_split

# Let's practice using a random array with ints ranging from 0-3
X = np.random.randint(4, size= (6,5))

#Let's view the transpose of each array to make the labels line up more clearly on the screen
print(f'X:\n {X}')

# We'll also say we have a corresponding array of class labels. 
# Often, this will be in the same dataset, and you will have to split it into its own array
y = np.random.randint(2, size = (6, 1))
print(f'y:\n {y}')

# Now we can split the data using train_test_split. Note that the class labels stay with their corresponding rows.
# We are setting a test size of .3, meaning 30% of our data will be used as a test set, and the rest for training.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3)
print(f'X_train:\n {X_train}')
print(f'y_train:\n {y_train}')
print(f'X_test:\n {X_test}')
print(f'y_test:\n {y_test}')

Sklearn also contains other ways to split data other than just randomly, such as a stratified sample which tries to retain the same proportions of classes in each split. We encourage you to look through the documentation to learn about other sampling methods!

### More information on `sklearn`
To know more about the methods and functions supported in `sklearn`, visit `sklearn` user guide: https://scikit-learn.org/stable/user_guide.html

### **pandas**

In [None]:
import pandas as pd
import numpy as np

# !pip install jupyter_contib_nbextensions

#### Series & DataFrames

In [None]:
sports = pd.Series(['football', 'basketball',' volleyball','tennis'])

population = pd.Series({'Germany': 81.3, 'Belgium': 11.3, 'France': 64.3, 
                        'United Kingdom': 64.9, 'Netherlands': 16.9})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [None]:
sports

In [None]:
population

In [None]:
countries

In [None]:
type(population)

In [None]:
sports.index

In [None]:
population.index

In [None]:
population['Belgium']

In [None]:
population.values

In [None]:
population/100

In [None]:
type(population.values)

Accessing dataframe variables using the '.' operator

In [None]:
type(countries.area)

In [None]:
countries.area.values

In [None]:
type(countries.capital.values)

#### Basic Methods

In [None]:
countries.columns

In [None]:
countries.dtypes

In [None]:
countries.head(3)

In [None]:
countries.describe()

In [None]:
countries.values

In [None]:
countries.info()

In [None]:
countries.capital.value_counts()

In [None]:
population

In [None]:
population.reset_index()

In [None]:
type(population.reset_index())

In [None]:
countries.capital.value_counts().reset_index()

#### Selecting and Filtering Data

<div class="alert alert-warning">
<b>ATTENTION!</b>: <br><br>

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. <br><br> We now have to distuinguish between:

 <ul>
  <li>selection by **label**</li>
  <li>selection by **position**</li>
</ul>
</div>

In [None]:
df = pd.read_csv("train.csv")

In [None]:
df.head()

##### `data[]` provides some convenience shortcuts 

Selecting a single column

In [None]:
df['Pclass']  # Can also use df.Pclass

Selecting multiple columns

In [None]:
df[  ['Pclass','Sex']   ]

Keep in mind that when we select more than one column, the output is DataFrame and not a series. Hence the difference in formatting of the two outputs above





We can also use this syntax to select specific rows

In [None]:
df[3:5]

#### Systematic indexing with `loc` and `iloc`

When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
    
* `loc`: selection by label
* `iloc`: selection by position

These methods index the different dimensions of the frame:

* `df.loc[row_indexer, column_indexer]`
* `df.iloc[row_indexer, column_indexer]`

**As long as the dataframe has no defined indices, loc can do the job of iloc**

In [None]:
df.loc[4,'Fare']

In [None]:
df.loc[df.Sex=='female']

In [None]:
df.loc[df.Sex=='female','Fare']

In [None]:
df.loc[df.Sex=='female',['Fare','Name','Sex']]

In [None]:
df.loc[df.Sex=='female'][['Fare','Name','Sex']]

iloc is based on the position of the elements

In [None]:
df.iloc[4]

In [None]:
df.iloc[5:7]

In [None]:
df.iloc[5:7]['Fare']

In [None]:
df.iloc[[1,2,3,8]]

In [None]:
df.loc[[1,2,3,8]]

In [None]:
population

In [None]:
population.iloc[0]

In [None]:
population.loc['Germany']

The different indexing methods can also be used to assign data:

In [None]:
df2 = df.copy()

df2.loc[0,'Fare'] = -100.0

In [None]:
df2.head()

#### Creating New Variables

In [None]:
countries['newVar'] = [1,2,3,4,5]                   #Basic assignment
countries

In [None]:
countries['newVar'] = countries.population * 2  + countries.area**0.5   #Using existing columns
countries

##### Using apply

Apply is a very powerful method which can be used for making major data manipulation tasks

In [None]:
countries['capital_upper'] = countries['capital'].apply(lambda x : x.upper())
countries

In [None]:
def ageBucket(x):
    if x<18:
        return "A. <18"
    elif x<25:
        return "B. 18-25"
    elif x<45:
        return "C. 25-45"
    else:
        return "D. >45"
        

Apply can be used on a single column (Series object)

In [None]:
df['AgeBucket'] = df['Age'].apply(lambda x : ageBucket(x))
df.head()

It can also be used on an entire dataframe

In [None]:
df['AgeBucket2'] = df.apply(lambda x : ageBucket(x['Age']),axis=1)
df.head()

Other derivative methods that you can look into : `map` and `applymap`

#### Groupby Operations

##### Some 'theory': the groupby operation (split-apply-combine)

The "group by" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

<img src="https://github.com/CIS-519/primer-dev/blob/master/pandas-tutorial-master/img/splitApplyCombine.png?raw=1">

Similar to SQL `GROUP BY`

In [None]:
df.groupby('Sex')

In [None]:
df.groupby("Sex").mean()

In [None]:
def getRange(x):
    
    minVal = np.min(x.Fare)
    maxVal = np.max(x.Fare)
    
    return maxVal - minVal


df.groupby('Pclass').apply(lambda x : getRange(x))

Grouping on multiple columns

In [None]:
df.groupby(['Sex','Pclass']).mean()

In [None]:
df.groupby(['Sex','Pclass'])['Age'].mean()

In [None]:
df.groupby('Sex').agg({'PassengerId':'min', 'Age':'max','Fare':'sum'})

#### Merge Operations

Merging with Pandas works pretty much the same as SQL. There are four merge methods:
1. Left
2. Right
3. Inner 
4. Outer

Basic syntax : pd.merge(left_dataframe, right_dataframe, left_on="some_column", right_on="some_column", how="left|right|inner|outer)`

In [None]:
population = pd.DataFrame({'country': ['Germany', 'Belgium', 'France', 
                        'United Kingdom', 'United States'],'population': [81.3, 11.3, 64.3, 64.9, 65.9]})

countries = pd.DataFrame({'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']})

In [None]:
population

In [None]:
countries

In a Left Merge we are mostly concerned with data on the LEFT side but we would like to add data from 
the RIGHT side if it has some of the same countries in this case.

In [None]:
pd.merge(left=population, right=countries, on="country", how="left")

In a Right Merge we are mostly concerned with data on the RIGHT side but we would like to add data from 
the LEFT side if it has some of the same countries in this case.

In [None]:
pd.merge(left=population, right=countries, on="country", how="right")

With an Inner Merge, we chop up both dataframes and only glue the stuff that matches. If a country isn't in both 
dataframes, we don't keep it and we don't add NaN's. If no type of join is mentioned, then inner join is the 
default join. 

In [None]:
pd.merge(left=population, right=countries,on ='country')

In [None]:
pd.merge(left=population, right=countries,on ='country', how = "inner")

With an Outer Merge, we chop up both dataframes and keep everything from both sides. Then we toss in NaN's to fill
any blanks.

In [None]:
pd.merge(left=population, right=countries,on ='country', how = "outer")

#### Reading Files

The following code will not run unless the correct files have been uploaded (in the Python Coding section above). In case they are not, run the next cell to upload them again. The files are on Canvas, under Files/PDF Primer and Dataset/Datasets for Python Primer/

In [None]:
from google.colab import files

uploaded = files.upload()

In [None]:
sales_data = pd.read_csv('blooth_sales_data.csv')

In [None]:
sales_data.head(5)

In [None]:
# header = 0 denotes the first line of data. If nothing is mentioned about header, then header = 0 is default.
sales_data2 = pd.read_csv('blooth_sales_data.csv', header = 0)

In [None]:
sales_data2.head(5)

In [None]:
sales_data3 = pd.read_csv('blooth_sales_data.csv', header = None)
sales_data3.head(5)

In [None]:
sales_data = pd.read_csv('blooth_sales_data.csv', usecols=['name', 'birthday'])
sales_data.head(5)

In [None]:
sales_data = pd.read_csv('blooth_sales_data.csv', header= None, skiprows=2)
sales_data.columns= ['name', 'birthday','customer','orderadate','product','units','unitprice']
sales_data.head(2)

In [None]:
# The date parse is US datew friendly! MM/DD/YYYY


sales_data = pd.read_csv('blooth_sales_data.csv',parse_dates=['birthday', 'orderdate'])
sales_data.head(2)                     

In [None]:
# To use the more common international format for sure, add 'dayfirst=True'
sales_data = pd.read_csv('blooth_sales_data.csv',parse_dates=['birthday', 'orderdate'], dayfirst=True)
sales_data.head(2) 

In [None]:
sales_data.dtypes

In [None]:
sales_data['modified_orderdate'] = sales_data['orderdate'].apply(lambda x: "%d/%d/%d" % (x.day, x.month, x.year))
sales_data.head(4)

In [None]:
sales_data.dtypes

In [None]:
sales_data['Hour'] = sales_data['orderdate'].apply(lambda x: "%d" % (x.hour))
sales_data.head(4)

In [None]:
sales_data["modified_orderdate"]= pd.to_datetime(sales_data["modified_orderdate"])
sales_data.head(4)
sales_data.dtypes

In [None]:
sales_data['birth_month'] = sales_data['birthday'].dt.month
sales_data.head(4)

In [None]:
sales_data_json = pd.read_json('blooth_sales_data.json')
sales_data_json.head(5)

#### Missing Data
How to handle missing data (NaN's)? Most common commands used are fillna and dropna. 

In [None]:
missing_df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns=['one', 'two', 'three'])
missing_df['four'] = 'bar'
missing_df['five'] = missing_df['one'] > 0
missing_df.loc[['a','c','h'],['one','four']] = np.nan
missing_df

In [None]:
# fillna replaces NA/NAN values with the given value in the command.
missing_df.fillna(0)

In [None]:
missing_df['one'].fillna('missing')

Dropna is used to drop the rows or columns with NA/NAN values.
<br>
'axis' argument determines if rows or columns which contain missing values are removed.
<br>
'axis =0': Drop rows which contain missing values. 
<br>
'axis =1': Drop columns which contain missing value.
<br>


'how' argument determines if row or column is removed from DataFrame, when we have at least one NA or all NA.
<br>
‘how = any’ : If any NA values are present, drop that row or column. (default)
<br>
‘how = all’ : If all values are NA, drop that row or column.
<br>

In [None]:
missing_df.dropna(axis=0)

In [None]:
missing_df.dropna(axis=1)

In [None]:
missing_df['six'] = np.nan
missing_df

In [None]:
missing_df.dropna(axis=1, how = 'all')

In [None]:
#dropping rows only where some columns are missing
missing_df.dropna(subset = ['one', 'two', 'four'])

In [None]:
df.head()

### More information on `pandas`
To know more about the methods and functions in `pandas`, visit the `pandas` User Guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

# Exercises

## Titanic

In [None]:
df.head()

Calculate the number of passengers with Pclass = 3

Compute the percentage of passengers that survived

How many children below the age of 18?

Whats the ratio of male and female passengers?

Between the two genders, whats the ratio of passengers that survived?

Create a new variable which has 0 for male and 1 for female. Name this variable **LabelEncode_Sex**

Create a variable that takes the value of 1 when Pclass is 1 and 0 otherwise. Create similar variables for when Pclass has a value of 2 and 3.

Name these variables **OHE_PClass1, OHE_PClass2, OHE_PClass3** respectively 

Calculate the mean fare for all samples with an odd index

Create a new variable which stores the last name of passengers

Calculate the number of unique families ( based on last names)

Create a variable that indicates the **size of the family** for each passenger. *Family size is the number of passengers with the same family name*

#### Fare by Cabin Index

All cabin numbers begin with a letter. We hypothesize that this first letter actually has a significance. So create a new variable that stores the first letter of the cabin variable. Call this **CabinIndex**.

NOTE : The cabin variable has missing values. Also check for the data type of the Cabin variable.

Once you have created the CabinIndex variable, calculate the mean value of fare for different levels of CabinIndex

## Sales Data

For sales_data, create a variable named mean_units which is the average of all units when the birth month lies between Feb and August.

Create a new column in sales_data titled 'order_minutes' and for each row, store the minutes from orderdate 

For sales_data dataframe, create a dataframe called 'sd_df' to store only those rows where product is 'Harry Potter book'

For sales_data, find the data of people who were born before 1980

For sales_data, find the average unitprice for products that were ordered in first week of a month

For sales data, display number of units sold for each product

Create a new column in sales_data and store orderdate in the format mm/dd/yyyy

## Iris Dataset

In [None]:
## Loading the dataset

from sklearn.datasets import load_iris
data = load_iris()

In [None]:
data.data

In [None]:
data.target

In [None]:
data.feature_names

In [None]:
data.target_names

### Exercises

Put together all the components of the data variable into a Pandas DataFrame. *This means putting together the feature and target variables, and adding their names as column names*

In [None]:
df = pd.DataFrame(data.data, columns = data.feature_names)

df['target'] = data.target

df.head()

Find number of observations in the dataset which belong to class setosa and have a petal length > 3

Find the maximum and minimum values of each of features.

Find the range of value for each of the features

For each of the target classes, find the mean value of each of the independent variables. The mean values should be represented in a table.

**Do not** use for loops. This should be doable in a single line of code

## More resources

### Primary resources
Resources that are particularly useful are:

*   [Overview of Colaboratory](/notebooks/basic_features_overview.ipynb)
*   [Guide to markdown](/notebooks/markdown_guide.ipynb)
*   [Importing libraries and installing dependencies](/notebooks/snippets/importing_libraries.ipynb)
*   [Saving and loading notebooks in GitHub](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)
*   [Loading data: Drive, Sheets and Google Cloud Storage](/notebooks/io.ipynb) 

### Other resources
- [Interactive forms](/notebooks/forms.ipynb)
- [Interactive widgets](/notebooks/widgets.ipynb)
- <img src="/img/new.png" height="20px" align="left" hspace="4px" alt="New"></img>
 [TensorFlow 2 in Colab](/notebooks/tensorflow_version.ipynb) 
- [Charts: visualising data](/notebooks/charts.ipynb)
- [Getting started with BigQuery](/notebooks/bigquery.ipynb)

### Machine learning crash course
These are a few of the notebooks from Google's online machine learning course. See the <a href="https://developers.google.com/machine-learning/crash-course/">full course website</a> for more.
- [Intro to Pandas](/notebooks/mlcc/intro_to_pandas.ipynb)
- [TensorFlow concepts](/notebooks/mlcc/tensorflow_programming_concepts.ipynb)
- [First steps with TensorFlow](/notebooks/mlcc/first_steps_with_tensor_flow.ipynb)
- [Intro to neural nets](/notebooks/mlcc/intro_to_neural_nets.ipynb)
- [Intro to sparse data and embeddings](/notebooks/mlcc/intro_to_sparse_data_and_embeddings.ipynb)

<a name="using-accelerated-hardware"></a>
### Using accelerated hardware
- [TensorFlow with GPUs](/notebooks/gpu.ipynb)
- [TensorFlow with TPUs](/notebooks/tpu.ipynb)

<a name="machine-learning-examples"></a>

## Machine learning examples

To see end-to-end examples of the interactive machine-learning analyses that Colaboratory makes possible, take a look at these tutorials using models from <a href="https://tfhub.dev">TensorFlow Hub</a>.

A few featured examples:

- <a href="https://tensorflow.org/hub/tutorials/tf2_image_retraining">Retraining an Image Classifier</a>: Build a Keras model on top of a pre-trained image classifier to distinguish flowers.
- <a href="https://tensorflow.org/hub/tutorials/tf2_text_classification">Text Classification</a>: Classify IMDB film reviews as either <em>positive</em> or <em>negative</em>.
- <a href="https://tensorflow.org/hub/tutorials/tf2_arbitrary_image_stylization">Style Transfer</a>: Use deep learning to transfer style between images.
- <a href="https://tensorflow.org/hub/tutorials/retrieval_with_tf_hub_universal_encoder_qa">Multilingual Universal Sentence Encoder Q&amp;A</a>: Use a machine-learning model to answer questions from the SQuAD dataset.
- <a href="https://tensorflow.org/hub/tutorials/tweening_conv3d">Video Interpolation</a>: Predict what happened in a video between the first and the last frame.
