Pandas for CIS 1100


This is an abridged version of the documentation for pandas that should contain most of the functions you will need to use for the course. We will leave out certain information (e.g., additional keyword arguments) that aren’t relevant to the course. For more complete and accurate information, refer to the official documentation.

A couple notes while reading …

  • Throughout this guide, we will refer to pandas as pd, a generic DataFrame as df, and a generic Series as s. This is important, as certain functions can only be called at certain levels (library vs. DataFrame vs. Series).
  • Keyword arguments are optional and will have default values that we show.
  • Where relevant, we will provide data types for parameters.
  • Remember that most functions will return a copy of the DataFrame/Series rather than modify the existing one.

Cheat sheet


Reading in data


Function Description
pd.read_csv(filepath, sep=',', index_col=0) Reads in a csv file and turns it into a DataFrame.

Viewing data


Function Description
df.col Accesses the column named col in df as a Series.
df['col'] Accesses the column named col in df as a Series.
df[['col1', 'col2', ...]] Accesses a list of columns (here, ['col1', 'col2', ...]) as a DataFrame.
df.iloc[i] Accesses row i in df as a DataFrame.
df[start:stop:step] Accesses the rows start (inclusive) to stop (exclusive) using step size step.
df.head(n=5) Shows the first n rows.
df.tail(n=5) Shows the last n rows.

Modifying data


Function Description
df.rename(columns=None) Renames columns of df with a given dictionary mapping old names to new names.
df.drop(columns=None) Deletes (drops) the specified columns.
df.dropna() Deletes (drops) all rows that have at least one missing value.
df.fillna(value=0) Replaces all missing values with value.

Describing data


Function Description
df.shape A tuple of the dimensions of df of the form (num_rows, num_cols).
df.info() Prints out summary information about a DataFrame, including the number of columns, non-null values, and datatypes.
df.dtypes A Series containing the data type of each column.
df.columns A Series containing all column labels.
df.describe() Generates a DataFrame with descriptive statistics (count, mean, standard deviation, median, quartiles, etc.).
df.empty A bool that indicates whether an entire DataFrame (or Series) is empty.
df.count() Counts the number of non-null elements in each column.
df.value_counts(normalize=False) Counts the number of each value that appears in each column.
df.nunique() Counts the number of unique values in each column.
df.isna() Checks if each cell in a DataFrame is null and returns a boolean index.
df.min(axis=0) Finds the minimum of each row or column.
df.max(axis=0) Finds the maximum of each row or column.
df.std(axis=0) Finds the standard deviation of each row or column.
df.mean(axis=0) Finds the mean/average of each row or column.
df.median(axis=0) Finds the median of each row or column.
df.sum(axis=0) Finds the sum of each row or column.

Strings (and other sequences)


Function Description
s.str.len() Gets the length of each element (this works on lists and other sequences too).
s.str.upper() Makes a Series uppercase.
s.str.lower() Makes a Series lowercase.
s.str.get(i) Extracts the ith index of each element.
s.str.strip() Removes whitespace at the beginning and end of every element.
s.str.split(sep=None) Divides str into a list divided by sep.
s.str.contains(pat) Checks if each element contains a given string pat.
s.str.count(pat) Counts the number of times each element contains a given string pat.
s.str.replace(pat, repl, regex=False) Replaces all instances of pat with repl.
df.explode(column) For list-like columns, separates each value of each list into its own row, copying other elements.

Dates and times


Function Description
pd.to_datetime(arg, format=None) Converts arg into a pandas datetime object, which makes it easier to work with dates and times.

Plotting


Function Description
df.plot(x=None, y=None, kind='line', title=None, xlabel=None, ylabel=None) Plots the given DataFrame or Series.

Details


Reading in data


pd.read_csv(filepath, sep=',', index_col=0)

Reads in a csv file and turns it into a DataFrame.

Parameters:

  • filepath (str): File path to csv (make sure to include folders!)
  • index_col (False, str, or int): Specifies the index with the column name or index. We primarily use index_col=False to not include an index at all.

Returns: New DataFrame from csv.


Viewing data


df.col

Accesses the column named col in df as a Series.

Note: col must be a proper name in Python (i.e., you should be able to name a variable col—meaning no spaces, can’t start with a number, etc.).


df['col']

Accesses the column named col in df as a Series.

Note: Unlike the df.col syntax, col here just needs to be a valid string.


df[['col1', 'col2', ...]]

Accesses a list of columns (here, ['col1', 'col2', ...]) as a DataFrame.


df.iloc[i]

Accesses row i in df as a DataFrame.

Note: df[i] does not do the same thing.


df[start:stop:step]

Accesses the rows start (inclusive) to stop (exclusive) using step size step. This behaves the same way as ordinary sequence slicing (e.g., step is optional).


df.head(n=5)

Shows the first n rows.

Parameters:

  • n (int): Number of rows to show

Returns: New DataFrame with specified rows.


df.tail(n=5)

Shows the last n rows.

Parameters:

  • n (int): Number of rows to show

Returns: New DataFrame with specified rows.


Modifying data


df.rename(columns=None)

Renames columns of df with a given dictionary mapping old names to new names.

Parameters:

  • columns (Dict[str, str]): A dictionary mapping old names (usually strings) to new names.

Returns: New DataFrame with renamed columns.

Notes:

  • By default, keys in columns that are not actual columns in df will be ignored.
  • Although columns is a keyword argument, for our purposes it is required.

df.drop(columns=None)

Deletes (drops) the specified columns.

Parameters:

  • columns (List[str]): A list of column names to drop.

Returns: New DataFrame without dropped columns.

Notes:

  • By default, if you try to drop a nonexistent column, it will throw an error.
  • Although columns is a keyword argument, for our purposes it is required.

df.dropna()

Deletes (drops) all rows that have at least one missing value.

Returns: New DataFrame without dropped rows.


df.fillna(value=0)

Replaces all missing values with value.

Parameters:

  • value: Value to replace missing values with. We will primarily have this be a scalar (str, int, etc.) but advanced uses can assign different values to different columns, etc.

Returns: New DataFrame with filled values.


Describing data


df.shape

A tuple of the dimensions of df of the form (num_rows, num_cols).


df.info()

Prints out summary information about a DataFrame, including the number of columns, non-null values, and datatypes.

Returns: None


df.dtypes

A Series containing the data type of each column.


df.columns

A Series containing all column labels.


df.describe()

Generates a DataFrame with descriptive statistics (count, mean, standard deviation, median, quartiles, etc.).

Returns: A Series containing descriptive statistics for each column.

Note: Also works with Series.


df.empty

A bool that indicates whether an entire DataFrame (or Series) is empty.


df.count()

Counts the number of non-null elements in each column.

Returns: A Series with the non-null element counts for each column.


df.value_counts(normalize=False)

Counts the number of each value that appears in each column.

Parameters:

  • normalize (bool): Whether to use raw counts (an integer number) or the proportion of values (a value from 0 to 1).

Returns: A Series containing the value counts for each column.

Notes:

  • By default the method sorts values in descending order.
  • For our purposes, it’s easiest if you only call this on a Series, as calling it on a DataFrame may result in multi-indexing, which we do not cover.

df.nunique()

Counts the number of unique values in each column.

Returns: A Series with the number of unique values for each column.


df.isna()

Checks if each cell in a DataFrame is null and returns a boolean index.

Returns: A DataFrame of the same shape as df, where each value is True if the corresponding value in df is NaN and False otherwise.


df.min(axis=0)

Finds the minimum of each row or column.

Parameters:

  • axis (0 or 1): The axis to find the minimum along. 0 means finding the minimum of each column. 1 means finding the minimum of each row.

Returns: A scalar (single number) if applied on a Series or a Series with the minimum for each row/column if applied on a DataFrame.


df.max(axis=0)

Finds the maximum of each row or column.

Parameters:

  • axis (0 or 1): The axis to find the maximum along. 0 means finding the maximum of each column. 1 means finding the maximum of each row.

Returns: A scalar (single number) if applied on a Series or a Series with the maximum for each row/column if applied on a DataFrame.


df.std(axis=0)

Finds the standard deviation of each row or column.

Parameters:

  • axis (0 or 1): The axis to find the standard deviation along. 0 means finding the standard deviation of each column. 1 means finding the standard deviation of each row.

Returns: A scalar (single number) if applied on a Series or a Series with the standard deviation for each row/column if applied on a DataFrame.


df.mean(axis=0)

Finds the mean/average of each row or column.

Parameters:

  • axis (0 or 1): The axis to find the mean along. 0 means finding the mean of each column. 1 means finding the mean of each row.

Returns: A scalar (single number) if applied on a Series or a Series with the mean for each row/column if applied on a DataFrame.


df.median(axis=0)

Finds the median of each row or column.

Parameters:

  • axis (0 or 1): The axis to find the median along. 0 means finding the median of each column. 1 means finding the median of each row.

Returns: A scalar (single number) if applied on a Series or a Series with the median for each row/column if applied on a DataFrame.


df.sum(axis=0)

Finds the sum of each row or column.

Parameters:

  • axis (0 or 1): The axis to find the sum along. 0 means finding the sum of each column. 1 means finding the sum of each row.

Returns: A scalar (single number) if applied on a Series or a Series with the sum for each row/column if applied on a DataFrame.


Strings (and other sequences)


s.str.len()

Gets the length of each element (this works on lists and other sequences too).

Returns: A Series containing the length of each element.


s.str.upper()

Makes a Series uppercase.

Returns: A new Series with every element uppercase.

Note: Analogous to str.upper()


s.str.lower()

Makes a Series lowercase.

Returns: A new Series with every element lowercase.

Note: Analogous to str.lower()


s.str.strip()

Removes whitespace at the beginning and end of every element.

Returns: A new Series with every element stripped.

Note: Analogous to str.strip()


s.str.split(sep)

Divides str into a list divided by sep.

Parameters:

  • sep (str): The separator string to divide each element using. If this is not provided (i.e., None) the default is to split by whitespace. Note that there are also options to make this a regex.

Returns: A new Series of lists containing the divided strs.

Note: Analogous to str.split(sep)


s.str.get(i)

Extracts the ith index of each element.

Parameters:

  • i (int): The index to extract.

Returns: A new Series with just the ith element of each original sequence.

Note: Analogous to str[i] (but importantly not slicing).


s.str.contains(pat)

Checks if each element contains a given string pat.

Parameters:

  • pat (str): String to check membership of. Note that this can be a “regex,” which you don’t need to know about but may appear occasionally in the course.

Returns: A Series of bools that indicate whether each element contains pat.


s.str.count(pat)

Counts the number of times each element contains a given string pat.

Parameters:

  • pat (str): String to count. Note that this can be a “regex,” which you don’t need to know about but may appear occasionally in the course.

Returns: A Series of counts that pat appears in each element.


s.str.replace(pat, repl, regex=False)

Replaces all instances of pat with repl.

Parameters:

  • pat (str): Pattern to find.
  • repl (str): String to replace pat with.
  • regex (bool): Determines whether pat and repl can be regex expressions. Unless we explicitly tell you otherwise, you can leave this alone.

Returns: A new Series with replaced values.


df.explode(column)

For list-like columns, separates each value of each list into its own row, copying other elements.

Parameters:

  • column (str or List[str]): Either the column name to “explode” or a list of column names to explode. For the most part, stick to just a single column.

Returns: A DataFrame with exploded rows.

Note: If applied to a Series, you don’t need to specify any parameters.


Dates and times


pd.to_datetime(arg, format=None)

Converts arg into a pandas datetime object, which makes it easier to work with dates and times.

Parameters:

  • arg (int, str, Series, DataFrame, etc.): The thing to convert into a datetime object. While arg can be a wide range of types, we will primarily use it as a Series.
  • format (str): How you want to format the datetime object. For the most part, you should use 'mixed', which tries to automatically guess how the dates/times are formatted in the original data. There are more advanced options available.

Returns: The same type as arg (e.g., a Series), with dates converted into datetime objects.


Plotting


df.plot(x=None, y=None, kind='line', title=None, xlabel=None, ylabel=None)

Plots the given DataFrame or Series.

Parameters:

  • x (label like int or str): The variable to use for the $x$-axis. By default, this is the index.
  • y (label or list of labels): The variable(s) to use for the $y$-axis. By default, this is all numerical columns.
  • kind (str): Type of plot. A full list of options is provided below.
  • title (str): Title for the plot.
  • xlabel (str): Label for the $x$-axis.
  • ylabel (str): Label for the $y$-axis.

Returns: The “plot” object (don’t worry about the specifics).

Note: There are many more optional parameters that you can use to modify the plot. Refer to the official documentation for this.

Types of plots:

  • 'line' : line plot (default)
  • 'bar' : vertical bar plot
  • 'barh' : horizontal bar plot
  • 'hist' : histogram
  • 'box' : boxplot
  • 'kde' : Kernel Density Estimation plot
  • 'density' : same as ‘kde’
  • 'area' : area plot
  • 'pie' : pie plot
  • 'scatter' : scatter plot (DataFrame only)
  • 'hexbin' : hexbin plot (DataFrame only)