Pandas for CIS 1100
This is an abridged version of the documentation for pandas
that should contain most of the functions you will need to use for the course. We will leave out certain information (e.g., additional keyword arguments) that aren’t relevant to the course. For more complete and accurate information, refer to the official documentation.
A couple notes while reading …
- Throughout this guide, we will refer to
pandas
aspd
, a genericDataFrame
asdf
, and a genericSeries
ass
. This is important, as certain functions can only be called at certain levels (library vs.DataFrame
vs.Series
). - Keyword arguments are optional and will have default values that we show.
- Where relevant, we will provide data types for parameters.
- Remember that most functions will return a copy of the
DataFrame
/Series
rather than modify the existing one.
Cheat sheet
Reading in data
Function | Description |
---|---|
pd.read_csv(filepath, sep=',', index_col=0) |
Reads in a csv file and turns it into a DataFrame . |
Viewing data
Function | Description |
---|---|
df.col |
Accesses the column named col in df as a Series . |
df['col'] |
Accesses the column named col in df as a Series . |
df[['col1', 'col2', ...]] |
Accesses a list of columns (here, ['col1', 'col2', ...] ) as a DataFrame . |
df.iloc[i] |
Accesses row i in df as a DataFrame . |
df[start:stop:step] |
Accesses the rows start (inclusive) to stop (exclusive) using step size step . |
df.head(n=5) |
Shows the first n rows. |
df.tail(n=5) |
Shows the last n rows. |
Modifying data
Function | Description |
---|---|
df.rename(columns=None) |
Renames columns of df with a given dictionary mapping old names to new names. |
df.drop(columns=None) |
Deletes (drops) the specified columns. |
df.dropna() |
Deletes (drops) all rows that have at least one missing value. |
df.fillna(value=0) |
Replaces all missing values with value . |
Describing data
Function | Description |
---|---|
df.shape |
A tuple of the dimensions of df of the form (num_rows, num_cols) . |
df.info() |
Prints out summary information about a DataFrame , including the number of columns, non-null values, and datatypes. |
df.dtypes |
A Series containing the data type of each column. |
df.columns |
A Series containing all column labels. |
df.describe() |
Generates a DataFrame with descriptive statistics (count, mean, standard deviation, median, quartiles, etc.). |
df.empty |
A bool that indicates whether an entire DataFrame (or Series ) is empty. |
df.count() |
Counts the number of non-null elements in each column. |
df.value_counts(normalize=False) |
Counts the number of each value that appears in each column. |
df.nunique() |
Counts the number of unique values in each column. |
df.isna() |
Checks if each cell in a DataFrame is null and returns a boolean index. |
df.min(axis=0) |
Finds the minimum of each row or column. |
df.max(axis=0) |
Finds the maximum of each row or column. |
df.std(axis=0) |
Finds the standard deviation of each row or column. |
df.mean(axis=0) |
Finds the mean/average of each row or column. |
df.median(axis=0) |
Finds the median of each row or column. |
df.sum(axis=0) |
Finds the sum of each row or column. |
Strings (and other sequences)
Function | Description |
---|---|
s.str.len() |
Gets the length of each element (this works on lists and other sequences too). |
s.str.upper() |
Makes a Series uppercase. |
s.str.lower() |
Makes a Series lowercase. |
s.str.get(i) |
Extracts the i th index of each element. |
s.str.strip() |
Removes whitespace at the beginning and end of every element. |
s.str.split(sep=None) |
Divides str into a list divided by sep . |
s.str.contains(pat) |
Checks if each element contains a given string pat . |
s.str.count(pat) |
Counts the number of times each element contains a given string pat . |
s.str.replace(pat, repl, regex=False) |
Replaces all instances of pat with repl . |
df.explode(column) |
For list-like columns, separates each value of each list into its own row, copying other elements. |
Dates and times
Function | Description |
---|---|
pd.to_datetime(arg, format=None) |
Converts arg into a pandas datetime object, which makes it easier to work with dates and times. |
Plotting
Function | Description |
---|---|
df.plot(x=None, y=None, kind='line', title=None, xlabel=None, ylabel=None) |
Plots the given DataFrame or Series . |
Details
Reading in data
pd.read_csv(filepath, sep=',', index_col=0)
Reads in a csv file and turns it into a DataFrame
.
Parameters:
filepath
(str
): File path to csv (make sure to include folders!)index_col
(False
,str
, orint
): Specifies the index with the column name or index. We primarily useindex_col=False
to not include an index at all.
Returns: New DataFrame
from csv.
Viewing data
df.col
Accesses the column named col
in df
as a Series
.
Note: col
must be a proper name in Python (i.e., you should be able to name a variable col
—meaning no spaces, can’t start with a number, etc.).
df['col']
Accesses the column named col
in df
as a Series
.
Note: Unlike the df.col
syntax, col
here just needs to be a valid string.
df[['col1', 'col2', ...]]
Accesses a list of columns (here, ['col1', 'col2', ...]
) as a DataFrame
.
df.iloc[i]
Accesses row i
in df
as a DataFrame
.
Note: df[i]
does not do the same thing.
df[start:stop:step]
Accesses the rows start
(inclusive) to stop
(exclusive) using step size step
. This behaves the same way as ordinary sequence slicing (e.g., step
is optional).
df.head(n=5)
Shows the first n
rows.
Parameters:
n
(int
): Number of rows to show
Returns: New DataFrame
with specified rows.
df.tail(n=5)
Shows the last n
rows.
Parameters:
n
(int
): Number of rows to show
Returns: New DataFrame
with specified rows.
Modifying data
df.rename(columns=None)
Renames columns of df
with a given dictionary mapping old names to new names.
Parameters:
columns
(Dict[str, str]
): A dictionary mapping old names (usually strings) to new names.
Returns: New DataFrame
with renamed columns.
Notes:
- By default, keys in
columns
that are not actual columns indf
will be ignored. - Although
columns
is a keyword argument, for our purposes it is required.
df.drop(columns=None)
Deletes (drops) the specified columns.
Parameters:
columns
(List[str]
): A list of column names to drop.
Returns: New DataFrame
without dropped columns.
Notes:
- By default, if you try to drop a nonexistent column, it will throw an error.
- Although
columns
is a keyword argument, for our purposes it is required.
df.dropna()
Deletes (drops) all rows that have at least one missing value.
Returns: New DataFrame
without dropped rows.
df.fillna(value=0)
Replaces all missing values with value
.
Parameters:
value
: Value to replace missing values with. We will primarily have this be a scalar (str
,int
, etc.) but advanced uses can assign different values to different columns, etc.
Returns: New DataFrame
with filled values.
Describing data
df.shape
A tuple of the dimensions of df
of the form (num_rows, num_cols)
.
df.info()
Prints out summary information about a DataFrame
, including the number of columns, non-null values, and datatypes.
Returns: None
df.dtypes
A Series
containing the data type of each column.
df.columns
A Series
containing all column labels.
df.describe()
Generates a DataFrame
with descriptive statistics (count, mean, standard deviation, median, quartiles, etc.).
Returns: A Series
containing descriptive statistics for each column.
Note: Also works with Series
.
df.empty
A bool
that indicates whether an entire DataFrame
(or Series
) is empty.
df.count()
Counts the number of non-null elements in each column.
Returns: A Series
with the non-null element counts for each column.
df.value_counts(normalize=False)
Counts the number of each value that appears in each column.
Parameters:
normalize
(bool
): Whether to use raw counts (an integer number) or the proportion of values (a value from 0 to 1).
Returns: A Series
containing the value counts for each column.
Notes:
- By default the method sorts values in descending order.
- For our purposes, it’s easiest if you only call this on a
Series
, as calling it on aDataFrame
may result in multi-indexing, which we do not cover.
df.nunique()
Counts the number of unique values in each column.
Returns: A Series
with the number of unique values for each column.
df.isna()
Checks if each cell in a DataFrame
is null and returns a boolean index.
Returns: A DataFrame
of the same shape as df
, where each value is True
if the corresponding value in df
is NaN
and False
otherwise.
df.min(axis=0)
Finds the minimum of each row or column.
Parameters:
axis
(0
or1
): The axis to find the minimum along.0
means finding the minimum of each column.1
means finding the minimum of each row.
Returns: A scalar (single number) if applied on a Series
or a Series
with the minimum for each row/column if applied on a DataFrame
.
df.max(axis=0)
Finds the maximum of each row or column.
Parameters:
axis
(0
or1
): The axis to find the maximum along.0
means finding the maximum of each column.1
means finding the maximum of each row.
Returns: A scalar (single number) if applied on a Series
or a Series
with the maximum for each row/column if applied on a DataFrame
.
df.std(axis=0)
Finds the standard deviation of each row or column.
Parameters:
axis
(0
or1
): The axis to find the standard deviation along.0
means finding the standard deviation of each column.1
means finding the standard deviation of each row.
Returns: A scalar (single number) if applied on a Series
or a Series
with the standard deviation for each row/column if applied on a DataFrame
.
df.mean(axis=0)
Finds the mean/average of each row or column.
Parameters:
axis
(0
or1
): The axis to find the mean along.0
means finding the mean of each column.1
means finding the mean of each row.
Returns: A scalar (single number) if applied on a Series
or a Series
with the mean for each row/column if applied on a DataFrame
.
df.median(axis=0)
Finds the median of each row or column.
Parameters:
axis
(0
or1
): The axis to find the median along.0
means finding the median of each column.1
means finding the median of each row.
Returns: A scalar (single number) if applied on a Series
or a Series
with the median for each row/column if applied on a DataFrame
.
df.sum(axis=0)
Finds the sum of each row or column.
Parameters:
axis
(0
or1
): The axis to find the sum along.0
means finding the sum of each column.1
means finding the sum of each row.
Returns: A scalar (single number) if applied on a Series
or a Series
with the sum for each row/column if applied on a DataFrame
.
Strings (and other sequences)
s.str.len()
Gets the length of each element (this works on lists and other sequences too).
Returns: A Series
containing the length of each element.
s.str.upper()
Makes a Series
uppercase.
Returns: A new Series
with every element uppercase.
Note: Analogous to str.upper()
s.str.lower()
Makes a Series
lowercase.
Returns: A new Series
with every element lowercase.
Note: Analogous to str.lower()
s.str.strip()
Removes whitespace at the beginning and end of every element.
Returns: A new Series
with every element stripped.
Note: Analogous to str.strip()
s.str.split(sep)
Divides str
into a list divided by sep
.
Parameters:
sep
(str
): The separator string to divide each element using. If this is not provided (i.e.,None
) the default is to split by whitespace. Note that there are also options to make this a regex.
Returns: A new Series
of lists containing the divided str
s.
Note: Analogous to str.split(sep)
s.str.get(i)
Extracts the i
th index of each element.
Parameters:
i
(int
): The index to extract.
Returns: A new Series
with just the i
th element of each original sequence.
Note: Analogous to str[i]
(but importantly not slicing).
s.str.contains(pat)
Checks if each element contains a given string pat
.
Parameters:
pat
(str
): String to check membership of. Note that this can be a “regex,” which you don’t need to know about but may appear occasionally in the course.
Returns: A Series
of bool
s that indicate whether each element contains pat
.
s.str.count(pat)
Counts the number of times each element contains a given string pat
.
Parameters:
pat
(str
): String to count. Note that this can be a “regex,” which you don’t need to know about but may appear occasionally in the course.
Returns: A Series
of counts that pat
appears in each element.
s.str.replace(pat, repl, regex=False)
Replaces all instances of pat
with repl
.
Parameters:
pat
(str
): Pattern to find.repl
(str
): String to replacepat
with.regex
(bool
): Determines whetherpat
andrepl
can be regex expressions. Unless we explicitly tell you otherwise, you can leave this alone.
Returns: A new Series
with replaced values.
df.explode(column)
For list-like columns, separates each value of each list into its own row, copying other elements.
Parameters:
column
(str
orList[str]
): Either the column name to “explode” or a list of column names to explode. For the most part, stick to just a single column.
Returns: A DataFrame
with exploded rows.
Note: If applied to a Series
, you don’t need to specify any parameters.
Dates and times
pd.to_datetime(arg, format=None)
Converts arg
into a pandas
datetime
object, which makes it easier to work with dates and times.
Parameters:
arg
(int
,str
,Series
,DataFrame
, etc.): The thing to convert into adatetime
object. Whilearg
can be a wide range of types, we will primarily use it as aSeries
.format
(str
): How you want to format thedatetime
object. For the most part, you should use'mixed'
, which tries to automatically guess how the dates/times are formatted in the original data. There are more advanced options available.
Returns: The same type as arg
(e.g., a Series
), with dates converted into datetime
objects.
Plotting
df.plot(x=None, y=None, kind='line', title=None, xlabel=None, ylabel=None)
Plots the given DataFrame
or Series
.
Parameters:
x
(label likeint
orstr
): The variable to use for the $x$-axis. By default, this is the index.y
(label or list of labels): The variable(s) to use for the $y$-axis. By default, this is all numerical columns.kind
(str
): Type of plot. A full list of options is provided below.title
(str
): Title for the plot.xlabel
(str
): Label for the $x$-axis.ylabel
(str
): Label for the $y$-axis.
Returns: The “plot” object (don’t worry about the specifics).
Note: There are many more optional parameters that you can use to modify the plot. Refer to the official documentation for this.
Types of plots:
'line'
: line plot (default)'bar'
: vertical bar plot'barh'
: horizontal bar plot'hist'
: histogram'box'
: boxplot'kde'
: Kernel Density Estimation plot'density'
: same as ‘kde’'area'
: area plot'pie'
: pie plot'scatter'
: scatter plot (DataFrame only)'hexbin'
: hexbin plot (DataFrame only)