Data Visualization Design 6 | Lab: Basic Pandas

Series: Data Visualization Design

Data Visualization Design 6 | Lab: Basic Pandas

Pandas Methods

import pandas

import pandas as pd

read from csv or txt

df = pd.read_csv('data.csv')

change data type of a column to datetime64when reading the data

df = pd.read_csv('data.csv', parse_date=sequence_of_colkeywords)

change data type of a column to categorywhen reading the data

df = pd.read_csv('data.csv', dtype={'colkeyword': 'category'})

change data type of a column to datetime64whenever

df['Time'] = pd.to_datetime(df['Time'])

change data type of a column to categorywhenever

df['Category'] = df['Category'].astype('category')

reserve only the date of a datetime64 column

df['Time'] = df['Time'].dt.date

reserve only the time of a datetime64 column

df['Time'] = df['Time'].dt.time

return the first 5 rows of a dataframe

df.head()

return the first 10 rows of a dataframe

df.head(10)

check the keywords, data types, and null values in a dataframe

df.info()

check the sample size, mean, standard deviation, min values, max values of each column in a dataframe

df.describe()

transpose a dataframe

df.T

get a list of column names

list(df.columns)

get the number of rows and the number of columns of a dataframe

df.shape

get the mean value of a specific column

df['Values'].mean()

get the maximum value of a specific column

df['Values'].max()

get the minimum value of a specific column

df['Values'].min()

get the sum of values in a specific column

df['Values'].sum()

get the number of values in a specific column

df['Values'].count()

get the dataframe of a series at 10%, 25%, 50%, 75%, and 90% quantile

df['Values'].quantile(q=[0.1, 0.25, 0.5, 0.75, 0.9])

get the numpy array of a series at 10%, 25%, 50%, 75%, and 90% quantile

df['Values'].quantile(q=[0.1, 0.25, 0.5, 0.75, 0.9]).values

change the format of the output result to two decimal digits

df.style.format("{:.2f}")

change a list to a pd.Series object

s = pd.Series(myList)

each column of a dataframe is a pd.Series object

df['Values'].dtype

we can use sum, min, max, mean, count, etc to pd.Series object

s.sum()
s.min()
s.mean()
s.max()
s.count()

change string type column to lower case

df['Name'].str.lower()

get the first letter using regular expression in a string type column

df['Name'].str.extract("([a-z])")

set an index for a Series, if we don’t have index parameter, we will have an index from 0 to n-1 by default

p = pd.Series(vector, index=['t','u','v','w'])

set an index for a DataFrame, if we don’t have index parameter, we will have an index from 0 to n-1 by default

p = pd.DataFrame(matrix, index=['t','u','v','w'])

turn a column of the dataframe to index

df.set_index('Index')

get rows by index

df.loc['u']

get multiple rows by index

df.loc[['u', 'w']]

get the first row by default index (0 to n-1)

df.iloc[0]

get some rows by default index (0 to n-1) and slicing

df.iloc[2:10]

get a column by column name

df['Values']

get a column in a range of rows

df.loc[4:10, 'Values']

get multiple columns by column name

df[['Name', 'Values']]

series substitution (automatically by index if between different dataframes)

df['New_values'] - df['Values']

create a boolean series based on a certain condition

df['Values'] > 100

get values in a series with relational operator based on a boolean series

df['Values'].loc[df['Values'] > 100]

get values in a dataframe with relational operator based on a boolean series

df.loc[df['Values'] > 100]

generate a numpy array from 0 to 1 with 8 points including endpoint 1

np.linspace(0, 1, 8)

generate a numpy array from 0 to 1 with 8 points not including endpoint 1

np.linspace(0, 1, 8, endpoint=False)

get numpy array underlying the series
get numpy array underlying the series

df['Values'].values

Note that any arithmetic involving a Nan, results in a Nan. (Nan is a numpy type, which means no value)

get the a boolean list to show where is the null value

df['Values'].isnull()

Note that this can also be df['Values'].isna()

get the a boolean list to show where is the not null value

~df['Values'].isnull()

get all the not null value in a series

df['Values'].loc[~df['Values'].isnull()]

get all the not null value in a dataframe

df.loc[~df['Values'].isnull()]

drop all the null value in a dataframe

df.dropna()

replace all the Nan with a value in a series

df['Values'].fillna(-1)

replace all the Nan with a value in a dataframe

df.fillna(-1)

sum a column not ignoring the Nan value (by default ignore, where skipna=True)

df['Values'].sum(skipna=False)

print if there is a Nan value in each columns of the dataframe

df.isnull().any()

print if there is a Nan value in the dataframe

df.isnull().any().any()

get the unique values in a series

df['Name'].unique()

get the counting times of each unique values in a series

df['Category'].value_counts()

order the unique items in a category type series

df['Category'] = df['Category'].cat.as_ordered()

return the order codes of unique items in a category type series

df_rent['Category'].cat.codes

reset order of unique items in a category type series

df_rent['Category'] = df_rent['Category'].cat.reorder_categories(myOrderList)

sort a series by its value (ascending)

df['Values'].sort_values()

reversely sort a series by its value (not ascending)

df['Values'].sort_values(ascending=False)

sort a dataframe by values of a column (ascending)

df.sort_values(by=['Values'])

apply a given function (len in this case) to each of the value in a series

df['List'].apply(len)

calculate the correlate coefficient for two series

np.corrcoef(df['Values'], df['New_values'])[0,1]

Group the dataframe by a column, do mean calculation to others (or we can do count/max/min/etc)

df.groupby('Category').mean()

Group the dataframe by two columns, do mean calculation to others (or we can do count/max/min/etc)

df.groupby(['Category','Time']).mean()

get a specific group by its value

df.get_group(val)

merge two different dataframes (outer means union of keys)

df_1.merge(df_2, how="outer", left_on=colname_1, right_on=colname_2)

get a series of ranges based on a series and its quantile

df['Values'] = pd.qcut(df['Values'], q=[.1, .5, .9])

get a series of ranges based on a series and given the number of ranges we want to cut into, these ranges will be equal length.

df['Values'] = pd.qcut(df['Values'], 3)

create a color list based on the range of values. This is always used for heatmap and the geographic map visualization.

colors = ['#122345', '#090909', '#abc123']
df["Color"] = pd.qcut(df["Values"], 3, labels=colors)

change the dataframe to an iterable by its row

df.iterrows()

turn a datafram into a dictionary (for loop)

df_dict = {}
for i, row in df.iterrows():
    state2color[row["Name"]] = row["Values"]

turn a datafram into a dictionary (dictionary comprehension)

df_dict = {row["Name"]: row["Values"] for i, row in df.iterrows()}