Data Visualization Design 6 | Lab: Basic Pandas
Data Visualization Design 6 | Lab: Basic Pandas

- Pandas Methods
- import pandas
import pandas as pd
- read from csv or txt
df = pd.read_csv('data.csv')
- change data type of a column to
datetime64
when reading the data
df = pd.read_csv('data.csv', parse_date=sequence_of_colkeywords)
- change data type of a column to
category
when reading the data
df = pd.read_csv('data.csv', dtype={'colkeyword': 'category
'})
- change data type of a column to
datetime64
whenever
df['Time'] = pd.to_datetime(df['Time'])
- change data type of a column to
category
whenever
df['Category'] = df['Category'].astype('category')
- reserve only the date of a
datetime64
column
df['Time'] = df['Time'].dt.date
- reserve only the time of a
datetime64
column
df['Time'] = df['Time'].dt.time
- return the first 5 rows of a dataframe
df.head()
- return the first 10 rows of a dataframe
df.head(10)
- check the keywords, data types, and null values in a dataframe
df.info()
- check the sample size, mean, standard deviation, min values, max values of each column in a dataframe
df.describe()
- transpose a dataframe
df.T
- get a list of column names
list(df.columns)
- get the number of rows and the number of columns of a dataframe
df.shape
- get the mean value of a specific column
df['Values'].mean()
- get the maximum value of a specific column
df['Values'].max()
- get the minimum value of a specific column
df['Values'].min()
- get the sum of values in a specific column
df['Values'].sum()
- get the number of values in a specific column
df['Values'].count()
- get the dataframe of a series at 10%, 25%, 50%, 75%, and 90% quantile
df['Values'].quantile(q=[0.1, 0.25, 0.5, 0.75, 0.9])
- get the numpy array of a series at 10%, 25%, 50%, 75%, and 90% quantile
df['Values'].quantile(q=[0.1, 0.25, 0.5, 0.75, 0.9]).values
- change the format of the output result to two decimal digits
df.style.format("{:.2f}")
- change a list to a pd.Series object
s = pd.Series(myList)
- each column of a dataframe is a pd.Series object
df['Values'].dtype
- we can use sum, min, max, mean, count, etc to pd.Series object
s.sum()
s.min()
s.mean()
s.max()
s.count()
- change string type column to lower case
df['Name'].str.lower()
- get the first letter using regular expression in a string type column
df['Name'].str.extract("([a-z])")
- set an index for a Series, if we don’t have index parameter, we will have an index from 0 to n-1 by default
p = pd.Series(vector, index=['t','u','v','w'])
- set an index for a DataFrame, if we don’t have index parameter, we will have an index from 0 to n-1 by default
p = pd.DataFrame(matrix, index=['t','u','v','w'])
- turn a column of the dataframe to index
df.set_index('Index')
- get rows by index
df.loc['u']
- get multiple rows by index
df.loc[['u', 'w']]
- get the first row by default index (0 to n-1)
df.iloc[0]
- get some rows by default index (0 to n-1) and slicing
df.iloc[2:10]
- get a column by column name
df['Values']
- get a column in a range of rows
df.loc[4:10, 'Values']
- get multiple columns by column name
df[['Name', 'Values']]
- series substitution (automatically by index if between different dataframes)
df['New_values'] - df['Values']
- create a boolean series based on a certain condition
df['Values'] > 100
- get values in a series with relational operator based on a boolean series
df['Values'].loc[df['Values'] > 100]
- get values in a dataframe with relational operator based on a boolean series
df.loc[df['Values'] > 100]
- generate a numpy array from 0 to 1 with 8 points including endpoint 1
np.linspace(0, 1, 8)
- generate a numpy array from 0 to 1 with 8 points not including endpoint 1
np.linspace(0, 1, 8, endpoint=False)
- get numpy array underlying the series
- get numpy array underlying the series
df['Values'].values
Note that any arithmetic involving a Nan, results in a Nan. (Nan is a numpy type, which means no value)
- get the a boolean list to show where is the null value
df['Values'].isnull()
Note that this can also be df['Values'].isna()
- get the a boolean list to show where is the not null value
~df['Values'].isnull()
- get all the not null value in a series
df['Values'].loc[~df['Values'].isnull()]
- get all the not null value in a dataframe
df.loc[~df['Values'].isnull()]
- drop all the null value in a dataframe
df.dropna()
- replace all the Nan with a value in a series
df['Values'].fillna(-1)
- replace all the Nan with a value in a dataframe
df.fillna(-1)
- sum a column not ignoring the Nan value (by default ignore, where skipna=True)
df['Values'].sum(skipna=False)
- print if there is a Nan value in each columns of the dataframe
df.isnull().any()
- print if there is a Nan value in the dataframe
df.isnull().any().any()
- get the unique values in a series
df['Name'].unique()
- get the counting times of each unique values in a series
df['Category'].value_counts()
- order the unique items in a category type series
df['Category'] = df['Category'].cat.as_ordered()
- return the order codes of unique items in a category type series
df_rent['Category'].cat.codes
- reset order of unique items in a category type series
df_rent['Category'] = df_rent['Category'].cat.reorder_categories(myOrderList)
- sort a series by its value (ascending)
df['Values'].sort_values()
- reversely sort a series by its value (not ascending)
df['Values'].sort_values(ascending=False)
- sort a dataframe by values of a column (ascending)
df.sort_values(by=['Values'])
- apply a given function (len in this case) to each of the value in a series
df['List'].apply(len)
- calculate the correlate coefficient for two series
np.corrcoef(df['Values'], df['New_values'])[0,1]
- Group the dataframe by a column, do mean calculation to others (or we can do count/max/min/etc)
df.groupby('Category').mean()
- Group the dataframe by two columns, do mean calculation to others (or we can do count/max/min/etc)
df.groupby(['Category','Time']).mean()
- get a specific group by its value
df.get_group(val)
- merge two different dataframes (outer means union of keys)
df_1.merge(df_2, how="outer", left_on=colname_1, right_on=colname_2)
- get a series of ranges based on a series and its quantile
df['Values'] = pd.qcut(df['Values'], q=[.1, .5, .9])
- get a series of ranges based on a series and given the number of ranges we want to cut into, these ranges will be equal length.
df['Values'] = pd.qcut(df['Values'], 3)
- create a color list based on the range of values. This is always used for heatmap and the geographic map visualization.
colors = ['#122345', '#090909', '#abc123']
df["Color"] = pd.qcut(df["Values"], 3, labels=colors)
- change the dataframe to an iterable by its row
df.iterrows()
- turn a datafram into a dictionary (for loop)
df_dict = {}
for i, row in df.iterrows():
state2color[row["Name"]] = row["Values"]
- turn a datafram into a dictionary (dictionary comprehension)
df_dict = {row["Name"]: row["Values"] for i, row in df.iterrows()}