Data Visualization Design 6 | Lab: Basic Pandas

Series: Data Visualization Design

Data Visualization Design 6 | Lab: Basic Pandas

  1. Pandas Methods
  • import pandas
import pandas as pd
  • read from csv or txt
df = pd.read_csv('data.csv')
  • change data type of a column to datetime64when reading the data
df = pd.read_csv('data.csv', parse_date=sequence_of_colkeywords)
  • change data type of a column to categorywhen reading the data
df = pd.read_csv('data.csv', dtype={'colkeyword': 'category'})
  • change data type of a column to datetime64whenever
df['Time'] = pd.to_datetime(df['Time'])
  • change data type of a column to categorywhenever
df['Category'] = df['Category'].astype('category')
  • reserve only the date of a datetime64 column
df['Time'] = df['Time'].dt.date
  • reserve only the time of a datetime64 column
df['Time'] = df['Time'].dt.time
  • return the first 5 rows of a dataframe
df.head()
  • return the first 10 rows of a dataframe
df.head(10)
  • check the keywords, data types, and null values in a dataframe
df.info()
  • check the sample size, mean, standard deviation, min values, max values of each column in a dataframe
df.describe()
  • transpose a dataframe
df.T
  • get a list of column names
list(df.columns)
  • get the number of rows and the number of columns of a dataframe
df.shape
  • get the mean value of a specific column
df['Values'].mean()
  • get the maximum value of a specific column
df['Values'].max()
  • get the minimum value of a specific column
df['Values'].min()
  • get the sum of values in a specific column
df['Values'].sum()
  • get the number of values in a specific column
df['Values'].count()
  • get the dataframe of a series at 10%, 25%, 50%, 75%, and 90% quantile
df['Values'].quantile(q=[0.1, 0.25, 0.5, 0.75, 0.9])
  • get the numpy array of a series at 10%, 25%, 50%, 75%, and 90% quantile
df['Values'].quantile(q=[0.1, 0.25, 0.5, 0.75, 0.9]).values
  • change the format of the output result to two decimal digits
df.style.format("{:.2f}")
  • change a list to a pd.Series object
s = pd.Series(myList)
  • each column of a dataframe is a pd.Series object
df['Values'].dtype
  • we can use sum, min, max, mean, count, etc to pd.Series object
s.sum()
s.min()
s.mean()
s.max()
s.count()
  • change string type column to lower case
df['Name'].str.lower()
  • get the first letter using regular expression in a string type column
df['Name'].str.extract("([a-z])")
  • set an index for a Series, if we don’t have index parameter, we will have an index from 0 to n-1 by default
p = pd.Series(vector, index=['t','u','v','w'])
  • set an index for a DataFrame, if we don’t have index parameter, we will have an index from 0 to n-1 by default
p = pd.DataFrame(matrix, index=['t','u','v','w'])
  • turn a column of the dataframe to index
df.set_index('Index')
  • get rows by index
df.loc['u']
  • get multiple rows by index
df.loc[['u', 'w']]
  • get the first row by default index (0 to n-1)
df.iloc[0]
  • get some rows by default index (0 to n-1) and slicing
df.iloc[2:10]
  • get a column by column name
df['Values']
  • get a column in a range of rows
df.loc[4:10, 'Values']
  • get multiple columns by column name
df[['Name', 'Values']]
  • series substitution (automatically by index if between different dataframes)
df['New_values'] - df['Values']
  • create a boolean series based on a certain condition
df['Values'] > 100
  • get values in a series with relational operator based on a boolean series
df['Values'].loc[df['Values'] > 100]
  • get values in a dataframe with relational operator based on a boolean series
df.loc[df['Values'] > 100]
  • generate a numpy array from 0 to 1 with 8 points including endpoint 1
np.linspace(0, 1, 8)
  • generate a numpy array from 0 to 1 with 8 points not including endpoint 1
np.linspace(0, 1, 8, endpoint=False)
  • get numpy array underlying the series
  • get numpy array underlying the series
df['Values'].values

Note that any arithmetic involving a Nan, results in a Nan. (Nan is a numpy type, which means no value)

  • get the a boolean list to show where is the null value
df['Values'].isnull()

Note that this can also be df['Values'].isna()

  • get the a boolean list to show where is the not null value
~df['Values'].isnull()
  • get all the not null value in a series
df['Values'].loc[~df['Values'].isnull()]
  • get all the not null value in a dataframe
df.loc[~df['Values'].isnull()]
  • drop all the null value in a dataframe
df.dropna()
  • replace all the Nan with a value in a series
df['Values'].fillna(-1)
  • replace all the Nan with a value in a dataframe
df.fillna(-1)
  • sum a column not ignoring the Nan value (by default ignore, where skipna=True)
df['Values'].sum(skipna=False)
  • print if there is a Nan value in each columns of the dataframe
df.isnull().any()
  • print if there is a Nan value in the dataframe
df.isnull().any().any()
  • get the unique values in a series
df['Name'].unique()
  • get the counting times of each unique values in a series
df['Category'].value_counts()
  • order the unique items in a category type series
df['Category'] = df['Category'].cat.as_ordered()
  • return the order codes of unique items in a category type series
df_rent['Category'].cat.codes
  • reset order of unique items in a category type series
df_rent['Category'] = df_rent['Category'].cat.reorder_categories(myOrderList)
  • sort a series by its value (ascending)
df['Values'].sort_values()
  • reversely sort a series by its value (not ascending)
df['Values'].sort_values(ascending=False)
  • sort a dataframe by values of a column (ascending)
df.sort_values(by=['Values'])
  • apply a given function (len in this case) to each of the value in a series
df['List'].apply(len)
  • calculate the correlate coefficient for two series
np.corrcoef(df['Values'], df['New_values'])[0,1]
  • Group the dataframe by a column, do mean calculation to others (or we can do count/max/min/etc)
df.groupby('Category').mean()
  • Group the dataframe by two columns, do mean calculation to others (or we can do count/max/min/etc)
df.groupby(['Category','Time']).mean()
  • get a specific group by its value
df.get_group(val)
  • merge two different dataframes (outer means union of keys)
df_1.merge(df_2, how="outer", left_on=colname_1, right_on=colname_2)
  • get a series of ranges based on a series and its quantile
df['Values'] = pd.qcut(df['Values'], q=[.1, .5, .9])
  • get a series of ranges based on a series and given the number of ranges we want to cut into, these ranges will be equal length.
df['Values'] = pd.qcut(df['Values'], 3)
  • create a color list based on the range of values. This is always used for heatmap and the geographic map visualization.
colors = ['#122345', '#090909', '#abc123']
df["Color"] = pd.qcut(df["Values"], 3, labels=colors)
  • change the dataframe to an iterable by its row
df.iterrows()
  • turn a datafram into a dictionary (for loop)
df_dict = {}
for i, row in df.iterrows():
state2color[row["Name"]] = row["Values"]
  • turn a datafram into a dictionary (dictionary comprehension)
df_dict = {row["Name"]: row["Values"] for i, row in df.iterrows()}