Skip to article frontmatterSkip to article content

Python for Data 9: Pandas DataFrames

In the last lesson, we learned that Numpy’s ndarrays well-suited for performing math operations on one and two-dimensional arrays of numeric values, but they fall short when it comes to dealing with heterogeneous data sets. To store data from an external source like an excel workbook or database, we need a data structure that can hold different data types. It is also desirable to be able to refer to rows and columns in the data by custom labels rather than numbered indexes.

The pandas library offers data structures designed with this in mind: the series and the DataFrame. Series are 1-dimensional labeled arrays similar to numpy’s ndarrays, while DataFrames are labeled 2-dimensional structures, that essentially function as spreadsheet tables.

Series: Creation and Access

Before we get into DataFrames, we’ll take a brief detour to explore pandas series. Series are very similar to ndarrays: the main difference between them is that with series, you can provide custom index labels and then operations you perform on series automatically align the data based on the labels.

To create a new series, first load the numpy and pandas libraries:

import numpy as np
import pandas as pd 

Define a new series by passing a collection of homogeneous data like ndarray or list, along with a list of associated indexes to pd.Series():

my_series = pd.Series( data = [2,3,5,4],             # Data
                       index= ['a', 'b', 'c', 'd'])  # Indexes

display(type(my_series))

display(my_series)
pandas.core.series.Series
a 2 b 3 c 5 d 4 dtype: int64

You can also create a series from a dictionary, in which case the dictionary keys act as the labels and the values act as the data:

my_dict = {"x": 2, "a": 5, "b": 4, "c": 8}

my_series2 = pd.Series(my_dict)

my_series2 
x 2 a 5 b 4 c 8 dtype: int64

Similar to a dictionary, you can access items in a series by the labels:

my_series["a"]
2

Numeric indexing also works:

my_series[0]
2

If you take a slice of a series, you get both the values and the labels contained in the slice:

my_series[1:3]
b 3 c 5 dtype: int64

As mentioned earlier, operations performed on two series align by label:

my_series + my_series
a 4 b 6 c 10 d 8 dtype: int64

If you perform an operation with two series that have different labels, the unmatched labels will return a value of NaN (not a number.).

my_series + my_series2
a 7.0 b 7.0 c 13.0 d NaN x NaN dtype: float64

Other than labeling, series behave much like numpy’s ndarrays. A series is even a valid argument to many of the numpy array functions we covered last time:

np.mean(my_series)        # numpy array functions generally work on series
3.5

DataFrame: Creation and Indexing

A DataFrame is a 2D table with labeled columns that can each hold different types of data. DataFrames are essentially a Python implementation of the types of tables you’d see in an Excel workbook or SQL database. DataFrames are the defacto standard data structure for working with tabular data in Python; we’ll be using them a lot throughout the remainder of this guide.

You can create a DataFrame out a variety of data sources like dictionaries, 2D numpy arrays and series using the pd.DataFrame() function. Dictionaries provide an intuitive way to create DataFrames: when passed to pd.DataFrame() a dictionary’s keys become column labels and the values become the columns themselves:

# Create a dictionary with some different data types as values

my_dict = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : (75,123,239),
           "height" : pd.Series([4.5, 5, 6.1], 
                                index=["Joe","Bob","Frans"]),
           "siblings" : 1,
           "gender" : "M"}

df = pd.DataFrame(my_dict)   # Convert the dict to DataFrame

df                           # Show the DataFrame
Loading...

Notice that values in the dictionary you use to make a DataFrame can be a variety of sequence objects, including lists, ndarrays, tuples and series. If you pass in singular values like a single number or string, that value is duplicated for every row in the DataFrame (in this case gender is set to “M” for all records and siblings is set to 1.).

Also note that in the DataFrame above, the rows were automatically given indexes that align with the indexes of the series we passed in for the “height” column. If we did not use a series with index labels to create our DataFrame, it would be given numeric row index labels by default:

my_dict2 = {"name" : ["Joe","Bob","Frans"],
           "age" : np.array([10,15,20]),
           "weight" : (75,123,239),
           "height" :[4.5, 5, 6.1],
           "siblings" : 1,
           "gender" : "M"}

df2 = pd.DataFrame(my_dict2)   # Convert the dict to DataFrame

df2                            # Show the DataFrame
Loading...

You can provide custom row labels when creating a DataFrame by adding the index argument:

df2 = pd.DataFrame(my_dict2,
                   index = my_dict["name"] )

df2
Loading...

A DataFrame behaves like a dictionary of Series objects that each have the same length and indexes. This means we can get, add and delete columns in a DataFrame the same way we would when dealing with a dictionary:

# Get a column by name

display(df2['weight'])

display(df2[["weight"]])
Loading...

Alternatively, you can get a column by label using “dot” notation:

df2.weight
Joe 75 Bob 123 Frans 239 Name: weight, dtype: int64
# Delete a column

df2.drop(['name'], axis=1, inplace=True)
# del df2['name']
# Add a new column

df2["IQ"] = [130, 105, 115]

df2
Loading...

Inserting a single value into a DataFrame causes it to populate across all the rows:

df2["Married"] = False

df2
Loading...

When inserting a Series into a DataFrame, rows are matched by index. Unmatched rows will be filled with NaN:


df2["College"] = pd.Series(["Harvard"],
                           index=["Frans"])

df2
Loading...

You can select both rows or columns by label with df.loc[row, column]:

df2.loc["Joe"]          # Select row "Joe"
age 10 weight 75 height 4.5 siblings 1 gender M IQ 130 Married False College NaN Name: Joe, dtype: object
df2.loc["Joe","IQ"]     # Select row "Joe" and column "IQ"
130
df2.loc["Joe":"Bob" , "IQ":"College"]   # Slice by label
Loading...

Select rows or columns by numeric index with df.iloc[row, column]:

df2.iloc[0]          # Get row 0
age 10 weight 75 height 4.5 siblings 1 gender M IQ 130 Married False College NaN Name: Joe, dtype: object
df2.iloc[0, 5]       # Get row 0, column 5
130
df2.iloc[0:2, 5:8]   # Slice by numeric row and column index
Loading...

You can also select rows by passing in a sequence boolean(True/False) values. Rows where the corresponding boolean is True are returned:

boolean_index = [False, True, True]  

df2[boolean_index] 
Loading...

This sort of logical True/False indexing is useful for subsetting data when combined with logical operations. For example, say we wanted to get a subset of our DataFrame with all persons who are over 12 years old. We can do it with boolean indexing:

# Create a boolean sequence with a logical comparison
boolean_index = df2["age"] > 12

# Use the index to get the rows where age > 12
df2[boolean_index]
Loading...

You can do this sort of indexing all in one operation without assigning the boolean sequence to a variable:

df2[ df2["age"] > 12 ]
Loading...

Learn more about indexing DataFrames here.

Exploring DataFrames

Exploring data is an important first step in most data analyses. DataFrames come with a variety of functions to help you explore and summarize the data they contain.

First, let’s load in data set to explore: the Titanic Disaster training data. (We will cover reading and writing data in more detail in the next lesson.).

titanic_train = pd.read_csv("https://code.datasciencedojo.com/datasciencedojo/tutorials/raw/master/Datasets/titanic.csv")

type(titanic_train)
pandas.core.frame.DataFrame

Notice that Titanic data is loaded as a DataFrame. We can check the dimensions and size of a DataFrame with df.shape:

titanic_train.shape      # Check dimensions
(891, 12)

The output shows that Titanic trainin data has 891 rows and 12 columns.

We can check the first n rows of the data with the df.head() function:

titanic_train.head()    # Check the first 5 rows
Loading...

Similarly, we can check the last few rows with df.tail():

titanic_train.tail(6)   # Check the last 6 rows
Loading...

With large data sets, df.head() and df.tail() are useful to get a sense of what the data looks like without printing hundreds or thousands of rows to the screen. Since each row specifies a different person, lets try setting the row indexes equal to the passenger’s name. You can access and assign new row indexes with df.index:

titanic_train.index = titanic_train["Name"]  # Set index to name
del titanic_train["Name"]                    # Delete name column

print(titanic_train.index[0:10])             # Print new indexes
Index(['Braund, Mr. Owen Harris',
       'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
       'Heikkinen, Miss. Laina',
       'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
       'Allen, Mr. William Henry', 'Moran, Mr. James',
       'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard',
       'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
       'Nasser, Mrs. Nicholas (Adele Achem)'],
      dtype='object', name='Name')

You can access the column labels with df.columns:

titanic_train.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')

Use the df.describe() command to get a quick statistical summary of your data set. The summary includes the mean, median, min, max and a few key percentiles for numeric columns:

titanic_train.describe()    # Summarize the numeric columns
Loading...

Since the columns of a DataFrame are series and series are closely related to numpy’s arrays, many ndarray functions work on DataFrames, operating on each column of the DataFrame:

np.mean(titanic_train,
        axis=0)          # Get the mean of each numeric column
/Users/bytedance/Library/Caches/pypoetry/virtualenvs/data-sci-GRhfB7_c-py3.9/lib/python3.9/site-packages/numpy/core/fromnumeric.py:3430: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.
  return mean(axis=axis, dtype=dtype, out=out, **kwargs)
PassengerId 446.000000 Survived 0.383838 Pclass 2.308642 Age 29.699118 SibSp 0.523008 Parch 0.381594 Fare 32.204208 dtype: float64

To get an overview of the overall structure of a DataFrame, use the df.info() function:

titanic_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, Braund, Mr. Owen Harris to Dooley, Mr. Patrick
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 83.5+ KB

Wrap Up

Pandas DataFrames are the workhorse data structure for data analysis in Python. They provide an intuitive structure that mirrors the sorts of data tables we’re using to seeing in spreadsheet programs and indexing functionality that follows the same pattern as other Python data structures. This brief introduction only scratches the surface; DataFrames offer a host of other indexing options and functions, many of which we will see in future lessons.