Pandas DataFrames Practice - Statistics and Data Science in Python

import numpy as np
import pandas as pd

数据结构 / 创建¶

Pandas中有三种数据结构 Series 、 DataFrame 、 MultiIndex（老版本中叫Panel ）。

Series 是一维数据结构；
DataFrame 是二维的表格型数据结构；
MultiIndex 是三维的数据结构；

Series 一维结构¶

Series 用一维数组，可以存储不同类型的数据。pd.Series() 函数用来创建Series对象。

第一个参数是存储的数据，这里是 Numpy 随机生成的一维数组。
第二个参数 index 是数据对应的索引。在 Python list 或 Numpy 中数组的索引都是数字，也称为下标，但在 Pandas 中索引可以是任意类型。

基于数组创建¶

dt1 = np.random.randn(5)
display(type(dt1))

s_array = pd.Series(dt1)
display(type(s_array))
display(s_array)

numpy.ndarray

pandas.core.series.Series

0    0.257091
1    2.107121
2    0.110101
3    1.278196
4   -1.325080
dtype: float64

# 默认index是从0开始，步长为1的数字。也可以在 pd.Series() 指定 index=
s_array.index

RangeIndex(start=0, stop=5, step=1)

基于字典创建¶

dt2 = {'b':1, 'a':0, 'c':2}
display(type(dt2))

s_dict = pd.Series(dt2)

display(type(s_dict))
display(s_dict)

dict

pandas.core.series.Series

b    1
a    0
c    2
dtype: int64

s_dict.index

Index(['b', 'a', 'c'], dtype='object')

基于标量创建¶

如果data是标量值，则必须提供索引。该值会重复，来匹配索引的长度

dt3 = 10
display(type(dt3))

s_int = pd.Series(data=dt3, index=range(5))
s_int

int

0    10
1    10
2    10
3    10
4    10
dtype: int64

DataFrame 二维结构¶

DataFrame是二维结构，类似 Excel 或数据库中的表。

基于Series字典创建¶

From dict of Series or dicts

df_dict_s = pd.DataFrame(
    {
        'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
        'two': pd.Series([1., 2., 3., 4.], index=['b', 'c', 'a', 'd'])
    }
)

# 注意 index 位置
df_dict_s

基于数组字典创建¶

From dict of ndarrays / lists

df_dict_array = pd.DataFrame(
    {
        "one": [1.0, 2.0, 3.0, 4.0], 
        "two": [4.0, 3.0, 2.0, 1.0]
    }
)
df_dict_array

基于列表字典创建¶

From a list of dicts

df_lst_dict = pd.DataFrame(
    [
        {"a": 1, "b": 2}, 
        {"a": 5, "b": 10, "c": 20}
    ]
)
df_lst_dict

元组字典创建多层索引¶

From a dict of tuples

df_dict_tuple = pd.DataFrame(
    {
        ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
        ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
        ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
        ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
        ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
    }
)
df_dict_tuple

文件读取¶

pd.read_csv()
pd.read_excel()
pd.read_json()
pd.read_html()

查看数据¶

通过属性查看 DataFrame 基本情况，如： index , columns 和 shape .

查看结构¶

df_dict_s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

df_dict_s.columns

Index(['one', 'two'], dtype='object')

df_dict_s.shape

(4, 2)

查看明细¶

df_dict_s.head()

df_dict_s.tail(2)

查看统计¶

df_dict_s.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   one     3 non-null      float64
 1   two     4 non-null      float64
dtypes: float64(2)
memory usage: 268.0+ bytes

df_dict_s.describe()

其他¶

df_dict_s.to_numpy()

array([[ 1.,  3.],
       [ 2.,  1.],
       [ 3.,  2.],
       [nan,  4.]])

索引¶

索引在 Pandas 中非常重要，通过索引我们可以获取 Series 或 DataFrame 中的任意数据。

Pandas的既有行索引，也有列索引。

创建与转化¶

索引除了创建Series 或 DataFrame 时指定，也可以单独创建。

通过 pd.Index 分别创建行列索引 index 和 columns ，并用于创建 DateFrame 。

ind = pd.Index(['e', 'd', 'a', 'b'])
col = pd.Index(['A', 'B', 'C'], name='cols')
df = pd.DataFrame(np.random.randn(4, 3), index=ind, columns=col)

df

排序¶

用 sort_index 函数对上面的 df 行索引排序。

Pandas 函数里经常会见到 axis 参数，用来指定行索引或列索引：

axis=0 等价于 axis='index'
axis=1 等价于 axis='columns'

axis=0 表示行索引或者行索引对应的列值，因此，axis=0 表示处理每列数据。同样地，axis=1 表示处理每行数据。

df.sort_index(axis=0)

数据查询¶

📚 Python for Data Analysis

Python for Data 9: Pandas DataFrames

📚 Python for Data Analysis

Python for Data 10: Reading and Writing Data