Pandas is oriented around performing SIMD operations on arrays, and its API directly reflects this. Some operations may be less intuitive than they are in the SQL or stream processing worlds. Operations like apply and iter* which involve user-defined Python are generally slow.
import pandas as pd
import numpy as np
# by column
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
# by row
df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])# if the headers lie about the number of columns, ignore them and supply new ones
df = pd.read_csv(f1 skiprows=[0], names=['A', 'B', 'C'])
# if the data has merged cells
df.fillna(method='ffill', inplace=True)>>> df[df['A'] >= 2]
A B
1 2 7
2 3 8loc selects by index (aka labels). iloc selects by ordinal (aka locations).
>>> p aa.loc[:, 'A':'B']
A B
0 2 7
1 3 8An index is like an id column in the SQL world. It's an extra column that is by default a range from 0 to length-1, and hidden when printing.
After selection, the index column is unchanged and may have to be reset.
>>> a = df[df['A'] >= 2]
>>> p a
A B
1 2 7
2 3 8
>>> a.reset_index(inplace=True)
>>> a
index A B
0 1 2 7
1 2 3 8
>>> del a['index']
>>> a
A B
0 2 7
1 3 8An existing column can be set as the index. A multi-index is a composite key, ordered to be hierarchical.
>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.set_index(['A', 'B'])
C
A B
1 2 8
5 7
5 6
3 8 4df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
df2 = pd.DataFrame({'A': [1], 'C': [5]})
df1.merge(df2, on='A') # inner
df1.merge(df2, on='A', how='outer')>>> pd.concat([df1, df2])
A B C
0 1 6.0 NaN
1 2 7.0 NaN
2 3 8.0 NaN
0 1 NaN 5.0>>> pd.concat([df1, df1]).drop_duplicates()
A B
0 1 6
1 2 7
2 3 8>>> df = pd.DataFrame({'A': [1, 1, 1, 3], 'B': [2, 5, 5, 8], 'C': [8, 7, 6, 4]})
>>> df.groupby('A').sum()
B C
A
1 12 21
3 8 4
>>> df.groupby('A').apply(print)
A B C
0 1 2 8
1 1 5 7
2 1 5 6
A B C
3 3 8 4
Empty DataFrame
Columns: []
Index: []>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> for k in df.itertuples(): print(k)
Pandas(Index=0, A=1, B=6)
Pandas(Index=1, A=2, B=7)
Pandas(Index=2, A=3, B=8)df.head()
df.tail()
df.to_string()>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 7, 8]})
>>> df.apply(lambda x: -x) # unary functions are mapped
A B
0 -1 -6
1 -2 -7
2 -3 -8
>>> df.apply(np.sum) # binary functions are folded. default is axis 0 (rows)
A 6
B 21
dtype: int64
>>> df.apply(sum, axis=1) # columns
0 7
1 9
2 11