the tl;dr of https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428
select a column of data, use brackets df['column_name']
select rows of data, use .loc or datetime index df['2019-01-01':'2019-02-28' ]
- if performance is primary concern, using numpy array instead of Pandas
use read_csv and it's many arguments for reading files
use .isna method to filter NaN rows
use ~ to negate
prefer math operators (+ - * / ** // %) instead of math methods (lt gt eq ne)
use pandas math aggregation methods instead of built in math functions
df['column_name'].sum()instead ofsum(df['column_name'])df['column_name'].max()instead ofmax(df['column_name'])
prefer df.groupby(...).agg(...) for doing group by aggregation
- Good:
df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})- e.g.
df.groupby('fruit').agg({'tastiness': 'mean'}) - e.g.
df.groupby('fruit').agg({'tastiness': 'mean', 'weight': ['mean', 'median']})
- e.g.
- OK:
df.groupby('grouping column')['aggregating column'].agg('aggregating function')- e.g.
df.groupby('fruit')['tastiness'].mean()
- e.g.
for going from wide to long format, prefer melt over stack
for going from long to wide format, prefer pivot_table over unstack or pivot