Intermediate
Pandas Complete Guide
Master data manipulation dengan Python Pandas dari dasar hingga advanced
⏱️ 45 min read
📅 Updated Jan 2025
👤 By DataLearn Team
📚 Apa yang akan kamu pelajari
- DataFrame & Series - struktur data fundamental
- Data loading dari berbagai sumber (CSV, Excel, SQL)
- Data cleaning & preprocessing
- Filtering, grouping, & aggregation
- Merge, join, & reshape data
- Time series analysis
1. Setup & Introduction
Instalasi
pip install pandas
pip install pandas numpy matplotlib openpyxl
Import & Basic Setup
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
2. Data Structures: Series & DataFrame
Series
Series adalah array 1-dimensional dengan label (index).
s = pd.Series([10, 20, 30, 40],
index=['a', 'b', 'c', 'd'])
print(s)
Output:
a 10
b 20
c 30
d 40
dtype: int64
DataFrame
DataFrame adalah tabel 2-dimensional (rows × columns).
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['Jakarta', 'Bandung', 'Surabaya']
}
df = pd.DataFrame(data)
print(df)
Output:
name age city
0 Alice 25 Jakarta
1 Bob 30 Bandung
2 Charlie 35 Surabaya
3. Loading Data
📊 Data Loading Cheat Sheet
CSV
pd.read_csv('file.csv')
Excel
pd.read_excel('file.xlsx')
SQL
pd.read_sql(query, conn)
JSON
pd.read_json('file.json')
Parquet
pd.read_parquet('file.parquet')
Clipboard
pd.read_clipboard()
CSV dengan Options
df = pd.read_csv(
'data.csv',
sep=',',
encoding='utf-8',
header=0,
index_col=0,
usecols=['col1', 'col2'],
nrows=1000,
na_values=['N/A', 'NULL'],
parse_dates=['date_column']
)
4. Data Exploration
🔍 Exploration Methods
First rowsdf.head()
Last rowsdf.tail()
Dimensionsdf.shape
Column namesdf.columns
Data typesdf.dtypes
Summary statsdf.describe()
Info lengkapdf.info()
Null countdf.isnull().sum()
5. Selecting & Filtering Data
Select Columns
df['name']
df.name
df[['name', 'age']]
df.iloc[:, 0:3]
df.iloc[:, [0, 2]]
Filter Rows
df[df['age'] > 25]
df[df['city'] == 'Jakarta']
df[(df['age'] > 25) & (df['city'] == 'Jakarta')]
df[(df['age'] < 20) | (df['age'] > 60)]
df[df['city'].isin(['Jakarta', 'Bandung'])]
df[df['name'].str.contains('Ali')]
6. Data Cleaning
Handling Missing Values
df.isnull().sum()
df.isnull().sum().sum()
df.dropna()
df.dropna(subset=['col1'])
df.dropna(how='all')
df.dropna(thresh=2)
df.fillna(0)
df.fillna({'age': 0, 'name': 'Unknown'})
df['age'].fillna(df['age'].mean())
df.fillna(method='ffill')
df.fillna(method='bfill')
Remove Duplicates
df.duplicated().sum()
df.duplicated(subset=['name', 'age']).sum()
df.drop_duplicates()
df.drop_duplicates(subset=['name'], keep='first')
7. Data Transformation
Create/Modify Columns
df['age_plus_10'] = df['age'] + 10
df['category'] = df['age'].apply(
lambda x: 'young' if x < 30 else 'old'
)
import numpy as np
df['status'] = np.where(df['age'] >= 18, 'adult', 'minor')
8. Group By & Aggregation
df.groupby('city')['age'].mean()
df.groupby('city').agg({
'age': ['mean', 'min', 'max'],
'salary': ['sum', 'mean']
})
df.groupby(['city', 'gender'])['age'].mean()
df.groupby('city').agg({'age': 'mean'}).reset_index()
9. Merge & Join
pd.merge(df1, df2, on='id', how='inner')
pd.merge(df1, df2, on='id', how='left')
pd.merge(df1, df2, on='id', how='right')
pd.merge(df1, df2, on='id', how='outer')
pd.merge(df1, df2, left_on='user_id', right_on='id')
df1.join(df2, how='left')
pd.concat([df1, df2], axis=0)
pd.concat([df1, df2], axis=1)
10. Pivot Tables
pivot = df.pivot_table(
values='sales',
index='region',
columns='month',
aggfunc='sum',
fill_value=0
)
pivot = df.pivot_table(
values=['sales', 'profit'],
index=['region', 'category'],
aggfunc={'sales': 'sum', 'profit': 'mean'}
)
11. Time Series
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df.resample('M')['value'].mean()
df.resample('D')['value'].sum()
df['2023-01-01':'2023-12-31']
df[df.index.month == 1]
✏️ Hands-on Exercise: Sales Analysis
Kamu punya dataset penjualan dengan kolom: date, product, region, quantity, price
- Load data dan convert
date ke datetime
- Buat kolom
revenue = quantity × price
- Cari total revenue per region
- Cari produk terlaris per bulan
- Buat pivot table: region vs month dengan total revenue
Download dataset & jawaban →
12. Performance Tips
🚀 Optimizing Pandas
- Vectorization: Hindari loop, gunakan built-in operations
- Categorical data: Convert ke category dtype untuk memory efisiensi
- Chunking: Baca file besar dengan
chunksize
- Use PyArrow:
pd.read_csv(..., engine='pyarrow') lebih cepat
- Polars: Alternatif lebih cepat untuk data besar