Intermediate

Pandas Complete Guide

Master data manipulation dengan Python Pandas dari dasar hingga advanced

⏱️ 45 min read 📅 Updated Jan 2025 👤 By DataLearn Team

📚 Apa yang akan kamu pelajari

1. Setup & Introduction

Instalasi

# Install pandas pip install pandas # Install dengan dependencies umum pip install pandas numpy matplotlib openpyxl

Import & Basic Setup

import pandas as pd import numpy as np # Set display options pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', 100)

2. Data Structures: Series & DataFrame

Series

Series adalah array 1-dimensional dengan label (index).

# Create Series s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd']) print(s)
Output:
a 10
b 20
c 30
d 40
dtype: int64

DataFrame

DataFrame adalah tabel 2-dimensional (rows × columns).

# Create DataFrame from dictionary data = { 'name': ['Alice', 'Bob', 'Charlie'], 'age': [25, 30, 35], 'city': ['Jakarta', 'Bandung', 'Surabaya'] } df = pd.DataFrame(data) print(df)
Output:
name age city
0 Alice 25 Jakarta
1 Bob 30 Bandung
2 Charlie 35 Surabaya

3. Loading Data

📊 Data Loading Cheat Sheet

CSV pd.read_csv('file.csv')
Excel pd.read_excel('file.xlsx')
SQL pd.read_sql(query, conn)
JSON pd.read_json('file.json')
Parquet pd.read_parquet('file.parquet')
Clipboard pd.read_clipboard()

CSV dengan Options

# Load CSV dengan berbagai options df = pd.read_csv( 'data.csv', sep=',', # delimiter encoding='utf-8', # encoding header=0, # row untuk header index_col=0, # column untuk index usecols=['col1', 'col2'], # hanya load kolom tertentu nrows=1000, # limit jumlah rows na_values=['N/A', 'NULL'], # nilai yang dianggap NA parse_dates=['date_column'] # parse sebagai datetime )

4. Data Exploration

🔍 Exploration Methods

First rowsdf.head()
Last rowsdf.tail()
Dimensionsdf.shape
Column namesdf.columns
Data typesdf.dtypes
Summary statsdf.describe()
Info lengkapdf.info()
Null countdf.isnull().sum()

5. Selecting & Filtering Data

Select Columns

# Single column (returns Series) df['name'] df.name # Multiple columns (returns DataFrame) df[['name', 'age']] # Select by position df.iloc[:, 0:3] # columns 0-2 df.iloc[:, [0, 2]] # columns 0 dan 2

Filter Rows

# Single condition df[df['age'] > 25] df[df['city'] == 'Jakarta'] # Multiple conditions df[(df['age'] > 25) & (df['city'] == 'Jakarta')] # AND df[(df['age'] < 20) | (df['age'] > 60)] # OR # isin untuk multiple values df[df['city'].isin(['Jakarta', 'Bandung'])] # String contains df[df['name'].str.contains('Ali')]

6. Data Cleaning

Handling Missing Values

# Detect missing values df.isnull().sum() # count per column df.isnull().sum().sum() # total count # Drop missing values df.dropna() # drop rows dengan NA df.dropna(subset=['col1']) # drop jika NA di kolom tertentu df.dropna(how='all') # drop jika semua kolom NA df.dropna(thresh=2) # drop jika kurang dari 2 non-NA # Fill missing values df.fillna(0) # isi dengan 0 df.fillna({'age': 0, 'name': 'Unknown'}) # isi per kolom df['age'].fillna(df['age'].mean()) # isi dengan mean df.fillna(method='ffill') # forward fill df.fillna(method='bfill') # backward fill

Remove Duplicates

# Check duplicates df.duplicated().sum() df.duplicated(subset=['name', 'age']).sum() # Remove duplicates df.drop_duplicates() df.drop_duplicates(subset=['name'], keep='first')

7. Data Transformation

Create/Modify Columns

# New column df['age_plus_10'] = df['age'] + 10 # With condition df['category'] = df['age'].apply( lambda x: 'young' if x < 30 else 'old' ) # Using numpy where import numpy as np df['status'] = np.where(df['age'] >= 18, 'adult', 'minor')

8. Group By & Aggregation

# Basic groupby df.groupby('city')['age'].mean() # Multiple aggregations df.groupby('city').agg({ 'age': ['mean', 'min', 'max'], 'salary': ['sum', 'mean'] }) # Multiple groupby columns df.groupby(['city', 'gender'])['age'].mean() # Reset index df.groupby('city').agg({'age': 'mean'}).reset_index()

9. Merge & Join

# Merge (SQL-style joins) pd.merge(df1, df2, on='id', how='inner') pd.merge(df1, df2, on='id', how='left') pd.merge(df1, df2, on='id', how='right') pd.merge(df1, df2, on='id', how='outer') # Merge dengan different column names pd.merge(df1, df2, left_on='user_id', right_on='id') # Join (index-based) df1.join(df2, how='left') # Concatenate pd.concat([df1, df2], axis=0) # vertical (add rows) pd.concat([df1, df2], axis=1) # horizontal (add columns)

10. Pivot Tables

# Pivot table pivot = df.pivot_table( values='sales', index='region', columns='month', aggfunc='sum', fill_value=0 ) # Multiple aggregations pivot = df.pivot_table( values=['sales', 'profit'], index=['region', 'category'], aggfunc={'sales': 'sum', 'profit': 'mean'} )

11. Time Series

# Parse datetime df['date'] = pd.to_datetime(df['date']) # Set as index df.set_index('date', inplace=True) # Resampling df.resample('M')['value'].mean() # monthly average df.resample('D')['value'].sum() # daily sum # Time-based filtering df['2023-01-01':'2023-12-31'] df[df.index.month == 1] # January only

✏️ Hands-on Exercise: Sales Analysis

Kamu punya dataset penjualan dengan kolom: date, product, region, quantity, price

  1. Load data dan convert date ke datetime
  2. Buat kolom revenue = quantity × price
  3. Cari total revenue per region
  4. Cari produk terlaris per bulan
  5. Buat pivot table: region vs month dengan total revenue

Download dataset & jawaban →

12. Performance Tips

🚀 Optimizing Pandas