Essential Pandas CheatSheet Part 1: Quick Pocket Guide

Essential Pandas CheatSheet Part 1: Quick Pocket Guide

·

9 min read

Whether you're just getting started with pandas or looking to sharpen your data manipulation skills, this cheat sheet has got you covered! We've distilled the most essential functions, tips, and tricks into an easy-to-follow guide that anyone—from beginners to seasoned users—can rely on. Use it as a quick reference to learn new techniques, refresh your memory, or take your pandas skills to the next level. No matter where you are on your data journey, this pocket guide will help you navigate through pandas with confidence and ease!

Let’s Dive In!!

Before starting, if you’re a beginner just starting then know that Pandas provides two types of classes for handling data:

  1. Series: a one-dimensional labelled array holding data of any type such as integers, strings, python objects etc.

  2. DataFrame: a two-dimensional data structure that holds data like a two-dimensional array or a table with rows and columns.


1. Basics: Series and DataFrame

import pandas as pd

Creating a Series

A Pandas Series is like a single column of data, similar to a column in a spreadsheet.

ser = pd.Series([1,2,3,4,np.nan,5])   # Series Creation

ser[0] = 12  # Assigning a value at specific index
ser.index  # Shows the index range
ser.index.array   # Shows the array of index
ser.to_numpy()   #  It essentially converts the data within the Pandas Series (ser) into a NumPy array.

ser.to_numpy() : In simpler terms, imagine you have data in a spreadsheet column. This line of code takes that data and puts it into a format that NumPy can easily work with for calculations and other operations.

Creating a DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

There are several ways to create a DataFrame, but don't worry about mastering them all right now.

# From dictionary of List/ ndarrays
d = {
    "one":[1,2,3,4],
    "two":[5,6,7,8],
}

df = pd.DataFrame(d)
# From Dictionary of Series
d1 = {
    "one":pd.Series([1,2,3], index=["a","b","c"]),
    "two":pd.Series([4,5,6,7], index=["a","b","c","d"])
}

df1 = pd.DataFrame(d)
# From structured or record array
# (A structured array is a type of array that allows you to store heterogeneous data—meaning different data types—in each element.)
data = np.zeros((2,),dtype = [("A","i4"),("B","f4"),("C","a10")])
# Fills the structured array with the given tuples.
data[:] = [(1,2.0,"Hello"),(2,3.0,"World")]
# Converts the structured array into a pandas DataFrame.
pd.DataFrame(data)

Creates a structured NumPy array with 2 rows.

The dtype specifies:

  1. "A": 4-byte integer (i4)

  2. "B": 4-byte float (f4)

  3. "C": String of up to 10 characters (a10).

# From a list of dicts
data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
pd.DataFrame(data2)

# From a dict of tuples
pd.DataFrame(
    {
        ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
        ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
        ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
        ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
        ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
    }
)

''' O/P
           ('a', 'b')    ('a', 'a')    ('a', 'c')    ('b', 'a')    ('b', 'b')
('A', 'B')       1             4             5            8            10
('A', 'C')       2             3             6            7            NaN
('A', 'D')      NaN            NaN            NaN           NaN             9

'''

# From a Series
ser = pd.Series(range(3), index=list("abc"), name="ser")
pd.DataFrame(ser)

A namedtuple is a function in Python's collections module that allows you to create tuple-like objects with named fields. It combines the efficiency of tuples with the readability of objects, making your code cleaner and more self-explanatory.

# From a list of namedtuples
from collections import namedtuple
Point = namedtuple("Point","x y")
pd.DataFrame([Point(0,0),(0,3),(2,)]) # if more given then raise error
# From a list of dataclasses
from dataclasses import make_dataclass

Point = make_dataclass("Point" , [("x",int),("y",int)])
pd.DataFrame([Point(0,0),Point(0,3),Point(2,3)])

A dataclass in Python is a simple way to create classes that are mainly used to store data. *make_dataclass** is a function in Python that allows you to create data classes dynamically at runtime, instead of using the @dataclass decorator. This can be useful when you don’t know the exact fields of the class ahead of time and need to define them on the fly.

# Alternate constructors
pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]
))

pd.DataFrame.from_dict(
    dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
    orient="index",
    columns=["one", "two", "three"],
)

pd.DataFrame.from_records(data, index="C")

Reading and Writing Data

df = pd.read_csv('file_name.csv')  # Read CSV
df.to_csv('file_name.csv', index=False)  # Write CSV

df = pd.read_excel('file_name.xlsx')  # Read Excel
df.to_excel('file_name.xlsx', index=False)  # Write Excel

df = pd.read_json('file_name.json')  # Read JSON
df.to_json('file_name.json')  # Write JSON

df = pd.read_parquet('file_name.parquet')  # Read Parquet
df.to_parquet('file_name.parquet', index=False)  # Write Parquet

Note: If the data file is not in the same location as your code file, specify the full file path.


2. Data Exploration & Selection

Basic Info

df.head()  # First 5 rows
df.head(10) # First 10 rows
df.tail()  # Last 5 rows
df.info()  # Overview of DataFrame
df.describe()  # Statistical summary
df.shape  # (rows, columns)
df.columns  # Column names
df.index  # R index
df.dtypes  # Data types
df.memory_usage()  # Memory usage of DataFrame

Selection & Filtering

Pandas DataFrames are primarily designed for accessing data using row and column labels (with loc) or integer positions (with iloc)

# Selecting rows by index
df.loc[:, ['A']]  # Using loc to select all rows (:) and column 'A'
# or
df['A']          # Using the column name directly for a Series
# or
df.iloc[:, df.columns.get_loc('A')] # Using iloc with column index

# More Examples
df.loc[2,['A','B']]  # 2nd row and A , B columns
df.iloc[0:2,1]  # 1st to 3rd row and 2nd column
  1. df.loc[:, ['A']]: This is the recommended way to select all rows and a specific column using labels. loc uses labels for both rows and columns. : selects all rows, and ['A'] selects the column with the label 'A'.

  2. df.iloc[:, df.columns.get_loc('A')]: This approach uses iloc for integer-based indexing. df.columns.get_loc('A') gets the integer position of the column 'A', and then iloc selects all rows (:) and that specific column index.

# Selecting a column
df['Age']  # Series
df[['Name', 'Salary']]  # DataFrame
# Filtering rows
df[df['Age'] > 30]  # Conditional filtering
df.query('Age > 30 & Salary < 70000')  # Query-based filtering

3. Handling Missing Values & NA


df.isnull().sum()  # Count missing values
df.dropna()  # Drop missing values
df.fillna(value=0)  # Replace missing values with 0
df.fillna(df.mean(), inplace=True)  # Fill with column mean
df.interpolate()  # Fill missing values using interpolation
df.ffill()  # Forward fill
df.bfill()  # Backward fill
df.dropna(axis=1, thresh=2)  # Drop columns with less than 2 non-null values
  • df.ffill()

    In simpler terms:

    Imagine you have a column in your DataFrame with some empty cells. The ffill() method will look for the nearest previous non-empty cell in that column and copy its value into the empty cell. It does this for all empty cells, effectively "filling them forward" with the previous known value.

    Example:

    If your DataFrame df looks like this:

A  B
0  1  4
1  NaN  5
2  3  6

After running df.ffill(), it would become:

A  B
0  1  4
1  1  5  # The NaN in column 'A' is replaced with the previous value '1'
2  3  6

df.bfill()

In simpler words:

If you have a column with missing data, bfill() looks for the next valid value and fills the gaps moving backward. Imagine it's like filling in the blanks from the future.

Example:

After applying df.bfill(), it would be transformed into:

A  B
0  1  4
1  3  5   # The NaN in column 'A' is replaced with the next valid value '3'
2  3  6

4. Changing & Assigning Values


df['Age'] = df['Age'] + 1  # Increment column
df['New_Column'] = df['Salary'] / 1000  # Create new column
df.rename(columns={'Salary': 'Annual Salary'}, inplace=True)  # Rename column
df.replace({'Alice': 'Alicia'}, inplace=True)  # Replace values
df.at[0, 'Salary'] = 52000  # Assign value at specific row-column index
df.iat[0, 2] = 52000  # Faster assignment using integer location indexing
df.assign(NewColumn=df['Age'] * 2)  # Assign new column dynamically

5. Boolean Handling & Masking


mask = df['Age'] > 30
df[mask]  # Select rows where Age > 30
df.where(df['Age'] > 30, other=0)  # Replace values that don't satisfy condition
df.mask(df['Age'] > 30, other=0)  # Opposite of where

6. Sorting & Reordering Data

df.sort_values(by='Salary', ascending=False)  # Sort by Salary
df.sort_index()  # Sort by index
df.reindex(columns=['Name', 'Salary', 'Age'])  # Change column order

7. Datetime Operations


df['Date'] = pd.to_datetime(df['Date'])  # Convert to datetime
df['Year'] = df['Date'].dt.year  # Extract year
df['Month'] = df['Date'].dt.month  # Extract month
df['Day'] = df['Date'].dt.day  # Extract day
df['DayOfWeek'] = df['Date'].dt.day_name()  # Extract day name
df['Week'] = df['Date'].dt.isocalendar().week  # Extract week number
df['Time'] = df['Date'].dt.time  # Extract time
df['ElapsedDays'] = (df['Date'] - df['Date'].min()).dt.days  # Days since first date

8. Performance Optimization & Handling Large Datasets

df['Category'] = df['Category'].astype('category')  # Convert column to category for memory efficiency
df['Age'] = pd.to_numeric(df['Age'], downcast='integer')  # Optimize integer storage
df['Salary'] = pd.to_numeric(df['Salary'], downcast='float')  # Optimize float storage
df.sample(frac=0.1)  # Random sampling for large dataframes

9. Apply Function & Iteration


df['New_Salary'] = df['Salary'].apply(lambda x: x * 1.1)  # Increase Salary by 10%
for index, row in df.iterrows():
    print(row['Name'], row['Salary'])

10. Query & Slicing (Revisited)


df.query('Age > 30 & Salary < 70000')  # Query-based filtering
df.iloc[0:5]  # Select first 5 rows
df.loc[df['Age'] > 30, ['Name', 'Salary']]  # Select specific columns with condition

11. Copying & Views


df_copy = df.copy()  # Deep copy
df_view = df[['Name', 'Age']]  # View (no new memory allocated)

12. Exporting & Saving Data (Revisited)

df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', index=False)
df.to_json('output.json')
df.to_parquet('output.parquet', index=False)  # Save as Parquet for efficient storage

13. Categorical Data Optimization


df['Category'] = df['Category'].astype('category')  # Convert to category for memory efficiency
df['Category'].cat.codes  # Access category codes
df['Category'].cat.categories  # Access category labels
df['Category'].cat.add_categories(['NewCategory'])  # Add new category
df['Category'].cat.remove_unused_categories()  # Remove unused categories

14. Method Chaining


df = (df.dropna()
        .assign(NewColumn=lambda x: x['ExistingColumn'] * 2)
        .query("Age > 30")
        .sort_values("Salary", ascending=False)
     )

15. Handling Large Datasets Efficiently


import dask.dataframe as dd
df = dd.read_csv('large_file.csv')  # Load large file efficiently
df = pd.read_csv('large_file.csv', dtype={'column_name': 'category'})  # Optimize dtype
df['column'] = pd.to_numeric(df['column'], downcast='integer')  # Reduce memory usage
df.memory_usage(deep=True)  # Check memory usage
df.sample(frac=0.1)  # Randomly sample 10% of data

16. Apply Function (Preview)


df['SalaryIncremented'] = df['Salary'].apply(lambda x: x * 1.1)  # Apply function to each row
df['Salary Range From'].apply(np.sqrt)

17. Query & Filtering (Revisited)


df.query('Age > 30 & Salary < 70000')  # Query-based filtering
df[df['Age'].between(30, 40)]  # Filter between values
df[df['Name'].str.contains("Alice")]  # Filter strings

Conclusion

That wraps up Part 1 of our Pandas cheatsheet, covering the essential basics you need to get started with data manipulation in Python. Mastering these fundamentals will set a strong foundation for more advanced techniques. In Part 2, we'll dive deeper into powerful features like the apply function, merging and joining datasets, complex aggregations, pivoting, string manipulations, time series data handling, and much more. If you're excited for the next part and want a comprehensive cheatsheet, drop a comment with #PART2 to let me know!

Keep Learning, Keep Exploring!!