Skip to content

Data Querying for Data ScientistsπŸ”—

A Comprehensive Guide of using Pandas, SQL, PySpark, and Polars for Data Manipulation Techniques, with Practical Examples and VisualisationsπŸ”—

If you wanted to run the code yourself, you can download just that Jupyter notebook:


Download
- ALL

Download
- Pandas

Download
- SQL

Download
- PySpark

Download
- Polars

Or you can follow along on this page...

IntroductionπŸ”—

Working as a Data Scientist or Data Engineer often involves querying data from various sources. There are many tools and libraries available to perform these tasks, each with its own strengths and weaknesses. Also, there are many different ways to achieve similar results, depending on the tool or library used. It's important to be familiar with these different methods to choose the best one for your specific use case.

This article provides a comprehensive guide on how to query data using different tools and libraries, including Pandas, SQL, PySpark, and Polars. Each section will cover the setup, data creation, and various querying techniques such as filtering, grouping, joining, window functions, ranking, and sorting. The output will be identical across all tools, but the transformations will be implemented using the specific syntax and features of each library. Therefore allowing you to compare the different approaches and understand the nuances of each method.

Overview of the Different LibrariesπŸ”—

Before we dive into the querying techniques, let's take a moment to understand the different libraries and tools we will be using in this article. Each library has its own strengths and weaknesses, and understanding these can help you choose the right tool for your specific use case.

Throughout this article, you can easily switch between the different libraries by selecting the appropriate tab. Each section will provide the same functionality, but implemented using the specific syntax and features of each library.

Pandas is a powerful data manipulation library in Python that provides data structures and functions for working with structured data. It is widely used for data analysis and manipulation tasks.

Historically, Pandas was one of the first libraries to provide a DataFrame structure, which is similar to a table in a relational database. It allows for easy data manipulation, filtering, grouping, and aggregation. Pandas is built on top of NumPy and provides a high-level interface for working with data. It is particularly well-suited for small to medium-sized datasets and is often used in Data Science and Machine Learning workflows.

Pandas provides a rich set of functionalities for data manipulation, including filtering, grouping, joining, and window functions. It also integrates well with other libraries such as Matplotlib and Seaborn for data visualization, making it a popular choice among data scientists and analysts.

While Pandas is both powerful and popular, it is important to note that it operates in-memory, which means that it may not be suitable for very large datasets that do not fit into memory. In such cases, other libraries like PySpark or Polars may be more appropriate.

SQL (Structured Query Language) is a standard language for managing and manipulating relational databases. It is widely used for querying and modifying data in databases. SQL is a declarative language, meaning that you specify what you want to retrieve or manipulate without detailing how to do it. This makes SQL queries concise and expressive. SQL is particularly well-suited for working with large datasets and complex queries. It provides powerful features for filtering, grouping, joining, and aggregating data. SQL is the backbone of many database systems.

SQL is actually a language (like Python is a language), not a library (like Pandas is a library), and it is used to interact with relational databases. The core of the SQL language is actually an ISO standard, which means that the basic syntax and functionality are consistent across different database systems. However, each database system may have its own extensions or variations of SQL, which can lead to differences in syntax and features. Each database system can be considered as variations (or dialects) of SQL, with their own specific features and optimizations and syntax enhancements.

Some of the more popular SQL dialects include:

PySpark is the Python API for Apache Spark, a distributed computing framework that allows for large-scale data processing. PySpark provides a high-level interface for working with Spark, making it easier to write distributed data processing applications in Python. It is particularly well-suited for big data processing and analytics.

PySpark provides a DataFrame API similar to Pandas, but it is designed to work with large datasets that do not fit into memory. It allows for distributed data processing across a cluster of machines, making it suitable for big data applications. PySpark supports various data sources, including HDFS, S3, ADLS, and JDBC, and provides powerful features for filtering, grouping, joining, and aggregating data.

While PySpark is a powerful tool for big data processing, it can be more complex to set up and use compared to Pandas. It requires a Spark cluster and may have a steeper learning curve for those unfamiliar with distributed computing concepts. However, it is an excellent choice for processing large datasets and performing complex data transformations.

Polars is a fast DataFrame library for Python that is designed for high-performance data manipulation. It is built on top of Rust and provides a DataFrame API similar to Pandas, but with a focus on performance and memory efficiency. Polars is particularly well-suited for large datasets and complex queries.

Polars supports lazy evaluation, which allows for optimizations in query execution. Polars also provides powerful features for filtering, grouping, joining, and aggregating data, making it a great choice for data analysis tasks.

While Polars is a relatively new library compared to Pandas, it has gained popularity for its performance and ease of use. It is designed to be a drop-in replacement for Pandas, allowing users to leverage its performance benefits without significant changes to their existing code. It is particularly useful for data scientists and analysts who need to work with large datasets and require fast data manipulation capabilities. The setup is simple and straightforward, similar to Pandas, and less complex than PySpark. It is a great choice for data analysis tasks that require high performance and memory efficiency.

SetupπŸ”—

Before we start querying data, we need to set up our environment. This includes importing the necessary libraries, creating sample data, and defining constants that will be used throughout the article. The following sections will guide you through this setup process. The code for this article is also available on GitHub: querying-data.

Setup
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# StdLib Imports
from typing import Any

# Third Party Imports
import numpy as np
import pandas as pd
from plotly import express as px, graph_objects as go, io as pio


# Set seed for reproducibility
np.random.seed(42)

# Determine the number of records to generate
n_records = 100

# Set default Plotly template
pio.templates.default = "simple_white+gridon"

# Set Pandas display options
pd.set_option("display.max_columns", None)
Setup
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# StdLib Imports
import sqlite3
from typing import Any

# Third Party Imports
import numpy as np
import pandas as pd
from plotly import express as px, graph_objects as go, io as pio


# Set default Plotly template
pio.templates.default = "simple_white+gridon"

# Set Pandas display options
pd.set_option("display.max_columns", None)
Setup
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# StdLib Imports
from typing import Any

# Third Party Imports
import numpy as np
from plotly import express as px, graph_objects as go, io as pio
from pyspark.sql import (
    DataFrame as psDataFrame,
    SparkSession,
    Window,
    functions as F,
    types as T,
)


# Set seed for reproducibility
np.random.seed(42)

# Determine the number of records to generate
n_records = 100

# Set default Plotly template
pio.templates.default = "simple_white+gridon"
Setup
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# StdLib Imports
from typing import Any

# Third Party Imports
import numpy as np
import polars as pl
from plotly import express as px, graph_objects as go, io as pio


# Set seed for reproducibility
np.random.seed(42)

# Determine the number of records to generate
n_records = 100

# Set default Plotly template
pio.templates.default = "simple_white+gridon"

# Set Polars display options
pl.Config.set_tbl_cols(-1)

Once the setup is complete, we can proceed to create our sample data. This data will be used for querying and will be consistent across all libraries. All tables will be created from scratch with randomly generated data to simulate a real-world scenario. This is to ensure that the examples are self-contained and can be run without any external dependencies, and also there is no issues about data privacy or security.

For the below data creation steps, we will be defining the tables using Python dictionaries. Each dictionary will represent a table, with keys as column names and values as lists of data. We will then convert these dictionaries into DataFrames or equivalent structures in each library.

First, we will create a sales fact table. This table will contain information about sales transactions, including the date, customer ID, product ID, category, sales amount, and quantity sold.

Create Sales Fact Data
1
2
3
4
5
6
7
8
sales_data: dict[str, Any] = {
    "date": pd.date_range(start="2023-01-01", periods=n_records, freq="D"),
    "customer_id": np.random.randint(1, 100, n_records),
    "product_id": np.random.randint(1, 50, n_records),
    "category": np.random.choice(["Electronics", "Clothing", "Food", "Books", "Home"], n_records),
    "sales_amount": np.random.uniform(10, 1000, n_records).round(2),
    "quantity": np.random.randint(1, 10, n_records),
}

Next, we will create a product dimension table. This table will contain information about products, including the product ID, name, price, category, and supplier ID.

Create Product Dimension Data
1
2
3
4
5
6
7
product_data: dict[str, Any] = {
    "product_id": np.arange(1, 51),
    "product_name": [f"Product {i}" for i in range(1, 51)],
    "price": np.random.uniform(10, 500, 50).round(2),
    "category": np.random.choice(["Electronics", "Clothing", "Food", "Books", "Home"], 50),
    "supplier_id": np.random.randint(1, 10, 50),
}

Finally, we will create a customer dimension table. This table will contain information about customers, including the customer ID, name, city, state, and segment.

Create Customer Dimension Data
1
2
3
4
5
6
7
customer_data: dict[str, Any] = {
    "customer_id": np.arange(1, 101),
    "customer_name": [f"Customer {i}" for i in range(1, 101)],
    "city": np.random.choice(["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"], 100),
    "state": np.random.choice(["NY", "CA", "IL", "TX", "AZ"], 100),
    "segment": np.random.choice(["Consumer", "Corporate", "Home Office"], 100),
}

Now that we have our sample data created, we can proceed to the querying section. Each of the following sections will demonstrate how to perform similar operations using the different libraries and methods, allowing you to compare and contrast their capabilities.

Create the DataFramesπŸ”—

To create the dataframes in Pandas, we will use the data we generated earlier. We will parse the dictionaries into Pandas DataFrames, which will allow us to perform various data manipulation tasks.

Create DataFrames
1
2
3
df_sales_pd: pd.DataFrame = pd.DataFrame(sales_data)
df_product_pd: pd.DataFrame = pd.DataFrame(product_data)
df_customer_pd: pd.DataFrame = pd.DataFrame(customer_data)

Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the .head() method to display the first 5 rows of each DataFrame, and then parse to the print() function to display the DataFrame in a readable format.

Check Sales DataFrame
1
2
3
print(f"Sales DataFrame: {len(df_sales_pd)}")
print(df_sales_pd.head(5))
print(df_sales_pd.head(5).to_markdown())
Sales DataFrame: 100
        date  customer_id  product_id     category  sales_amount  quantity
0 2023-01-01           52          45         Food        490.76         7
1 2023-01-02           93          41  Electronics        453.94         5
2 2023-01-03           15          29         Home        994.51         5
3 2023-01-04           72          15  Electronics        184.17         7
4 2023-01-05           61          45         Food         27.89         9
date customer_id product_id category sales_amount quantity
0 2023-01-01 00:00:00 52 45 Food 490.76 7
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5
2 2023-01-03 00:00:00 15 29 Home 994.51 5
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7
4 2023-01-05 00:00:00 61 45 Food 27.89 9
Check Product DataFrame
1
2
3
print(f"Product DataFrame: {len(df_product_pd)}")
print(df_product_pd.head(5))
print(df_product_pd.head(5).to_markdown())
Product DataFrame: 50
   product_id product_name   price  category  supplier_id
0           1    Product 1  257.57      Food            8
1           2    Product 2  414.96  Clothing            5
2           3    Product 3  166.82  Clothing            8
3           4    Product 4  448.81      Food            4
4           5    Product 5  200.71      Food            8
product_id product_name price category supplier_id
0 1 Product 1 257.57 Food 8
1 2 Product 2 414.96 Clothing 5
2 3 Product 3 166.82 Clothing 8
3 4 Product 4 448.81 Food 4
4 5 Product 5 200.71 Food 8
Check Customer DataFrame
1
2
3
print(f"Customer DataFrame: {len(df_customer_pd)}")
print(df_customer_pd.head(5))
print(df_customer_pd.head(5).to_markdown())
Customer DataFrame: 100
   customer_id customer_name         city state      segment
0            1    Customer 1      Phoenix    NY    Corporate
1            2    Customer 2      Phoenix    CA  Home Office
2            3    Customer 3      Phoenix    NY  Home Office
3            4    Customer 4  Los Angeles    NY     Consumer
4            5    Customer 5  Los Angeles    IL  Home Office
customer_id customer_name city state segment
0 1 Customer 1 Phoenix NY Corporate
1 2 Customer 2 Phoenix CA Home Office
2 3 Customer 3 Phoenix NY Home Office
3 4 Customer 4 Los Angeles NY Consumer
4 5 Customer 5 Los Angeles IL Home Office

To create the dataframes in SQL, we will use the data we generated earlier. Firstly, we need to create the SQLite database. This will be an in-memory database for demonstration purposes, but in a real-world scenario, you would typically connect to a persistent (on-disk) database. To do this, we will use the sqlite3 library to create a connection to the database, which we define with the :memory: parameter on the .connect() function. The result is to create a temporary database that exists only during the lifetime of the connection.

Next, we will then parse the dictionaries into Pandas DataFrames, which will then be loaded into an SQLite database. This allows us to perform various data manipulation tasks using SQL queries.

Create DataFrames
1
2
3
4
5
# Creates SQLite database and tables
conn: sqlite3.Connection = sqlite3.connect(":memory:")
pd.DataFrame(sales_data).to_sql("sales", conn, index=False, if_exists="replace")
pd.DataFrame(product_data).to_sql("product", conn, index=False, if_exists="replace")
pd.DataFrame(customer_data).to_sql("customer", conn, index=False, if_exists="replace")

Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the pd.read_sql() function to execute SQL queries and retrieve the data from the database. We will then parse the results to the print() function to display the DataFrame in a readable format.

Check Sales DataFrame
1
2
3
print(f"Sales Table: {len(pd.read_sql('SELECT * FROM sales', conn))}")
print(pd.read_sql("SELECT * FROM sales LIMIT 5", conn))
print(pd.read_sql("SELECT * FROM sales LIMIT 5", conn).to_markdown())
Sales Table: 100
                  date  customer_id  product_id     category  sales_amount  quantity
0  2023-01-01 00:00:00           52          45         Food        490.76         7
1  2023-01-02 00:00:00           93          41  Electronics        453.94         5
2  2023-01-03 00:00:00           15          29         Home        994.51         5
3  2023-01-04 00:00:00           72          15  Electronics        184.17         7
4  2023-01-05 00:00:00           61          45         Food         27.89         9
date customer_id product_id category sales_amount quantity
0 2023-01-01 00:00:00 52 45 Food 490.76 7
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5
2 2023-01-03 00:00:00 15 29 Home 994.51 5
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7
4 2023-01-05 00:00:00 61 45 Food 27.89 9
Check Product DataFrame
1
2
3
print(f"Product Table: {len(pd.read_sql('SELECT * FROM product', conn))}")
print(pd.read_sql("SELECT * FROM product LIMIT 5", conn))
print(pd.read_sql("SELECT * FROM product LIMIT 5", conn).to_markdown())
Product Table: 50
   product_id product_name   price  category  supplier_id
0           1    Product 1  257.57      Food            8
1           2    Product 2  414.96  Clothing            5
2           3    Product 3  166.82  Clothing            8
3           4    Product 4  448.81      Food            4
4           5    Product 5  200.71      Food            8
product_id product_name price category supplier_id
0 1 Product 1 257.57 Food 8
1 2 Product 2 414.96 Clothing 5
2 3 Product 3 166.82 Clothing 8
3 4 Product 4 448.81 Food 4
4 5 Product 5 200.71 Food 8
Check Customer DataFrame
1
2
3
print(f"Customer Table: {len(pd.read_sql('SELECT * FROM customer', conn))}")
print(pd.read_sql("SELECT * FROM customer LIMIT 5", conn))
print(pd.read_sql("SELECT * FROM customer LIMIT 5", conn).to_markdown())
Customer Table: 100
   customer_id customer_name         city state      segment
0            1    Customer 1      Phoenix    NY    Corporate
1            2    Customer 2      Phoenix    CA  Home Office
2            3    Customer 3      Phoenix    NY  Home Office
3            4    Customer 4  Los Angeles    NY     Consumer
4            5    Customer 5  Los Angeles    IL  Home Office
customer_id customer_name city state segment
0 1 Customer 1 Phoenix NY Corporate
1 2 Customer 2 Phoenix CA Home Office
2 3 Customer 3 Phoenix NY Home Office
3 4 Customer 4 Los Angeles NY Consumer
4 5 Customer 5 Los Angeles IL Home Office

Spark DataFrames are similar to Pandas DataFrames, but they are designed to work with large datasets that do not fit into memory. They can be distributed across a cluster of machines, allowing for parallel processing of data.

To create the dataframes in PySpark, we will use the data we generated earlier. We will first create a Spark session, which is the entry point to using PySpark. Then, we will parse the dictionaries into PySpark DataFrames, which will allow us to perform various data manipulation tasks.

The PySpark session is created using the .builder method on the SparkSession class, which allows us to configure the session with various options such as the application name. The .getOrCreate() method is used to either get an existing session or create a new one if it doesn't exist.

Create Spark Session
1
spark: SparkSession = SparkSession.builder.appName("SalesAnalysis").getOrCreate()

Once the Spark session is created, we can create the DataFrames from the dictionaries. We will use the .createDataFrame() method on the Spark session to convert the dictionaries into PySpark DataFrames. The .createDataFrame() method is expecting the data to be oriented by row. Meaning that the data should be in the form of a list of dictionaries, where each dictionary represents a row of data. However, we currently have our data is oriented by column, where the dictionarieshave keys as column names and values as lists of data. Therefore, we will first need to convert the dictionaries from column orientation to row orientation. The easiest way to do this is by parse'ing the data to a Pandas DataFrames, and then using that to create our PySpark DataFrames from there.

A good description of how to create PySpark DataFrames from Python Dictionaries can be found in the PySpark documentation: PySpark Create DataFrame From Dictionary.

Create DataFrames
1
2
3
df_sales_ps: psDataFrame = spark.createDataFrame(pd.DataFrame(sales_data))
df_product_ps: psDataFrame = spark.createDataFrame(pd.DataFrame(product_data))
df_customer_ps: psDataFrame = spark.createDataFrame(pd.DataFrame(customer_data))

Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the .show() method to display the first 5 rows of each DataFrame. The .show() method is used to display the data in a tabular format, similar to how it would be displayed in a SQL database.

Check Sales DataFrame
1
2
3
print(f"Sales DataFrame: {df_sales_ps.count()}")
df_sales_ps.show(5)
print(df_sales_ps.limit(5).toPandas().to_markdown())
Sales DataFrame: 100
+-------------------+-----------+----------+-----------+------------+--------+
|               date|customer_id|product_id|   category|sales_amount|quantity|
+-------------------+-----------+----------+-----------+------------+--------+
|2023-01-01 00:00:00|         52|        45|       Food|      490.76|       7|
|2023-01-02 00:00:00|         93|        41|Electronics|      453.94|       5|
|2023-01-03 00:00:00|         15|        29|       Home|      994.51|       5|
|2023-01-04 00:00:00|         72|        15|Electronics|      184.17|       7|
|2023-01-05 00:00:00|         61|        45|       Food|       27.89|       9|
+-------------------+-----------+----------+-----------+------------+--------+
only showing top 10 rows
date customer_id product_id category sales_amount quantity
0 2023-01-01 00:00:00 52 45 Food 490.76 7
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5
2 2023-01-03 00:00:00 15 29 Home 994.51 5
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7
4 2023-01-05 00:00:00 61 45 Food 27.89 9
Check Product DataFrame
1
2
3
print(f"Product DataFrame: {df_product_ps.count()}")
df_product_ps.show(5)
print(df_product_ps.limit(5).toPandas().to_markdown())
Product DataFrame: 50
+----------+------------+------+--------+-----------+
|product_id|product_name| price|category|supplier_id|
+----------+------------+------+--------+-----------+
|         1|   Product 1|257.57|    Food|          8|
|         2|   Product 2|414.96|Clothing|          5|
|         3|   Product 3|166.82|Clothing|          8|
|         4|   Product 4|448.81|    Food|          4|
|         5|   Product 5|200.71|    Food|          8|
+----------+------------+------+--------+-----------+
only showing top 5 rows
product_id product_name price category supplier_id
0 1 Product 1 257.57 Food 8
1 2 Product 2 414.96 Clothing 5
2 3 Product 3 166.82 Clothing 8
3 4 Product 4 448.81 Food 4
4 5 Product 5 200.71 Food 8
Check Customer DataFrame
1
2
3
print(f"Customer DataFrame: {df_customer_ps.count()}")
df_customer_ps.show(5)
print(df_customer_ps.limit(5).toPandas().to_markdown())
Customer DataFrame: 100
+-----------+-------------+-----------+-----+-----------+
|customer_id|customer_name|       city|state|    segment|
+-----------+-------------+-----------+-----+-----------+
|          1|   Customer 1|    Phoenix|   NY|  Corporate|
|          2|   Customer 2|    Phoenix|   CA|Home Office|
|          3|   Customer 3|    Phoenix|   NY|Home Office|
|          4|   Customer 4|Los Angeles|   NY|   Consumer|
|          5|   Customer 5|Los Angeles|   IL|Home Office|
+-----------+-------------+-----------+-----+-----------+
only showing top 5 rows
customer_id customer_name city state segment
0 1 Customer 1 Phoenix NY Corporate
1 2 Customer 2 Phoenix CA Home Office
2 3 Customer 3 Phoenix NY Home Office
3 4 Customer 4 Los Angeles NY Consumer
4 5 Customer 5 Los Angeles IL Home Office

To create the dataframes in Polars, we will use the data we generated earlier. We will parse the dictionaries into Polars DataFrames, which will allow us to perform various data manipulation tasks.

Create DataFrames
1
2
3
df_sales_pl: pl.DataFrame = pl.DataFrame(sales_data)
df_product_pl: pl.DataFrame = pl.DataFrame(product_data)
df_customer_pl: pl.DataFrame = pl.DataFrame(customer_data)

Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the .head() method to display the first 5 rows of each DataFrame, and then parse to the print() function to display the DataFrame in a readable format.

Check Sales DataFrame
1
2
3
print(f"Sales DataFrame: {df_sales_pl.shape[0]}")
print(df_sales_pl.head(5))
print(df_sales_pl.head(5).to_pandas().to_markdown())
Sales DataFrame: 100
shape: (5, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date                ┆ customer_id ┆ product_id ┆ category    ┆ sales_amount ┆ quantity β”‚
β”‚ ---                 ┆ ---         ┆ ---        ┆ ---         ┆ ---          ┆ ---      β”‚
β”‚ datetime[ns]        ┆ i64         ┆ i64        ┆ str         ┆ f64          ┆ i64      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ══════════════β•ͺ══════════║
β”‚ 2023-01-01 00:00:00 ┆ 52          ┆ 45         ┆ Food        ┆ 490.76       ┆ 7        β”‚
β”‚ 2023-01-02 00:00:00 ┆ 93          ┆ 41         ┆ Electronics ┆ 453.94       ┆ 5        β”‚
β”‚ 2023-01-03 00:00:00 ┆ 15          ┆ 29         ┆ Home        ┆ 994.51       ┆ 5        β”‚
β”‚ 2023-01-04 00:00:00 ┆ 72          ┆ 15         ┆ Electronics ┆ 184.17       ┆ 7        β”‚
β”‚ 2023-01-05 00:00:00 ┆ 61          ┆ 45         ┆ Food        ┆ 27.89        ┆ 9        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date customer_id product_id category sales_amount quantity
0 2023-01-01 00:00:00 52 45 Food 490.76 7
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5
2 2023-01-03 00:00:00 15 29 Home 994.51 5
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7
4 2023-01-05 00:00:00 61 45 Food 27.89 9
Check Product DataFrame
1
2
3
print(f"Product DataFrame: {df_product_pl.shape[0]}")
print(df_product_pl.head(5))
print(df_product_pl.head(5).to_pandas().to_markdown())
Product DataFrame: 50
shape: (5, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ product_id ┆ product_name ┆ price  ┆ category ┆ supplier_id β”‚
β”‚ ---        ┆ ---          ┆ ---    ┆ ---      ┆ ---         β”‚
β”‚ i64        ┆ str          ┆ f64    ┆ str      ┆ i64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ══════════════β•ͺ════════β•ͺ══════════β•ͺ═════════════║
β”‚ 1          ┆ Product 1    ┆ 257.57 ┆ Food     ┆ 8           β”‚
β”‚ 2          ┆ Product 2    ┆ 414.96 ┆ Clothing ┆ 5           β”‚
β”‚ 3          ┆ Product 3    ┆ 166.82 ┆ Clothing ┆ 8           β”‚
β”‚ 4          ┆ Product 4    ┆ 448.81 ┆ Food     ┆ 4           β”‚
β”‚ 5          ┆ Product 5    ┆ 200.71 ┆ Food     ┆ 8           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
product_id product_name price category supplier_id
0 1 Product 1 257.57 Food 8
1 2 Product 2 414.96 Clothing 5
2 3 Product 3 166.82 Clothing 8
3 4 Product 4 448.81 Food 4
4 5 Product 5 200.71 Food 8
Check Customer DataFrame
1
2
3
print(f"Customer DataFrame: {df_customer_pl.shape[0]}")
print(df_customer_pl.head(5))
print(df_customer_pl.head(5).to_pandas().to_markdown())
Customer DataFrame: 100
shape: (5, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ customer_id ┆ customer_name ┆ city        ┆ state ┆ segment     β”‚
β”‚ ---         ┆ ---           ┆ ---         ┆ ---   ┆ ---         β”‚
β”‚ i64         ┆ str           ┆ str         ┆ str   ┆ str         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═══════════════β•ͺ═════════════β•ͺ═══════β•ͺ═════════════║
β”‚ 1           ┆ Customer 1    ┆ Phoenix     ┆ NY    ┆ Corporate   β”‚
β”‚ 2           ┆ Customer 2    ┆ Phoenix     ┆ CA    ┆ Home Office β”‚
β”‚ 3           ┆ Customer 3    ┆ Phoenix     ┆ NY    ┆ Home Office β”‚
β”‚ 4           ┆ Customer 4    ┆ Los Angeles ┆ NY    ┆ Consumer    β”‚
β”‚ 5           ┆ Customer 5    ┆ Los Angeles ┆ IL    ┆ Home Office β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
customer_id customer_name city state segment
0 1 Customer 1 Phoenix NY Corporate
1 2 Customer 2 Phoenix CA Home Office
2 3 Customer 3 Phoenix NY Home Office
3 4 Customer 4 Los Angeles NY Consumer
4 5 Customer 5 Los Angeles IL Home Office

1. Filtering and SelectingπŸ”—

This first section will demonstrate how to filter and select data from the DataFrames. This is a common operation in data analysis, allowing us to focus on specific subsets of the data.

In Pandas, we can use boolean indexing to filter rows based on specific conditions. As you can see in this first example, this looks like using square brackets, within which we define a column and a condition. In the below example, we can use string values to filter categorical data.

For more information about filtering in Pandas, see the Pandas documentation on filtering.

Filter sales data for specific category
1
2
3
4
electronics_sales_pd: pd.DataFrame = df_sales_pd[df_sales_pd["category"] == "Electronics"]
print(f"Number of Electronics Sales: {len(electronics_sales_pd)}")
print(electronics_sales_pd.head(5))
print(electronics_sales_pd.head(5).to_markdown())
Number of Electronics Sales: 28
         date  customer_id  product_id     category  sales_amount  quantity
1  2023-01-02           93          41  Electronics        453.94         5
3  2023-01-04           72          15  Electronics        184.17         7
8  2023-01-09           75           9  Electronics        746.73         2
10 2023-01-11           88           1  Electronics        314.98         9
11 2023-01-12           24          44  Electronics        547.11         8
date customer_id product_id category sales_amount quantity
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7
8 2023-01-09 00:00:00 75 9 Electronics 746.73 2
10 2023-01-11 00:00:00 88 1 Electronics 314.98 9
11 2023-01-12 00:00:00 24 44 Electronics 547.11 8

In SQL, we can use the WHERE clause to filter rows based on specific conditions. The syntax should be very familiar to anyone who has worked with SQL before. We can use the pd.read_sql() function to execute SQL queries and retrieve the data from the database. The result is a Pandas DataFrame that contains only the rows that match the specified condition. In the below example, we filter for sales in the "Electronics" category.

For more information about filtering in SQL, see the SQL WHERE clause documentation.

Filter sales for a specific category
1
2
3
4
5
6
7
8
9
electronics_sales_txt: str = """
    SELECT *
    FROM sales
    WHERE category = 'Electronics'
"""
electronics_sales_sql: pd.DataFrame = pd.read_sql(electronics_sales_txt, conn)
print(f"Number of Electronics Sales: {len(electronics_sales_sql)}")
print(pd.read_sql(electronics_sales_txt + "LIMIT 5", conn))
print(pd.read_sql(electronics_sales_txt + "LIMIT 5", conn).to_markdown())
Number of Electronics Sales: 28
                  date  customer_id  product_id     category  sales_amount  quantity
0  2023-01-02 00:00:00           93          41  Electronics        453.94         5
1  2023-01-04 00:00:00           72          15  Electronics        184.17         7
2  2023-01-09 00:00:00           75           9  Electronics        746.73         2
3  2023-01-11 00:00:00           88           1  Electronics        314.98         9
4  2023-01-12 00:00:00           24          44  Electronics        547.11         8
date customer_id product_id category sales_amount quantity
0 2023-01-02 00:00:00 93 41 Electronics 453.94 5
1 2023-01-04 00:00:00 72 15 Electronics 184.17 7
2 2023-01-09 00:00:00 75 9 Electronics 746.73 2
3 2023-01-11 00:00:00 88 1 Electronics 314.98 9
4 2023-01-12 00:00:00 24 44 Electronics 547.11 8

In PySpark, we can use the .filter() (or the .where()) method to filter rows based on specific conditions. This process is effectively doing a boolean indexing operation to filter the DataFrame. The syntax is similar to SQL, where we can specify the condition as a string or using column expressions. In the below example, we filter for sales in the "Electronics" category.

For more information about filtering in PySpark, see the PySpark documentation on filtering.

Filter sales for a specific category
1
2
3
4
electronics_sales_ps: psDataFrame = df_sales_ps.filter(df_sales_ps["category"] == "Electronics")
print(f"Number of Electronics Sales: {electronics_sales_ps.count()}")
electronics_sales_ps.show(5)
print(electronics_sales_ps.limit(5).toPandas().to_markdown())
Number of Electronics Sales: 28
+-------------------+-----------+----------+-----------+------------+--------+
|               date|customer_id|product_id|   category|sales_amount|quantity|
+-------------------+-----------+----------+-----------+------------+--------+
|2023-01-02 00:00:00|         93|        41|Electronics|      453.94|       5|
|2023-01-04 00:00:00|         72|        15|Electronics|      184.17|       7|
|2023-01-09 00:00:00|         75|         9|Electronics|      746.73|       2|
|2023-01-11 00:00:00|         88|         1|Electronics|      314.98|       9|
|2023-01-12 00:00:00|         24|        44|Electronics|      547.11|       8|
+-------------------+-----------+----------+-----------+------------+--------+
only showing top 5 rows
date customer_id product_id category sales_amount quantity
0 2023-01-02 00:00:00 93 41 Electronics 453.94 5
1 2023-01-04 00:00:00 72 15 Electronics 184.17 7
2 2023-01-09 00:00:00 75 9 Electronics 746.73 2
3 2023-01-11 00:00:00 88 1 Electronics 314.98 9
4 2023-01-12 00:00:00 24 44 Electronics 547.11 8

In Polars, we can use the .filter() method to filter rows based on specific conditions. The syntax is similar to Pandas, where we can specify the condition using column expressions. In the below example, we filter for sales in the "Electronics" category.

For more information about filtering in Polars, see the Polars documentation on filtering.

Filter sales for a specific category
1
2
3
4
electronics_sales_pl: pl.DataFrame = df_sales_pl.filter(df_sales_pl["category"] == "Electronics")
print(f"Number of Electronics Sales: {len(electronics_sales_pl)}")
print(electronics_sales_pl.head(5))
print(electronics_sales_pl.head(5).to_pandas().to_markdown())
Number of Electronics Sales: 28
shape: (5, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date                ┆ customer_id ┆ product_id ┆ category    ┆ sales_amount ┆ quantity β”‚
β”‚ ---                 ┆ ---         ┆ ---        ┆ ---         ┆ ---          ┆ ---      β”‚
β”‚ datetime[ns]        ┆ i64         ┆ i64        ┆ str         ┆ f64          ┆ i64      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ══════════════β•ͺ══════════║
β”‚ 2023-01-02 00:00:00 ┆ 93          ┆ 41         ┆ Electronics ┆ 453.94       ┆ 5        β”‚
β”‚ 2023-01-04 00:00:00 ┆ 72          ┆ 15         ┆ Electronics ┆ 184.17       ┆ 7        β”‚
β”‚ 2023-01-09 00:00:00 ┆ 75          ┆ 9          ┆ Electronics ┆ 746.73       ┆ 2        β”‚
β”‚ 2023-01-11 00:00:00 ┆ 88          ┆ 1          ┆ Electronics ┆ 314.98       ┆ 9        β”‚
β”‚ 2023-01-12 00:00:00 ┆ 24          ┆ 44         ┆ Electronics ┆ 547.11       ┆ 8        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date customer_id product_id category sales_amount quantity
0 2023-01-02 00:00:00 93 41 Electronics 453.94 5
1 2023-01-04 00:00:00 72 15 Electronics 184.17 7
2 2023-01-09 00:00:00 75 9 Electronics 746.73 2
3 2023-01-11 00:00:00 88 1 Electronics 314.98 9
4 2023-01-12 00:00:00 24 44 Electronics 547.11 8

We can also use numerical filtering, as you can see in the next example, where we filter for sales amounts greater than $500.

When it comes to numerical filtering in Pandas, the process is similar to the previous example, where we use boolean indexing to filter rows based on a given condition condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500.

Filter for high value transactions
1
2
3
4
high_value_sales_pd: pd.DataFrame = df_sales_pd[df_sales_pd["sales_amount"] > 500]
print(f"Number of high-value Sales: {len(high_value_sales_pd)}")
print(high_value_sales_pd.head(5))
print(high_value_sales_pd.head(5).to_markdown())
Number of high-value Sales: 43
         date  customer_id  product_id     category  sales_amount  quantity
2  2023-01-03           15          29         Home        994.51         5
8  2023-01-09           75           9  Electronics        746.73         2
9  2023-01-10           75          24        Books        723.73         6
11 2023-01-12           24          44  Electronics        547.11         8
12 2023-01-13            3           8     Clothing        513.73         5
date customer_id product_id category sales_amount quantity
2 2023-01-03 00:00:00 15 29 Home 994.51 5
8 2023-01-09 00:00:00 75 9 Electronics 746.73 2
9 2023-01-10 00:00:00 75 24 Books 723.73 6
11 2023-01-12 00:00:00 24 44 Electronics 547.11 8
12 2023-01-13 00:00:00 3 8 Clothing 513.73 5

When it comes to numerical filtering in SQL, the process is similar to the previous example, where we use the WHERE clause to filter rows based on a given condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500.

Filter for high value transactions
1
2
3
4
5
6
7
8
9
high_value_sales_txt: str = """
    SELECT *
    FROM sales
    WHERE sales_amount > 500
"""
high_value_sales_sql: pd.DataFrame = pd.read_sql(high_value_sales_txt, conn)
print(f"Number of high-value Sales: {len(high_value_sales_sql)}")
print(pd.read_sql(high_value_sales_txt + "LIMIT 5", conn))
print(pd.read_sql(high_value_sales_txt + "LIMIT 5", conn).to_markdown())
Number of high-value Sales: 43
                  date  customer_id  product_id     category  sales_amount  quantity
0  2023-01-03 00:00:00           15          29         Home        994.51         5
1  2023-01-09 00:00:00           75           9  Electronics        746.73         2
2  2023-01-10 00:00:00           75          24        Books        723.73         6
3  2023-01-12 00:00:00           24          44  Electronics        547.11         8
4  2023-01-13 00:00:00            3           8     Clothing        513.73         5
date customer_id product_id category sales_amount quantity
0 2023-01-03 00:00:00 15 29 Home 994.51 5
1 2023-01-09 00:00:00 75 9 Electronics 746.73 2
2 2023-01-10 00:00:00 75 24 Books 723.73 6
3 2023-01-12 00:00:00 24 44 Electronics 547.11 8
4 2023-01-13 00:00:00 3 8 Clothing 513.73 5

When it comes to numerical filtering in PySpark, the process is similar to the previous example, where we use the .filter() (or .where()) method to filter rows based on a given condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500.

Also note here that we have parsed a string value to the .filter() method, instead of using the pure-Python syntax as shown above. This is because the .filter() method can accept a SQL-like string expression. This is a common practice in PySpark to parse a SQL-like string to a PySpark method.

Filter for high value transactions
1
2
3
4
high_value_sales_ps: psDataFrame = df_sales_ps.filter("sales_amount > 500")
print(f"Number of high-value Sales: {high_value_sales_ps.count()}")
high_value_sales_ps.show(5)
print(high_value_sales_ps.limit(5).toPandas().to_markdown())
Number of high-value Sales: 43
+-------------------+-----------+----------+-----------+------------+--------+
|               date|customer_id|product_id|   category|sales_amount|quantity|
+-------------------+-----------+----------+-----------+------------+--------+
|2023-01-03 00:00:00|         15|        29|       Home|      994.51|       5|
|2023-01-09 00:00:00|         75|         9|Electronics|      746.73|       2|
|2023-01-10 00:00:00|         75|        24|      Books|      723.73|       6|
|2023-01-12 00:00:00|         24|        44|Electronics|      547.11|       8|
|2023-01-13 00:00:00|          3|         8|   Clothing|      513.73|       5|
+-------------------+-----------+----------+-----------+------------+--------+
only showing top 5 rows
date customer_id product_id category sales_amount quantity
0 2023-01-03 00:00:00 15 29 Home 994.51 5
1 2023-01-09 00:00:00 75 9 Electronics 746.73 2
2 2023-01-10 00:00:00 75 24 Books 723.73 6
3 2023-01-12 00:00:00 24 44 Electronics 547.11 8
4 2023-01-13 00:00:00 3 8 Clothing 513.73 5

When it comes to numerical filtering in Polars, the process is similar to the previous example, where we use the .filter() method to filter rows based on a given condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500.

Also note here that we have used the pl.col() function to specify the column we want to filter on. This is different from the previous examples, where we used the column name directly. The use of pl.col() is a common practice in Polars to specify the column name in a more readable way.

Filter for high value transactions
1
2
3
4
high_value_sales_pl: pl.DataFrame = df_sales_pl.filter(pl.col("sales_amount") > 500)
print(f"Number of high-value Sales: {len(high_value_sales_pl)}")
print(high_value_sales_pl.head(5))
print(high_value_sales_pl.head(5).to_pandas().to_markdown())
Number of high-value Sales: 43
shape: (5, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date                ┆ customer_id ┆ product_id ┆ category    ┆ sales_amount ┆ quantity β”‚
β”‚ ---                 ┆ ---         ┆ ---        ┆ ---         ┆ ---          ┆ ---      β”‚
β”‚ datetime[ns]        ┆ i64         ┆ i64        ┆ str         ┆ f64          ┆ i64      β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ══════════════β•ͺ══════════║
β”‚ 2023-01-03 00:00:00 ┆ 15          ┆ 29         ┆ Home        ┆ 994.51       ┆ 5        β”‚
β”‚ 2023-01-09 00:00:00 ┆ 75          ┆ 9          ┆ Electronics ┆ 746.73       ┆ 2        β”‚
β”‚ 2023-01-10 00:00:00 ┆ 75          ┆ 24         ┆ Books       ┆ 723.73       ┆ 6        β”‚
β”‚ 2023-01-12 00:00:00 ┆ 24          ┆ 44         ┆ Electronics ┆ 547.11       ┆ 8        β”‚
β”‚ 2023-01-13 00:00:00 ┆ 3           ┆ 8          ┆ Clothing    ┆ 513.73       ┆ 5        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date customer_id product_id category sales_amount quantity
0 2023-01-03 00:00:00 15 29 Home 994.51 5
1 2023-01-09 00:00:00 75 9 Electronics 746.73 2
2 2023-01-10 00:00:00 75 24 Books 723.73 6
3 2023-01-12 00:00:00 24 44 Electronics 547.11 8
4 2023-01-13 00:00:00 3 8 Clothing 513.73 5

In addition to subsetting a table by rows (aka filtering), we can also subset a table by columns (aka selecting). This allows us to create a new DataFrame with only the relevant columns we want to work with. This is useful when we want to focus on specific attributes of the data, such as dates, categories, or sales amounts.

To select specific columns in Pandas, we can use the double square brackets syntax to specify the columns we want to keep in the DataFrame. This allows us to create a new DataFrame with only the relevant columns.

For more information about selecting specific columns, see the Pandas documentation on selecting columns.

Select specific columns
1
2
3
4
sales_summary_pd: pd.DataFrame = df_sales_pd[["date", "category", "sales_amount"]]
print(f"Sales Summary DataFrame: {len(sales_summary_pd)}")
print(sales_summary_pd.head(5))
print(sales_summary_pd.head(5).to_markdown())
Sales Summary DataFrame: 100
        date     category  sales_amount
0 2023-01-01         Food        490.76
1 2023-01-02  Electronics        453.94
2 2023-01-03         Home        994.51
3 2023-01-04  Electronics        184.17
4 2023-01-05         Food         27.89
date category sales_amount
0 2023-01-01 00:00:00 Food 490.76
1 2023-01-02 00:00:00 Electronics 453.94
2 2023-01-03 00:00:00 Home 994.51
3 2023-01-04 00:00:00 Electronics 184.17
4 2023-01-05 00:00:00 Food 27.89

To select specific columns in SQL, we can use the SELECT statement to specify the columns we want to retrieve from the table. This allows us to create a new DataFrame with only the relevant columns. We can use the pd.read_sql() function to execute SQL queries and retrieve the data from the database.

For more information about selecting specific columns in SQL, see the SQL SELECT statement documentation.

Select specific columns
1
2
3
4
5
6
7
8
sales_summary_txt: str = """
    SELECT date, category, sales_amount
    FROM sales
"""
sales_summary_sql: pd.DataFrame = pd.read_sql(sales_summary_txt, conn)
print(f"Selected columns in Sales: {len(sales_summary_sql)}")
print(pd.read_sql(sales_summary_txt + "LIMIT 5", conn))
print(pd.read_sql(sales_summary_txt + "LIMIT 5", conn).to_markdown())
Selected columns in Sales: 100
                  date     category  sales_amount
0  2023-01-01 00:00:00         Food        490.76
1  2023-01-02 00:00:00  Electronics        453.94
2  2023-01-03 00:00:00         Home        994.51
3  2023-01-04 00:00:00  Electronics        184.17
4  2023-01-05 00:00:00         Food         27.89
date category sales_amount
0 2023-01-01 00:00:00 Food 490.76
1 2023-01-02 00:00:00 Electronics 453.94
2 2023-01-03 00:00:00 Home 994.51
3 2023-01-04 00:00:00 Electronics 184.17
4 2023-01-05 00:00:00 Food 27.89

To select specific columns in PySpark, we can use the .select() method to specify the columns we want to keep in the DataFrame. This allows us to create a new DataFrame with only the relevant columns. The syntax is similar to SQL, where we can specify the column names as strings.

Select specific columns
1
2
3
4
sales_summary_ps: psDataFrame = df_sales_ps.select("date", "category", "sales_amount")
print(f"Sales Summary DataFrame: {sales_summary_ps.count()}")
sales_summary_ps.show(5)
print(sales_summary_ps.limit(5).toPandas().to_markdown())
Sales Summary DataFrame: 100
+-------------------+-----------+------------+
|               date|   category|sales_amount|
+-------------------+-----------+------------+
|2023-01-01 00:00:00|       Food|      490.76|
|2023-01-02 00:00:00|Electronics|      453.94|
|2023-01-03 00:00:00|       Home|      994.51|
|2023-01-04 00:00:00|Electronics|      184.17|
|2023-01-05 00:00:00|       Food|       27.89|
+-------------------+-----------+------------+
only showing top 5 rows
date category sales_amount
0 2023-01-01 00:00:00 Food 490.76
1 2023-01-02 00:00:00 Electronics 453.94
2 2023-01-03 00:00:00 Home 994.51
3 2023-01-04 00:00:00 Electronics 184.17
4 2023-01-05 00:00:00 Food 27.89

To select specific columns in Polars, we can use the .select() method to specify the columns we want to keep in the DataFrame. This allows us to create a new DataFrame with only the relevant columns.

Select specific columns
1
2
3
4
sales_summary_pl: pl.DataFrame = df_sales_pl.select(["date", "category", "sales_amount"])
print(f"Sales Summary DataFrame: {len(sales_summary_pl)}")
print(sales_summary_pl.head(5))
print(sales_summary_pl.head(5).to_pandas().to_markdown())
Sales Summary DataFrame: 100
shape: (5, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date                ┆ category    ┆ sales_amount β”‚
β”‚ ---                 ┆ ---         ┆ ---          β”‚
β”‚ datetime[ns]        ┆ str         ┆ f64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ══════════════║
β”‚ 2023-01-01 00:00:00 ┆ Food        ┆ 490.76       β”‚
β”‚ 2023-01-02 00:00:00 ┆ Electronics ┆ 453.94       β”‚
β”‚ 2023-01-03 00:00:00 ┆ Home        ┆ 994.51       β”‚
β”‚ 2023-01-04 00:00:00 ┆ Electronics ┆ 184.17       β”‚
β”‚ 2023-01-05 00:00:00 ┆ Food        ┆ 27.89        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date category sales_amount
0 2023-01-01 00:00:00 Food 490.76
1 2023-01-02 00:00:00 Electronics 453.94
2 2023-01-03 00:00:00 Home 994.51
3 2023-01-04 00:00:00 Electronics 184.17
4 2023-01-05 00:00:00 Food 27.89

2. Grouping and AggregationπŸ”—

The second section will cover grouping and aggregation techniques. These operations are essential for summarizing data and extracting insights from large datasets.

In Pandas, we can use the .agg() method to perform aggregation operations on DataFrames. This method allows us to apply multiple aggregation functions to different columns in a single operation.

Basic aggregation
1
2
3
4
5
6
7
8
9
sales_stats_pd: pd.DataFrame = df_sales_pd.agg(
    {
        "sales_amount": ["sum", "mean", "min", "max", "count"],
        "quantity": ["sum", "mean", "min", "max"],
    }
)
print(f"Sales Statistics: {len(sales_stats_pd)}")
print(sales_stats_pd)
print(sales_stats_pd.to_markdown())
Sales Statistics: 5
       sales_amount  quantity
sum      48227.0500    464.00
mean       482.2705      4.64
min         15.1300      1.00
max        994.6100      9.00
count      100.0000       NaN
sales_amount quantity
sum 48227.1 464
mean 482.271 4.64
min 15.13 1
max 994.61 9
count 100 nan

In SQL, we can use the aggregate functions like SUM(), AVG(), MIN(), MAX(), and COUNT() to perform aggregation operations on tables.

Note here that we are not using the GROUP BY clause, which is typically used to group rows that have the same values in specified columns into summary rows. Instead, we are performing a basic aggregation on the entire table.

Basic aggregation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
sales_stats_txt: str = """
    SELECT
        SUM(sales_amount) AS sales_sum,
        AVG(sales_amount) AS sales_mean,
        MIN(sales_amount) AS sales_min,
        MAX(sales_amount) AS sales_max,
        COUNT(*) AS sales_count,
        SUM(quantity) AS quantity_sum,
        AVG(quantity) AS quantity_mean,
        MIN(quantity) AS quantity_min,
        MAX(quantity) AS quantity_max
    FROM sales
"""
print(f"Sales Statistics: {len(pd.read_sql(sales_stats_txt, conn))}")
print(pd.read_sql(sales_stats_txt, conn))
print(pd.read_sql(sales_stats_txt, conn).to_markdown())
Sales Statistics: 1
   sales_sum  sales_mean  sales_min  sales_max  sales_count  quantity_sum  quantity_mean  quantity_min  quantity_max
0   48227.05    482.2705      15.13     994.61          100           464           4.64             1             9
sales_sum sales_mean sales_min sales_max sales_count quantity_sum quantity_mean quantity_min quantity_max
0 48227.1 482.271 15.13 994.61 100 464 4.64 1 9

In PySpark, we can use the .agg() method to perform aggregation operations on DataFrames. This method allows us to apply multiple aggregation functions to different columns in a single operation.

Basic aggregation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
sales_stats_ps: psDataFrame = df_sales_ps.agg(
    F.sum("sales_amount").alias("sales_sum"),
    F.avg("sales_amount").alias("sales_mean"),
    F.expr("MIN(sales_amount) AS sales_min"),
    F.expr("MAX(sales_amount) AS sales_max"),
    F.count("*").alias("sales_count"),
    F.expr("SUM(quantity) AS quantity_sum"),
    F.expr("AVG(quantity) AS quantity_mean"),
    F.min("quantity").alias("quantity_min"),
    F.max("quantity").alias("quantity_max"),
)
print(f"Sales Statistics: {sales_stats_ps.count()}")
sales_stats_ps.show(5)
print(sales_stats_ps.limit(5).toPandas().to_markdown())
Sales Statistics: 1
+---------+----------+---------+---------+-----------+------------+-------------+------------+------------+
|sales_sum|sales_mean|sales_min|sales_max|sales_count|quantity_sum|quantity_mean|quantity_min|quantity_max|
+---------+----------+---------+---------+-----------+------------+-------------+------------+------------+
| 48227.05|  482.2705|    15.13|   994.61|        100|         464|         4.64|           1|           9|
+---------+----------+---------+---------+-----------+------------+-------------+------------+------------+
sales_sum sales_mean sales_min sales_max sales_count quantity_sum quantity_mean quantity_min quantity_max
0 48227.1 482.271 15.13 994.61 100 464 4.64 1 9

In Polars, we can use the .select() method to perform aggregation operations on DataFrames. This method allows us to apply multiple aggregation functions to different columns in a single operation.

Basic aggregation
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
sales_stats_pl: pl.DataFrame = df_sales_pl.select(
    pl.col("sales_amount").sum().alias("sales_sum"),
    pl.col("sales_amount").mean().alias("sales_mean"),
    pl.col("sales_amount").min().alias("sales_min"),
    pl.col("sales_amount").max().alias("sales_max"),
    pl.col("quantity").sum().alias("quantity_sum"),
    pl.col("quantity").mean().alias("quantity_mean"),
    pl.col("quantity").min().alias("quantity_min"),
    pl.col("quantity").max().alias("quantity_max"),
)
print(f"Sales Statistics: {len(sales_stats_pl)}")
print(sales_stats_pl)
print(sales_stats_pl.to_pandas().to_markdown())
Sales Statistics: 1
shape: (1, 8)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ sales_sum ┆ sales_mean ┆ sales_min ┆ sales_max ┆ quantity_sum ┆ quantity_mean ┆ quantity_min ┆ quantity_max β”‚
β”‚ ---       ┆ ---        ┆ ---       ┆ ---       ┆ ---          ┆ ---           ┆ ---          ┆ ---          β”‚
β”‚ f64       ┆ f64        ┆ f64       ┆ f64       ┆ i64          ┆ f64           ┆ i64          ┆ i64          β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════β•ͺ═══════════β•ͺ═══════════β•ͺ══════════════β•ͺ═══════════════β•ͺ══════════════β•ͺ══════════════║
β”‚ 48227.05  ┆ 482.2705   ┆ 15.13     ┆ 994.61    ┆ 464          ┆ 4.64          ┆ 1            ┆ 9            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
sales_sum sales_mean sales_min sales_max quantity_sum quantity_mean quantity_min quantity_max
0 48227 482.27 15.13 994.61 464 4.64 1 9

It is also possible to group the data by a specific column and then apply aggregation functions to summarize the data by group.

This is done using the .groupby() method to group data by one or more columns and then apply aggregation functions to summarize the data, followed by the .agg() method.

Group by category and aggregate
1
2
3
4
5
6
7
8
9
category_sales_pd: pd.DataFrame = df_sales_pd.groupby("category").agg(
    {
        "sales_amount": ["sum", "mean", "count"],
        "quantity": "sum",
    }
)
print(f"Category Sales Summary: {len(category_sales_pd)}")
print(category_sales_pd)
print(category_sales_pd.to_markdown())
Category Sales Summary: 5
            sales_amount                   quantity
                     sum        mean count      sum
category
Books           10154.83  441.514348    23      100
Clothing         7325.31  457.831875    16       62
Electronics     11407.45  407.408929    28      147
Food            12995.57  541.482083    24      115
Home             6343.89  704.876667     9       40
category ('sales_amount', 'sum') ('sales_amount', 'mean') ('sales_amount', 'count') ('quantity', 'sum')
Books 10154.8 441.514 23 100
Clothing 7325.31 457.832 16 62
Electronics 11407.5 407.409 28 147
Food 12995.6 541.482 24 115
Home 6343.89 704.877 9 40

In SQL, we can use the GROUP BY clause to group rows that have the same values in specified columns into summary rows. We can then apply aggregate functions like SUM(), AVG(), and COUNT() in the SELECT clause to summarize the data by group.

Group by category and aggregate
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
category_sales_txt: str = """
    SELECT
        category,
        SUM(sales_amount) AS total_sales,
        AVG(sales_amount) AS average_sales,
        COUNT(*) AS transaction_count,
        SUM(quantity) AS total_quantity
    FROM sales
    GROUP BY category
"""
print(f"Category Sales Summary: {len(pd.read_sql(category_sales_txt, conn))}")
print(pd.read_sql(category_sales_txt + "LIMIT 5", conn))
print(pd.read_sql(category_sales_txt + "LIMIT 5", conn).to_markdown())
Category Sales Summary: 5
      category  total_sales  average_sales  transaction_count  total_quantity
0        Books     10154.83     441.514348                 23             100
1     Clothing      7325.31     457.831875                 16              62
2  Electronics     11407.45     407.408929                 28             147
3         Food     12995.57     541.482083                 24             115
4         Home      6343.89     704.876667                  9              40
category total_sales average_sales transaction_count total_quantity
0 Books 10154.8 441.514 23 100
1 Clothing 7325.31 457.832 16 62
2 Electronics 11407.5 407.409 28 147
3 Food 12995.6 541.482 24 115
4 Home 6343.89 704.877 9 40

In PySpark, we can use the .groupBy() method to group data by one or more columns and then apply aggregation functions using the .agg() method.

Group by category and aggregate
1
2
3
4
5
6
7
8
9
category_sales_ps: psDataFrame = df_sales_ps.groupBy("category").agg(
    F.sum("sales_amount").alias("total_sales"),
    F.avg("sales_amount").alias("average_sales"),
    F.count("*").alias("transaction_count"),
    F.sum("quantity").alias("total_quantity"),
)
print(f"Category Sales Summary: {category_sales_ps.count()}")
category_sales_ps.show(5)
print(category_sales_ps.limit(5).toPandas().to_markdown())
Category Sales Summary: 5
+-----------+------------------+------------------+-----------------+--------------+
|   category|       total_sales|     average_sales|transaction_count|total_quantity|
+-----------+------------------+------------------+-----------------+--------------+
|       Home| 6343.889999999999| 704.8766666666666|                9|            40|
|       Food|          12995.57| 541.4820833333333|               24|           115|
|Electronics|11407.449999999999|407.40892857142853|               28|           147|
|   Clothing|7325.3099999999995|457.83187499999997|               16|            62|
|      Books|          10154.83|  441.514347826087|               23|           100|
+-----------+------------------+------------------+-----------------+--------------+
category total_sales average_sales transaction_count total_quantity
0 Home 6343.89 704.877 9 40
1 Food 12995.6 541.482 24 115
2 Electronics 11407.4 407.409 28 147
3 Clothing 7325.31 457.832 16 62
4 Books 10154.8 441.514 23 100

In Polars, we can use the .group_by() method to group data by one or more columns and then apply aggregation functions using the .agg() method.

Group by category and aggregate
1
2
3
4
5
6
7
8
9
category_sales_pl: pl.DataFrame = df_sales_pl.group_by("category").agg(
    pl.col("sales_amount").sum().alias("total_sales"),
    pl.col("sales_amount").mean().alias("average_sales"),
    pl.col("sales_amount").count().alias("transaction_count"),
    pl.col("quantity").sum().alias("total_quantity"),
)
print(f"Category Sales Summary: {len(category_sales_pl)}")
print(category_sales_pl.head(5))
print(category_sales_pl.head(5).to_pandas().to_markdown())
Category Sales Summary: 5
shape: (5, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category    ┆ total_sales ┆ average_sales ┆ transaction_count ┆ total_quantity β”‚
β”‚ ---         ┆ ---         ┆ ---           ┆ ---               ┆ ---            β”‚
β”‚ str         ┆ f64         ┆ f64           ┆ u32               ┆ i64            β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ═══════════════β•ͺ═══════════════════β•ͺ════════════════║
β”‚ Food        ┆ 12995.57    ┆ 541.482083    ┆ 24                ┆ 115            β”‚
β”‚ Electronics ┆ 11407.45    ┆ 407.408929    ┆ 28                ┆ 147            β”‚
β”‚ Books       ┆ 10154.83    ┆ 441.514348    ┆ 23                ┆ 100            β”‚
β”‚ Home        ┆ 6343.89     ┆ 704.876667    ┆ 9                 ┆ 40             β”‚
β”‚ Clothing    ┆ 7325.31     ┆ 457.831875    ┆ 16                ┆ 62             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
category total_sales average_sales transaction_count total_quantity
0 Food 12995.6 541.482 24 115
1 Electronics 11407.5 407.409 28 147
2 Books 10154.8 441.514 23 100
3 Home 6343.89 704.877 9 40
4 Clothing 7325.31 457.832 16 62

We can rename the columns for clarity by simply assigning new names.

In Pandas, we use the .columns attribute of the DataFrame. This makes it easier to understand the results of the aggregation.

It's also possible to rename columns using the .rename() method, which allows for more flexibility in renaming specific columns from within 'dot-method' chains.

Rename columns for clarity
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
category_sales_renamed_pd: pd.DataFrame = category_sales_pd.copy()
category_sales_renamed_pd.columns = [
    "Total Sales",
    "Average Sales",
    "Transaction Count",
    "Total Quantity",
]
print(f"Renamed Category Sales Summary: {len(category_sales_renamed_pd)}")
print(category_sales_renamed_pd.head(5))
print(category_sales_renamed_pd.head(5).to_markdown())
Renamed Category Sales Summary: 5
             Total Sales  Average Sales  Transaction Count  Total Quantity
category
Books           10154.83     441.514348                 23             100
Clothing         7325.31     457.831875                 16              62
Electronics     11407.45     407.408929                 28             147
Food            12995.57     541.482083                 24             115
Home             6343.89     704.876667                  9              40
category Total Sales Average Sales Transaction Count Total Quantity
Books 10154.8 441.514 23 100
Clothing 7325.31 457.832 16 62
Electronics 11407.5 407.409 28 147
Food 12995.6 541.482 24 115
Home 6343.89 704.877 9 40

In SQL, we can use the AS keyword to rename columns in the SELECT clause. This allows us to provide more descriptive names for the aggregated columns.

In this example, we provide the same aggregation as before, but from within a subquery. Then, in the parent query, we rename the columns for clarity.

Rename columns for clarity
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
category_sales_renamed_txt: str = """
    SELECT
        category,
        total_sales AS `Total Sales`,
        average_sales AS `Average Sales`,
        transaction_count AS `Transaction Count`,
        total_quantity AS `Total Quantity`
    FROM (
        SELECT
            category,
            SUM(sales_amount) AS total_sales,
            AVG(sales_amount) AS average_sales,
            COUNT(*) AS transaction_count,
            SUM(quantity) AS total_quantity
        FROM sales
        GROUP BY category
    ) AS sales_summary
"""
print(f"Renamed Category Sales Summary: {len(pd.read_sql(category_sales_renamed_txt, conn))}")
print(pd.read_sql(category_sales_renamed_txt + "LIMIT 5", conn))
print(pd.read_sql(category_sales_renamed_txt + "LIMIT 5", conn).to_markdown())
Renamed Category Sales Summary: 5
             Total Sales  Average Sales  Transaction Count  Total Quantity
category
Books           10154.83     441.514348                 23             100
Clothing         7325.31     457.831875                 16              62
Electronics     11407.45     407.408929                 28             147
Food            12995.57     541.482083                 24             115
Home             6343.89     704.876667                  9              40
category Total Sales Average Sales Transaction Count Total Quantity
Books 10154.8 441.514 23 100
Clothing 7325.31 457.832 16 62
Electronics 11407.5 407.409 28 147
Food 12995.6 541.482 24 115
Home 6343.89 704.877 9 40

In PySpark, we can use the .withColumnsRenamed() method to rename columns in a DataFrame. This allows us to provide more descriptive names for the aggregated columns.

Rename columns for clarity
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
category_sales_renamed_ps: psDataFrame = category_sales_ps.withColumnsRenamed(
    {
        "total_sales": "Total Sales",
        "average_sales": "Average Sales",
        "transaction_count": "Transaction Count",
        "total_quantity": "Total Quantity",
    }
)
print(f"Renamed Category Sales Summary: {category_sales_renamed_ps.count()}")
category_sales_renamed_ps.show(5)
print(category_sales_renamed_ps.limit(5).toPandas().to_markdown())
Renamed Category Sales Summary: 5
+-----------+------------------+------------------+-----------------+--------------+
|   category|       Total Sales|     Average Sales|Transaction Count|Total Quantity|
+-----------+------------------+------------------+-----------------+--------------+
|       Home| 6343.889999999999| 704.8766666666666|                9|            40|
|       Food|          12995.57| 541.4820833333333|               24|           115|
|Electronics|11407.449999999999|407.40892857142853|               28|           147|
|   Clothing|7325.3099999999995|457.83187499999997|               16|            62|
|      Books|          10154.83|  441.514347826087|               23|           100|
+-----------+------------------+------------------+-----------------+--------------+
category Total Sales Average Sales Transaction Count Total Quantity
0 Home 6343.89 704.877 9 40
1 Food 12995.6 541.482 24 115
2 Electronics 11407.4 407.409 28 147
3 Clothing 7325.31 457.832 16 62
4 Books 10154.8 441.514 23 100

In Polars, we can use the .rename() method to rename columns in a DataFrame. This allows us to provide more descriptive names for the aggregated columns.

Rename columns for clarity
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
category_sales_renamed_pl: pl.DataFrame = category_sales_pl.rename(
    {
        "total_sales": "Total Sales",
        "average_sales": "Average Sales",
        "transaction_count": "Transaction Count",
        "total_quantity": "Total Quantity",
    }
)
print(f"Renamed Category Sales Summary: {len(category_sales_renamed_pl)}")
print(category_sales_renamed_pl.head(5))
print(category_sales_renamed_pl.head(5).to_pandas().to_markdown())
Renamed Category Sales Summary: 5
shape: (5, 5)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ category    ┆ Total Sales ┆ Average Sales ┆ Transaction Count ┆ Total Quantity β”‚
β”‚ ---         ┆ ---         ┆ ---           ┆ ---               ┆ ---            β”‚
β”‚ str         ┆ f64         ┆ f64           ┆ u32               ┆ i64            β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ═══════════════β•ͺ═══════════════════β•ͺ════════════════║
β”‚ Food        ┆ 12995.57    ┆ 541.482083    ┆ 24                ┆ 115            β”‚
β”‚ Electronics ┆ 11407.45    ┆ 407.408929    ┆ 28                ┆ 147            β”‚
β”‚ Books       ┆ 10154.83    ┆ 441.514348    ┆ 23                ┆ 100            β”‚
β”‚ Home        ┆ 6343.89     ┆ 704.876667    ┆ 9                 ┆ 40             β”‚
β”‚ Clothing    ┆ 7325.31     ┆ 457.831875    ┆ 16                ┆ 62             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
category Total Sales Average Sales Transaction Count Total Quantity
0 Food 12995.6 541.482 24 115
1 Electronics 11407.5 407.409 28 147
2 Books 10154.8 441.514 23 100
3 Home 6343.89 704.877 9 40
4 Clothing 7325.31 457.832 16 62

Having aggregated the data, we can now visualize the results using Plotly. This allows us to create interactive visualizations that can help us better understand the data. The simplest way to do this is to use the Plotly Express module, which provides a high-level interface for creating visualizations. Here, we have utilised the px.bar() function to create a bar chart of the total sales by category.

The Plotly px.bar() function is able to receive a Pandas DataFrame directly, making it easy to create visualizations from the aggregated data. However, what we first need to do is to convert the index labels in to a column, so that we can use it as the x-axis in the bar chart. We do this with the .reset_index() method.

Plot the results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fig: go.Figure = px.bar(
    data_frame=category_sales_renamed_pd.reset_index(),
    x="category",
    y="Total Sales",
    title="Total Sales by Category",
    text="Transaction Count",
    labels={"Total Sales": "Total Sales ($)", "category": "Product Category"},
)
fig.write_html("images/pt2_total_sales_by_category_pd.html", include_plotlyjs="cdn", full_html=True)
fig.show()

The Plotly px.bar() function can also receive a Pandas DataFrame, so we can use the results of the SQL query directly. Since the method we are using already returns the group labels in an individual column, we can use that directly in Plotly as the labels for the x-axis.

Plot the results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fig: go.Figure = px.bar(
    data_frame=pd.read_sql(category_sales_renamed_txt, conn),
    x="category",
    y="Total Sales",
    title="Total Sales by Category",
    text="Transaction Count",
    labels={"Total Sales": "Total Sales ($)", "category": "Product Category"},
)
fig.write_html("images/pt2_total_sales_by_category_sql.html", include_plotlyjs="cdn", full_html=True)
fig.show()

Plotly is unfortunately not able to directly receive a PySpark DataFrame, so we need to convert it to a Pandas DataFrame first. This is done using the .toPandas() method, which converts the PySpark DataFrame to a Pandas DataFrame.

Plot the results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fig: go.Figure = px.bar(
    data_frame=category_sales_renamed_ps.toPandas(),
    x="category",
    y="Total Sales",
    title="Total Sales by Category",
    text="Transaction Count",
    labels={"Total Sales": "Total Sales ($)", "category": "Product Category"},
)
fig.write_html("images/pt2_total_sales_by_category_ps.html", include_plotlyjs="cdn", full_html=True)
fig.show()

Plotly is also able to receive a Polars DataFrame, so we can use the results of the aggregation directly.

Plot the results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fig: go.Figure = px.bar(
    data_frame=category_sales_renamed_pl,
    x="category",
    y="Total Sales",
    title="Total Sales by Category",
    text="Transaction Count",
    labels={"Total Sales": "Total Sales ($)", "category": "Product Category"},
)
fig.write_html("images/pt2_total_sales_by_category_pl.html", include_plotlyjs="cdn", full_html=True)
fig.show()

3. JoiningπŸ”—

The third section will demonstrate how to join DataFrames to combine data from different sources. This is a common operation in data analysis, allowing us to enrich our data with additional information.

Here, we will join the sales DataFrame with the product DataFrame to get additional information about the products sold.

In Pandas, we can use the pd.merge() method to combine rows from two or more tables based on a related column between them. In this case, we will join the sales table with the product table on the product_id column.

Join sales with product data
1
2
3
4
5
6
7
8
9
sales_with_product_pd: pd.DataFrame = pd.merge(
    left=df_sales_pd,
    right=df_product_pd[["product_id", "product_name", "price"]],
    on="product_id",
    how="left",
)
print(f"Sales with Product Information: {len(sales_with_product_pd)}")
print(sales_with_product_pd.head(5))
print(sales_with_product_pd.head(5).to_markdown())
Sales with Product Information: 100
        date  customer_id  product_id     category  sales_amount  quantity  product_name   price
0 2023-01-01           52          45         Food        490.76         7    Product 45  493.14
1 2023-01-02           93          41  Electronics        453.94         5    Product 41  193.39
2 2023-01-03           15          29         Home        994.51         5    Product 29   80.07
3 2023-01-04           72          15  Electronics        184.17         7    Product 15  153.67
4 2023-01-05           61          45         Food         27.89         9    Product 45  493.14
date customer_id product_id category sales_amount quantity product_name price
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14

In SQL, we can use the JOIN clause to combine rows from two or more tables based on a related column between them. In this case, we will join the sales table with the product table on the product_id column.

Join sales with product data
1
2
3
4
5
6
7
8
sales_with_product_txt: str = """
    SELECT s.*, p.product_name, p.price
    FROM sales s
    LEFT JOIN product p ON s.product_id = p.product_id
"""
print(f"Sales with Product Information: {len(pd.read_sql(sales_with_product_txt, conn))}")
print(pd.read_sql(sales_with_product_txt + "LIMIT 5", conn))
print(pd.read_sql(sales_with_product_txt + "LIMIT 5", conn).to_markdown())
Sales with Product Information: 100
                  date  customer_id  product_id     category  sales_amount  quantity  product_name   price
0  2023-01-01 00:00:00           52          45         Food        490.76         7    Product 45  493.14
1  2023-01-02 00:00:00           93          41  Electronics        453.94         5    Product 41  193.39
2  2023-01-03 00:00:00           15          29         Home        994.51         5    Product 29   80.07
3  2023-01-04 00:00:00           72          15  Electronics        184.17         7    Product 15  153.67
4  2023-01-05 00:00:00           61          45         Food         27.89         9    Product 45  493.14
date customer_id product_id category sales_amount quantity product_name price
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14

In PySpark, we can use the .join() method to combine rows from two or more DataFrames based on a related column between them. In this case, we will join the sales DataFrame with the product DataFrame on the product_id column.

Join sales with product data
1
2
3
4
5
6
7
8
sales_with_product_ps: psDataFrame = df_sales_ps.join(
    other=df_product_ps.select("product_id", "product_name", "price"),
    on="product_id",
    how="left",
)
print(f"Sales with Product Information: {sales_with_product_ps.count()}")
sales_with_product_ps.show(5)
print(sales_with_product_ps.limit(5).toPandas().to_markdown())
Sales with Product Information: 100
+----------+-------------------+-----------+-----------+------------+--------+------------+------+
|product_id|               date|customer_id|   category|sales_amount|quantity|product_name| price|
+----------+-------------------+-----------+-----------+------------+--------+------------+------+
|         1|2023-01-06 00:00:00|         21|   Clothing|      498.95|       5|   Product 1|257.57|
|         1|2023-01-11 00:00:00|         88|Electronics|      314.98|       9|   Product 1|257.57|
|         1|2023-02-11 00:00:00|         55|       Food|       199.0|       5|   Product 1|257.57|
|         1|2023-04-04 00:00:00|         85|       Food|      146.97|       7|   Product 1|257.57|
|         5|2023-01-21 00:00:00|         64|Electronics|      356.58|       5|   Product 5|200.71|
+----------+-------------------+-----------+-----------+------------+--------+------------+------+
only showing top 5 rows
product_id date customer_id category sales_amount quantity product_name price
0 1 2023-01-11 00:00:00 88 Electronics 314.98 9 Product 1 257.57
1 1 2023-02-11 00:00:00 55 Food 199 5 Product 1 257.57
2 5 2023-01-21 00:00:00 64 Electronics 356.58 5 Product 5 200.71
3 5 2023-02-18 00:00:00 39 Books 79.71 8 Product 5 200.71
4 6 2023-03-23 00:00:00 34 Electronics 48.45 8 Product 6 15.31

In Polars, we can use the .join() method to combine rows from two or more DataFrames based on a related column between them. In this case, we will join the sales DataFrame with the product DataFrame on the product_id column.

Join sales with product data
1
2
3
4
5
6
7
8
sales_with_product_pl: pl.DataFrame = df_sales_pl.join(
    df_product_pl.select(["product_id", "product_name", "price"]),
    on="product_id",
    how="left",
)
print(f"Sales with Product Information: {len(sales_with_product_pl)}")
print(sales_with_product_pl.head(5))
print(sales_with_product_pl.head(5).to_pandas().to_markdown())
Sales with Product Information: 100
shape: (5, 8)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date                ┆ customer_id ┆ product_id ┆ category    ┆ sales_amount ┆ quantity ┆ product_name ┆ price  β”‚
β”‚ ---                 ┆ ---         ┆ ---        ┆ ---         ┆ ---          ┆ ---      ┆ ---          ┆ ---    β”‚
β”‚ datetime[ns]        ┆ i64         ┆ i64        ┆ str         ┆ f64          ┆ i64      ┆ str          ┆ f64    β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ══════════════β•ͺ══════════β•ͺ══════════════β•ͺ════════║
β”‚ 2023-01-01 00:00:00 ┆ 52          ┆ 45         ┆ Food        ┆ 490.76       ┆ 7        ┆ Product 45   ┆ 493.14 β”‚
β”‚ 2023-01-02 00:00:00 ┆ 93          ┆ 41         ┆ Electronics ┆ 453.94       ┆ 5        ┆ Product 41   ┆ 193.39 β”‚
β”‚ 2023-01-03 00:00:00 ┆ 15          ┆ 29         ┆ Home        ┆ 994.51       ┆ 5        ┆ Product 29   ┆ 80.07  β”‚
β”‚ 2023-01-04 00:00:00 ┆ 72          ┆ 15         ┆ Electronics ┆ 184.17       ┆ 7        ┆ Product 15   ┆ 153.67 β”‚
β”‚ 2023-01-05 00:00:00 ┆ 61          ┆ 45         ┆ Food        ┆ 27.89        ┆ 9        ┆ Product 45   ┆ 493.14 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date customer_id product_id category sales_amount quantity product_name price
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14

In the next step, we will join the resulting DataFrame with the customer DataFrame to get customer information for each sale. This allows us to create a complete view of the sales data, including product and customer details.

This process is similar to the previous step, but now we will extend the sales_with_product DataFrame to join it with the customer DataFrame on the customer_id column. This will give us a complete view of the sales data, including product and customer details.

Join with customer information to get a complete view
1
2
3
4
5
6
7
8
9
complete_sales_pd: pd.DataFrame = pd.merge(
    sales_with_product_pd,
    df_customer_pd[["customer_id", "customer_name", "city", "state"]],
    on="customer_id",
    how="left",
)
print(f"Complete Sales Data with Customer Information: {len(complete_sales_pd)}")
print(complete_sales_pd.head(5))
print(complete_sales_pd.head(5).to_markdown())
Complete Sales Data with Customer Information: 100
        date  customer_id  product_id     category  sales_amount  quantity  product_name   price  customer_name      city  state
0 2023-01-01           52          45         Food        490.76         7    Product 45  493.14    Customer 52   Phoenix     TX
1 2023-01-02           93          41  Electronics        453.94         5    Product 41  193.39    Customer 93  New York     TX
2 2023-01-03           15          29         Home        994.51         5    Product 29   80.07    Customer 15  New York     CA
3 2023-01-04           72          15  Electronics        184.17         7    Product 15  153.67    Customer 72   Houston     IL
4 2023-01-05           61          45         Food         27.89         9    Product 45  493.14    Customer 61   Phoenix     IL
date customer_id product_id category sales_amount quantity product_name price customer_name city state
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14 Customer 52 Phoenix TX
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39 Customer 93 New York TX
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07 Customer 15 New York CA
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67 Customer 72 Houston IL
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14 Customer 61 Phoenix IL

This process is similar to the previous step, but now we will extend the sales_with_product DataFrame to join it with the customer DataFrame on the customer_id column. This will give us a complete view of the sales data, including product and customer details.

Join with customer information to get a complete view
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
complete_sales_txt: str = """
    SELECT
        s.*,
        p.product_name,
        p.price,
        c.customer_name,
        c.city,
        c.state
    FROM sales s
    LEFT JOIN product p ON s.product_id = p.product_id
    LEFT JOIN customer c ON s.customer_id = c.customer_id
"""
print(f"Complete Sales Data with Customer Information: {len(pd.read_sql(complete_sales_txt, conn))}")
print(pd.read_sql(complete_sales_txt + "LIMIT 5", conn))
print(pd.read_sql(complete_sales_txt + "LIMIT 5", conn).to_markdown())
Complete Sales Data with Customer Information: 100
                  date  customer_id  product_id     category  sales_amount  quantity  product_name   price  customer_name      city  state
0  2023-01-01 00:00:00           52          45         Food        490.76         7    Product 45  493.14    Customer 52   Phoenix     TX
1  2023-01-02 00:00:00           93          41  Electronics        453.94         5    Product 41  193.39    Customer 93  New York     TX
2  2023-01-03 00:00:00           15          29         Home        994.51         5    Product 29   80.07    Customer 15  New York     CA
3  2023-01-04 00:00:00           72          15  Electronics        184.17         7    Product 15  153.67    Customer 72   Houston     IL
4  2023-01-05 00:00:00           61          45         Food         27.89         9    Product 45  493.14    Customer 61   Phoenix     IL
date customer_id product_id category sales_amount quantity product_name price customer_name city state
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14 Customer 52 Phoenix TX
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39 Customer 93 New York TX
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07 Customer 15 New York CA
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67 Customer 72 Houston IL
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14 Customer 61 Phoenix IL

This process is similar to the previous step, but now we will extend the sales_with_product DataFrame to join it with the customer DataFrame on the customer_id column. This will give us a complete view of the sales data, including product and customer details.

Join with customer information to get a complete view
1
2
3
4
5
6
7
8
complete_sales_ps: psDataFrame = sales_with_product_ps.alias("s").join(
    other=df_customer_ps.select("customer_id", "customer_name", "city", "state").alias("c"),
    on="customer_id",
    how="left",
)
print(f"Complete Sales Data with Customer Information: {complete_sales_ps.count()}")
complete_sales_ps.show(5)
print(complete_sales_ps.limit(5).toPandas().to_markdown())
Complete Sales Data with Customer Information: 100
+-----------+----------+-------------------+-----------+------------+--------+------------+------+-------------+-----------+-----+
|customer_id|product_id|               date|   category|sales_amount|quantity|product_name| price|customer_name|       city|state|
+-----------+----------+-------------------+-----------+------------+--------+------------+------+-------------+-----------+-----+
|         39|         5|2023-02-18 00:00:00|      Books|       79.71|       8|   Product 5|200.71|  Customer 39|Los Angeles|   NY|
|         88|         1|2023-01-11 00:00:00|Electronics|      314.98|       9|   Product 1|257.57|  Customer 88|Los Angeles|   TX|
|         85|         1|2023-04-04 00:00:00|       Food|      146.97|       7|   Product 1|257.57|  Customer 85|    Phoenix|   CA|
|         55|         1|2023-02-11 00:00:00|       Food|       199.0|       5|   Product 1|257.57|  Customer 55|Los Angeles|   NY|
|         21|         1|2023-01-06 00:00:00|   Clothing|      498.95|       5|   Product 1|257.57|  Customer 21|Los Angeles|   IL|
+-----------+----------+-------------------+-----------+------------+--------+------------+------+-------------+-----------+-----+
only showing top 5 rows
customer_id product_id date category sales_amount quantity product_name price customer_name city state
0 88 1 2023-01-11 00:00:00 Electronics 314.98 9 Product 1 257.57 Customer 88 Los Angeles TX
1 55 1 2023-02-11 00:00:00 Food 199 5 Product 1 257.57 Customer 55 Los Angeles NY
2 64 5 2023-01-21 00:00:00 Electronics 356.58 5 Product 5 200.71 Customer 64 Los Angeles NY
3 39 5 2023-02-18 00:00:00 Books 79.71 8 Product 5 200.71 Customer 39 Los Angeles NY
4 34 6 2023-03-23 00:00:00 Electronics 48.45 8 Product 6 15.31 Customer 34 Los Angeles NY

This process is similar to the previous step, but now we will extend the sales_with_product DataFrame to join it with the customer DataFrame on the customer_id column. This will give us a complete view of the sales data, including product and customer details.

Join with customer information to get a complete view
1
2
3
4
5
6
7
8
complete_sales_pl: pl.DataFrame = sales_with_product_pl.join(
    df_customer_pl.select(["customer_id", "customer_name", "city", "state"]),
    on="customer_id",
    how="left",
)
print(f"Complete Sales Data with Customer Information: {len(complete_sales_pl)}")
print(complete_sales_pl.head(5))
print(complete_sales_pl.head(5).to_pandas().to_markdown())
Complete Sales Data with Customer Information: 100
shape: (5, 11)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date                ┆ customer_id ┆ product_id ┆ category    ┆ sales_amount ┆ quantity ┆ product_name ┆ price  ┆ customer_name ┆ city     ┆ state β”‚
β”‚ ---                 ┆ ---         ┆ ---        ┆ ---         ┆ ---          ┆ ---      ┆ ---          ┆ ---    ┆ ---           ┆ ---      ┆ ---   β”‚
β”‚ datetime[ns]        ┆ i64         ┆ i64        ┆ str         ┆ f64          ┆ i64      ┆ str          ┆ f64    ┆ str           ┆ str      ┆ str   β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════β•ͺ═════════════β•ͺ══════════════β•ͺ══════════β•ͺ══════════════β•ͺ════════β•ͺ═══════════════β•ͺ══════════β•ͺ═══════║
β”‚ 2023-01-01 00:00:00 ┆ 52          ┆ 45         ┆ Food        ┆ 490.76       ┆ 7        ┆ Product 45   ┆ 493.14 ┆ Customer 52   ┆ Phoenis  ┆ TX    β”‚
β”‚ 2023-01-02 00:00:00 ┆ 93          ┆ 41         ┆ Electronics ┆ 453.94       ┆ 5        ┆ Product 41   ┆ 193.39 ┆ Customer 93   ┆ New York ┆ TX    β”‚
β”‚ 2023-01-03 00:00:00 ┆ 15          ┆ 29         ┆ Home        ┆ 994.51       ┆ 5        ┆ Product 29   ┆ 80.07  ┆ Customer 15   ┆ New York ┆ CA    β”‚
β”‚ 2023-01-04 00:00:00 ┆ 72          ┆ 15         ┆ Electronics ┆ 184.17       ┆ 7        ┆ Product 15   ┆ 153.67 ┆ Customer 72   ┆ Houston  ┆ IL    β”‚
β”‚ 2023-01-05 00:00:00 ┆ 61          ┆ 45         ┆ Food        ┆ 27.89        ┆ 9        ┆ Product 45   ┆ 493.14 ┆ Customer 61   ┆ Phoenix  ┆ IL    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜
date customer_id product_id category sales_amount quantity product_name price customer_name city state
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14 Customer 52 Phoenix TX
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39 Customer 93 New York TX
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07 Customer 15 New York CA
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67 Customer 72 Houston IL
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14 Customer 61 Phoenix IL

Once we have the complete sales data, we can calculate the revenue for each sale by multiplying the price and quantity (columns from different tables). We can also compare this calculated revenue with the sales amount to identify any discrepancies.

In Pandas, we can calculate the revenue for each sale by multiplying the price and quantity columns. We can then compare this calculated revenue with the sales_amount column to identify any discrepancies.

Notice here that the syntax for Pandas uses the DataFrame object directly, and we can access the columns using the 'slice' ([]) operator.

Calculate revenue and compare with sales amount
1
2
3
4
5
6
7
8
9
complete_sales_pd["calculated_revenue"] = complete_sales_pd["price"] * complete_sales_pd["quantity"]
complete_sales_pd["price_difference"] = complete_sales_pd["sales_amount"] - complete_sales_pd["calculated_revenue"]
print(f"Complete Sales Data with Calculated Revenue and Price Difference: {len(complete_sales_pd)}")
print(complete_sales_pd[["sales_amount", "price", "quantity", "calculated_revenue", "price_difference"]].head(5))
print(
    complete_sales_pd[["sales_amount", "price", "quantity", "calculated_revenue", "price_difference"]]
    .head(5)
    .to_markdown()
)
Complete Sales Data with Calculated Revenue and Price Difference: 100
   sales_amount   price  quantity  calculated_revenue  price_difference
0        490.76  493.14         7             3451.98          -2961.22
1        453.94  193.39         5              966.95           -513.01
2        994.51   80.07         5              400.35            594.16
3        184.17  153.67         7             1075.69           -891.52
4         27.89  493.14         9             4438.26          -4410.37
sales_amount price quantity calculated_revenue price_difference
0 490.76 493.14 7 3451.98 -2961.22
1 453.94 193.39 5 966.95 -513.01
2 994.51 80.07 5 400.35 594.16
3 184.17 153.67 7 1075.69 -891.52
4 27.89 493.14 9 4438.26 -4410.37

In SQL, we can calculate the revenue for each sale by multiplying the price and quantity columns. We can then compare this calculated revenue with the sales_amount column to identify any discrepancies.

Calculate revenue and compare with sales amount
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
revenue_comparison_txt: str = """
    SELECT
        s.sales_amount,
        p.price,
        s.quantity,
        (p.price * s.quantity) AS calculated_revenue,
        (s.sales_amount - (p.price * s.quantity)) AS price_difference
    FROM sales s
    LEFT JOIN product p ON s.product_id = p.product_id
"""
print(
    f"Complete Sales Data with Calculated Revenue and Price Difference: {len(pd.read_sql(revenue_comparison_txt, conn))}"
)
print(pd.read_sql(revenue_comparison_txt + "LIMIT 5", conn))
print(pd.read_sql(revenue_comparison_txt + "LIMIT 5", conn).to_markdown())
Complete Sales Data with Calculated Revenue and Price Difference: 100
   sales_amount   price  quantity  calculated_revenue  price_difference
0        490.76  493.14         7             3451.98          -2961.22
1        453.94  193.39         5              966.95           -513.01
2        994.51   80.07         5              400.35            594.16
3        184.17  153.67         7             1075.69           -891.52
4         27.89  493.14         9             4438.26          -4410.37
sales_amount price quantity calculated_revenue price_difference
0 490.76 493.14 7 3451.98 -2961.22
1 453.94 193.39 5 966.95 -513.01
2 994.51 80.07 5 400.35 594.16
3 184.17 153.67 7 1075.69 -891.52
4 27.89 493.14 9 4438.26 -4410.37

In PySpark, we can calculate the revenue for each sale by multiplying the price and quantity columns. We can then compare this calculated revenue with the sales_amount column to identify any discrepancies.

Notice here that the syntax for PySpark uses the .withColumns method to add new multiple columns to the DataFrame simultaneously. This method takes a dictionary where the keys are the names of the new columns and the values are the expressions to compute those columns. The methematical computation we have shown here uses two different methods:

  1. With the PySpark API, we can use the F.col() function to refer to the columns, and multiply them directly
  2. With the Spark SQL API, we can use the F.expr() function to write a SQL-like expression for the calculation.
Calculate revenue and compare with sales amount
1
2
3
4
5
6
7
8
9
complete_sales_ps: psDataFrame = complete_sales_ps.withColumns(
    {
        "calculated_revenue": F.col("price") * F.col("quantity"),
        "price_difference": F.expr("sales_amount - (price * quantity)"),
    },
).select("sales_amount", "price", "quantity", "calculated_revenue", "price_difference")
print(f"Complete Sales Data with Calculated Revenue and Price Difference: {complete_sales_ps.count()}")
complete_sales_ps.show(5)
print(complete_sales_ps.limit(5).toPandas().to_markdown())
Complete Sales Data with Calculated Revenue and Price Difference: 100
+------------+------+--------+------------------+------------------+
|sales_amount| price|quantity|calculated_revenue|  price_difference|
+------------+------+--------+------------------+------------------+
|       79.71|200.71|       8|           1605.68|          -1525.97|
|      314.98|257.57|       9|           2318.13|          -2003.15|
|      146.97|257.57|       7|           1802.99|          -1656.02|
|       199.0|257.57|       5|           1287.85|          -1088.85|
|      498.95|257.57|       5|           1287.85|-788.8999999999999|
+------------+------+--------+------------------+------------------+
only showing top 5 rows
sales_amount price quantity calculated_revenue price_difference
0 48.45 15.31 8 122.48 -74.03
1 79.71 200.71 8 1605.68 -1525.97
2 314.98 257.57 9 2318.13 -2003.15
3 199 257.57 5 1287.85 -1088.85
4 356.58 200.71 5 1003.55 -646.97

In Polars, we can calculate the revenue for each sale by multiplying the price and quantity columns. We can then compare this calculated revenue with the sales_amount column to identify any discrepancies.

Notice here that the syntax for Polars uses the .with_columns method to add new multiple columns to the DataFrame simultaneously. This method takes a list of expressions, where each expression defines a new column.

Calculate revenue and compare with sales amount
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
complete_sales_pl: pl.DataFrame = complete_sales_pl.with_columns(
    (pl.col("price") * pl.col("quantity")).alias("calculated_revenue"),
    (pl.col("sales_amount") - (pl.col("price") * pl.col("quantity"))).alias("price_difference"),
)
print(f"Complete Sales Data with Calculated Revenue and Price Difference: {len(complete_sales_pl)}")
print(complete_sales_pl.select(["sales_amount", "price", "quantity", "calculated_revenue", "price_difference"]).head(5))
print(
    complete_sales_pl.select(["sales_amount", "price", "quantity", "calculated_revenue", "price_difference"])
    .head(5)
    .to_pandas()
    .to_markdown()
)
Complete Sales Data with Calculated Revenue and Price Difference: 100
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ sales_amount ┆ price  ┆ quantity ┆ calculated_revenue ┆ price_difference β”‚
β”‚ ---          ┆ ---    ┆ ---      ┆ ---                ┆ ---              β”‚
β”‚ f64          ┆ f64    ┆ i64      ┆ f64                ┆ f64              β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════β•ͺ══════════β•ͺ════════════════════β•ͺ══════════════════║
β”‚ 490.76       ┆ 493.14 ┆ 7        ┆ 3451.98            ┆ -2961.22         β”‚
β”‚ 453.94       ┆ 193.39 ┆ 5        ┆ 966.95             ┆ -513.01          β”‚
β”‚ 994.51       ┆ 80.07  ┆ 5        ┆ 400.35             ┆ 594.16           β”‚
β”‚ 184.17       ┆ 153.67 ┆ 7        ┆ 1075.69            ┆ -891.52          β”‚
β”‚ 27.89        ┆ 493.14 ┆ 9        ┆ 4438.26            ┆ -4410.37         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
sales_amount price quantity calculated_revenue price_difference
0 490.76 493.14 7 3451.98 -2961.22
1 453.94 193.39 5 966.95 -513.01
2 994.51 80.07 5 400.35 594.16
3 184.17 153.67 7 1075.69 -891.52
4 27.89 493.14 9 4438.26 -4410.37

4. Window FunctionsπŸ”—

Window functions are a powerful feature in Pandas that allow us to perform calculations across a set of rows related to the current row. This is particularly useful for time series data, where we may want to calculate rolling averages, cumulative sums, or other metrics based on previous or subsequent rows.

To understand more about the nuances of the window functions, check out some of these guides:

In this section, we will demonstrate how to use window functions to analyze sales data over time. We will start by converting the date column to a datetime type, which is necessary for time-based calculations. We will then group the data by date and calculate the total sales for each day.

The first thing that we will do is to group the sales data by date and calculate the total sales for each day. This will give us a daily summary of sales, which we can then use to analyze trends over time.

In Pandas, we can use the .groupby() method to group the data by the date column, followed by the .agg() method to calculate the total sales for each day. This will then set us up for further time-based calculations in the following steps

Time-based window function
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
df_sales_pd["date"] = pd.to_datetime(df_sales_pd["date"])  # Ensure correct date type
daily_sales_pd: pd.DataFrame = (
    df_sales_pd.groupby(df_sales_pd["date"].dt.date)
    .agg(total_sales=("sales_amount", "sum"))
    .reset_index()
    .sort_values("date")
)
print(f"Daily Sales Summary: {len(daily_sales_pd)}")
print(daily_sales_pd.head(5))
print(daily_sales_pd.head(5).to_markdown())
Daily Sales Summary: 100
         date  total_sales
0  2023-01-01        490.76
1  2023-01-02        453.94
2  2023-01-03        994.51
3  2023-01-04        184.17
4  2023-01-05         27.89
date total_sales
0 2023-01-01 490.76
1 2023-01-02 453.94
2 2023-01-03 994.51
3 2023-01-04 184.17
4 2023-01-05 27.89

In SQL, we can use the GROUP BY clause to group the data by the date column and then use the SUM() function to calculate the total sales for each day. This will give us a daily summary of sales, which we can then use to analyze trends over time.

Time-based window function
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
daily_sales_txt: str = """
    SELECT
        date,
        SUM(sales_amount) AS total_sales
    FROM sales
    GROUP BY date
    ORDER BY date
"""
print(f"Daily Sales Summary: {len(pd.read_sql(daily_sales_txt, conn))}")
print(pd.read_sql(daily_sales_txt + "LIMIT 5", conn))
print(pd.read_sql(daily_sales_txt + "LIMIT 5", conn).to_markdown())
Daily Sales Summary: 100
                  date  total_sales
0  2023-01-01 00:00:00       490.76
1  2023-01-02 00:00:00       453.94
2  2023-01-03 00:00:00       994.51
3  2023-01-04 00:00:00       184.17
4  2023-01-05 00:00:00        27.89
date total_sales
0 2023-01-01 00:00:00 490.76
1 2023-01-02 00:00:00 453.94
2 2023-01-03 00:00:00 994.51
3 2023-01-04 00:00:00 184.17
4 2023-01-05 00:00:00 27.89

In PySpark, we can use the .groupBy() method to group the data by the date column, followed by the .agg() method to calculate the total sales for each day. This will then set us up for further time-based calculations in the following steps.

Time-based window function
1
2
3
4
5
6
7
df_sales_ps: psDataFrame = df_sales_ps.withColumn("date", F.to_date(df_sales_ps["date"]))
daily_sales_ps: psDataFrame = (
    df_sales_ps.groupBy("date").agg(F.sum("sales_amount").alias("total_sales")).orderBy("date")
)
print(f"Daily Sales Summary: {daily_sales_ps.count()}")
daily_sales_ps.show(5)
print(daily_sales_ps.limit(5).toPandas().to_markdown())
Daily Sales Summary: 100
+----------+-----------+
|      date|total_sales|
+----------+-----------+
|2023-01-01|     490.76|
|2023-01-02|     453.94|
|2023-01-03|     994.51|
|2023-01-04|     184.17|
|2023-01-05|      27.89|
+----------+-----------+
only showing top 5 rows
date total_sales
0 2023-01-01 490.76
1 2023-01-02 453.94
2 2023-01-03 994.51
3 2023-01-04 184.17
4 2023-01-05 27.89

In Polars, we can use the .group_by() method to group the data by the date column, followed by the .agg() method to calculate the total sales for each day. This will then set us up for further time-based calculations in the following steps.

Time-based window function
1
2
3
4
5
6
7
df_sales_pl: pl.DataFrame = df_sales_pl.with_columns(pl.col("date").cast(pl.Date))
daily_sales_pl: pl.DataFrame = (
    df_sales_pl.group_by("date").agg(pl.col("sales_amount").sum().alias("total_sales")).sort("date")
)
print(f"Daily Sales Summary: {len(daily_sales_pl)}")
print(daily_sales_pl.head(5))
print(daily_sales_pl.head(5).to_pandas().to_markdown())
Daily Sales Summary: 100
shape: (5, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date       ┆ total_sales β”‚
β”‚ ---        ┆ ---         β”‚
β”‚ date       ┆ f64         β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════║
β”‚ 2023-01-01 ┆ 490.76      β”‚
β”‚ 2023-01-02 ┆ 453.94      β”‚
β”‚ 2023-01-03 ┆ 994.51      β”‚
β”‚ 2023-01-04 ┆ 184.17      β”‚
β”‚ 2023-01-05 ┆ 27.89       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date total_sales
0 2023-01-01 00:00:00 490.76
1 2023-01-02 00:00:00 453.94
2 2023-01-03 00:00:00 994.51
3 2023-01-04 00:00:00 184.17
4 2023-01-05 00:00:00 27.89

Next, we will calculate the lag and lead values for the sales amount. This allows us to compare the current day's sales with the previous and next days' sales.

In Pandas, we can calculate the lag and lead values for the sales amount by using the .shift() method. This method shifts the values in a column by a specified number of periods, allowing us to create lag and lead columns.

Note that the .shift() method simply shifts the values in the column by a number of rows up or down, so we can use it to create lag and lead columns. This function itself does not need to be ordered because it assumes that the DataFrame is already ordered. However, if you want it to be ordered, you can use the .sort_values() method before applying .shift().

Calculate lag and lead
1
2
3
4
5
daily_sales_pd["previous_day_sales"] = daily_sales_pd["total_sales"].shift(1)
daily_sales_pd["next_day_sales"] = daily_sales_pd["total_sales"].shift(-1)
print(f"Daily Sales with Lag and Lead: {len(daily_sales_pd)}")
print(daily_sales_pd.head(5))
print(daily_sales_pd.head(5).to_markdown())
Daily Sales with Lag and Lead: 100
         date  total_sales  previous_day_sales  next_day_sales
0  2023-01-01       490.76                 NaN          453.94
1  2023-01-02       453.94              490.76          994.51
2  2023-01-03       994.51              453.94          184.17
3  2023-01-04       184.17              994.51           27.89
4  2023-01-05        27.89              184.17          498.95
date total_sales previous_day_sales next_day_sales
0 2023-01-01 490.76 nan 453.94
1 2023-01-02 453.94 490.76 994.51
2 2023-01-03 994.51 453.94 184.17
3 2023-01-04 184.17 994.51 27.89
4 2023-01-05 27.89 184.17 498.95

In SQL, we can use the LAG() and LEAD() window functions to calculate the lag and lead values for the sales amount. These functions allow us to access data from previous and next rows in the result set without needing to join the table to itself.

The part that is important to note here is that the LAG() and LEAD() functions are used in conjunction with the OVER clause, which defines the window over which the function operates. In this case, we are ordering by the date column to ensure that the lag and lead values are calculated based on the chronological order of the sales data.

Calculate lag and lead
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
lag_lead_txt: str = """
    SELECT
        date AS sale_date,
        SUM(sales_amount) AS total_sales,
        LAG(SUM(sales_amount)) OVER (ORDER BY date) AS previous_day_sales,
        LEAD(SUM(sales_amount)) OVER (ORDER BY date) AS next_day_sales
    FROM sales
    GROUP BY date
    ORDER BY date
"""
lag_lead_df_sql: pd.DataFrame = pd.read_sql(lag_lead_txt, conn)
print(f"Daily Sales with Lag and Lead: {len(lag_lead_df_sql)}")
print(pd.read_sql(lag_lead_txt + " LIMIT 5", conn))
print(pd.read_sql(lag_lead_txt + " LIMIT 5", conn).to_markdown())
Daily Sales with Lag and Lead: 100
             sale_date  total_sales  previous_day_sales  next_day_sales
0  2023-01-01 00:00:00        490.76                 NaN          453.94
1  2023-01-02 00:00:00        453.94              490.76          994.51
2  2023-01-03 00:00:00        994.51              453.94          184.17
3  2023-01-04 00:00:00        184.17              994.51           27.89
4  2023-01-05 00:00:00         27.89              184.17          498.95
sale_date total_sales previous_day_sales next_day_sales
0 2023-01-01 00:00:00 490.76 nan 453.94
1 2023-01-02 00:00:00 453.94 490.76 994.51
2 2023-01-03 00:00:00 994.51 453.94 184.17
3 2023-01-04 00:00:00 184.17 994.51 27.89
4 2023-01-05 00:00:00 27.89 184.17 498.95

In PySpark, we can use the .lag() and .lead() functions to calculate the lag and lead values for the sales amount. These functions are used in conjunction with a window specification that defines the order of the rows.

Note that in PySpark, we can define a Window function in one of two ways: using the PySpark API or using the Spark SQL API.

  1. The PySpark API: The PySpark API allows us to define a window specification using the Window() class, which provides methods to specify the ordering of the rows. We can then use the F.lag() and F.lead() functions to calculate the lag and lead values over a given window on the table.
  2. The Spark SQL API: The Spark SQL API is used through the F.expr() function, which allows us to write SQL-like expressions for the calculations. This is similar to how we would write SQL queries, but it is executed within the PySpark context.

Here in the below example, we show how the previous day sales can be calculated using the .lag() function in the PySpark API, and the next day sales can be calculated using the LEAD() function in the Spark SQL API. Functionally, both of these two methods achieve the same result, but aesthetically they use slightly different syntax. It is primarily a matter of preference which one you choose to use.

Calculate lag and lead
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
window_spec_ps: Window = Window.orderBy("date")
daily_sales_ps: psDataFrame = daily_sales_ps.withColumns(
    {
        "previous_day_sales": F.lag("total_sales").over(window_spec_ps),
        "next_day_sales": F.expr("LEAD(total_sales) OVER (ORDER BY date)"),
    },
)
print(f"Daily Sales with Lag and Lead: {daily_sales_ps.count()}")
daily_sales_ps.show(5)
print(daily_sales_ps.limit(5).toPandas().to_markdown())
Daily Sales with Lag and Lead: 100
+----------+-----------+------------------+--------------+
|      date|total_sales|previous_day_sales|next_day_sales|
+----------+-----------+------------------+--------------+
|2023-01-01|     490.76|              NULL|        453.94|
|2023-01-02|     453.94|            490.76|        994.51|
|2023-01-03|     994.51|            453.94|        184.17|
|2023-01-04|     184.17|            994.51|         27.89|
|2023-01-05|      27.89|            184.17|        498.95|
+----------+-----------+------------------+--------------+
only showing top 5 rows
date total_sales previous_day_sales next_day_sales
0 2023-01-01 490.76 nan 453.94
1 2023-01-02 453.94 490.76 994.51
2 2023-01-03 994.51 453.94 184.17
3 2023-01-04 184.17 994.51 27.89
4 2023-01-05 27.89 184.17 498.95

In Polars, we can use the .shift() method to calculate the lag and lead values for the sales amount. This method shifts the values in a column by a specified number of periods, allowing us to create lag and lead columns.

Note that the .shift() method simply shifts the values in the column by a number of rows up or down, so we can use it to create lag and lead columns. This function itself does not need to be ordered because it assumes that the DataFrame is already ordered. However, if you want it to be ordered, you can use the .sort() method before applying .shift().

Calculate lag and lead
1
2
3
4
5
6
7
daily_sales_pl: pl.DataFrame = daily_sales_pl.with_columns(
    pl.col("total_sales").shift(1).alias("previous_day_sales"),
    pl.col("total_sales").shift(-1).alias("next_day_sales"),
)
print(f"Daily Sales with Lag and Lead: {len(daily_sales_pl)}")
print(daily_sales_pl.head(5))
print(daily_sales_pl.head(5).to_pandas().to_markdown())
Daily Sales with Lag and Lead: 100
shape: (5, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date       ┆ total_sales ┆ previous_day_sales ┆ next_day_sales β”‚
β”‚ ---        ┆ ---         ┆ ---                ┆ ---            β”‚
β”‚ date       ┆ f64         ┆ f64                ┆ f64            β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════════════β•ͺ════════════════║
β”‚ 2023-01-01 ┆ 490.76      ┆ null               ┆ 453.94         β”‚
β”‚ 2023-01-02 ┆ 453.94      ┆ 490.76             ┆ 994.51         β”‚
β”‚ 2023-01-03 ┆ 994.51      ┆ 453.94             ┆ 184.17         β”‚
β”‚ 2023-01-04 ┆ 184.17      ┆ 994.51             ┆ 27.89          β”‚
β”‚ 2023-01-05 ┆ 27.89       ┆ 184.17             ┆ 498.95         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date total_sales previous_day_sales next_day_sales
0 2023-01-01 00:00:00 490.76 nan 453.94
1 2023-01-02 00:00:00 453.94 490.76 994.51
2 2023-01-03 00:00:00 994.51 453.94 184.17
3 2023-01-04 00:00:00 184.17 994.51 27.89
4 2023-01-05 00:00:00 27.89 184.17 498.95

Now, we can calculate the day-over-day change in sales. This is done by subtracting the previous day's sales from the current day's sales. Then secondly, we can calculate the percentage change in sales using the formula:

((current_day_sales - previous_day_sales) / previous_day_sales) * 100

In Pandas, we can calculate the day-over-day change in sales by subtracting the previous_day_sales column from the total_sales column. This is a fairly straight-forward calculation.

We can also calculate the percentage change in sales using the .pct_change() method, which calculates the percentage change between the current and previous values. Under the hood, this method calculates the fractional change using the formula:

((value_current_row - value_previous_row) / value_previous_row)

So therefore we need to multiple the result by 100.

Calculate day-over-day change
1
2
3
4
5
daily_sales_pd["day_over_day_change"] = daily_sales_pd["total_sales"] - daily_sales_pd["previous_day_sales"]
daily_sales_pd["pct_change"] = daily_sales_pd["total_sales"].pct_change() * 100
print(f"Daily Sales with Day-over-Day Change: {len(daily_sales_pd)}")
print(daily_sales_pd.head(5))
print(daily_sales_pd.head(5).to_markdown())
Daily Sales with Day-over-Day Change: 100
         date  total_sales  previous_day_sales  next_day_sales  day_over_day_change  day_over_day_change  7d_moving_avg
0  2023-01-01       490.76                 NaN          453.94                  NaN                  NaN     490.760000
1  2023-01-02       453.94              490.76          994.51               -36.82               -36.82     472.350000
2  2023-01-03       994.51              453.94          184.17               540.57               540.57     646.403333
3  2023-01-04       184.17              994.51           27.89              -810.34              -810.34     530.845000
4  2023-01-05        27.89              184.17          498.95              -156.28              -156.28     430.254000
date total_sales previous_day_sales next_day_sales pct_change day_over_day_change 7d_moving_avg
0 2023-01-01 490.76 nan 453.94 nan nan 490.76
1 2023-01-02 453.94 490.76 994.51 -7.50265 -36.82 472.35
2 2023-01-03 994.51 453.94 184.17 119.084 540.57 646.403
3 2023-01-04 184.17 994.51 27.89 -81.4813 -810.34 530.845
4 2023-01-05 27.89 184.17 498.95 -84.8564 -156.28 430.254

In SQL, we can calculate the day-over-day change in sales by subtracting the previous_day_sales column from the total_sales column. We can also calculate the percentage change in sales using the formula:

((current_day_sales - previous_day_sales) / previous_day_sales) * 100
Day-over-day change already calculated
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
dod_change_txt: str = """
    SELECT
        sale_date,
        total_sales,
        previous_day_sales,
        next_day_sales,
        total_sales - previous_day_sales AS day_over_day_change,
        ((total_sales - previous_day_sales) / previous_day_sales) * 100 AS pct_change
    FROM (
        SELECT
            date AS sale_date,
            SUM(sales_amount) AS total_sales,
            LAG(SUM(sales_amount)) OVER (ORDER BY date) AS previous_day_sales,
            LEAD(SUM(sales_amount)) OVER (ORDER BY date) AS next_day_sales
        FROM sales
        GROUP BY date
    )
    ORDER BY sale_date
"""
dod_change_df_sql: pd.DataFrame = pd.read_sql(dod_change_txt, conn)
print(f"Daily Sales with Day-over-Day Change: {len(dod_change_df_sql)}")
print(dod_change_df_sql.head(5))
print(dod_change_df_sql.head(5).to_markdown())
Daily Sales with Day-over-Day Change: 100
             sale_date  total_sales  previous_day_sales  next_day_sales  day_over_day_change  pct_change
0  2023-01-01 00:00:00       490.76                 NaN          453.94                  NaN         NaN
1  2023-01-02 00:00:00       453.94              490.76          994.51               -36.82   -7.502649
2  2023-01-03 00:00:00       994.51              453.94          184.17               540.57  119.084020
3  2023-01-04 00:00:00       184.17              994.51           27.89              -810.34  -81.481333
4  2023-01-05 00:00:00        27.89              184.17          498.95              -156.28  -84.856383
sale_date total_sales previous_day_sales next_day_sales day_over_day_change pct_change
0 2023-01-01 00:00:00 490.76 nan 453.94 nan nan
1 2023-01-02 00:00:00 453.94 490.76 994.51 -36.82 -7.50265
2 2023-01-03 00:00:00 994.51 453.94 184.17 540.57 119.084
3 2023-01-04 00:00:00 184.17 994.51 27.89 -810.34 -81.4813
4 2023-01-05 00:00:00 27.89 184.17 498.95 -156.28 -84.8564

In PySpark, we can calculate the day-over-day change in sales by subtracting the previous_day_sales column from the total_sales column. We can also calculate the percentage change in sales using the formula:

((current_day_sales - previous_day_sales) / previous_day_sales) * 100

Here, we have again shown these calculations using two different methods: using the PySpark API and using the Spark SQL API. Realistically, the results for both of them can be achieved using either method.

Calculate day-over-day change
1
2
3
4
5
6
7
8
9
daily_sales_ps: psDataFrame = daily_sales_ps.withColumns(
    {
        "day_over_day_change": F.col("total_sales") - F.col("previous_day_sales"),
        "pct_change": F.expr("((total_sales - previous_day_sales) / previous_day_sales) * 100").alias("pct_change"),
    }
)
print(f"Daily Sales with Day-over-Day Change: {daily_sales_ps.count()}")
daily_sales_ps.show(5)
print(daily_sales_ps.limit(5).toPandas().to_markdown())
Daily Sales with Day-over-Day Change: 100
+----------+-----------+------------------+--------------+-------------------+------------------+
|      date|total_sales|previous_day_sales|next_day_sales|day_over_day_change|        pct_change|
+----------+-----------+------------------+--------------+-------------------+------------------+
|2023-01-01|     490.76|              NULL|        453.94|               NULL|              NULL|
|2023-01-02|     453.94|            490.76|        994.51| -36.81999999999999|-7.502648952644875|
|2023-01-03|     994.51|            453.94|        184.17|  540.5699999999999|119.08401991452612|
|2023-01-04|     184.17|            994.51|         27.89|            -810.34|-81.48133251551015|
|2023-01-05|      27.89|            184.17|        498.95|-156.27999999999997|-84.85638268990606|
+----------+-----------+------------------+--------------+-------------------+------------------+
only showing top 5 rows
date total_sales previous_day_sales next_day_sales day_over_day_change pct_change
0 2023-01-01 490.76 nan 453.94 nan nan
1 2023-01-02 453.94 490.76 994.51 -36.82 -7.50265
2 2023-01-03 994.51 453.94 184.17 540.57 119.084
3 2023-01-04 184.17 994.51 27.89 -810.34 -81.4813
4 2023-01-05 27.89 184.17 498.95 -156.28 -84.8564

In Polars, we can calculate the day-over-day change in sales by subtracting the previous_day_sales column from the total_sales column. We can also calculate the percentage change in sales using the formula:

((current_day_sales - previous_day_sales) / previous_day_sales) * 100
Calculate day-over-day change
1
2
3
4
5
6
7
daily_sales_pl: pl.DataFrame = daily_sales_pl.with_columns(
    (pl.col("total_sales") - pl.col("previous_day_sales")).alias("day_over_day_change"),
    (pl.col("total_sales") / pl.col("previous_day_sales") - 1).alias("pct_change") * 100,
)
print(f"Daily Sales with Day-over-Day Change: {len(daily_sales_pl)}")
print(daily_sales_pl.head(5))
print(daily_sales_pl.head(5).to_pandas().to_markdown())
Daily Sales with Day-over-Day Change: 100
shape: (5, 6)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date       ┆ total_sales ┆ previous_day_sales ┆ next_day_sales ┆ day_over_day_change ┆ pct_change β”‚
β”‚ ---        ┆ ---         ┆ ---                ┆ ---            ┆ ---                 ┆ ---        β”‚
β”‚ date       ┆ f64         ┆ f64                ┆ f64            ┆ f64                 ┆ f64        β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════════════β•ͺ════════════════β•ͺ═════════════════════β•ͺ════════════║
β”‚ 2023-01-01 ┆ 490.76      ┆ null               ┆ 453.94         ┆ null                ┆ null       β”‚
β”‚ 2023-01-02 ┆ 453.94      ┆ 490.76             ┆ 994.51         ┆ -36.82              ┆ -7.502649  β”‚
β”‚ 2023-01-03 ┆ 994.51      ┆ 453.94             ┆ 184.17         ┆ 540.57              ┆ 119.08402  β”‚
β”‚ 2023-01-04 ┆ 184.17      ┆ 994.51             ┆ 27.89          ┆ -810.34             ┆ -81.481333 β”‚
β”‚ 2023-01-05 ┆ 27.89       ┆ 184.17             ┆ 498.95         ┆ -156.28             ┆ -84.856383 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date total_sales previous_day_sales next_day_sales day_over_day_change pct_change
0 2023-01-01 00:00:00 490.76 nan 453.94 nan nan
1 2023-01-02 00:00:00 453.94 490.76 994.51 -36.82 -7.50265
2 2023-01-03 00:00:00 994.51 453.94 184.17 540.57 119.084
3 2023-01-04 00:00:00 184.17 994.51 27.89 -810.34 -81.4813
4 2023-01-05 00:00:00 27.89 184.17 498.95 -156.28 -84.8564

Next, we will calculate the rolling average of sales over a 7-day window. Rolling averages (aka moving averages) are useful for smoothing out short-term fluctuations and highlighting longer-term trends in the data. This is particularly useful in time series analysis, where we want to understand the underlying trend in the data without being overly influenced by short-term variations. It is also a very common technique used in financial analysis to analyze stock prices, sales data, and other time series data.

In Pandas, we can calculate the 7-day moving average of sales using the .rolling() method. This method allows us to specify a window size (in this case, window=7 which is 7 days) and calculate the mean over that window. The min_periods parameter ensures that we get a value even if there are fewer than 7 days of data available at the start of the series. Finally, the .mean() method calculates the average over the specified window.

Calculate 7-day moving average
1
2
3
4
daily_sales_pd["7d_moving_avg"] = daily_sales_pd["total_sales"].rolling(window=7, min_periods=1).mean()
print(f"Daily Sales with 7-Day Moving Average: {len(daily_sales_pd)}")
print(daily_sales_pd.head(5))
print(daily_sales_pd.head(5).to_markdown())
Daily Sales with 7-Day Moving Average: 100
         date  total_sales  previous_day_sales  next_day_sales  day_over_day_change  pct_change  7d_moving_avg
0  2023-01-01       490.76                 NaN          453.94                  NaN         NaN     490.760000
1  2023-01-02       453.94              490.76          994.51               -36.82   -7.502649     472.350000
2  2023-01-03       994.51              453.94          184.17               540.57  119.084020     646.403333
3  2023-01-04       184.17              994.51           27.89              -810.34  -81.481333     530.845000
4  2023-01-05        27.89              184.17          498.95              -156.28  -84.856383     430.254000
date total_sales previous_day_sales next_day_sales day_over_day_change pct_change 7d_moving_avg
0 2023-01-01 490.76 nan 453.94 nan nan 490.76
1 2023-01-02 453.94 490.76 994.51 -36.82 -7.50265 472.35
2 2023-01-03 994.51 453.94 184.17 540.57 119.084 646.403
3 2023-01-04 184.17 994.51 27.89 -810.34 -81.4813 530.845
4 2023-01-05 27.89 184.17 498.95 -156.28 -84.8564 430.254

In SQL, we can calculate the 7-day moving average of sales using the AVG() window function with the OVER clause. It is important to include this OVER clause, because it is what the SQL engine uses to determine that it should be a Window function, rather than a regular aggregate function (which is specified using the GROUP BY clause).

Here in our example, there are three different parts to the Window function:

  1. The ORDER BY clause: This specifies the order of the rows in the window. In this case, we are ordering by the sale_date column.
  2. The ROWS BETWEEN clause: This specifies the range of rows to include in the window. In this case, we are including the number of rows from 6 preceding rows to the current row. This means that for each row, the window will include the current row and the 6 rows before it, giving us a total of 7 rows in the window. It is important that you specify the ORDER BY clause before the ROWS BETWEEN clause to ensure that the correct rows are included in the window.
  3. The AVG() function: This calculates the average of the total_sales column over the specified window.

Another peculiarity to note here is around the use of the sub-query. The sub-query is used to first calculate the daily sales, including the previous and next day sales, and the day-over-day change. This is because we need to calculate the moving average over the daily sales, rather than the individual sales transactions. The sub-query allows us to aggregate the sales data by date before calculating the moving average. The only change that we are including in the outer-query is the addition of the moving average calculation.

Calculate 7-day moving average
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
rolling_avg_txt: str = """
    SELECT
        sale_date,
        total_sales,
        previous_day_sales,
        next_day_sales,
        day_over_day_change,
        pct_change,
        AVG(total_sales) OVER (ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS "7d_moving_avg"
    FROM (
        SELECT
            date AS sale_date,
            SUM(sales_amount) AS total_sales,
            LAG(SUM(sales_amount)) OVER (ORDER BY date) AS previous_day_sales,
            LEAD(SUM(sales_amount)) OVER (ORDER BY date) AS next_day_sales,
            SUM(sales_amount) - LAG(SUM(sales_amount)) OVER (ORDER BY date) AS day_over_day_change,
            (SUM(sales_amount) / NULLIF(LAG(SUM(sales_amount)) OVER (ORDER BY date), 0) - 1) * 100 AS pct_change
        FROM sales
        GROUP BY date
    ) AS daily_sales
    ORDER BY sale_date
"""
window_df_sql: pd.DataFrame = pd.read_sql(rolling_avg_txt, conn)
print(f"Daily Sales with 7-Day Moving Average: {len(window_df_sql)}")
print(pd.read_sql(rolling_avg_txt + " LIMIT 5", conn))
print(pd.read_sql(rolling_avg_txt + " LIMIT 5", conn).to_markdown())
Daily Sales with 7-Day Moving Average: 100
             sale_date  total_sales  previous_day_sales  next_day_sales  day_over_day_change  pct_change  7d_moving_avg
0  2023-01-01 00:00:00       490.76                 NaN          453.94                  NaN         NaN     490.760000
1  2023-01-02 00:00:00       453.94              490.76          994.51               -36.82   -7.502649     472.350000
2  2023-01-03 00:00:00       994.51              453.94          184.17               540.57  119.084020     646.403333
3  2023-01-04 00:00:00       184.17              994.51           27.89              -810.34  -81.481333     530.845000
4  2023-01-05 00:00:00        27.89              184.17          498.95              -156.28  -84.856383     430.254000
sale_date total_sales previous_day_sales next_day_sales day_over_day_change pct_change 7d_moving_avg
0 2023-01-01 00:00:00 490.76 nan 453.94 nan nan 490.76
1 2023-01-02 00:00:00 453.94 490.76 994.51 -36.82 -7.50265 472.35
2 2023-01-03 00:00:00 994.51 453.94 184.17 540.57 119.084 646.403
3 2023-01-04 00:00:00 184.17 994.51 27.89 -810.34 -81.4813 530.845
4 2023-01-05 00:00:00 27.89 184.17 498.95 -156.28 -84.8564 430.254

In PySpark, we can calculate the 7-day moving average of sales using the F.avg() function in combination with the Window() class. The Window() class allows us to define a window specification for the calculation. We can use the .orderBy() method to specify the order of the rows in the window, and the .rowsBetween() method to specify the range of rows to include in the window. The F.avg() function is then able to calculate the average of the total_sales column over the specified window.

As with many aspects of PySpark, there are multiple ways to achieve the same result. In this case, we can use either the F.avg() function with the Window() class, or we can use the SQL expression syntax with the F.expr() function. Both methods will yield the same result.

Calculate 7-day moving average
1
2
3
4
5
6
7
8
9
daily_sales_ps: psDataFrame = daily_sales_ps.withColumns(
    {
        "7d_moving_avg": F.avg("total_sales").over(Window.orderBy("date").rowsBetween(-6, 0)),
        "7d_rolling_avg": F.expr("AVG(total_sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)"),
    }
)
print(f"Daily Sales with 7-Day Moving Average: {daily_sales_ps.count()}")
daily_sales_ps.show(5)
print(daily_sales_ps.limit(5).toPandas().to_markdown())
Daily Sales with 7-Day Moving Average: 100
+----------+-----------+------------------+--------------+-------------------+------------------+-----------------+-----------------+
|      date|total_sales|previous_day_sales|next_day_sales|day_over_day_change|        pct_change|    7d_moving_avg|   7d_rolling_avg|
+----------+-----------+------------------+--------------+-------------------+------------------+-----------------+-----------------+
|2023-01-01|     490.76|              NULL|        453.94|               NULL|              NULL|           490.76|           490.76|
|2023-01-02|     453.94|            490.76|        994.51| -36.81999999999999|-7.502648952644875|           472.35|           472.35|
|2023-01-03|     994.51|            453.94|        184.17|  540.5699999999999|119.08401991452612|646.4033333333333|646.4033333333333|
|2023-01-04|     184.17|            994.51|         27.89|            -810.34|-81.48133251551015|          530.845|          530.845|
|2023-01-05|      27.89|            184.17|        498.95|-156.27999999999997|-84.85638268990606|          430.254|          430.254|
+----------+-----------+------------------+--------------+-------------------+------------------+-----------------+-----------------+
only showing top 5 rows
date total_sales previous_day_sales next_day_sales day_over_day_change pct_change 7d_moving_avg 7d_rolling_avg
0 2023-01-01 490.76 nan 453.94 nan nan 490.76 490.76
1 2023-01-02 453.94 490.76 994.51 -36.82 -7.50265 472.35 472.35
2 2023-01-03 994.51 453.94 184.17 540.57 119.084 646.403 646.403
3 2023-01-04 184.17 994.51 27.89 -810.34 -81.4813 530.845 530.845
4 2023-01-05 27.89 184.17 498.95 -156.28 -84.8564 430.254 430.254

In Polars, we can calculate the 7-day moving average of sales using the .rolling_mean() method. This method allows us to specify a window size (in this case, window_size=7 which is 7 days) and calculate the mean over that window. The min_samples=1 parameter ensures that we get a value even if there are fewer than 7 days of data available at the start of the series.

Calculate 7-day moving average
1
2
3
4
5
6
daily_sales_pl: pl.DataFrame = daily_sales_pl.with_columns(
    pl.col("total_sales").rolling_mean(window_size=7, min_samples=1).alias("7d_moving_avg"),
)
print(f"Daily Sales with 7-Day Moving Average: {len(daily_sales_pl)}")
print(daily_sales_pl.head(5))
print(daily_sales_pl.head(5).to_pandas().to_markdown())
Daily Sales with 7-Day Moving Average: 100
shape: (5, 7)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ date       ┆ total_sales ┆ previous_day_sales ┆ next_day_sales ┆ day_over_day_change ┆ pct_change ┆ 7d_moving_avg β”‚
β”‚ ---        ┆ ---         ┆ ---                ┆ ---            ┆ ---                 ┆ ---        ┆ ---           β”‚
β”‚ date       ┆ f64         ┆ f64                ┆ f64            ┆ f64                 ┆ f64        ┆ f64           β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════════════β•ͺ════════════════════β•ͺ════════════════β•ͺ═════════════════════β•ͺ════════════β•ͺ═══════════════║
β”‚ 2023-01-01 ┆ 490.76      ┆ null               ┆ 453.94         ┆ null                ┆ null       ┆ 490.76        β”‚
β”‚ 2023-01-02 ┆ 453.94      ┆ 490.76             ┆ 994.51         ┆ -36.82              ┆ -7.502649  ┆ 472.35        β”‚
β”‚ 2023-01-03 ┆ 994.51      ┆ 453.94             ┆ 184.17         ┆ 540.57              ┆ 119.08402  ┆ 646.403333    β”‚
β”‚ 2023-01-04 ┆ 184.17      ┆ 994.51             ┆ 27.89          ┆ -810.34             ┆ -81.481333 ┆ 530.845       β”‚
β”‚ 2023-01-05 ┆ 27.89       ┆ 184.17             ┆ 498.95         ┆ -156.28             ┆ -84.856383 ┆ 430.254       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
date total_sales previous_day_sales next_day_sales day_over_day_change pct_change 7d_moving_avg
0 2023-01-01 00:00:00 490.76 nan 453.94 nan nan 490.76
1 2023-01-02 00:00:00 453.94 490.76 994.51 -36.82 -7.50265 472.35
2 2023-01-03 00:00:00 994.51 453.94 184.17 540.57 119.084 646.403
3 2023-01-04 00:00:00 184.17 994.51 27.89 -810.34 -81.4813 530.845
4 2023-01-05 00:00:00 27.89 184.17 498.95 -156.28 -84.8564 430.254

Finally, we can visualize the daily sales data along with the 7-day moving average using Plotly. This allows us to see the trends in sales over time and how the moving average smooths out the fluctuations in daily sales.

For this, we will again utilise Plotly to create an interactive line chart that displays both the daily sales and the 7-day moving average. The chart will have the date on the x-axis and the sales amount on the y-axis, with two lines representing the daily sales and the moving average.

The graph will be instantiated using the go.Figure() class, and using the .add_trace() method we will add two traces to the figure: one for the daily sales and one for the 7-day moving average. The go.Scatter() class is used to create the line traces, by defining mode="lines" to display the data as a line chart.

Finally, we will use the .update_layout() method to set the titles for the chart, and the position of the legend.

Plotly is easily able to handle Pandas DataFrames, so we can directly parse the columns from the DataFrame to create the traces for the daily sales and the 7-day moving average.

Plot results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
fig: go.Figure = (
    go.Figure()
    .add_trace(
        go.Scatter(
            x=daily_sales_pd["date"],
            y=daily_sales_pd["total_sales"],
            mode="lines",
            name="Daily Sales",
        )
    )
    .add_trace(
        go.Scatter(
            x=daily_sales_pd["date"],
            y=daily_sales_pd["7d_moving_avg"],
            mode="lines",
            name="7-Day Moving Average",
            line_width=3,
        ),
    )
    .update_layout(
        title="Daily Sales with 7-Day Moving Average",
        xaxis_title="Date",
        yaxis_title="Sales Amount ($)",
        legend_orientation="h",
        legend_yanchor="bottom",
        legend_y=1,
    )
)
fig.write_html("images/pt4_daily_sales_with_7d_avg_pd.html", include_plotlyjs="cdn", full_html=True)
fig.show()

Plotly is easily able to handle Pandas DataFrames, so we can directly parse the columns from the DataFrame to create the traces for the daily sales and the 7-day moving average.

Plot results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
fig: go.Figure = (
    go.Figure()
    .add_trace(
        go.Scatter(
            x=window_df_sql["sale_date"],
            y=window_df_sql["total_sales"],
            mode="lines",
            name="Daily Sales",
        )
    )
    .add_trace(
        go.Scatter(
            x=window_df_sql["sale_date"],
            y=window_df_sql["7d_moving_avg"],
            mode="lines",
            name="7-Day Moving Average",
            line_width=3,
        )
    )
    .update_layout(
        title="Daily Sales with 7-Day Moving Average",
        xaxis_title="Date",
        yaxis_title="Sales Amount ($)",
        legend_orientation="h",
        legend_yanchor="bottom",
        legend_y=1,
    )
)
fig.write_html("images/pt4_daily_sales_with_7d_avg_sql.html", include_plotlyjs="cdn", full_html=True)
fig.show()

Plotly is not able to interpret PySpark DataFrames directly, so we need to convert the PySpark DataFrame to a Pandas DataFrame before plotting. This can be done using the .toPandas() method. We can then parse the columns from the Pandas DataFrame to create the traces for the daily sales and the 7-day moving average.

Plot results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
fig: go.Figure = (
    go.Figure()
    .add_trace(
        go.Scatter(
            x=daily_sales_ps.toPandas()["date"],
            y=daily_sales_ps.toPandas()["total_sales"],
            mode="lines",
            name="Daily Sales",
        )
    )
    .add_trace(
        go.Scatter(
            x=daily_sales_ps.toPandas()["date"],
            y=daily_sales_ps.toPandas()["7d_moving_avg"],
            mode="lines",
            name="7-Day Moving Average",
            line_width=3,
        ),
    )
    .update_layout(
        title="Daily Sales with 7-Day Moving Average",
        xaxis_title="Date",
        yaxis_title="Sales Amount ($)",
        legend_orientation="h",
        legend_yanchor="bottom",
        legend_y=1,
    )
)
fig.write_html("images/pt4_daily_sales_with_7d_avg_ps.html", include_plotlyjs="cdn", full_html=True)
fig.show()

Plotly is easily able to handle Polars DataFrames, so we can directly parse the columns from the DataFrame to create the traces for the daily sales and the 7-day moving average.

Plot results
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
fig: go.Figure = (
    go.Figure()
    .add_trace(
        go.Scatter(
            x=daily_sales_pl["date"],
            y=daily_sales_pl["total_sales"],
            mode="lines",
            name="Daily Sales",
        )
    )
    .add_trace(
        go.Scatter(
            x=daily_sales_pl["date"],
            y=daily_sales_pl["7d_moving_avg"],
            mode="lines",
            name="7-Day Moving Average",
            line_width=3,
        )
    )
    .update_layout(
        title="Daily Sales with 7-Day Moving Average",
        xaxis_title="Date",
        yaxis_title="Sales Amount ($)",
        legend_orientation="h",
        legend_yanchor="bottom",
        legend_y=1,
    )
)
fig.write_html("images/pt4_daily_sales_with_7d_avg_pl.html", include_plotlyjs="cdn", full_html=True)
fig.show()

5. Ranking and PartitioningπŸ”—

The fifth section will demonstrate how to rank and partition data. This is useful for identifying top performers, such as the highest spending customers or the most popular products.

In Pandas, we can use the .rank() method to rank values in a DataFrame. This method allows us to specify the ranking method (e.g., dense, average, min, max) and whether to rank in ascending or descending order.

Rank customers by total spending
1
2
3
4
5
6
customer_spending_pd: pd.DataFrame = df_sales_pd.groupby("customer_id").agg(total_spending=("sales_amount", "sum"))
customer_spending_pd["rank"] = customer_spending_pd["total_spending"].rank(method="dense", ascending=False)
customer_spending_pd: pd.DataFrame = customer_spending_pd.sort_values("rank").reset_index()
print(f"Customer Spending Summary: {len(customer_spending_pd)}")
print(customer_spending_pd.head(5))
print(customer_spending_pd.head(5).to_markdown())
Customer Spending Summary: 61
   customer_id  total_spending  rank
0           15         2297.55   1.0
1            4         2237.49   2.0
2           62         2177.35   3.0
3           60         2086.09   4.0
4           21         2016.95   5.0
customer_id total_spending rank
0 15 2297.55 1
1 4 2237.49 2
2 62 2177.35 3
3 60 2086.09 4
4 21 2016.95 5

In SQL, we can use the DENSE_RANK() window function to rank values in a query. This function assigns a rank to each row within a partition of a given result set, with no gaps in the ranking values. Note that this function can only be used in congunction with the OVER clause, which defines it as a Window function. The ORDER BY clause within the OVER clause specifies the order in which the rows are ranked.

Rank customers by total spending
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
customer_spending_txt: str = """
    SELECT
        customer_id,
        SUM(sales_amount) AS total_spending,
        DENSE_RANK() OVER (ORDER BY SUM(sales_amount) DESC) AS rank
    FROM sales
    GROUP BY customer_id
    ORDER BY rank
"""
customer_spending_sql: pd.DataFrame = pd.read_sql(customer_spending_txt, conn)
print(f"Customer Spending Summary: {len(customer_spending_sql)}")
print(customer_spending_sql.head(5))
print(customer_spending_sql.head(5).to_markdown())
Customer Spending Summary: 61
   customer_id  total_spending  rank
0           15         2297.55     1
1            4         2237.49     2
2           62         2177.35     3
3           60         2086.09     4
4           21         2016.95     5
customer_id total_spending rank
0 15 2297.55 1
1 4 2237.49 2
2 62 2177.35 3
3 60 2086.09 4
4 21 2016.95 5

In PySpark, we can use the F.dense_rank() function in combination with the Window() class to rank values in a DataFrame. The Window() class allows us to define a window specification for the calculation, and the F.dense_rank() function calculates the dense rank of each row within that window.

Rank customers by total spending
1
2
3
4
5
6
7
8
9
customer_spending_ps: psDataFrame = (
    df_sales_ps.groupBy("customer_id")
    .agg(F.sum("sales_amount").alias("total_spending"))
    .withColumn("rank", F.dense_rank().over(Window.orderBy(F.desc("total_spending"))))
    .orderBy("rank")
)
print(f"Customer Spending Summary: {customer_spending_ps.count()}")
customer_spending_ps.show(5)
print(customer_spending_ps.limit(5).toPandas().to_markdown())
Customer Spending Summary: 61
+-----------+------------------+----+
|customer_id|    total_spending|rank|
+-----------+------------------+----+
|         15|           2297.55|   1|
|          4|           2237.49|   2|
|         62|           2177.35|   3|
|         60|2086.0899999999997|   4|
|         21|           2016.95|   5|
+-----------+------------------+----+
customer_id total_spending rank
0 15 2297.55 1
1 4 2237.49 2
2 62 2177.35 3
3 60 2086.09 4
4 21 2016.95 5

In Polars, we can use the .rank() method to rank values in a DataFrame. This method allows us to specify the ranking method (e.g., dense, average, min, max) and whether to rank in ascending or descending order.

Rank customers by total spending
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
customer_spending_pl: pl.DataFrame = (
    df_sales_pl.group_by("customer_id")
    .agg(pl.col("sales_amount").sum().alias("total_spending"))
    .with_columns(
        pl.col("total_spending").rank(method="dense", descending=True).alias("rank"),
    )
    .sort("rank")
)
print(f"Customer Spending Summary: {len(customer_spending_pl)}")
print(customer_spending_pl.head(5))
print(customer_spending_pl.head(5).to_pandas().to_markdown())
Customer Spending Summary: 61
shape: (5, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ customer_id ┆ total_spending ┆ rank β”‚
β”‚ ---         ┆ ---            ┆ ---  β”‚
β”‚ i64         ┆ f64            ┆ u32  β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════════β•ͺ══════║
β”‚ 15          ┆ 2297.55        ┆ 1    β”‚
β”‚ 4           ┆ 2237.49        ┆ 2    β”‚
β”‚ 62          ┆ 2177.35        ┆ 3    β”‚
β”‚ 60          ┆ 2086.09        ┆ 4    β”‚
β”‚ 21          ┆ 2016.95        ┆ 5    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
customer_id total_spending rank
0 15 2297.55 1
1 4 2237.49 2
2 62 2177.35 3
3 60 2086.09 4
4 21 2016.95 5

Next, we will rank products based on the quantity sold, partitioned by the product category. This will help us identify the most popular products within each category.

In Pandas it is first necessary to group the sales data by category and product_id, then aggregate the data for the .sum() of the quantity to find the total_quantity sold for each product. After that, we can use the .rank() method to rank the products within each category based on the total quantity sold.

It is important to note here that we are implementing the .rank() method from within an .assign() method. This is a common pattern in Pandas to create new columns on a DataFrame based on other columns already existing on the DataFrame, while keeping the DataFrame immutable. Here, we are using the .groupby() method to group the DataFrame by category, and then applying the .rank() method to the total_quantity column within each category. In this way, we are creating a partitioned DataFrame that ranks products by quantity sold within each category.

Rank products by quantity sold, by category
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
product_popularity_pd: pd.DataFrame = (
    df_sales_pd.groupby(["category", "product_id"])
    .agg(
        total_quantity=("quantity", "sum"),
    )
    .reset_index()
    .assign(
        rank=lambda x: x.groupby("category")["total_quantity"].rank(method="dense", ascending=False),
    )
    .sort_values(["rank", "category"])
    .reset_index(drop=True)
)
print(f"Product Popularity Summary: {len(product_popularity_pd)}")
print(product_popularity_pd.head(10))
print(product_popularity_pd.head(10).to_markdown())
Product Popularity Summary: 78
      category  product_id  total_quantity  rank
0        Books          11              14   1.0
1     Clothing           7               9   1.0
2  Electronics          37              16   1.0
3         Food          45              34   1.0
4         Home           3              10   1.0
5        Books          28               9   2.0
6     Clothing          35               8   2.0
7  Electronics          35              11   2.0
8         Food           1              16   2.0
9         Home           9               5   2.0
category product_id total_quantity rank
0 Books 11 14 1
1 Clothing 7 9 1
2 Electronics 37 16 1
3 Food 45 34 1
4 Home 3 10 1
5 Books 28 9 2
6 Clothing 35 8 2
7 Electronics 35 11 2
8 Food 1 16 2
9 Home 9 5 2

In SQL, we can use the RANK() window function to rank products within each category based on the total quantity sold. The PARTITION BY clause allows us to partition the data by category, and the ORDER BY clause specifies the order in which the rows are ranked within each partition.

Rank products by quantity sold, by category
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
product_popularity_txt: str = """
    SELECT
        category,
        product_id,
        SUM(quantity) AS total_quantity,
        RANK() OVER (PARTITION BY category ORDER BY SUM(quantity) DESC) AS rank
    FROM sales
    GROUP BY category, product_id
    ORDER BY rank
"""
print(f"Product Popularity: {len(pd.read_sql(product_popularity_txt, conn))}")
print(pd.read_sql(product_popularity_txt + "LIMIT 10", conn))
print(pd.read_sql(product_popularity_txt + "LIMIT 10", conn).to_markdown())
Product Popularity: 78
      category  product_id  total_quantity  rank
0        Books          11              14     1
1     Clothing           7               9     1
2  Electronics          37              16     1
3         Food          45              34     1
4         Home           3              10     1
5        Books          28               9     2
6     Clothing          35               8     2
7  Electronics          35              11     2
8         Food           1              16     2
9         Home          48               5     2
category product_id total_quantity rank
0 Books 11 14 1
1 Clothing 7 9 1
2 Electronics 37 16 1
3 Food 45 34 1
4 Home 3 10 1
5 Books 28 9 2
6 Clothing 35 8 2
7 Electronics 35 11 2
8 Food 1 16 2
9 Home 48 5 2

In PySpark, we can use the F.dense_rank() function in combination with the Window() class to rank products within each category based on the total quantity sold. We can define the partitioning by using the .partitionBy() method and parse'ing in the "category" column. We can then define the ordering by using the .orderBy() method and parse'ing in the "total_quantity" expression to order the products by total quantity sold in descending order with the F.desc() method.

Here, we have also provided an alternative way to define the rank by using the Spark SQL method. The outcome is the same, it's simply written in a SQL-like expression.

Rank products by quantity sold, by category
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
product_popularity_ps: psDataFrame = (
    df_sales_ps.groupBy("category", "product_id")
    .agg(F.sum("quantity").alias("total_quantity"))
    .withColumns(
        {
            "rank_p": F.dense_rank().over(Window.partitionBy("category").orderBy(F.desc("total_quantity"))),
            "rank_s": F.expr("DENSE_RANK() OVER (PARTITION BY category ORDER BY total_quantity DESC)"),
        }
    )
    .orderBy("rank_p")
)
print(f"Product Popularity Summary: {product_popularity_ps.count()}")
product_popularity_ps.show(10)
print(product_popularity_ps.limit(10).toPandas().to_markdown())
Product Popularity Summary: 78
+-----------+----------+--------------+------+------+
|   category|product_id|total_quantity|rank_p|rank_s|
+-----------+----------+--------------+------+------+
|   Clothing|         7|             9|     1|     1|
|      Books|        11|            14|     1|     1|
|Electronics|        37|            16|     1|     1|
|       Food|        45|            34|     1|     1|
|       Home|         3|            10|     1|     1|
|      Books|        28|             9|     2|     2|
|Electronics|        35|            11|     2|     2|
|       Home|        29|             5|     2|     2|
|       Home|        48|             5|     2|     2|
|       Home|         9|             5|     2|     2|
+-----------+----------+--------------+------+------+
only showing top 10 rows
category product_id total_quantity rank_p rank_s
0 Clothing 7 9 1 1
1 Books 11 14 1 1
2 Electronics 37 16 1 1
3 Food 45 34 1 1
4 Home 3 10 1 1
5 Books 28 9 2 2
6 Clothing 35 8 2 2
7 Electronics 35 11 2 2
8 Food 1 16 2 2
9 Home 29 5 2 2

In Polars, we can use the .rank() method to rank products within each category based on the total quantity sold. We first group the sales data by category and product_id, then aggregate the data for the .sum() of the quantity to find the total_quantity sold for each product. After that, we can use the .rank() method to rank the products within each category based on the total quantity sold. Finally, we can define the partitioning by using the .over() method and parse'ing in partition_by="category".

Rank products by quantity sold, by category
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
product_popularity_pl: pl.DataFrame = (
    df_sales_pl.group_by("category", "product_id")
    .agg(pl.sum("quantity").alias("total_quantity"))
    .with_columns(
        pl.col("total_quantity").rank(method="dense", descending=True).over(partition_by="category").alias("rank")
    )
    .sort("rank", "category")
)
print(f"Product Popularity Summary: {len(product_popularity_pl)}")
print(product_popularity_pl.head(10))
print(product_popularity_pl.head(10).to_pandas().to_markdown())
Product Popularity Summary: 78
shape: (10, 4)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
β”‚ category    ┆ product_id ┆ total_quantity ┆ rank β”‚
β”‚ ---         ┆ ---        ┆ ---            ┆ ---  β”‚
β”‚ str         ┆ i64        ┆ i64            ┆ u32  β”‚
β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•β•ͺ════════════β•ͺ════════════════β•ͺ══════║
β”‚ Books       ┆ 11         ┆ 14             ┆ 1    β”‚
β”‚ Clothing    ┆ 7          ┆ 9              ┆ 1    β”‚
β”‚ Electronics ┆ 37         ┆ 16             ┆ 1    β”‚
β”‚ Food        ┆ 45         ┆ 34             ┆ 1    β”‚
β”‚ Home        ┆ 3          ┆ 10             ┆ 1    β”‚
β”‚ Books       ┆ 28         ┆ 9              ┆ 2    β”‚
β”‚ Clothing    ┆ 35         ┆ 8              ┆ 2    β”‚
β”‚ Electronics ┆ 35         ┆ 11             ┆ 2    β”‚
β”‚ Food        ┆ 1          ┆ 16             ┆ 2    β”‚
β”‚ Home        ┆ 48         ┆ 5              ┆ 2    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜
category product_id total_quantity rank
0 Books 11 14 1
1 Clothing 7 9 1
2 Electronics 37 16 1
3 Food 45 34 1
4 Home 3 10 1
5 Books 28 9 2
6 Clothing 35 8 2
7 Electronics 35 11 2
8 Food 1 16 2
9 Home 48 5 2

ConclusionπŸ”—

This comprehensive guide has demonstrated how to perform essential data querying and manipulation operations across four powerful tools: Pandas, SQL, PySpark, and Polars. Each tool brings unique advantages to the data processing landscape, and understanding their strengths helps you choose the right tool for your specific use case.

Tool Comparison and Use CasesπŸ”—

  • Pandas has an extensive ecosystem, making it ideal for:

    • Small to medium datasets (up to millions of rows)
    • Interactive data exploration and visualization
    • Data preprocessing for machine learning workflows
    • Quick statistical analysis and reporting

    Pandas remains the go-to choice for exploratory data analysis and rapid prototyping.

  • SQL excels in:

    • Working with relational databases and data warehouses
    • Complex joins and subqueries
    • Declarative data transformations
    • Team environments where SQL knowledge is widespread

    SQL provides the universal language of data with unmatched expressiveness for complex queries

  • PySpark is great for when you need:

    • Processing datasets that don't fit in memory (terabytes or larger)
    • Distributed computing across clusters
    • Integration with Hadoop ecosystem components
    • Scalable machine learning with MLlib

    PySpark unlocks the power of distributed computing for big data scenarios.

  • Polars is particularly valuable for:

    • Large datasets that require fast processing (gigabytes to small terabytes)
    • Performance-critical applications
    • Memory-constrained environments
    • Lazy evaluation and query optimization

    Polars emerges as the high-performance alternative with excellent memory efficiency.

Key Techniques CoveredπŸ”—

Throughout this guide, we've explored fundamental data manipulation patterns that remain consistent across all tools:

  1. Data Filtering and Selection - Essential for subsetting data based on conditions
  2. Grouping and Aggregation - Critical for summarizing data by categories
  3. Joining and Merging - Necessary for combining data from multiple sources
  4. Window Functions - Powerful for time-series analysis and advanced calculations
  5. Ranking and Partitioning - Useful for identifying top performers and comparative analysis

Best Practices and RecommendationsπŸ”—

When working with any of these tools, consider these best practices:

  • Start with the right tool: Match your tool choice to your data size, infrastructure, and team expertise
  • Understand your data: Always examine data types, null values, and distributions before processing
  • Optimize for readability: Write clear, well-documented code that your future self and teammates can understand
  • Profile performance: Measure execution time and memory usage, especially for large datasets
  • Leverage built-in optimizations: Use vectorized operations, avoid loops, and take advantage of lazy evaluation where available

Moving ForwardπŸ”—

The data landscape continues to evolve rapidly, with new tools and techniques emerging regularly. The fundamental concepts demonstrated in this guideβ€”filtering, grouping, joining, and analytical functionsβ€”remain constant across platforms. By mastering these core concepts, you'll be well-equipped to adapt to new tools and technologies as they arise.

Whether you're analyzing customer behavior, processing sensor data, or building machine learning models, the techniques in this guide provide a solid foundation for effective data manipulation. Remember that the best tool is often the one that best fits your specific requirements for performance, scalability, and team capabilities.

Continue practicing with real datasets, explore advanced features of each tool, and stay curious about emerging technologies in the data processing ecosystem. The skills you've learned here will serve as building blocks for increasingly sophisticated data analysis and engineering tasks.