Data Querying for Data Scientistsπ
A Comprehensive Guide of using Pandas, SQL, PySpark, and Polars for Data Manipulation Techniques, with Practical Examples and Visualisationsπ
If you wanted to run the code yourself, you can download just that Jupyter notebook:
Download
- ALL
Download
- Pandas
Download
- SQL
Download
- PySpark
Download
- Polars
Or you can follow along on this page...
Introductionπ
Working as a Data Scientist or Data Engineer often involves querying data from various sources. There are many tools and libraries available to perform these tasks, each with its own strengths and weaknesses. Also, there are many different ways to achieve similar results, depending on the tool or library used. It's important to be familiar with these different methods to choose the best one for your specific use case.
This article provides a comprehensive guide on how to query data using different tools and libraries, including Pandas, SQL, PySpark, and Polars. Each section will cover the setup, data creation, and various querying techniques such as filtering, grouping, joining, window functions, ranking, and sorting. The output will be identical across all tools, but the transformations will be implemented using the specific syntax and features of each library. Therefore allowing you to compare the different approaches and understand the nuances of each method.
Overview of the Different Librariesπ
Before we dive into the querying techniques, let's take a moment to understand the different libraries and tools we will be using in this article. Each library has its own strengths and weaknesses, and understanding these can help you choose the right tool for your specific use case.
Throughout this article, you can easily switch between the different libraries by selecting the appropriate tab. Each section will provide the same functionality, but implemented using the specific syntax and features of each library.
Pandas is a powerful data manipulation library in Python that provides data structures and functions for working with structured data. It is widely used for data analysis and manipulation tasks.
Historically, Pandas was one of the first libraries to provide a DataFrame structure, which is similar to a table in a relational database. It allows for easy data manipulation, filtering, grouping, and aggregation. Pandas is built on top of NumPy and provides a high-level interface for working with data. It is particularly well-suited for small to medium-sized datasets and is often used in Data Science and Machine Learning workflows.
Pandas provides a rich set of functionalities for data manipulation, including filtering, grouping, joining, and window functions. It also integrates well with other libraries such as Matplotlib and Seaborn for data visualization, making it a popular choice among data scientists and analysts.
While Pandas is both powerful and popular, it is important to note that it operates in-memory, which means that it may not be suitable for very large datasets that do not fit into memory. In such cases, other libraries like PySpark or Polars may be more appropriate.
SQL (Structured Query Language) is a standard language for managing and manipulating relational databases. It is widely used for querying and modifying data in databases. SQL is a declarative language, meaning that you specify what you want to retrieve or manipulate without detailing how to do it. This makes SQL queries concise and expressive. SQL is particularly well-suited for working with large datasets and complex queries. It provides powerful features for filtering, grouping, joining, and aggregating data. SQL is the backbone of many database systems.
SQL is actually a language (like Python is a language), not a library (like Pandas is a library), and it is used to interact with relational databases. The core of the SQL language is actually an ISO standard, which means that the basic syntax and functionality are consistent across different database systems. However, each database system may have its own extensions or variations of SQL, which can lead to differences in syntax and features. Each database system can be considered as variations (or dialects) of SQL, with their own specific features and optimizations and syntax enhancements.
Some of the more popular SQL dialects include:
PySpark is the Python API for Apache Spark, a distributed computing framework that allows for large-scale data processing. PySpark provides a high-level interface for working with Spark, making it easier to write distributed data processing applications in Python. It is particularly well-suited for big data processing and analytics.
PySpark provides a DataFrame API similar to Pandas, but it is designed to work with large datasets that do not fit into memory. It allows for distributed data processing across a cluster of machines, making it suitable for big data applications. PySpark supports various data sources, including HDFS, S3, ADLS, and JDBC, and provides powerful features for filtering, grouping, joining, and aggregating data.
While PySpark is a powerful tool for big data processing, it can be more complex to set up and use compared to Pandas. It requires a Spark cluster and may have a steeper learning curve for those unfamiliar with distributed computing concepts. However, it is an excellent choice for processing large datasets and performing complex data transformations.
Polars is a fast DataFrame library for Python that is designed for high-performance data manipulation. It is built on top of Rust and provides a DataFrame API similar to Pandas, but with a focus on performance and memory efficiency. Polars is particularly well-suited for large datasets and complex queries.
Polars supports lazy evaluation, which allows for optimizations in query execution. Polars also provides powerful features for filtering, grouping, joining, and aggregating data, making it a great choice for data analysis tasks.
While Polars is a relatively new library compared to Pandas, it has gained popularity for its performance and ease of use. It is designed to be a drop-in replacement for Pandas, allowing users to leverage its performance benefits without significant changes to their existing code. It is particularly useful for data scientists and analysts who need to work with large datasets and require fast data manipulation capabilities. The setup is simple and straightforward, similar to Pandas, and less complex than PySpark. It is a great choice for data analysis tasks that require high performance and memory efficiency.
Setupπ
Before we start querying data, we need to set up our environment. This includes importing the necessary libraries, creating sample data, and defining constants that will be used throughout the article. The following sections will guide you through this setup process. The code for this article is also available on GitHub: querying-data.
Setup | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Setup | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Setup | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Setup | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Once the setup is complete, we can proceed to create our sample data. This data will be used for querying and will be consistent across all libraries. All tables will be created from scratch with randomly generated data to simulate a real-world scenario. This is to ensure that the examples are self-contained and can be run without any external dependencies, and also there is no issues about data privacy or security.
For the below data creation steps, we will be defining the tables using Python dictionaries. Each dictionary will represent a table, with keys as column names and values as lists of data. We will then convert these dictionaries into DataFrames or equivalent structures in each library.
First, we will create a sales fact table. This table will contain information about sales transactions, including the date, customer ID, product ID, category, sales amount, and quantity sold.
Create Sales Fact Data | |
---|---|
1 2 3 4 5 6 7 8 |
|
Next, we will create a product dimension table. This table will contain information about products, including the product ID, name, price, category, and supplier ID.
Create Product Dimension Data | |
---|---|
1 2 3 4 5 6 7 |
|
Finally, we will create a customer dimension table. This table will contain information about customers, including the customer ID, name, city, state, and segment.
Create Customer Dimension Data | |
---|---|
1 2 3 4 5 6 7 |
|
Now that we have our sample data created, we can proceed to the querying section. Each of the following sections will demonstrate how to perform similar operations using the different libraries and methods, allowing you to compare and contrast their capabilities.
Create the DataFramesπ
To create the dataframes in Pandas, we will use the data we generated earlier. We will parse the dictionaries into Pandas DataFrames, which will allow us to perform various data manipulation tasks.
Create DataFrames | |
---|---|
1 2 3 |
|
Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the .head()
method to display the first 5 rows of each DataFrame, and then parse to the print()
function to display the DataFrame in a readable format.
Check Sales DataFrame | |
---|---|
1 2 3 |
|
Sales DataFrame: 100
date customer_id product_id category sales_amount quantity
0 2023-01-01 52 45 Food 490.76 7
1 2023-01-02 93 41 Electronics 453.94 5
2 2023-01-03 15 29 Home 994.51 5
3 2023-01-04 72 15 Electronics 184.17 7
4 2023-01-05 61 45 Food 27.89 9
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 |
Check Product DataFrame | |
---|---|
1 2 3 |
|
Product DataFrame: 50
product_id product_name price category supplier_id
0 1 Product 1 257.57 Food 8
1 2 Product 2 414.96 Clothing 5
2 3 Product 3 166.82 Clothing 8
3 4 Product 4 448.81 Food 4
4 5 Product 5 200.71 Food 8
product_id | product_name | price | category | supplier_id | |
---|---|---|---|---|---|
0 | 1 | Product 1 | 257.57 | Food | 8 |
1 | 2 | Product 2 | 414.96 | Clothing | 5 |
2 | 3 | Product 3 | 166.82 | Clothing | 8 |
3 | 4 | Product 4 | 448.81 | Food | 4 |
4 | 5 | Product 5 | 200.71 | Food | 8 |
Check Customer DataFrame | |
---|---|
1 2 3 |
|
Customer DataFrame: 100
customer_id customer_name city state segment
0 1 Customer 1 Phoenix NY Corporate
1 2 Customer 2 Phoenix CA Home Office
2 3 Customer 3 Phoenix NY Home Office
3 4 Customer 4 Los Angeles NY Consumer
4 5 Customer 5 Los Angeles IL Home Office
customer_id | customer_name | city | state | segment | |
---|---|---|---|---|---|
0 | 1 | Customer 1 | Phoenix | NY | Corporate |
1 | 2 | Customer 2 | Phoenix | CA | Home Office |
2 | 3 | Customer 3 | Phoenix | NY | Home Office |
3 | 4 | Customer 4 | Los Angeles | NY | Consumer |
4 | 5 | Customer 5 | Los Angeles | IL | Home Office |
To create the dataframes in SQL, we will use the data we generated earlier. Firstly, we need to create the SQLite database. This will be an in-memory database for demonstration purposes, but in a real-world scenario, you would typically connect to a persistent (on-disk) database. To do this, we will use the sqlite3
library to create a connection to the database, which we define with the :memory:
parameter on the .connect()
function. The result is to create a temporary database that exists only during the lifetime of the connection.
Next, we will then parse the dictionaries into Pandas DataFrames, which will then be loaded into an SQLite database. This allows us to perform various data manipulation tasks using SQL queries.
Create DataFrames | |
---|---|
1 2 3 4 5 |
|
Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the pd.read_sql()
function to execute SQL queries and retrieve the data from the database. We will then parse the results to the print()
function to display the DataFrame in a readable format.
Check Sales DataFrame | |
---|---|
1 2 3 |
|
Sales Table: 100
date customer_id product_id category sales_amount quantity
0 2023-01-01 00:00:00 52 45 Food 490.76 7
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5
2 2023-01-03 00:00:00 15 29 Home 994.51 5
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7
4 2023-01-05 00:00:00 61 45 Food 27.89 9
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 |
Check Product DataFrame | |
---|---|
1 2 3 |
|
Product Table: 50
product_id product_name price category supplier_id
0 1 Product 1 257.57 Food 8
1 2 Product 2 414.96 Clothing 5
2 3 Product 3 166.82 Clothing 8
3 4 Product 4 448.81 Food 4
4 5 Product 5 200.71 Food 8
product_id | product_name | price | category | supplier_id | |
---|---|---|---|---|---|
0 | 1 | Product 1 | 257.57 | Food | 8 |
1 | 2 | Product 2 | 414.96 | Clothing | 5 |
2 | 3 | Product 3 | 166.82 | Clothing | 8 |
3 | 4 | Product 4 | 448.81 | Food | 4 |
4 | 5 | Product 5 | 200.71 | Food | 8 |
Check Customer DataFrame | |
---|---|
1 2 3 |
|
Customer Table: 100
customer_id customer_name city state segment
0 1 Customer 1 Phoenix NY Corporate
1 2 Customer 2 Phoenix CA Home Office
2 3 Customer 3 Phoenix NY Home Office
3 4 Customer 4 Los Angeles NY Consumer
4 5 Customer 5 Los Angeles IL Home Office
customer_id | customer_name | city | state | segment | |
---|---|---|---|---|---|
0 | 1 | Customer 1 | Phoenix | NY | Corporate |
1 | 2 | Customer 2 | Phoenix | CA | Home Office |
2 | 3 | Customer 3 | Phoenix | NY | Home Office |
3 | 4 | Customer 4 | Los Angeles | NY | Consumer |
4 | 5 | Customer 5 | Los Angeles | IL | Home Office |
Spark DataFrames are similar to Pandas DataFrames, but they are designed to work with large datasets that do not fit into memory. They can be distributed across a cluster of machines, allowing for parallel processing of data.
To create the dataframes in PySpark, we will use the data we generated earlier. We will first create a Spark session, which is the entry point to using PySpark. Then, we will parse the dictionaries into PySpark DataFrames, which will allow us to perform various data manipulation tasks.
The PySpark session is created using the .builder
method on the SparkSession
class, which allows us to configure the session with various options such as the application name. The .getOrCreate()
method is used to either get an existing session or create a new one if it doesn't exist.
Create Spark Session | |
---|---|
1 |
|
Once the Spark session is created, we can create the DataFrames from the dictionaries. We will use the .createDataFrame()
method on the Spark session to convert the dictionaries into PySpark DataFrames. The .createDataFrame()
method is expecting the data to be oriented by row. Meaning that the data should be in the form of a list of dictionaries, where each dictionary represents a row of data. However, we currently have our data is oriented by column, where the dictionarieshave keys as column names and values as lists of data. Therefore, we will first need to convert the dictionaries from column orientation to row orientation. The easiest way to do this is by parse'ing the data to a Pandas DataFrames, and then using that to create our PySpark DataFrames from there.
A good description of how to create PySpark DataFrames from Python Dictionaries can be found in the PySpark documentation: PySpark Create DataFrame From Dictionary.
Create DataFrames | |
---|---|
1 2 3 |
|
Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the .show()
method to display the first 5
rows of each DataFrame. The .show()
method is used to display the data in a tabular format, similar to how it would be displayed in a SQL database.
Check Sales DataFrame | |
---|---|
1 2 3 |
|
Sales DataFrame: 100
+-------------------+-----------+----------+-----------+------------+--------+
| date|customer_id|product_id| category|sales_amount|quantity|
+-------------------+-----------+----------+-----------+------------+--------+
|2023-01-01 00:00:00| 52| 45| Food| 490.76| 7|
|2023-01-02 00:00:00| 93| 41|Electronics| 453.94| 5|
|2023-01-03 00:00:00| 15| 29| Home| 994.51| 5|
|2023-01-04 00:00:00| 72| 15|Electronics| 184.17| 7|
|2023-01-05 00:00:00| 61| 45| Food| 27.89| 9|
+-------------------+-----------+----------+-----------+------------+--------+
only showing top 10 rows
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 |
Check Product DataFrame | |
---|---|
1 2 3 |
|
Product DataFrame: 50
+----------+------------+------+--------+-----------+
|product_id|product_name| price|category|supplier_id|
+----------+------------+------+--------+-----------+
| 1| Product 1|257.57| Food| 8|
| 2| Product 2|414.96|Clothing| 5|
| 3| Product 3|166.82|Clothing| 8|
| 4| Product 4|448.81| Food| 4|
| 5| Product 5|200.71| Food| 8|
+----------+------------+------+--------+-----------+
only showing top 5 rows
product_id | product_name | price | category | supplier_id | |
---|---|---|---|---|---|
0 | 1 | Product 1 | 257.57 | Food | 8 |
1 | 2 | Product 2 | 414.96 | Clothing | 5 |
2 | 3 | Product 3 | 166.82 | Clothing | 8 |
3 | 4 | Product 4 | 448.81 | Food | 4 |
4 | 5 | Product 5 | 200.71 | Food | 8 |
Check Customer DataFrame | |
---|---|
1 2 3 |
|
Customer DataFrame: 100
+-----------+-------------+-----------+-----+-----------+
|customer_id|customer_name| city|state| segment|
+-----------+-------------+-----------+-----+-----------+
| 1| Customer 1| Phoenix| NY| Corporate|
| 2| Customer 2| Phoenix| CA|Home Office|
| 3| Customer 3| Phoenix| NY|Home Office|
| 4| Customer 4|Los Angeles| NY| Consumer|
| 5| Customer 5|Los Angeles| IL|Home Office|
+-----------+-------------+-----------+-----+-----------+
only showing top 5 rows
customer_id | customer_name | city | state | segment | |
---|---|---|---|---|---|
0 | 1 | Customer 1 | Phoenix | NY | Corporate |
1 | 2 | Customer 2 | Phoenix | CA | Home Office |
2 | 3 | Customer 3 | Phoenix | NY | Home Office |
3 | 4 | Customer 4 | Los Angeles | NY | Consumer |
4 | 5 | Customer 5 | Los Angeles | IL | Home Office |
To create the dataframes in Polars, we will use the data we generated earlier. We will parse the dictionaries into Polars DataFrames, which will allow us to perform various data manipulation tasks.
Create DataFrames | |
---|---|
1 2 3 |
|
Once the data is created, we can check that it has been loaded correctly by displaying the first few rows of each DataFrame. To do this, we will use the .head()
method to display the first 5
rows of each DataFrame, and then parse to the print()
function to display the DataFrame in a readable format.
Check Sales DataFrame | |
---|---|
1 2 3 |
|
Sales DataFrame: 100
shape: (5, 6)
βββββββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββ
β date β customer_id β product_id β category β sales_amount β quantity β
β --- β --- β --- β --- β --- β --- β
β datetime[ns] β i64 β i64 β str β f64 β i64 β
βββββββββββββββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββββͺβββββββββββ‘
β 2023-01-01 00:00:00 β 52 β 45 β Food β 490.76 β 7 β
β 2023-01-02 00:00:00 β 93 β 41 β Electronics β 453.94 β 5 β
β 2023-01-03 00:00:00 β 15 β 29 β Home β 994.51 β 5 β
β 2023-01-04 00:00:00 β 72 β 15 β Electronics β 184.17 β 7 β
β 2023-01-05 00:00:00 β 61 β 45 β Food β 27.89 β 9 β
βββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββ
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 |
Check Product DataFrame | |
---|---|
1 2 3 |
|
Product DataFrame: 50
shape: (5, 5)
ββββββββββββββ¬βββββββββββββββ¬βββββββββ¬βββββββββββ¬ββββββββββββββ
β product_id β product_name β price β category β supplier_id β
β --- β --- β --- β --- β --- β
β i64 β str β f64 β str β i64 β
ββββββββββββββͺβββββββββββββββͺβββββββββͺβββββββββββͺββββββββββββββ‘
β 1 β Product 1 β 257.57 β Food β 8 β
β 2 β Product 2 β 414.96 β Clothing β 5 β
β 3 β Product 3 β 166.82 β Clothing β 8 β
β 4 β Product 4 β 448.81 β Food β 4 β
β 5 β Product 5 β 200.71 β Food β 8 β
ββββββββββββββ΄βββββββββββββββ΄βββββββββ΄βββββββββββ΄ββββββββββββββ
product_id | product_name | price | category | supplier_id | |
---|---|---|---|---|---|
0 | 1 | Product 1 | 257.57 | Food | 8 |
1 | 2 | Product 2 | 414.96 | Clothing | 5 |
2 | 3 | Product 3 | 166.82 | Clothing | 8 |
3 | 4 | Product 4 | 448.81 | Food | 4 |
4 | 5 | Product 5 | 200.71 | Food | 8 |
Check Customer DataFrame | |
---|---|
1 2 3 |
|
Customer DataFrame: 100
shape: (5, 5)
βββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββ¬ββββββββ¬ββββββββββββββ
β customer_id β customer_name β city β state β segment β
β --- β --- β --- β --- β --- β
β i64 β str β str β str β str β
βββββββββββββββͺββββββββββββββββͺββββββββββββββͺββββββββͺββββββββββββββ‘
β 1 β Customer 1 β Phoenix β NY β Corporate β
β 2 β Customer 2 β Phoenix β CA β Home Office β
β 3 β Customer 3 β Phoenix β NY β Home Office β
β 4 β Customer 4 β Los Angeles β NY β Consumer β
β 5 β Customer 5 β Los Angeles β IL β Home Office β
βββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββ΄ββββββββ΄ββββββββββββββ
customer_id | customer_name | city | state | segment | |
---|---|---|---|---|---|
0 | 1 | Customer 1 | Phoenix | NY | Corporate |
1 | 2 | Customer 2 | Phoenix | CA | Home Office |
2 | 3 | Customer 3 | Phoenix | NY | Home Office |
3 | 4 | Customer 4 | Los Angeles | NY | Consumer |
4 | 5 | Customer 5 | Los Angeles | IL | Home Office |
1. Filtering and Selectingπ
This first section will demonstrate how to filter and select data from the DataFrames. This is a common operation in data analysis, allowing us to focus on specific subsets of the data.
In Pandas, we can use boolean indexing to filter rows based on specific conditions. As you can see in this first example, this looks like using square brackets, within which we define a column and a condition. In the below example, we can use string values to filter categorical data.
For more information about filtering in Pandas, see the Pandas documentation on filtering.
Filter sales data for specific category | |
---|---|
1 2 3 4 |
|
Number of Electronics Sales: 28
date customer_id product_id category sales_amount quantity
1 2023-01-02 93 41 Electronics 453.94 5
3 2023-01-04 72 15 Electronics 184.17 7
8 2023-01-09 75 9 Electronics 746.73 2
10 2023-01-11 88 1 Electronics 314.98 9
11 2023-01-12 24 44 Electronics 547.11 8
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
8 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
10 | 2023-01-11 00:00:00 | 88 | 1 | Electronics | 314.98 | 9 |
11 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
In SQL, we can use the WHERE
clause to filter rows based on specific conditions. The syntax should be very familiar to anyone who has worked with SQL before. We can use the pd.read_sql()
function to execute SQL queries and retrieve the data from the database. The result is a Pandas DataFrame that contains only the rows that match the specified condition. In the below example, we filter for sales in the "Electronics" category.
For more information about filtering in SQL, see the SQL WHERE clause documentation.
Filter sales for a specific category | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Number of Electronics Sales: 28
date customer_id product_id category sales_amount quantity
0 2023-01-02 00:00:00 93 41 Electronics 453.94 5
1 2023-01-04 00:00:00 72 15 Electronics 184.17 7
2 2023-01-09 00:00:00 75 9 Electronics 746.73 2
3 2023-01-11 00:00:00 88 1 Electronics 314.98 9
4 2023-01-12 00:00:00 24 44 Electronics 547.11 8
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
1 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
2 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
3 | 2023-01-11 00:00:00 | 88 | 1 | Electronics | 314.98 | 9 |
4 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
In PySpark, we can use the .filter()
(or the .where()
) method to filter rows based on specific conditions. This process is effectively doing a boolean indexing operation to filter the DataFrame. The syntax is similar to SQL, where we can specify the condition as a string or using column expressions. In the below example, we filter for sales in the "Electronics" category.
For more information about filtering in PySpark, see the PySpark documentation on filtering.
Filter sales for a specific category | |
---|---|
1 2 3 4 |
|
Number of Electronics Sales: 28
+-------------------+-----------+----------+-----------+------------+--------+
| date|customer_id|product_id| category|sales_amount|quantity|
+-------------------+-----------+----------+-----------+------------+--------+
|2023-01-02 00:00:00| 93| 41|Electronics| 453.94| 5|
|2023-01-04 00:00:00| 72| 15|Electronics| 184.17| 7|
|2023-01-09 00:00:00| 75| 9|Electronics| 746.73| 2|
|2023-01-11 00:00:00| 88| 1|Electronics| 314.98| 9|
|2023-01-12 00:00:00| 24| 44|Electronics| 547.11| 8|
+-------------------+-----------+----------+-----------+------------+--------+
only showing top 5 rows
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
1 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
2 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
3 | 2023-01-11 00:00:00 | 88 | 1 | Electronics | 314.98 | 9 |
4 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
In Polars, we can use the .filter()
method to filter rows based on specific conditions. The syntax is similar to Pandas, where we can specify the condition using column expressions. In the below example, we filter for sales in the "Electronics" category.
For more information about filtering in Polars, see the Polars documentation on filtering.
Filter sales for a specific category | |
---|---|
1 2 3 4 |
|
Number of Electronics Sales: 28
shape: (5, 6)
βββββββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββ
β date β customer_id β product_id β category β sales_amount β quantity β
β --- β --- β --- β --- β --- β --- β
β datetime[ns] β i64 β i64 β str β f64 β i64 β
βββββββββββββββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββββͺβββββββββββ‘
β 2023-01-02 00:00:00 β 93 β 41 β Electronics β 453.94 β 5 β
β 2023-01-04 00:00:00 β 72 β 15 β Electronics β 184.17 β 7 β
β 2023-01-09 00:00:00 β 75 β 9 β Electronics β 746.73 β 2 β
β 2023-01-11 00:00:00 β 88 β 1 β Electronics β 314.98 β 9 β
β 2023-01-12 00:00:00 β 24 β 44 β Electronics β 547.11 β 8 β
βββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββ
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 |
1 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 |
2 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
3 | 2023-01-11 00:00:00 | 88 | 1 | Electronics | 314.98 | 9 |
4 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
We can also use numerical filtering, as you can see in the next example, where we filter for sales amounts greater than $500.
When it comes to numerical filtering in Pandas, the process is similar to the previous example, where we use boolean indexing to filter rows based on a given condition condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500
.
Filter for high value transactions | |
---|---|
1 2 3 4 |
|
Number of high-value Sales: 43
date customer_id product_id category sales_amount quantity
2 2023-01-03 15 29 Home 994.51 5
8 2023-01-09 75 9 Electronics 746.73 2
9 2023-01-10 75 24 Books 723.73 6
11 2023-01-12 24 44 Electronics 547.11 8
12 2023-01-13 3 8 Clothing 513.73 5
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
8 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
9 | 2023-01-10 00:00:00 | 75 | 24 | Books | 723.73 | 6 |
11 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
12 | 2023-01-13 00:00:00 | 3 | 8 | Clothing | 513.73 | 5 |
When it comes to numerical filtering in SQL, the process is similar to the previous example, where we use the WHERE
clause to filter rows based on a given condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500
.
Filter for high value transactions | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Number of high-value Sales: 43
date customer_id product_id category sales_amount quantity
0 2023-01-03 00:00:00 15 29 Home 994.51 5
1 2023-01-09 00:00:00 75 9 Electronics 746.73 2
2 2023-01-10 00:00:00 75 24 Books 723.73 6
3 2023-01-12 00:00:00 24 44 Electronics 547.11 8
4 2023-01-13 00:00:00 3 8 Clothing 513.73 5
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
1 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
2 | 2023-01-10 00:00:00 | 75 | 24 | Books | 723.73 | 6 |
3 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
4 | 2023-01-13 00:00:00 | 3 | 8 | Clothing | 513.73 | 5 |
When it comes to numerical filtering in PySpark, the process is similar to the previous example, where we use the .filter()
(or .where()
) method to filter rows based on a given condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500
.
Also note here that we have parsed a string value to the .filter()
method, instead of using the pure-Python syntax as shown above. This is because the .filter()
method can accept a SQL-like string expression. This is a common practice in PySpark to parse a SQL-like string to a PySpark method.
Filter for high value transactions | |
---|---|
1 2 3 4 |
|
Number of high-value Sales: 43
+-------------------+-----------+----------+-----------+------------+--------+
| date|customer_id|product_id| category|sales_amount|quantity|
+-------------------+-----------+----------+-----------+------------+--------+
|2023-01-03 00:00:00| 15| 29| Home| 994.51| 5|
|2023-01-09 00:00:00| 75| 9|Electronics| 746.73| 2|
|2023-01-10 00:00:00| 75| 24| Books| 723.73| 6|
|2023-01-12 00:00:00| 24| 44|Electronics| 547.11| 8|
|2023-01-13 00:00:00| 3| 8| Clothing| 513.73| 5|
+-------------------+-----------+----------+-----------+------------+--------+
only showing top 5 rows
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
1 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
2 | 2023-01-10 00:00:00 | 75 | 24 | Books | 723.73 | 6 |
3 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
4 | 2023-01-13 00:00:00 | 3 | 8 | Clothing | 513.73 | 5 |
When it comes to numerical filtering in Polars, the process is similar to the previous example, where we use the .filter()
method to filter rows based on a given condition, but here we use a numerical value instead of a string value. In the below example, we filter for sales amounts greater than 500
.
Also note here that we have used the pl.col()
function to specify the column we want to filter on. This is different from the previous examples, where we used the column name directly. The use of pl.col()
is a common practice in Polars to specify the column name in a more readable way.
Filter for high value transactions | |
---|---|
1 2 3 4 |
|
Number of high-value Sales: 43
shape: (5, 6)
βββββββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββ
β date β customer_id β product_id β category β sales_amount β quantity β
β --- β --- β --- β --- β --- β --- β
β datetime[ns] β i64 β i64 β str β f64 β i64 β
βββββββββββββββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββββͺβββββββββββ‘
β 2023-01-03 00:00:00 β 15 β 29 β Home β 994.51 β 5 β
β 2023-01-09 00:00:00 β 75 β 9 β Electronics β 746.73 β 2 β
β 2023-01-10 00:00:00 β 75 β 24 β Books β 723.73 β 6 β
β 2023-01-12 00:00:00 β 24 β 44 β Electronics β 547.11 β 8 β
β 2023-01-13 00:00:00 β 3 β 8 β Clothing β 513.73 β 5 β
βββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββ
date | customer_id | product_id | category | sales_amount | quantity | |
---|---|---|---|---|---|---|
0 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 |
1 | 2023-01-09 00:00:00 | 75 | 9 | Electronics | 746.73 | 2 |
2 | 2023-01-10 00:00:00 | 75 | 24 | Books | 723.73 | 6 |
3 | 2023-01-12 00:00:00 | 24 | 44 | Electronics | 547.11 | 8 |
4 | 2023-01-13 00:00:00 | 3 | 8 | Clothing | 513.73 | 5 |
In addition to subsetting a table by rows (aka filtering), we can also subset a table by columns (aka selecting). This allows us to create a new DataFrame with only the relevant columns we want to work with. This is useful when we want to focus on specific attributes of the data, such as dates, categories, or sales amounts.
To select specific columns in Pandas, we can use the double square brackets syntax to specify the columns we want to keep in the DataFrame. This allows us to create a new DataFrame with only the relevant columns.
For more information about selecting specific columns, see the Pandas documentation on selecting columns.
Select specific columns | |
---|---|
1 2 3 4 |
|
Sales Summary DataFrame: 100
date category sales_amount
0 2023-01-01 Food 490.76
1 2023-01-02 Electronics 453.94
2 2023-01-03 Home 994.51
3 2023-01-04 Electronics 184.17
4 2023-01-05 Food 27.89
date | category | sales_amount | |
---|---|---|---|
0 | 2023-01-01 00:00:00 | Food | 490.76 |
1 | 2023-01-02 00:00:00 | Electronics | 453.94 |
2 | 2023-01-03 00:00:00 | Home | 994.51 |
3 | 2023-01-04 00:00:00 | Electronics | 184.17 |
4 | 2023-01-05 00:00:00 | Food | 27.89 |
To select specific columns in SQL, we can use the SELECT
statement to specify the columns we want to retrieve from the table. This allows us to create a new DataFrame with only the relevant columns. We can use the pd.read_sql()
function to execute SQL queries and retrieve the data from the database.
For more information about selecting specific columns in SQL, see the SQL SELECT statement documentation.
Select specific columns | |
---|---|
1 2 3 4 5 6 7 8 |
|
Selected columns in Sales: 100
date category sales_amount
0 2023-01-01 00:00:00 Food 490.76
1 2023-01-02 00:00:00 Electronics 453.94
2 2023-01-03 00:00:00 Home 994.51
3 2023-01-04 00:00:00 Electronics 184.17
4 2023-01-05 00:00:00 Food 27.89
date | category | sales_amount | |
---|---|---|---|
0 | 2023-01-01 00:00:00 | Food | 490.76 |
1 | 2023-01-02 00:00:00 | Electronics | 453.94 |
2 | 2023-01-03 00:00:00 | Home | 994.51 |
3 | 2023-01-04 00:00:00 | Electronics | 184.17 |
4 | 2023-01-05 00:00:00 | Food | 27.89 |
To select specific columns in PySpark, we can use the .select()
method to specify the columns we want to keep in the DataFrame. This allows us to create a new DataFrame with only the relevant columns. The syntax is similar to SQL, where we can specify the column names as strings.
Select specific columns | |
---|---|
1 2 3 4 |
|
Sales Summary DataFrame: 100
+-------------------+-----------+------------+
| date| category|sales_amount|
+-------------------+-----------+------------+
|2023-01-01 00:00:00| Food| 490.76|
|2023-01-02 00:00:00|Electronics| 453.94|
|2023-01-03 00:00:00| Home| 994.51|
|2023-01-04 00:00:00|Electronics| 184.17|
|2023-01-05 00:00:00| Food| 27.89|
+-------------------+-----------+------------+
only showing top 5 rows
date | category | sales_amount | |
---|---|---|---|
0 | 2023-01-01 00:00:00 | Food | 490.76 |
1 | 2023-01-02 00:00:00 | Electronics | 453.94 |
2 | 2023-01-03 00:00:00 | Home | 994.51 |
3 | 2023-01-04 00:00:00 | Electronics | 184.17 |
4 | 2023-01-05 00:00:00 | Food | 27.89 |
To select specific columns in Polars, we can use the .select()
method to specify the columns we want to keep in the DataFrame. This allows us to create a new DataFrame with only the relevant columns.
Select specific columns | |
---|---|
1 2 3 4 |
|
Sales Summary DataFrame: 100
shape: (5, 3)
βββββββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ
β date β category β sales_amount β
β --- β --- β --- β
β datetime[ns] β str β f64 β
βββββββββββββββββββββββͺββββββββββββββͺβββββββββββββββ‘
β 2023-01-01 00:00:00 β Food β 490.76 β
β 2023-01-02 00:00:00 β Electronics β 453.94 β
β 2023-01-03 00:00:00 β Home β 994.51 β
β 2023-01-04 00:00:00 β Electronics β 184.17 β
β 2023-01-05 00:00:00 β Food β 27.89 β
βββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ
date | category | sales_amount | |
---|---|---|---|
0 | 2023-01-01 00:00:00 | Food | 490.76 |
1 | 2023-01-02 00:00:00 | Electronics | 453.94 |
2 | 2023-01-03 00:00:00 | Home | 994.51 |
3 | 2023-01-04 00:00:00 | Electronics | 184.17 |
4 | 2023-01-05 00:00:00 | Food | 27.89 |
2. Grouping and Aggregationπ
The second section will cover grouping and aggregation techniques. These operations are essential for summarizing data and extracting insights from large datasets.
In Pandas, we can use the .agg()
method to perform aggregation operations on DataFrames. This method allows us to apply multiple aggregation functions to different columns in a single operation.
Basic aggregation | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Sales Statistics: 5
sales_amount quantity
sum 48227.0500 464.00
mean 482.2705 4.64
min 15.1300 1.00
max 994.6100 9.00
count 100.0000 NaN
sales_amount | quantity | |
---|---|---|
sum | 48227.1 | 464 |
mean | 482.271 | 4.64 |
min | 15.13 | 1 |
max | 994.61 | 9 |
count | 100 | nan |
In SQL, we can use the aggregate functions like SUM()
, AVG()
, MIN()
, MAX()
, and COUNT()
to perform aggregation operations on tables.
Note here that we are not using the GROUP BY
clause, which is typically used to group rows that have the same values in specified columns into summary rows. Instead, we are performing a basic aggregation on the entire table.
Basic aggregation | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Sales Statistics: 1
sales_sum sales_mean sales_min sales_max sales_count quantity_sum quantity_mean quantity_min quantity_max
0 48227.05 482.2705 15.13 994.61 100 464 4.64 1 9
sales_sum | sales_mean | sales_min | sales_max | sales_count | quantity_sum | quantity_mean | quantity_min | quantity_max | |
---|---|---|---|---|---|---|---|---|---|
0 | 48227.1 | 482.271 | 15.13 | 994.61 | 100 | 464 | 4.64 | 1 | 9 |
In PySpark, we can use the .agg()
method to perform aggregation operations on DataFrames. This method allows us to apply multiple aggregation functions to different columns in a single operation.
Basic aggregation | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Sales Statistics: 1
+---------+----------+---------+---------+-----------+------------+-------------+------------+------------+
|sales_sum|sales_mean|sales_min|sales_max|sales_count|quantity_sum|quantity_mean|quantity_min|quantity_max|
+---------+----------+---------+---------+-----------+------------+-------------+------------+------------+
| 48227.05| 482.2705| 15.13| 994.61| 100| 464| 4.64| 1| 9|
+---------+----------+---------+---------+-----------+------------+-------------+------------+------------+
sales_sum | sales_mean | sales_min | sales_max | sales_count | quantity_sum | quantity_mean | quantity_min | quantity_max | |
---|---|---|---|---|---|---|---|---|---|
0 | 48227.1 | 482.271 | 15.13 | 994.61 | 100 | 464 | 4.64 | 1 | 9 |
In Polars, we can use the .select()
method to perform aggregation operations on DataFrames. This method allows us to apply multiple aggregation functions to different columns in a single operation.
Basic aggregation | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Sales Statistics: 1
shape: (1, 8)
βββββββββββββ¬βββββββββββββ¬ββββββββββββ¬ββββββββββββ¬βββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ
β sales_sum β sales_mean β sales_min β sales_max β quantity_sum β quantity_mean β quantity_min β quantity_max β
β --- β --- β --- β --- β --- β --- β --- β --- β
β f64 β f64 β f64 β f64 β i64 β f64 β i64 β i64 β
βββββββββββββͺβββββββββββββͺββββββββββββͺββββββββββββͺβββββββββββββββͺββββββββββββββββͺβββββββββββββββͺβββββββββββββββ‘
β 48227.05 β 482.2705 β 15.13 β 994.61 β 464 β 4.64 β 1 β 9 β
βββββββββββββ΄βββββββββββββ΄ββββββββββββ΄ββββββββββββ΄βββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
sales_sum | sales_mean | sales_min | sales_max | quantity_sum | quantity_mean | quantity_min | quantity_max | |
---|---|---|---|---|---|---|---|---|
0 | 48227 | 482.27 | 15.13 | 994.61 | 464 | 4.64 | 1 | 9 |
It is also possible to group the data by a specific column and then apply aggregation functions to summarize the data by group.
This is done using the .groupby()
method to group data by one or more columns and then apply aggregation functions to summarize the data, followed by the .agg()
method.
Group by category and aggregate | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Category Sales Summary: 5
sales_amount quantity
sum mean count sum
category
Books 10154.83 441.514348 23 100
Clothing 7325.31 457.831875 16 62
Electronics 11407.45 407.408929 28 147
Food 12995.57 541.482083 24 115
Home 6343.89 704.876667 9 40
category | ('sales_amount', 'sum') | ('sales_amount', 'mean') | ('sales_amount', 'count') | ('quantity', 'sum') |
---|---|---|---|---|
Books | 10154.8 | 441.514 | 23 | 100 |
Clothing | 7325.31 | 457.832 | 16 | 62 |
Electronics | 11407.5 | 407.409 | 28 | 147 |
Food | 12995.6 | 541.482 | 24 | 115 |
Home | 6343.89 | 704.877 | 9 | 40 |
In SQL, we can use the GROUP BY
clause to group rows that have the same values in specified columns into summary rows. We can then apply aggregate functions like SUM()
, AVG()
, and COUNT()
in the SELECT
clause to summarize the data by group.
Group by category and aggregate | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Category Sales Summary: 5
category total_sales average_sales transaction_count total_quantity
0 Books 10154.83 441.514348 23 100
1 Clothing 7325.31 457.831875 16 62
2 Electronics 11407.45 407.408929 28 147
3 Food 12995.57 541.482083 24 115
4 Home 6343.89 704.876667 9 40
category | total_sales | average_sales | transaction_count | total_quantity | |
---|---|---|---|---|---|
0 | Books | 10154.8 | 441.514 | 23 | 100 |
1 | Clothing | 7325.31 | 457.832 | 16 | 62 |
2 | Electronics | 11407.5 | 407.409 | 28 | 147 |
3 | Food | 12995.6 | 541.482 | 24 | 115 |
4 | Home | 6343.89 | 704.877 | 9 | 40 |
In PySpark, we can use the .groupBy()
method to group data by one or more columns and then apply aggregation functions using the .agg()
method.
Group by category and aggregate | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Category Sales Summary: 5
+-----------+------------------+------------------+-----------------+--------------+
| category| total_sales| average_sales|transaction_count|total_quantity|
+-----------+------------------+------------------+-----------------+--------------+
| Home| 6343.889999999999| 704.8766666666666| 9| 40|
| Food| 12995.57| 541.4820833333333| 24| 115|
|Electronics|11407.449999999999|407.40892857142853| 28| 147|
| Clothing|7325.3099999999995|457.83187499999997| 16| 62|
| Books| 10154.83| 441.514347826087| 23| 100|
+-----------+------------------+------------------+-----------------+--------------+
category | total_sales | average_sales | transaction_count | total_quantity | |
---|---|---|---|---|---|
0 | Home | 6343.89 | 704.877 | 9 | 40 |
1 | Food | 12995.6 | 541.482 | 24 | 115 |
2 | Electronics | 11407.4 | 407.409 | 28 | 147 |
3 | Clothing | 7325.31 | 457.832 | 16 | 62 |
4 | Books | 10154.8 | 441.514 | 23 | 100 |
In Polars, we can use the .group_by()
method to group data by one or more columns and then apply aggregation functions using the .agg()
method.
Group by category and aggregate | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Category Sales Summary: 5
shape: (5, 5)
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββββ¬βββββββββββββββββ
β category β total_sales β average_sales β transaction_count β total_quantity β
β --- β --- β --- β --- β --- β
β str β f64 β f64 β u32 β i64 β
βββββββββββββββͺββββββββββββββͺββββββββββββββββͺββββββββββββββββββββͺβββββββββββββββββ‘
β Food β 12995.57 β 541.482083 β 24 β 115 β
β Electronics β 11407.45 β 407.408929 β 28 β 147 β
β Books β 10154.83 β 441.514348 β 23 β 100 β
β Home β 6343.89 β 704.876667 β 9 β 40 β
β Clothing β 7325.31 β 457.831875 β 16 β 62 β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββ΄βββββββββββββββββ
category | total_sales | average_sales | transaction_count | total_quantity | |
---|---|---|---|---|---|
0 | Food | 12995.6 | 541.482 | 24 | 115 |
1 | Electronics | 11407.5 | 407.409 | 28 | 147 |
2 | Books | 10154.8 | 441.514 | 23 | 100 |
3 | Home | 6343.89 | 704.877 | 9 | 40 |
4 | Clothing | 7325.31 | 457.832 | 16 | 62 |
We can rename the columns for clarity by simply assigning new names.
In Pandas, we use the .columns
attribute of the DataFrame. This makes it easier to understand the results of the aggregation.
It's also possible to rename columns using the .rename()
method, which allows for more flexibility in renaming specific columns from within 'dot-method' chains.
Rename columns for clarity | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Renamed Category Sales Summary: 5
Total Sales Average Sales Transaction Count Total Quantity
category
Books 10154.83 441.514348 23 100
Clothing 7325.31 457.831875 16 62
Electronics 11407.45 407.408929 28 147
Food 12995.57 541.482083 24 115
Home 6343.89 704.876667 9 40
category | Total Sales | Average Sales | Transaction Count | Total Quantity |
---|---|---|---|---|
Books | 10154.8 | 441.514 | 23 | 100 |
Clothing | 7325.31 | 457.832 | 16 | 62 |
Electronics | 11407.5 | 407.409 | 28 | 147 |
Food | 12995.6 | 541.482 | 24 | 115 |
Home | 6343.89 | 704.877 | 9 | 40 |
In SQL, we can use the AS
keyword to rename columns in the SELECT
clause. This allows us to provide more descriptive names for the aggregated columns.
In this example, we provide the same aggregation as before, but from within a subquery. Then, in the parent query, we rename the columns for clarity.
Rename columns for clarity | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Renamed Category Sales Summary: 5
Total Sales Average Sales Transaction Count Total Quantity
category
Books 10154.83 441.514348 23 100
Clothing 7325.31 457.831875 16 62
Electronics 11407.45 407.408929 28 147
Food 12995.57 541.482083 24 115
Home 6343.89 704.876667 9 40
category | Total Sales | Average Sales | Transaction Count | Total Quantity |
---|---|---|---|---|
Books | 10154.8 | 441.514 | 23 | 100 |
Clothing | 7325.31 | 457.832 | 16 | 62 |
Electronics | 11407.5 | 407.409 | 28 | 147 |
Food | 12995.6 | 541.482 | 24 | 115 |
Home | 6343.89 | 704.877 | 9 | 40 |
In PySpark, we can use the .withColumnsRenamed()
method to rename columns in a DataFrame. This allows us to provide more descriptive names for the aggregated columns.
Rename columns for clarity | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Renamed Category Sales Summary: 5
+-----------+------------------+------------------+-----------------+--------------+
| category| Total Sales| Average Sales|Transaction Count|Total Quantity|
+-----------+------------------+------------------+-----------------+--------------+
| Home| 6343.889999999999| 704.8766666666666| 9| 40|
| Food| 12995.57| 541.4820833333333| 24| 115|
|Electronics|11407.449999999999|407.40892857142853| 28| 147|
| Clothing|7325.3099999999995|457.83187499999997| 16| 62|
| Books| 10154.83| 441.514347826087| 23| 100|
+-----------+------------------+------------------+-----------------+--------------+
category | Total Sales | Average Sales | Transaction Count | Total Quantity | |
---|---|---|---|---|---|
0 | Home | 6343.89 | 704.877 | 9 | 40 |
1 | Food | 12995.6 | 541.482 | 24 | 115 |
2 | Electronics | 11407.4 | 407.409 | 28 | 147 |
3 | Clothing | 7325.31 | 457.832 | 16 | 62 |
4 | Books | 10154.8 | 441.514 | 23 | 100 |
In Polars, we can use the .rename()
method to rename columns in a DataFrame. This allows us to provide more descriptive names for the aggregated columns.
Rename columns for clarity | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Renamed Category Sales Summary: 5
shape: (5, 5)
βββββββββββββββ¬ββββββββββββββ¬ββββββββββββββββ¬ββββββββββββββββββββ¬βββββββββββββββββ
β category β Total Sales β Average Sales β Transaction Count β Total Quantity β
β --- β --- β --- β --- β --- β
β str β f64 β f64 β u32 β i64 β
βββββββββββββββͺββββββββββββββͺββββββββββββββββͺββββββββββββββββββββͺβββββββββββββββββ‘
β Food β 12995.57 β 541.482083 β 24 β 115 β
β Electronics β 11407.45 β 407.408929 β 28 β 147 β
β Books β 10154.83 β 441.514348 β 23 β 100 β
β Home β 6343.89 β 704.876667 β 9 β 40 β
β Clothing β 7325.31 β 457.831875 β 16 β 62 β
βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββ΄βββββββββββββββββ
category | Total Sales | Average Sales | Transaction Count | Total Quantity | |
---|---|---|---|---|---|
0 | Food | 12995.6 | 541.482 | 24 | 115 |
1 | Electronics | 11407.5 | 407.409 | 28 | 147 |
2 | Books | 10154.8 | 441.514 | 23 | 100 |
3 | Home | 6343.89 | 704.877 | 9 | 40 |
4 | Clothing | 7325.31 | 457.832 | 16 | 62 |
Having aggregated the data, we can now visualize the results using Plotly. This allows us to create interactive visualizations that can help us better understand the data. The simplest way to do this is to use the Plotly Express module, which provides a high-level interface for creating visualizations. Here, we have utilised the px.bar()
function to create a bar chart of the total sales by category.
The Plotly px.bar()
function is able to receive a Pandas DataFrame directly, making it easy to create visualizations from the aggregated data. However, what we first need to do is to convert the index labels in to a column, so that we can use it as the x-axis in the bar chart. We do this with the .reset_index()
method.
Plot the results | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
The Plotly px.bar()
function can also receive a Pandas DataFrame, so we can use the results of the SQL query directly. Since the method we are using already returns the group labels in an individual column, we can use that directly in Plotly as the labels for the x-axis.
Plot the results | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Plotly is unfortunately not able to directly receive a PySpark DataFrame, so we need to convert it to a Pandas DataFrame first. This is done using the .toPandas()
method, which converts the PySpark DataFrame to a Pandas DataFrame.
Plot the results | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Plotly is also able to receive a Polars DataFrame, so we can use the results of the aggregation directly.
Plot the results | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
3. Joiningπ
The third section will demonstrate how to join DataFrames to combine data from different sources. This is a common operation in data analysis, allowing us to enrich our data with additional information.
Here, we will join the sales
DataFrame with the product
DataFrame to get additional information about the products sold.
In Pandas, we can use the pd.merge()
method to combine rows from two or more tables based on a related column between them. In this case, we will join the sales
table with the product
table on the product_id
column.
Join sales with product data | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Sales with Product Information: 100
date customer_id product_id category sales_amount quantity product_name price
0 2023-01-01 52 45 Food 490.76 7 Product 45 493.14
1 2023-01-02 93 41 Electronics 453.94 5 Product 41 193.39
2 2023-01-03 15 29 Home 994.51 5 Product 29 80.07
3 2023-01-04 72 15 Electronics 184.17 7 Product 15 153.67
4 2023-01-05 61 45 Food 27.89 9 Product 45 493.14
date | customer_id | product_id | category | sales_amount | quantity | product_name | price | |
---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 | Product 45 | 493.14 |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 | Product 41 | 193.39 |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 | Product 29 | 80.07 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 | Product 15 | 153.67 |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 | Product 45 | 493.14 |
In SQL, we can use the JOIN
clause to combine rows from two or more tables based on a related column between them. In this case, we will join the sales
table with the product
table on the product_id
column.
Join sales with product data | |
---|---|
1 2 3 4 5 6 7 8 |
|
Sales with Product Information: 100
date customer_id product_id category sales_amount quantity product_name price
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14
date | customer_id | product_id | category | sales_amount | quantity | product_name | price | |
---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 | Product 45 | 493.14 |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 | Product 41 | 193.39 |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 | Product 29 | 80.07 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 | Product 15 | 153.67 |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 | Product 45 | 493.14 |
In PySpark, we can use the .join()
method to combine rows from two or more DataFrames based on a related column between them. In this case, we will join the sales
DataFrame with the product
DataFrame on the product_id
column.
Join sales with product data | |
---|---|
1 2 3 4 5 6 7 8 |
|
Sales with Product Information: 100
+----------+-------------------+-----------+-----------+------------+--------+------------+------+
|product_id| date|customer_id| category|sales_amount|quantity|product_name| price|
+----------+-------------------+-----------+-----------+------------+--------+------------+------+
| 1|2023-01-06 00:00:00| 21| Clothing| 498.95| 5| Product 1|257.57|
| 1|2023-01-11 00:00:00| 88|Electronics| 314.98| 9| Product 1|257.57|
| 1|2023-02-11 00:00:00| 55| Food| 199.0| 5| Product 1|257.57|
| 1|2023-04-04 00:00:00| 85| Food| 146.97| 7| Product 1|257.57|
| 5|2023-01-21 00:00:00| 64|Electronics| 356.58| 5| Product 5|200.71|
+----------+-------------------+-----------+-----------+------------+--------+------------+------+
only showing top 5 rows
product_id | date | customer_id | category | sales_amount | quantity | product_name | price | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 2023-01-11 00:00:00 | 88 | Electronics | 314.98 | 9 | Product 1 | 257.57 |
1 | 1 | 2023-02-11 00:00:00 | 55 | Food | 199 | 5 | Product 1 | 257.57 |
2 | 5 | 2023-01-21 00:00:00 | 64 | Electronics | 356.58 | 5 | Product 5 | 200.71 |
3 | 5 | 2023-02-18 00:00:00 | 39 | Books | 79.71 | 8 | Product 5 | 200.71 |
4 | 6 | 2023-03-23 00:00:00 | 34 | Electronics | 48.45 | 8 | Product 6 | 15.31 |
In Polars, we can use the .join()
method to combine rows from two or more DataFrames based on a related column between them. In this case, we will join the sales
DataFrame with the product
DataFrame on the product_id
column.
Join sales with product data | |
---|---|
1 2 3 4 5 6 7 8 |
|
Sales with Product Information: 100
shape: (5, 8)
βββββββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββ¬βββββββββββββββ¬βββββββββ
β date β customer_id β product_id β category β sales_amount β quantity β product_name β price β
β --- β --- β --- β --- β --- β --- β --- β --- β
β datetime[ns] β i64 β i64 β str β f64 β i64 β str β f64 β
βββββββββββββββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββββͺβββββββββββͺβββββββββββββββͺβββββββββ‘
β 2023-01-01 00:00:00 β 52 β 45 β Food β 490.76 β 7 β Product 45 β 493.14 β
β 2023-01-02 00:00:00 β 93 β 41 β Electronics β 453.94 β 5 β Product 41 β 193.39 β
β 2023-01-03 00:00:00 β 15 β 29 β Home β 994.51 β 5 β Product 29 β 80.07 β
β 2023-01-04 00:00:00 β 72 β 15 β Electronics β 184.17 β 7 β Product 15 β 153.67 β
β 2023-01-05 00:00:00 β 61 β 45 β Food β 27.89 β 9 β Product 45 β 493.14 β
βββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββ΄βββββββββββββββ΄βββββββββ
date | customer_id | product_id | category | sales_amount | quantity | product_name | price | |
---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 | Product 45 | 493.14 |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 | Product 41 | 193.39 |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 | Product 29 | 80.07 |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 | Product 15 | 153.67 |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 | Product 45 | 493.14 |
In the next step, we will join the resulting DataFrame with the customer
DataFrame to get customer information for each sale. This allows us to create a complete view of the sales data, including product and customer details.
This process is similar to the previous step, but now we will extend the sales_with_product
DataFrame to join it with the customer
DataFrame on the customer_id
column. This will give us a complete view of the sales data, including product and customer details.
Join with customer information to get a complete view | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Complete Sales Data with Customer Information: 100
date customer_id product_id category sales_amount quantity product_name price customer_name city state
0 2023-01-01 52 45 Food 490.76 7 Product 45 493.14 Customer 52 Phoenix TX
1 2023-01-02 93 41 Electronics 453.94 5 Product 41 193.39 Customer 93 New York TX
2 2023-01-03 15 29 Home 994.51 5 Product 29 80.07 Customer 15 New York CA
3 2023-01-04 72 15 Electronics 184.17 7 Product 15 153.67 Customer 72 Houston IL
4 2023-01-05 61 45 Food 27.89 9 Product 45 493.14 Customer 61 Phoenix IL
date | customer_id | product_id | category | sales_amount | quantity | product_name | price | customer_name | city | state | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 | Product 45 | 493.14 | Customer 52 | Phoenix | TX |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 | Product 41 | 193.39 | Customer 93 | New York | TX |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 | Product 29 | 80.07 | Customer 15 | New York | CA |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 | Product 15 | 153.67 | Customer 72 | Houston | IL |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 | Product 45 | 493.14 | Customer 61 | Phoenix | IL |
This process is similar to the previous step, but now we will extend the sales_with_product
DataFrame to join it with the customer
DataFrame on the customer_id
column. This will give us a complete view of the sales data, including product and customer details.
Join with customer information to get a complete view | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Complete Sales Data with Customer Information: 100
date customer_id product_id category sales_amount quantity product_name price customer_name city state
0 2023-01-01 00:00:00 52 45 Food 490.76 7 Product 45 493.14 Customer 52 Phoenix TX
1 2023-01-02 00:00:00 93 41 Electronics 453.94 5 Product 41 193.39 Customer 93 New York TX
2 2023-01-03 00:00:00 15 29 Home 994.51 5 Product 29 80.07 Customer 15 New York CA
3 2023-01-04 00:00:00 72 15 Electronics 184.17 7 Product 15 153.67 Customer 72 Houston IL
4 2023-01-05 00:00:00 61 45 Food 27.89 9 Product 45 493.14 Customer 61 Phoenix IL
date | customer_id | product_id | category | sales_amount | quantity | product_name | price | customer_name | city | state | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 | Product 45 | 493.14 | Customer 52 | Phoenix | TX |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 | Product 41 | 193.39 | Customer 93 | New York | TX |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 | Product 29 | 80.07 | Customer 15 | New York | CA |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 | Product 15 | 153.67 | Customer 72 | Houston | IL |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 | Product 45 | 493.14 | Customer 61 | Phoenix | IL |
This process is similar to the previous step, but now we will extend the sales_with_product
DataFrame to join it with the customer
DataFrame on the customer_id
column. This will give us a complete view of the sales data, including product and customer details.
Join with customer information to get a complete view | |
---|---|
1 2 3 4 5 6 7 8 |
|
Complete Sales Data with Customer Information: 100
+-----------+----------+-------------------+-----------+------------+--------+------------+------+-------------+-----------+-----+
|customer_id|product_id| date| category|sales_amount|quantity|product_name| price|customer_name| city|state|
+-----------+----------+-------------------+-----------+------------+--------+------------+------+-------------+-----------+-----+
| 39| 5|2023-02-18 00:00:00| Books| 79.71| 8| Product 5|200.71| Customer 39|Los Angeles| NY|
| 88| 1|2023-01-11 00:00:00|Electronics| 314.98| 9| Product 1|257.57| Customer 88|Los Angeles| TX|
| 85| 1|2023-04-04 00:00:00| Food| 146.97| 7| Product 1|257.57| Customer 85| Phoenix| CA|
| 55| 1|2023-02-11 00:00:00| Food| 199.0| 5| Product 1|257.57| Customer 55|Los Angeles| NY|
| 21| 1|2023-01-06 00:00:00| Clothing| 498.95| 5| Product 1|257.57| Customer 21|Los Angeles| IL|
+-----------+----------+-------------------+-----------+------------+--------+------------+------+-------------+-----------+-----+
only showing top 5 rows
customer_id | product_id | date | category | sales_amount | quantity | product_name | price | customer_name | city | state | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 88 | 1 | 2023-01-11 00:00:00 | Electronics | 314.98 | 9 | Product 1 | 257.57 | Customer 88 | Los Angeles | TX |
1 | 55 | 1 | 2023-02-11 00:00:00 | Food | 199 | 5 | Product 1 | 257.57 | Customer 55 | Los Angeles | NY |
2 | 64 | 5 | 2023-01-21 00:00:00 | Electronics | 356.58 | 5 | Product 5 | 200.71 | Customer 64 | Los Angeles | NY |
3 | 39 | 5 | 2023-02-18 00:00:00 | Books | 79.71 | 8 | Product 5 | 200.71 | Customer 39 | Los Angeles | NY |
4 | 34 | 6 | 2023-03-23 00:00:00 | Electronics | 48.45 | 8 | Product 6 | 15.31 | Customer 34 | Los Angeles | NY |
This process is similar to the previous step, but now we will extend the sales_with_product
DataFrame to join it with the customer
DataFrame on the customer_id
column. This will give us a complete view of the sales data, including product and customer details.
Join with customer information to get a complete view | |
---|---|
1 2 3 4 5 6 7 8 |
|
Complete Sales Data with Customer Information: 100
shape: (5, 11)
βββββββββββββββββββββββ¬ββββββββββββββ¬βββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬βββββββββββ¬βββββββββββββββ¬βββββββββ¬ββββββββββββββββ¬βββββββββββ¬ββββββββ
β date β customer_id β product_id β category β sales_amount β quantity β product_name β price β customer_name β city β state β
β --- β --- β --- β --- β --- β --- β --- β --- β --- β --- β --- β
β datetime[ns] β i64 β i64 β str β f64 β i64 β str β f64 β str β str β str β
βββββββββββββββββββββββͺββββββββββββββͺβββββββββββββͺββββββββββββββͺβββββββββββββββͺβββββββββββͺβββββββββββββββͺβββββββββͺββββββββββββββββͺβββββββββββͺββββββββ‘
β 2023-01-01 00:00:00 β 52 β 45 β Food β 490.76 β 7 β Product 45 β 493.14 β Customer 52 β Phoenis β TX β
β 2023-01-02 00:00:00 β 93 β 41 β Electronics β 453.94 β 5 β Product 41 β 193.39 β Customer 93 β New York β TX β
β 2023-01-03 00:00:00 β 15 β 29 β Home β 994.51 β 5 β Product 29 β 80.07 β Customer 15 β New York β CA β
β 2023-01-04 00:00:00 β 72 β 15 β Electronics β 184.17 β 7 β Product 15 β 153.67 β Customer 72 β Houston β IL β
β 2023-01-05 00:00:00 β 61 β 45 β Food β 27.89 β 9 β Product 45 β 493.14 β Customer 61 β Phoenix β IL β
βββββββββββββββββββββββ΄ββββββββββββββ΄βββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄βββββββββββ΄βββββββββββββββ΄βββββββββ΄ββββββββββββββββ΄βββββββββββ΄ββββββββ
date | customer_id | product_id | category | sales_amount | quantity | product_name | price | customer_name | city | state | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 52 | 45 | Food | 490.76 | 7 | Product 45 | 493.14 | Customer 52 | Phoenix | TX |
1 | 2023-01-02 00:00:00 | 93 | 41 | Electronics | 453.94 | 5 | Product 41 | 193.39 | Customer 93 | New York | TX |
2 | 2023-01-03 00:00:00 | 15 | 29 | Home | 994.51 | 5 | Product 29 | 80.07 | Customer 15 | New York | CA |
3 | 2023-01-04 00:00:00 | 72 | 15 | Electronics | 184.17 | 7 | Product 15 | 153.67 | Customer 72 | Houston | IL |
4 | 2023-01-05 00:00:00 | 61 | 45 | Food | 27.89 | 9 | Product 45 | 493.14 | Customer 61 | Phoenix | IL |
Once we have the complete sales data, we can calculate the revenue for each sale by multiplying the price and quantity (columns from different tables). We can also compare this calculated revenue with the sales amount to identify any discrepancies.
In Pandas, we can calculate the revenue for each sale by multiplying the price
and quantity
columns. We can then compare this calculated revenue with the sales_amount
column to identify any discrepancies.
Notice here that the syntax for Pandas uses the DataFrame
object directly, and we can access the columns using the 'slice' ([]
) operator.
Calculate revenue and compare with sales amount | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Complete Sales Data with Calculated Revenue and Price Difference: 100
sales_amount price quantity calculated_revenue price_difference
0 490.76 493.14 7 3451.98 -2961.22
1 453.94 193.39 5 966.95 -513.01
2 994.51 80.07 5 400.35 594.16
3 184.17 153.67 7 1075.69 -891.52
4 27.89 493.14 9 4438.26 -4410.37
sales_amount | price | quantity | calculated_revenue | price_difference | |
---|---|---|---|---|---|
0 | 490.76 | 493.14 | 7 | 3451.98 | -2961.22 |
1 | 453.94 | 193.39 | 5 | 966.95 | -513.01 |
2 | 994.51 | 80.07 | 5 | 400.35 | 594.16 |
3 | 184.17 | 153.67 | 7 | 1075.69 | -891.52 |
4 | 27.89 | 493.14 | 9 | 4438.26 | -4410.37 |
In SQL, we can calculate the revenue for each sale by multiplying the price
and quantity
columns. We can then compare this calculated revenue with the sales_amount
column to identify any discrepancies.
Calculate revenue and compare with sales amount | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Complete Sales Data with Calculated Revenue and Price Difference: 100
sales_amount price quantity calculated_revenue price_difference
0 490.76 493.14 7 3451.98 -2961.22
1 453.94 193.39 5 966.95 -513.01
2 994.51 80.07 5 400.35 594.16
3 184.17 153.67 7 1075.69 -891.52
4 27.89 493.14 9 4438.26 -4410.37
sales_amount | price | quantity | calculated_revenue | price_difference | |
---|---|---|---|---|---|
0 | 490.76 | 493.14 | 7 | 3451.98 | -2961.22 |
1 | 453.94 | 193.39 | 5 | 966.95 | -513.01 |
2 | 994.51 | 80.07 | 5 | 400.35 | 594.16 |
3 | 184.17 | 153.67 | 7 | 1075.69 | -891.52 |
4 | 27.89 | 493.14 | 9 | 4438.26 | -4410.37 |
In PySpark, we can calculate the revenue for each sale by multiplying the price
and quantity
columns. We can then compare this calculated revenue with the sales_amount
column to identify any discrepancies.
Notice here that the syntax for PySpark uses the .withColumns
method to add new multiple columns to the DataFrame simultaneously. This method takes a dictionary where the keys are the names of the new columns and the values are the expressions to compute those columns. The methematical computation we have shown here uses two different methods:
- With the PySpark API, we can use the
F.col()
function to refer to the columns, and multiply them directly - With the Spark SQL API, we can use the
F.expr()
function to write a SQL-like expression for the calculation.
Calculate revenue and compare with sales amount | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Complete Sales Data with Calculated Revenue and Price Difference: 100
+------------+------+--------+------------------+------------------+
|sales_amount| price|quantity|calculated_revenue| price_difference|
+------------+------+--------+------------------+------------------+
| 79.71|200.71| 8| 1605.68| -1525.97|
| 314.98|257.57| 9| 2318.13| -2003.15|
| 146.97|257.57| 7| 1802.99| -1656.02|
| 199.0|257.57| 5| 1287.85| -1088.85|
| 498.95|257.57| 5| 1287.85|-788.8999999999999|
+------------+------+--------+------------------+------------------+
only showing top 5 rows
sales_amount | price | quantity | calculated_revenue | price_difference | |
---|---|---|---|---|---|
0 | 48.45 | 15.31 | 8 | 122.48 | -74.03 |
1 | 79.71 | 200.71 | 8 | 1605.68 | -1525.97 |
2 | 314.98 | 257.57 | 9 | 2318.13 | -2003.15 |
3 | 199 | 257.57 | 5 | 1287.85 | -1088.85 |
4 | 356.58 | 200.71 | 5 | 1003.55 | -646.97 |
In Polars, we can calculate the revenue for each sale by multiplying the price
and quantity
columns. We can then compare this calculated revenue with the sales_amount
column to identify any discrepancies.
Notice here that the syntax for Polars uses the .with_columns
method to add new multiple columns to the DataFrame simultaneously. This method takes a list of expressions, where each expression defines a new column.
Calculate revenue and compare with sales amount | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Complete Sales Data with Calculated Revenue and Price Difference: 100
ββββββββββββββββ¬βββββββββ¬βββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββ
β sales_amount β price β quantity β calculated_revenue β price_difference β
β --- β --- β --- β --- β --- β
β f64 β f64 β i64 β f64 β f64 β
ββββββββββββββββͺβββββββββͺβββββββββββͺβββββββββββββββββββββͺβββββββββββββββββββ‘
β 490.76 β 493.14 β 7 β 3451.98 β -2961.22 β
β 453.94 β 193.39 β 5 β 966.95 β -513.01 β
β 994.51 β 80.07 β 5 β 400.35 β 594.16 β
β 184.17 β 153.67 β 7 β 1075.69 β -891.52 β
β 27.89 β 493.14 β 9 β 4438.26 β -4410.37 β
ββββββββββββββββ΄βββββββββ΄βββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββββ
sales_amount | price | quantity | calculated_revenue | price_difference | |
---|---|---|---|---|---|
0 | 490.76 | 493.14 | 7 | 3451.98 | -2961.22 |
1 | 453.94 | 193.39 | 5 | 966.95 | -513.01 |
2 | 994.51 | 80.07 | 5 | 400.35 | 594.16 |
3 | 184.17 | 153.67 | 7 | 1075.69 | -891.52 |
4 | 27.89 | 493.14 | 9 | 4438.26 | -4410.37 |
4. Window Functionsπ
Window functions are a powerful feature in Pandas that allow us to perform calculations across a set of rows related to the current row. This is particularly useful for time series data, where we may want to calculate rolling averages, cumulative sums, or other metrics based on previous or subsequent rows.
To understand more about the nuances of the window functions, check out some of these guides:
In this section, we will demonstrate how to use window functions to analyze sales data over time. We will start by converting the date
column to a datetime type, which is necessary for time-based calculations. We will then group the data by date and calculate the total sales for each day.
The first thing that we will do is to group the sales data by date and calculate the total sales for each day. This will give us a daily summary of sales, which we can then use to analyze trends over time.
In Pandas, we can use the .groupby()
method to group the data by the date
column, followed by the .agg()
method to calculate the total sales for each day. This will then set us up for further time-based calculations in the following steps
Time-based window function | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Daily Sales Summary: 100
date total_sales
0 2023-01-01 490.76
1 2023-01-02 453.94
2 2023-01-03 994.51
3 2023-01-04 184.17
4 2023-01-05 27.89
date | total_sales | |
---|---|---|
0 | 2023-01-01 | 490.76 |
1 | 2023-01-02 | 453.94 |
2 | 2023-01-03 | 994.51 |
3 | 2023-01-04 | 184.17 |
4 | 2023-01-05 | 27.89 |
In SQL, we can use the GROUP BY
clause to group the data by the date
column and then use the SUM()
function to calculate the total sales for each day. This will give us a daily summary of sales, which we can then use to analyze trends over time.
Time-based window function | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Daily Sales Summary: 100
date total_sales
0 2023-01-01 00:00:00 490.76
1 2023-01-02 00:00:00 453.94
2 2023-01-03 00:00:00 994.51
3 2023-01-04 00:00:00 184.17
4 2023-01-05 00:00:00 27.89
date | total_sales | |
---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 |
1 | 2023-01-02 00:00:00 | 453.94 |
2 | 2023-01-03 00:00:00 | 994.51 |
3 | 2023-01-04 00:00:00 | 184.17 |
4 | 2023-01-05 00:00:00 | 27.89 |
In PySpark, we can use the .groupBy()
method to group the data by the date
column, followed by the .agg()
method to calculate the total sales for each day. This will then set us up for further time-based calculations in the following steps.
Time-based window function | |
---|---|
1 2 3 4 5 6 7 |
|
Daily Sales Summary: 100
+----------+-----------+
| date|total_sales|
+----------+-----------+
|2023-01-01| 490.76|
|2023-01-02| 453.94|
|2023-01-03| 994.51|
|2023-01-04| 184.17|
|2023-01-05| 27.89|
+----------+-----------+
only showing top 5 rows
date | total_sales | |
---|---|---|
0 | 2023-01-01 | 490.76 |
1 | 2023-01-02 | 453.94 |
2 | 2023-01-03 | 994.51 |
3 | 2023-01-04 | 184.17 |
4 | 2023-01-05 | 27.89 |
In Polars, we can use the .group_by()
method to group the data by the date
column, followed by the .agg()
method to calculate the total sales for each day. This will then set us up for further time-based calculations in the following steps.
Time-based window function | |
---|---|
1 2 3 4 5 6 7 |
|
Daily Sales Summary: 100
shape: (5, 2)
ββββββββββββββ¬ββββββββββββββ
β date β total_sales β
β --- β --- β
β date β f64 β
ββββββββββββββͺββββββββββββββ‘
β 2023-01-01 β 490.76 β
β 2023-01-02 β 453.94 β
β 2023-01-03 β 994.51 β
β 2023-01-04 β 184.17 β
β 2023-01-05 β 27.89 β
ββββββββββββββ΄ββββββββββββββ
date | total_sales | |
---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 |
1 | 2023-01-02 00:00:00 | 453.94 |
2 | 2023-01-03 00:00:00 | 994.51 |
3 | 2023-01-04 00:00:00 | 184.17 |
4 | 2023-01-05 00:00:00 | 27.89 |
Next, we will calculate the lag and lead values for the sales amount. This allows us to compare the current day's sales with the previous and next days' sales.
In Pandas, we can calculate the lag and lead values for the sales amount by using the .shift()
method. This method shifts the values in a column by a specified number of periods, allowing us to create lag and lead columns.
Note that the .shift()
method simply shifts the values in the column by a number of rows up or down, so we can use it to create lag and lead columns. This function itself does not need to be ordered because it assumes that the DataFrame is already ordered. However, if you want it to be ordered, you can use the .sort_values()
method before applying .shift()
.
Calculate lag and lead | |
---|---|
1 2 3 4 5 |
|
Daily Sales with Lag and Lead: 100
date total_sales previous_day_sales next_day_sales
0 2023-01-01 490.76 NaN 453.94
1 2023-01-02 453.94 490.76 994.51
2 2023-01-03 994.51 453.94 184.17
3 2023-01-04 184.17 994.51 27.89
4 2023-01-05 27.89 184.17 498.95
date | total_sales | previous_day_sales | next_day_sales | |
---|---|---|---|---|
0 | 2023-01-01 | 490.76 | nan | 453.94 |
1 | 2023-01-02 | 453.94 | 490.76 | 994.51 |
2 | 2023-01-03 | 994.51 | 453.94 | 184.17 |
3 | 2023-01-04 | 184.17 | 994.51 | 27.89 |
4 | 2023-01-05 | 27.89 | 184.17 | 498.95 |
In SQL, we can use the LAG()
and LEAD()
window functions to calculate the lag and lead values for the sales amount. These functions allow us to access data from previous and next rows in the result set without needing to join the table to itself.
The part that is important to note here is that the LAG()
and LEAD()
functions are used in conjunction with the OVER
clause, which defines the window over which the function operates. In this case, we are ordering by the date
column to ensure that the lag and lead values are calculated based on the chronological order of the sales data.
Calculate lag and lead | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Daily Sales with Lag and Lead: 100
sale_date total_sales previous_day_sales next_day_sales
0 2023-01-01 00:00:00 490.76 NaN 453.94
1 2023-01-02 00:00:00 453.94 490.76 994.51
2 2023-01-03 00:00:00 994.51 453.94 184.17
3 2023-01-04 00:00:00 184.17 994.51 27.89
4 2023-01-05 00:00:00 27.89 184.17 498.95
sale_date | total_sales | previous_day_sales | next_day_sales | |
---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 | nan | 453.94 |
1 | 2023-01-02 00:00:00 | 453.94 | 490.76 | 994.51 |
2 | 2023-01-03 00:00:00 | 994.51 | 453.94 | 184.17 |
3 | 2023-01-04 00:00:00 | 184.17 | 994.51 | 27.89 |
4 | 2023-01-05 00:00:00 | 27.89 | 184.17 | 498.95 |
In PySpark, we can use the .lag()
and .lead()
functions to calculate the lag and lead values for the sales amount. These functions are used in conjunction with a window specification that defines the order of the rows.
Note that in PySpark, we can define a Window function in one of two ways: using the PySpark API or using the Spark SQL API.
- The PySpark API: The PySpark API allows us to define a window specification using the
Window()
class, which provides methods to specify the ordering of the rows. We can then use theF.lag()
andF.lead()
functions to calculate the lag and lead values over a given window on the table. - The Spark SQL API: The Spark SQL API is used through the
F.expr()
function, which allows us to write SQL-like expressions for the calculations. This is similar to how we would write SQL queries, but it is executed within the PySpark context.
Here in the below example, we show how the previous day sales can be calculated using the .lag()
function in the PySpark API, and the next day sales can be calculated using the LEAD()
function in the Spark SQL API. Functionally, both of these two methods achieve the same result, but aesthetically they use slightly different syntax. It is primarily a matter of preference which one you choose to use.
Calculate lag and lead | |
---|---|
1 2 3 4 5 6 7 8 9 10 |
|
Daily Sales with Lag and Lead: 100
+----------+-----------+------------------+--------------+
| date|total_sales|previous_day_sales|next_day_sales|
+----------+-----------+------------------+--------------+
|2023-01-01| 490.76| NULL| 453.94|
|2023-01-02| 453.94| 490.76| 994.51|
|2023-01-03| 994.51| 453.94| 184.17|
|2023-01-04| 184.17| 994.51| 27.89|
|2023-01-05| 27.89| 184.17| 498.95|
+----------+-----------+------------------+--------------+
only showing top 5 rows
date | total_sales | previous_day_sales | next_day_sales | |
---|---|---|---|---|
0 | 2023-01-01 | 490.76 | nan | 453.94 |
1 | 2023-01-02 | 453.94 | 490.76 | 994.51 |
2 | 2023-01-03 | 994.51 | 453.94 | 184.17 |
3 | 2023-01-04 | 184.17 | 994.51 | 27.89 |
4 | 2023-01-05 | 27.89 | 184.17 | 498.95 |
In Polars, we can use the .shift()
method to calculate the lag and lead values for the sales amount. This method shifts the values in a column by a specified number of periods, allowing us to create lag and lead columns.
Note that the .shift()
method simply shifts the values in the column by a number of rows up or down, so we can use it to create lag and lead columns. This function itself does not need to be ordered because it assumes that the DataFrame is already ordered. However, if you want it to be ordered, you can use the .sort()
method before applying .shift()
.
Calculate lag and lead | |
---|---|
1 2 3 4 5 6 7 |
|
Daily Sales with Lag and Lead: 100
shape: (5, 4)
ββββββββββββββ¬ββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββ
β date β total_sales β previous_day_sales β next_day_sales β
β --- β --- β --- β --- β
β date β f64 β f64 β f64 β
ββββββββββββββͺββββββββββββββͺβββββββββββββββββββββͺβββββββββββββββββ‘
β 2023-01-01 β 490.76 β null β 453.94 β
β 2023-01-02 β 453.94 β 490.76 β 994.51 β
β 2023-01-03 β 994.51 β 453.94 β 184.17 β
β 2023-01-04 β 184.17 β 994.51 β 27.89 β
β 2023-01-05 β 27.89 β 184.17 β 498.95 β
ββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββ
date | total_sales | previous_day_sales | next_day_sales | |
---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 | nan | 453.94 |
1 | 2023-01-02 00:00:00 | 453.94 | 490.76 | 994.51 |
2 | 2023-01-03 00:00:00 | 994.51 | 453.94 | 184.17 |
3 | 2023-01-04 00:00:00 | 184.17 | 994.51 | 27.89 |
4 | 2023-01-05 00:00:00 | 27.89 | 184.17 | 498.95 |
Now, we can calculate the day-over-day change in sales. This is done by subtracting the previous day's sales from the current day's sales. Then secondly, we can calculate the percentage change in sales using the formula:
((current_day_sales - previous_day_sales) / previous_day_sales) * 100
In Pandas, we can calculate the day-over-day change in sales by subtracting the previous_day_sales
column from the total_sales
column. This is a fairly straight-forward calculation.
We can also calculate the percentage change in sales using the .pct_change()
method, which calculates the percentage change between the current and previous values. Under the hood, this method calculates the fractional change using the formula:
((value_current_row - value_previous_row) / value_previous_row)
So therefore we need to multiple the result by 100
.
Calculate day-over-day change | |
---|---|
1 2 3 4 5 |
|
Daily Sales with Day-over-Day Change: 100
date total_sales previous_day_sales next_day_sales day_over_day_change day_over_day_change 7d_moving_avg
0 2023-01-01 490.76 NaN 453.94 NaN NaN 490.760000
1 2023-01-02 453.94 490.76 994.51 -36.82 -36.82 472.350000
2 2023-01-03 994.51 453.94 184.17 540.57 540.57 646.403333
3 2023-01-04 184.17 994.51 27.89 -810.34 -810.34 530.845000
4 2023-01-05 27.89 184.17 498.95 -156.28 -156.28 430.254000
date | total_sales | previous_day_sales | next_day_sales | pct_change | day_over_day_change | 7d_moving_avg | |
---|---|---|---|---|---|---|---|
0 | 2023-01-01 | 490.76 | nan | 453.94 | nan | nan | 490.76 |
1 | 2023-01-02 | 453.94 | 490.76 | 994.51 | -7.50265 | -36.82 | 472.35 |
2 | 2023-01-03 | 994.51 | 453.94 | 184.17 | 119.084 | 540.57 | 646.403 |
3 | 2023-01-04 | 184.17 | 994.51 | 27.89 | -81.4813 | -810.34 | 530.845 |
4 | 2023-01-05 | 27.89 | 184.17 | 498.95 | -84.8564 | -156.28 | 430.254 |
In SQL, we can calculate the day-over-day change in sales by subtracting the previous_day_sales
column from the total_sales
column. We can also calculate the percentage change in sales using the formula:
((current_day_sales - previous_day_sales) / previous_day_sales) * 100
Day-over-day change already calculated | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Daily Sales with Day-over-Day Change: 100
sale_date total_sales previous_day_sales next_day_sales day_over_day_change pct_change
0 2023-01-01 00:00:00 490.76 NaN 453.94 NaN NaN
1 2023-01-02 00:00:00 453.94 490.76 994.51 -36.82 -7.502649
2 2023-01-03 00:00:00 994.51 453.94 184.17 540.57 119.084020
3 2023-01-04 00:00:00 184.17 994.51 27.89 -810.34 -81.481333
4 2023-01-05 00:00:00 27.89 184.17 498.95 -156.28 -84.856383
sale_date | total_sales | previous_day_sales | next_day_sales | day_over_day_change | pct_change | |
---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 | nan | 453.94 | nan | nan |
1 | 2023-01-02 00:00:00 | 453.94 | 490.76 | 994.51 | -36.82 | -7.50265 |
2 | 2023-01-03 00:00:00 | 994.51 | 453.94 | 184.17 | 540.57 | 119.084 |
3 | 2023-01-04 00:00:00 | 184.17 | 994.51 | 27.89 | -810.34 | -81.4813 |
4 | 2023-01-05 00:00:00 | 27.89 | 184.17 | 498.95 | -156.28 | -84.8564 |
In PySpark, we can calculate the day-over-day change in sales by subtracting the previous_day_sales
column from the total_sales
column. We can also calculate the percentage change in sales using the formula:
((current_day_sales - previous_day_sales) / previous_day_sales) * 100
Here, we have again shown these calculations using two different methods: using the PySpark API and using the Spark SQL API. Realistically, the results for both of them can be achieved using either method.
Calculate day-over-day change | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Daily Sales with Day-over-Day Change: 100
+----------+-----------+------------------+--------------+-------------------+------------------+
| date|total_sales|previous_day_sales|next_day_sales|day_over_day_change| pct_change|
+----------+-----------+------------------+--------------+-------------------+------------------+
|2023-01-01| 490.76| NULL| 453.94| NULL| NULL|
|2023-01-02| 453.94| 490.76| 994.51| -36.81999999999999|-7.502648952644875|
|2023-01-03| 994.51| 453.94| 184.17| 540.5699999999999|119.08401991452612|
|2023-01-04| 184.17| 994.51| 27.89| -810.34|-81.48133251551015|
|2023-01-05| 27.89| 184.17| 498.95|-156.27999999999997|-84.85638268990606|
+----------+-----------+------------------+--------------+-------------------+------------------+
only showing top 5 rows
date | total_sales | previous_day_sales | next_day_sales | day_over_day_change | pct_change | |
---|---|---|---|---|---|---|
0 | 2023-01-01 | 490.76 | nan | 453.94 | nan | nan |
1 | 2023-01-02 | 453.94 | 490.76 | 994.51 | -36.82 | -7.50265 |
2 | 2023-01-03 | 994.51 | 453.94 | 184.17 | 540.57 | 119.084 |
3 | 2023-01-04 | 184.17 | 994.51 | 27.89 | -810.34 | -81.4813 |
4 | 2023-01-05 | 27.89 | 184.17 | 498.95 | -156.28 | -84.8564 |
In Polars, we can calculate the day-over-day change in sales by subtracting the previous_day_sales
column from the total_sales
column. We can also calculate the percentage change in sales using the formula:
((current_day_sales - previous_day_sales) / previous_day_sales) * 100
Calculate day-over-day change | |
---|---|
1 2 3 4 5 6 7 |
|
Daily Sales with Day-over-Day Change: 100
shape: (5, 6)
ββββββββββββββ¬ββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββ
β date β total_sales β previous_day_sales β next_day_sales β day_over_day_change β pct_change β
β --- β --- β --- β --- β --- β --- β
β date β f64 β f64 β f64 β f64 β f64 β
ββββββββββββββͺββββββββββββββͺβββββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββββββͺβββββββββββββ‘
β 2023-01-01 β 490.76 β null β 453.94 β null β null β
β 2023-01-02 β 453.94 β 490.76 β 994.51 β -36.82 β -7.502649 β
β 2023-01-03 β 994.51 β 453.94 β 184.17 β 540.57 β 119.08402 β
β 2023-01-04 β 184.17 β 994.51 β 27.89 β -810.34 β -81.481333 β
β 2023-01-05 β 27.89 β 184.17 β 498.95 β -156.28 β -84.856383 β
ββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββ
date | total_sales | previous_day_sales | next_day_sales | day_over_day_change | pct_change | |
---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 | nan | 453.94 | nan | nan |
1 | 2023-01-02 00:00:00 | 453.94 | 490.76 | 994.51 | -36.82 | -7.50265 |
2 | 2023-01-03 00:00:00 | 994.51 | 453.94 | 184.17 | 540.57 | 119.084 |
3 | 2023-01-04 00:00:00 | 184.17 | 994.51 | 27.89 | -810.34 | -81.4813 |
4 | 2023-01-05 00:00:00 | 27.89 | 184.17 | 498.95 | -156.28 | -84.8564 |
Next, we will calculate the rolling average of sales over a 7-day window. Rolling averages (aka moving averages) are useful for smoothing out short-term fluctuations and highlighting longer-term trends in the data. This is particularly useful in time series analysis, where we want to understand the underlying trend in the data without being overly influenced by short-term variations. It is also a very common technique used in financial analysis to analyze stock prices, sales data, and other time series data.
In Pandas, we can calculate the 7-day moving average of sales using the .rolling()
method. This method allows us to specify a window size (in this case, window=7
which is 7 days) and calculate the mean over that window. The min_periods
parameter ensures that we get a value even if there are fewer than 7 days of data available at the start of the series. Finally, the .mean()
method calculates the average over the specified window.
Calculate 7-day moving average | |
---|---|
1 2 3 4 |
|
Daily Sales with 7-Day Moving Average: 100
date total_sales previous_day_sales next_day_sales day_over_day_change pct_change 7d_moving_avg
0 2023-01-01 490.76 NaN 453.94 NaN NaN 490.760000
1 2023-01-02 453.94 490.76 994.51 -36.82 -7.502649 472.350000
2 2023-01-03 994.51 453.94 184.17 540.57 119.084020 646.403333
3 2023-01-04 184.17 994.51 27.89 -810.34 -81.481333 530.845000
4 2023-01-05 27.89 184.17 498.95 -156.28 -84.856383 430.254000
date | total_sales | previous_day_sales | next_day_sales | day_over_day_change | pct_change | 7d_moving_avg | |
---|---|---|---|---|---|---|---|
0 | 2023-01-01 | 490.76 | nan | 453.94 | nan | nan | 490.76 |
1 | 2023-01-02 | 453.94 | 490.76 | 994.51 | -36.82 | -7.50265 | 472.35 |
2 | 2023-01-03 | 994.51 | 453.94 | 184.17 | 540.57 | 119.084 | 646.403 |
3 | 2023-01-04 | 184.17 | 994.51 | 27.89 | -810.34 | -81.4813 | 530.845 |
4 | 2023-01-05 | 27.89 | 184.17 | 498.95 | -156.28 | -84.8564 | 430.254 |
In SQL, we can calculate the 7-day moving average of sales using the AVG()
window function with the OVER
clause. It is important to include this OVER
clause, because it is what the SQL engine uses to determine that it should be a Window function, rather than a regular aggregate function (which is specified using the GROUP BY
clause).
Here in our example, there are three different parts to the Window function:
- The
ORDER BY
clause: This specifies the order of the rows in the window. In this case, we are ordering by thesale_date
column. - The
ROWS BETWEEN
clause: This specifies the range of rows to include in the window. In this case, we are including the number of rows from 6 preceding rows to the current row. This means that for each row, the window will include the current row and the 6 rows before it, giving us a total of 7 rows in the window. It is important that you specify theORDER BY
clause before theROWS BETWEEN
clause to ensure that the correct rows are included in the window. - The
AVG()
function: This calculates the average of thetotal_sales
column over the specified window.
Another peculiarity to note here is around the use of the sub-query. The sub-query is used to first calculate the daily sales, including the previous and next day sales, and the day-over-day change. This is because we need to calculate the moving average over the daily sales, rather than the individual sales transactions. The sub-query allows us to aggregate the sales data by date before calculating the moving average. The only change that we are including in the outer-query is the addition of the moving average calculation.
Calculate 7-day moving average | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
Daily Sales with 7-Day Moving Average: 100
sale_date total_sales previous_day_sales next_day_sales day_over_day_change pct_change 7d_moving_avg
0 2023-01-01 00:00:00 490.76 NaN 453.94 NaN NaN 490.760000
1 2023-01-02 00:00:00 453.94 490.76 994.51 -36.82 -7.502649 472.350000
2 2023-01-03 00:00:00 994.51 453.94 184.17 540.57 119.084020 646.403333
3 2023-01-04 00:00:00 184.17 994.51 27.89 -810.34 -81.481333 530.845000
4 2023-01-05 00:00:00 27.89 184.17 498.95 -156.28 -84.856383 430.254000
sale_date | total_sales | previous_day_sales | next_day_sales | day_over_day_change | pct_change | 7d_moving_avg | |
---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 | nan | 453.94 | nan | nan | 490.76 |
1 | 2023-01-02 00:00:00 | 453.94 | 490.76 | 994.51 | -36.82 | -7.50265 | 472.35 |
2 | 2023-01-03 00:00:00 | 994.51 | 453.94 | 184.17 | 540.57 | 119.084 | 646.403 |
3 | 2023-01-04 00:00:00 | 184.17 | 994.51 | 27.89 | -810.34 | -81.4813 | 530.845 |
4 | 2023-01-05 00:00:00 | 27.89 | 184.17 | 498.95 | -156.28 | -84.8564 | 430.254 |
In PySpark, we can calculate the 7-day moving average of sales using the F.avg()
function in combination with the Window()
class. The Window()
class allows us to define a window specification for the calculation. We can use the .orderBy()
method to specify the order of the rows in the window, and the .rowsBetween()
method to specify the range of rows to include in the window. The F.avg()
function is then able to calculate the average of the total_sales
column over the specified window.
As with many aspects of PySpark, there are multiple ways to achieve the same result. In this case, we can use either the F.avg()
function with the Window()
class, or we can use the SQL expression syntax with the F.expr()
function. Both methods will yield the same result.
Calculate 7-day moving average | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Daily Sales with 7-Day Moving Average: 100
+----------+-----------+------------------+--------------+-------------------+------------------+-----------------+-----------------+
| date|total_sales|previous_day_sales|next_day_sales|day_over_day_change| pct_change| 7d_moving_avg| 7d_rolling_avg|
+----------+-----------+------------------+--------------+-------------------+------------------+-----------------+-----------------+
|2023-01-01| 490.76| NULL| 453.94| NULL| NULL| 490.76| 490.76|
|2023-01-02| 453.94| 490.76| 994.51| -36.81999999999999|-7.502648952644875| 472.35| 472.35|
|2023-01-03| 994.51| 453.94| 184.17| 540.5699999999999|119.08401991452612|646.4033333333333|646.4033333333333|
|2023-01-04| 184.17| 994.51| 27.89| -810.34|-81.48133251551015| 530.845| 530.845|
|2023-01-05| 27.89| 184.17| 498.95|-156.27999999999997|-84.85638268990606| 430.254| 430.254|
+----------+-----------+------------------+--------------+-------------------+------------------+-----------------+-----------------+
only showing top 5 rows
date | total_sales | previous_day_sales | next_day_sales | day_over_day_change | pct_change | 7d_moving_avg | 7d_rolling_avg | |
---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 | 490.76 | nan | 453.94 | nan | nan | 490.76 | 490.76 |
1 | 2023-01-02 | 453.94 | 490.76 | 994.51 | -36.82 | -7.50265 | 472.35 | 472.35 |
2 | 2023-01-03 | 994.51 | 453.94 | 184.17 | 540.57 | 119.084 | 646.403 | 646.403 |
3 | 2023-01-04 | 184.17 | 994.51 | 27.89 | -810.34 | -81.4813 | 530.845 | 530.845 |
4 | 2023-01-05 | 27.89 | 184.17 | 498.95 | -156.28 | -84.8564 | 430.254 | 430.254 |
In Polars, we can calculate the 7-day moving average of sales using the .rolling_mean()
method. This method allows us to specify a window size (in this case, window_size=7
which is 7 days) and calculate the mean over that window. The min_samples=1
parameter ensures that we get a value even if there are fewer than 7 days of data available at the start of the series.
Calculate 7-day moving average | |
---|---|
1 2 3 4 5 6 |
|
Daily Sales with 7-Day Moving Average: 100
shape: (5, 7)
ββββββββββββββ¬ββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββ¬ββββββββββββββββββββββ¬βββββββββββββ¬ββββββββββββββββ
β date β total_sales β previous_day_sales β next_day_sales β day_over_day_change β pct_change β 7d_moving_avg β
β --- β --- β --- β --- β --- β --- β --- β
β date β f64 β f64 β f64 β f64 β f64 β f64 β
ββββββββββββββͺββββββββββββββͺβββββββββββββββββββββͺβββββββββββββββββͺββββββββββββββββββββββͺβββββββββββββͺββββββββββββββββ‘
β 2023-01-01 β 490.76 β null β 453.94 β null β null β 490.76 β
β 2023-01-02 β 453.94 β 490.76 β 994.51 β -36.82 β -7.502649 β 472.35 β
β 2023-01-03 β 994.51 β 453.94 β 184.17 β 540.57 β 119.08402 β 646.403333 β
β 2023-01-04 β 184.17 β 994.51 β 27.89 β -810.34 β -81.481333 β 530.845 β
β 2023-01-05 β 27.89 β 184.17 β 498.95 β -156.28 β -84.856383 β 430.254 β
ββββββββββββββ΄ββββββββββββββ΄βββββββββββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββββββ
date | total_sales | previous_day_sales | next_day_sales | day_over_day_change | pct_change | 7d_moving_avg | |
---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 490.76 | nan | 453.94 | nan | nan | 490.76 |
1 | 2023-01-02 00:00:00 | 453.94 | 490.76 | 994.51 | -36.82 | -7.50265 | 472.35 |
2 | 2023-01-03 00:00:00 | 994.51 | 453.94 | 184.17 | 540.57 | 119.084 | 646.403 |
3 | 2023-01-04 00:00:00 | 184.17 | 994.51 | 27.89 | -810.34 | -81.4813 | 530.845 |
4 | 2023-01-05 00:00:00 | 27.89 | 184.17 | 498.95 | -156.28 | -84.8564 | 430.254 |
Finally, we can visualize the daily sales data along with the 7-day moving average using Plotly. This allows us to see the trends in sales over time and how the moving average smooths out the fluctuations in daily sales.
For this, we will again utilise Plotly to create an interactive line chart that displays both the daily sales and the 7-day moving average. The chart will have the date on the x-axis and the sales amount on the y-axis, with two lines representing the daily sales and the moving average.
The graph will be instantiated using the go.Figure()
class, and using the .add_trace()
method we will add two traces to the figure: one for the daily sales and one for the 7-day moving average. The go.Scatter()
class is used to create the line traces, by defining mode="lines"
to display the data as a line chart.
Finally, we will use the .update_layout()
method to set the titles for the chart, and the position of the legend.
Plotly is easily able to handle Pandas DataFrames, so we can directly parse the columns from the DataFrame to create the traces for the daily sales and the 7-day moving average.
Plot results | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Plotly is easily able to handle Pandas DataFrames, so we can directly parse the columns from the DataFrame to create the traces for the daily sales and the 7-day moving average.
Plot results | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Plotly is not able to interpret PySpark DataFrames directly, so we need to convert the PySpark DataFrame to a Pandas DataFrame before plotting. This can be done using the .toPandas()
method. We can then parse the columns from the Pandas DataFrame to create the traces for the daily sales and the 7-day moving average.
Plot results | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
Plotly is easily able to handle Polars DataFrames, so we can directly parse the columns from the DataFrame to create the traces for the daily sales and the 7-day moving average.
Plot results | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
|
5. Ranking and Partitioningπ
The fifth section will demonstrate how to rank and partition data. This is useful for identifying top performers, such as the highest spending customers or the most popular products.
In Pandas, we can use the .rank()
method to rank values in a DataFrame. This method allows us to specify the ranking method (e.g., dense, average, min, max) and whether to rank in ascending or descending order.
Rank customers by total spending | |
---|---|
1 2 3 4 5 6 |
|
Customer Spending Summary: 61
customer_id total_spending rank
0 15 2297.55 1.0
1 4 2237.49 2.0
2 62 2177.35 3.0
3 60 2086.09 4.0
4 21 2016.95 5.0
customer_id | total_spending | rank | |
---|---|---|---|
0 | 15 | 2297.55 | 1 |
1 | 4 | 2237.49 | 2 |
2 | 62 | 2177.35 | 3 |
3 | 60 | 2086.09 | 4 |
4 | 21 | 2016.95 | 5 |
In SQL, we can use the DENSE_RANK()
window function to rank values in a query. This function assigns a rank to each row within a partition of a given result set, with no gaps in the ranking values. Note that this function can only be used in congunction with the OVER
clause, which defines it as a Window function. The ORDER BY
clause within the OVER
clause specifies the order in which the rows are ranked.
Rank customers by total spending | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Customer Spending Summary: 61
customer_id total_spending rank
0 15 2297.55 1
1 4 2237.49 2
2 62 2177.35 3
3 60 2086.09 4
4 21 2016.95 5
customer_id | total_spending | rank | |
---|---|---|---|
0 | 15 | 2297.55 | 1 |
1 | 4 | 2237.49 | 2 |
2 | 62 | 2177.35 | 3 |
3 | 60 | 2086.09 | 4 |
4 | 21 | 2016.95 | 5 |
In PySpark, we can use the F.dense_rank()
function in combination with the Window()
class to rank values in a DataFrame. The Window()
class allows us to define a window specification for the calculation, and the F.dense_rank()
function calculates the dense rank of each row within that window.
Rank customers by total spending | |
---|---|
1 2 3 4 5 6 7 8 9 |
|
Customer Spending Summary: 61
+-----------+------------------+----+
|customer_id| total_spending|rank|
+-----------+------------------+----+
| 15| 2297.55| 1|
| 4| 2237.49| 2|
| 62| 2177.35| 3|
| 60|2086.0899999999997| 4|
| 21| 2016.95| 5|
+-----------+------------------+----+
customer_id | total_spending | rank | |
---|---|---|---|
0 | 15 | 2297.55 | 1 |
1 | 4 | 2237.49 | 2 |
2 | 62 | 2177.35 | 3 |
3 | 60 | 2086.09 | 4 |
4 | 21 | 2016.95 | 5 |
In Polars, we can use the .rank()
method to rank values in a DataFrame. This method allows us to specify the ranking method (e.g., dense, average, min, max) and whether to rank in ascending or descending order.
Rank customers by total spending | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Customer Spending Summary: 61
shape: (5, 3)
βββββββββββββββ¬βββββββββββββββββ¬βββββββ
β customer_id β total_spending β rank β
β --- β --- β --- β
β i64 β f64 β u32 β
βββββββββββββββͺβββββββββββββββββͺβββββββ‘
β 15 β 2297.55 β 1 β
β 4 β 2237.49 β 2 β
β 62 β 2177.35 β 3 β
β 60 β 2086.09 β 4 β
β 21 β 2016.95 β 5 β
βββββββββββββββ΄βββββββββββββββββ΄βββββββ
customer_id | total_spending | rank | |
---|---|---|---|
0 | 15 | 2297.55 | 1 |
1 | 4 | 2237.49 | 2 |
2 | 62 | 2177.35 | 3 |
3 | 60 | 2086.09 | 4 |
4 | 21 | 2016.95 | 5 |
Next, we will rank products based on the quantity sold, partitioned by the product category. This will help us identify the most popular products within each category.
In Pandas it is first necessary to group the sales data by category
and product_id
, then aggregate the data for the .sum()
of the quantity
to find the total_quantity
sold for each product. After that, we can use the .rank()
method to rank the products within each category based on the total quantity sold.
It is important to note here that we are implementing the .rank()
method from within an .assign()
method. This is a common pattern in Pandas to create new columns on a DataFrame based on other columns already existing on the DataFrame, while keeping the DataFrame immutable. Here, we are using the .groupby()
method to group the DataFrame by category
, and then applying the .rank()
method to the total_quantity
column within each category. In this way, we are creating a partitioned DataFrame that ranks products by quantity sold within each category.
Rank products by quantity sold, by category | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
Product Popularity Summary: 78
category product_id total_quantity rank
0 Books 11 14 1.0
1 Clothing 7 9 1.0
2 Electronics 37 16 1.0
3 Food 45 34 1.0
4 Home 3 10 1.0
5 Books 28 9 2.0
6 Clothing 35 8 2.0
7 Electronics 35 11 2.0
8 Food 1 16 2.0
9 Home 9 5 2.0
category | product_id | total_quantity | rank | |
---|---|---|---|---|
0 | Books | 11 | 14 | 1 |
1 | Clothing | 7 | 9 | 1 |
2 | Electronics | 37 | 16 | 1 |
3 | Food | 45 | 34 | 1 |
4 | Home | 3 | 10 | 1 |
5 | Books | 28 | 9 | 2 |
6 | Clothing | 35 | 8 | 2 |
7 | Electronics | 35 | 11 | 2 |
8 | Food | 1 | 16 | 2 |
9 | Home | 9 | 5 | 2 |
In SQL, we can use the RANK()
window function to rank products within each category based on the total quantity sold. The PARTITION BY
clause allows us to partition the data by category
, and the ORDER BY
clause specifies the order in which the rows are ranked within each partition.
Rank products by quantity sold, by category | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Product Popularity: 78
category product_id total_quantity rank
0 Books 11 14 1
1 Clothing 7 9 1
2 Electronics 37 16 1
3 Food 45 34 1
4 Home 3 10 1
5 Books 28 9 2
6 Clothing 35 8 2
7 Electronics 35 11 2
8 Food 1 16 2
9 Home 48 5 2
category | product_id | total_quantity | rank | |
---|---|---|---|---|
0 | Books | 11 | 14 | 1 |
1 | Clothing | 7 | 9 | 1 |
2 | Electronics | 37 | 16 | 1 |
3 | Food | 45 | 34 | 1 |
4 | Home | 3 | 10 | 1 |
5 | Books | 28 | 9 | 2 |
6 | Clothing | 35 | 8 | 2 |
7 | Electronics | 35 | 11 | 2 |
8 | Food | 1 | 16 | 2 |
9 | Home | 48 | 5 | 2 |
In PySpark, we can use the F.dense_rank()
function in combination with the Window()
class to rank products within each category based on the total quantity sold. We can define the partitioning by using the .partitionBy()
method and parse'ing in the "category"
column. We can then define the ordering by using the .orderBy()
method and parse'ing in the "total_quantity"
expression to order the products by total quantity sold in descending order with the F.desc()
method.
Here, we have also provided an alternative way to define the rank by using the Spark SQL method. The outcome is the same, it's simply written in a SQL-like expression.
Rank products by quantity sold, by category | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Product Popularity Summary: 78
+-----------+----------+--------------+------+------+
| category|product_id|total_quantity|rank_p|rank_s|
+-----------+----------+--------------+------+------+
| Clothing| 7| 9| 1| 1|
| Books| 11| 14| 1| 1|
|Electronics| 37| 16| 1| 1|
| Food| 45| 34| 1| 1|
| Home| 3| 10| 1| 1|
| Books| 28| 9| 2| 2|
|Electronics| 35| 11| 2| 2|
| Home| 29| 5| 2| 2|
| Home| 48| 5| 2| 2|
| Home| 9| 5| 2| 2|
+-----------+----------+--------------+------+------+
only showing top 10 rows
category | product_id | total_quantity | rank_p | rank_s | |
---|---|---|---|---|---|
0 | Clothing | 7 | 9 | 1 | 1 |
1 | Books | 11 | 14 | 1 | 1 |
2 | Electronics | 37 | 16 | 1 | 1 |
3 | Food | 45 | 34 | 1 | 1 |
4 | Home | 3 | 10 | 1 | 1 |
5 | Books | 28 | 9 | 2 | 2 |
6 | Clothing | 35 | 8 | 2 | 2 |
7 | Electronics | 35 | 11 | 2 | 2 |
8 | Food | 1 | 16 | 2 | 2 |
9 | Home | 29 | 5 | 2 | 2 |
In Polars, we can use the .rank()
method to rank products within each category based on the total quantity sold. We first group the sales data by category
and product_id
, then aggregate the data for the .sum()
of the quantity
to find the total_quantity
sold for each product. After that, we can use the .rank()
method to rank the products within each category based on the total quantity sold. Finally, we can define the partitioning by using the .over()
method and parse'ing in partition_by="category"
.
Rank products by quantity sold, by category | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 |
|
Product Popularity Summary: 78
shape: (10, 4)
βββββββββββββββ¬βββββββββββββ¬βββββββββββββββββ¬βββββββ
β category β product_id β total_quantity β rank β
β --- β --- β --- β --- β
β str β i64 β i64 β u32 β
βββββββββββββββͺβββββββββββββͺβββββββββββββββββͺβββββββ‘
β Books β 11 β 14 β 1 β
β Clothing β 7 β 9 β 1 β
β Electronics β 37 β 16 β 1 β
β Food β 45 β 34 β 1 β
β Home β 3 β 10 β 1 β
β Books β 28 β 9 β 2 β
β Clothing β 35 β 8 β 2 β
β Electronics β 35 β 11 β 2 β
β Food β 1 β 16 β 2 β
β Home β 48 β 5 β 2 β
βββββββββββββββ΄βββββββββββββ΄βββββββββββββββββ΄βββββββ
category | product_id | total_quantity | rank | |
---|---|---|---|---|
0 | Books | 11 | 14 | 1 |
1 | Clothing | 7 | 9 | 1 |
2 | Electronics | 37 | 16 | 1 |
3 | Food | 45 | 34 | 1 |
4 | Home | 3 | 10 | 1 |
5 | Books | 28 | 9 | 2 |
6 | Clothing | 35 | 8 | 2 |
7 | Electronics | 35 | 11 | 2 |
8 | Food | 1 | 16 | 2 |
9 | Home | 48 | 5 | 2 |
Conclusionπ
This comprehensive guide has demonstrated how to perform essential data querying and manipulation operations across four powerful tools: Pandas, SQL, PySpark, and Polars. Each tool brings unique advantages to the data processing landscape, and understanding their strengths helps you choose the right tool for your specific use case.
Tool Comparison and Use Casesπ
-
Pandas has an extensive ecosystem, making it ideal for:
- Small to medium datasets (up to millions of rows)
- Interactive data exploration and visualization
- Data preprocessing for machine learning workflows
- Quick statistical analysis and reporting
Pandas remains the go-to choice for exploratory data analysis and rapid prototyping.
-
SQL excels in:
- Working with relational databases and data warehouses
- Complex joins and subqueries
- Declarative data transformations
- Team environments where SQL knowledge is widespread
SQL provides the universal language of data with unmatched expressiveness for complex queries
-
PySpark is great for when you need:
- Processing datasets that don't fit in memory (terabytes or larger)
- Distributed computing across clusters
- Integration with Hadoop ecosystem components
- Scalable machine learning with MLlib
PySpark unlocks the power of distributed computing for big data scenarios.
-
Polars is particularly valuable for:
- Large datasets that require fast processing (gigabytes to small terabytes)
- Performance-critical applications
- Memory-constrained environments
- Lazy evaluation and query optimization
Polars emerges as the high-performance alternative with excellent memory efficiency.
Key Techniques Coveredπ
Throughout this guide, we've explored fundamental data manipulation patterns that remain consistent across all tools:
- Data Filtering and Selection - Essential for subsetting data based on conditions
- Grouping and Aggregation - Critical for summarizing data by categories
- Joining and Merging - Necessary for combining data from multiple sources
- Window Functions - Powerful for time-series analysis and advanced calculations
- Ranking and Partitioning - Useful for identifying top performers and comparative analysis
Best Practices and Recommendationsπ
When working with any of these tools, consider these best practices:
- Start with the right tool: Match your tool choice to your data size, infrastructure, and team expertise
- Understand your data: Always examine data types, null values, and distributions before processing
- Optimize for readability: Write clear, well-documented code that your future self and teammates can understand
- Profile performance: Measure execution time and memory usage, especially for large datasets
- Leverage built-in optimizations: Use vectorized operations, avoid loops, and take advantage of lazy evaluation where available
Moving Forwardπ
The data landscape continues to evolve rapidly, with new tools and techniques emerging regularly. The fundamental concepts demonstrated in this guideβfiltering, grouping, joining, and analytical functionsβremain constant across platforms. By mastering these core concepts, you'll be well-equipped to adapt to new tools and technologies as they arise.
Whether you're analyzing customer behavior, processing sensor data, or building machine learning models, the techniques in this guide provide a solid foundation for effective data manipulation. Remember that the best tool is often the one that best fits your specific requirements for performance, scalability, and team capabilities.
Continue practicing with real datasets, explore advanced features of each tool, and stay curious about emerging technologies in the data processing ecosystem. The skills you've learned here will serve as building blocks for increasingly sophisticated data analysis and engineering tasks.