Types

toolbox_pyspark.types 🔗

Summary

The types module is used to get, check, and change a datafames column data types.

get_column_types 🔗

get_column_types(
    dataframe: psDataFrame, output_type: str = "psDataFrame"
) -> Union[psDataFrame, pdDataFrame]

Summary

This is a convenient function to return the data types from a given table as either a pyspark.sql.DataFrame or pandas.DataFrame.

Parameters:

Name Type Description Default

dataframe

DataFrame

The DataFrame to be checked.

required

output_type

str

How should the data be returned? As pdDataFrame or psDataFrame.

For pandas, use one of:

Terminal

[
    "pandas", "pandas.DataFrame",
    "pd.df",  "pd.DataFrame",
    "pddf",   "pdDataFrame",
    "pd",     "pdDF",
]

For pyspark use one of:

Terminal

[
    "pyspark", "spark.DataFrame",
    "spark",   "pyspark.DataFrame",
    "ps.df",   "ps.DataFrame",
    "psdf",    "psDataFrame",
    "ps",      "psDF",
]

Any other options are invalid.
Defaults to "psDataFrame".

'psDataFrame'

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`InvalidPySparkDataTypeError`	If the given value parsed to `output_type` is not one of the given valid types.

Returns:

Type	Description
`Union[DataFrame, DataFrame]`	The DataFrame where each row represents a column on the original `dataframe` object, and which has two columns: The column name from `dataframe`; and The data type for that column in `dataframe`.

Examples

Set up
>>> # Imports
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.types import get_column_types
>>>
>>> # Instantiate Spark
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create data
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1, 1, 1, 1],
...             "d": ["2", "2", "2", "2"],
...         }
...     )
... )
>>>
>>> # Check
>>> print(df.dtypes)

Terminal

[
    ("a", "bigint"),
    ("b", "string"),
    ("c", "bigint"),
    ("d", "string"),
]

Example 1: Return PySpark
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | bigint   |
| b        | string   |
| c        | bigint   |
| d        | string   |
+----------+----------+

Conclusion: Successfully print PySpark output.

Example 2: Return Pandas
>>> print(get_column_types(df, "pd"))

Terminal

   col_name  col_type
0         a    bigint
1         b    string
2         c    bigint
3         d    string

Conclusion: Successfully print Pandas output.

Example 3: Invalid output
>>> print(get_column_types(df, "foo"))

Terminal

InvalidDataFrameNameError: Invalid value for `output_type`: "foo".
Must be one of: ["pandas.DataFrame", "pandas", "pd.DataFrame", "pd.df", "pddf", "pdDataFrame", "pdDF", "pd", "spark.DataFrame", "pyspark.DataFrame", "pyspark", "spark", "ps.DataFrame", "ps.df", "psdf", "psDataFrame", "psDF", "ps"]

Conclusion: Invalid input.

Source code in src/toolbox_pyspark/types.py

@typechecked
def get_column_types(
    dataframe: psDataFrame,
    output_type: str = "psDataFrame",
) -> Union[psDataFrame, pdDataFrame]:
    """
    !!! note "Summary"
        This is a convenient function to return the data types from a given table as either a `#!py pyspark.sql.DataFrame` or `#!py pandas.DataFrame`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to be checked.

        output_type (str, optional):
            How should the data be returned? As `#!py pdDataFrame` or `#!py psDataFrame`.

            For `#!py pandas`, use one of:

            ```{.sh .shell  title="Terminal"}
            [
                "pandas", "pandas.DataFrame",
                "pd.df",  "pd.DataFrame",
                "pddf",   "pdDataFrame",
                "pd",     "pdDF",
            ]
            ```

            </div>

            For `#!py pyspark` use one of:

            ```{.sh .shell  title="Terminal"}
            [
                "pyspark", "spark.DataFrame",
                "spark",   "pyspark.DataFrame",
                "ps.df",   "ps.DataFrame",
                "psdf",    "psDataFrame",
                "ps",      "psDF",
            ]
            ```

            Any other options are invalid.<br>
            Defaults to `#!py "psDataFrame"`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        InvalidPySparkDataTypeError:
            If the given value parsed to `#!py output_type` is not one of the given valid types.

    Returns:
        (Union[psDataFrame, pdDataFrame]):
            The DataFrame where each row represents a column on the original `#!py dataframe` object, and which has two columns:

            1. The column name from `#!py dataframe`; and
            2. The data type for that column in `#!py dataframe`.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> # Imports
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.types import get_column_types
        >>>
        >>> # Instantiate Spark
        >>> spark = SparkSession.builder.getOrCreate()
        >>>
        >>> # Create data
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...             "c": [1, 1, 1, 1],
        ...             "d": ["2", "2", "2", "2"],
        ...         }
        ...     )
        ... )
        >>>
        >>> # Check
        >>> print(df.dtypes)
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        [
            ("a", "bigint"),
            ("b", "string"),
            ("c", "bigint"),
            ("d", "string"),
        ]
        ```
        </div>

        ```{.py .python linenums="1" title="Example 1: Return PySpark"}
        >>> get_column_types(df).show()
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        +----------+----------+
        | col_name | col_type |
        +----------+----------+
        | a        | bigint   |
        | b        | string   |
        | c        | bigint   |
        | d        | string   |
        +----------+----------+
        ```
        !!! success "Conclusion: Successfully print PySpark output."
        </div>

        ```{.py .python linenums="1" title="Example 2: Return Pandas"}
        >>> print(get_column_types(df, "pd"))
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
           col_name  col_type
        0         a    bigint
        1         b    string
        2         c    bigint
        3         d    string
        ```
        !!! success "Conclusion: Successfully print Pandas output."
        </div>

        ```{.py .python linenums="1" title="Example 3: Invalid output"}
        >>> print(get_column_types(df, "foo"))
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidDataFrameNameError: Invalid value for `output_type`: "foo".
        Must be one of: ["pandas.DataFrame", "pandas", "pd.DataFrame", "pd.df", "pddf", "pdDataFrame", "pdDF", "pd", "spark.DataFrame", "pyspark.DataFrame", "pyspark", "spark", "ps.DataFrame", "ps.df", "psdf", "psDataFrame", "psDF", "ps"]
        ```
        !!! failure "Conclusion: Invalid input."
        </div>
    """
    if output_type not in VALID_DATAFRAME_NAMES:
        raise InvalidDataFrameNameError(
            f"Invalid value for `output_type`: '{output_type}'.\n"
            f"Must be one of: {VALID_DATAFRAME_NAMES}"
        )
    output = pd.DataFrame(dataframe.dtypes, columns=["col_name", "col_type"])
    if output_type in VALID_PYSPARK_DATAFRAME_NAMES:
        return dataframe.sparkSession.createDataFrame(output)
    else:
        return output

cast_column_to_type 🔗

cast_column_to_type(
    dataframe: psDataFrame,
    column: str,
    datatype: Union[str, type, T.DataType],
) -> psDataFrame

Summary

This is a convenience function for casting a single column on a given table to another data type.

Details

At it's core, it will call the function like this:

dataframe = dataframe.withColumn(column, F.col(column).cast(datatype))

The reason for wrapping it up in this function is for validation of a columns existence and convenient re-declaration of the same.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to be updated.	required
`column`	`str`	The column to be updated.	required
`datatype`	`Union[str, type, DataType]`	The datatype to be cast to. Must be a valid `pyspark` DataType. Use one of the following: Terminal `[ "string", "char", "varchar", "binary", "boolean", "decimal", "float", "double", "byte", "short", "integer", "long", "date", "timestamp", "void", "timestamp_ntz", ]`	required

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If the `column` does not exist within `dataframe.columns`.
`ParseException`	If the given `datatype` is not a valid PySpark DataType.

Returns:

Type	Description
`DataFrame`	The updated DataFrame.

Examples

Set up
>>> # Imports
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.types import cast_column_to_type, get_column_types
>>>
>>> # Instantiate Spark
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create data
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1, 1, 1, 1],
...             "d": ["2", "2", "2", "2"],
...         }
...     )
... )
>>>
>>> # Check
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | bigint   |
| b        | string   |
| c        | bigint   |
| d        | string   |
+----------+----------+

Example 1: Valid casting
>>> df = cast_column_to_type(df, "a", "string")
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | string   |
| b        | string   |
| c        | bigint   |
| d        | string   |
+----------+----------+

Conclusion: Successfully cast column to type.

Example 2: Invalid column
>>> df = cast_column_to_type(df, "x", "string")

Terminal

ColumnDoesNotExistError: Column "x" does not exist in DataFrame.
Try one of: ["a", "b", "c", "d"].

Conclusion: Column x does not exist as a valid column.

Example 3: Invalid datatype
>>> df = cast_column_to_type(df, "b", "foo")

Terminal

ParseException: DataType "foo" is not supported.

Conclusion: Datatype foo is not valid.

cast_columns_to_type 🔗

cast_columns_to_type(
    dataframe: psDataFrame,
    columns: Union[str, str_list],
    datatype: Union[str, type, T.DataType],
) -> psDataFrame

Summary

Cast multiple columns to a given type.

Details

An extension of cast_column_to_type() to allow casting of multiple columns simultaneously.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to be updated.	required
`columns`	`Union[str, str_list]`	The list of columns to be updated. They all must be valid columns existing on `DataFrame`.	required
`datatype`	`Union[str, type, DataType]`	The datatype to be cast to. Must be a valid PySpark DataType. Use one of the following: Terminal `[ "string", "char", "varchar", "binary", "boolean", "decimal", "float", "double", "byte", "short", "integer", "long", "date", "timestamp", "void", "timestamp_ntz", ]`	required

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`DataFrame`	The updated DataFrame.

Examples

Set up
>>> # Imports
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.types import cast_column_to_type, get_column_types
>>>
>>> # Instantiate Spark
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create data
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1, 1, 1, 1],
...             "d": ["2", "2", "2", "2"],
...         }
...     )
... )
>>>
>>> # Check
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | bigint   |
| b        | string   |
| c        | bigint   |
| d        | string   |
+----------+----------+

Example 1: Basic usage
>>> df = cast_column_to_type(df, ["a"], "string")
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | string   |
| b        | string   |
| c        | bigint   |
| d        | bigint   |
+----------+----------+

Conclusion: Successfully cast column to type.

Example 2: Multiple columns
>>> df = cast_column_to_type(df, ["c", "d"], "string")
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | string   |
| b        | string   |
| c        | string   |
| d        | string   |
+----------+----------+

Conclusion: Successfully cast columns to type.

Example 3: Invalid column
>>> df = cast_columns_to_type(df, ["x", "y"], "string")

Terminal

ColumnDoesNotExistError: Columns ["x", "y"] do not exist in DataFrame.
Try one of: ["a", "b", "c", "d"].

Conclusion: Columns [x] does not exist as a valid column.

Example 4: Invalid datatype
>>> df = cast_columns_to_type(df, ["a", "b"], "foo")

Terminal

ParseException: DataType "foo" is not supported.

Conclusion: Datatype foo is not valid.

map_cast_columns_to_type 🔗

map_cast_columns_to_type(
    dataframe: psDataFrame,
    columns_type_mapping: dict[
        Union[str, type, T.DataType],
        Union[str, str_list, str_tuple],
    ],
) -> psDataFrame

Summary

Take a dictionary mapping of where the keys is the type and the values are the column(s), and apply that to the given dataframe.

Details

Applies cast_columns_to_type() and cast_column_to_type() under the hood.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to transform.	required
`columns_type_mapping`	`Dict[Union[str, type, DataType], Union[str, str_list, str_tuple]]`	The mapping of the columns to manipulate. The format must be: `{type: columns}`. Where the keys are the relevant type to cast to, and the values are the column(s) for casting.	required

Returns:

Type	Description
`DataFrame`	The transformed data frame.

Examples

Set up
>>> # Imports
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.types import cast_column_to_type, get_column_types
>>>
>>> # Instantiate Spark
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create data
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1, 1, 1, 1],
...             "d": ["2", "2", "2", "2"],
...         }
...     )
... )
>>>
>>> # Check
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | bigint   |
| b        | string   |
| c        | bigint   |
| d        | string   |
+----------+----------+

Example 1: Basic usage
>>> df = map_cast_columns_to_type(df, {"str": ["a", "c"]})
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | string   |
| b        | string   |
| c        | string   |
| d        | string   |
+----------+----------+

Conclusion: Successfully cast columns to type.

Example 2: Multiple types
>>> df = map_cast_columns_to_type(df, {"int": ["a", "c"], "str": ["b"], "float": "d"})
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | bigint   |
| b        | string   |
| c        | bigint   |
| d        | float    |
+----------+----------+

Conclusion: Successfully cast columns to types.

Example 3: All to single type
>>> df = map_cast_columns_to_type(df, {str: [col for col in df.columns]})
>>> get_column_types(df).show()

Terminal

+----------+----------+
| col_name | col_type |
+----------+----------+
| a        | string   |
| b        | string   |
| c        | string   |
| d        | string   |
+----------+----------+

Conclusion: Successfully cast all columns to type.