Checks

toolbox_pyspark.checks 🔗

Summary

The checks module is used to check and validate various attributed about a given pyspark dataframe.

Column Existence🔗

Summary

The checks module is used to check and validate various attributed about a given pyspark dataframe.

ColumnExistsResult `dataclass` 🔗

Source code in src/toolbox_pyspark/checks.py

@dataclass
class ColumnExistsResult:
    result: bool
    missing_cols: str_list

    def __iter__(self):
        for field in fields(self):
            yield getattr(self, field.name)

init 🔗

__init__(result: bool, missing_cols: str_list) -> None

result `instance-attribute` 🔗

result: bool

missing_cols `instance-attribute` 🔗

missing_cols: str_list

iter 🔗

__iter__()

Source code in src/toolbox_pyspark/checks.py

def __iter__(self):
    for field in fields(self):
        yield getattr(self, field.name)

column_exists 🔗

column_exists(
    dataframe: psDataFrame,
    column: str,
    match_case: bool = False,
) -> bool

Summary

Check whether a given column exists as a valid column within dataframe.columns.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`column`	`str`	The column to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. If `False`, will default to: `column.upper()`. Default: `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`bool`	`True` if exists or `False` otherwise.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import column_exists
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example1: Column Exists
>>> result = column_exists(df, "a")
>>> print(result)

Terminal

True

Conclusion: Column exists.

Example 2: Column Missing
>>> result = column_exists(df, "c")
>>> print(result)

Terminal

False

Conclusion: Column does not exist.

See Also

column_exists
columns_exists
assert_column_exists
assert_columns_exists
warn_column_missing
warn_columns_missing

Source code in src/toolbox_pyspark/checks.py

@typechecked
def column_exists(
    dataframe: psDataFrame,
    column: str,
    match_case: bool = False,
) -> bool:
    """
    !!! note "Summary"
        Check whether a given `#!py column` exists as a valid column within `#!py dataframe.columns`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        column (str):
            The column to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            If `#!py False`, will default to: `#!py column.upper()`.<br>
            Default: `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (bool):
            `#!py True` if exists or `#!py False` otherwise.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import column_exists
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example1: Column Exists"}
        >>> result = column_exists(df, "a")
        >>> print(result)
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        True
        ```
        !!! success "Conclusion: Column exists."
        </div>

        ```{.py .python linenums="1" title="Example 2: Column Missing"}
        >>> result = column_exists(df, "c")
        >>> print(result)
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: Column does not exist."
        </div>

    ??? tip "See Also"
        - [`column_exists`][toolbox_pyspark.checks.column_exists]
        - [`columns_exists`][toolbox_pyspark.checks.columns_exists]
        - [`assert_column_exists`][toolbox_pyspark.checks.assert_column_exists]
        - [`assert_columns_exists`][toolbox_pyspark.checks.assert_columns_exists]
        - [`warn_column_missing`][toolbox_pyspark.checks.warn_column_missing]
        - [`warn_columns_missing`][toolbox_pyspark.checks.warn_columns_missing]
    """
    return _columns_exists(dataframe, [column], match_case).result

columns_exists 🔗

columns_exists(
    dataframe: psDataFrame,
    columns: str_collection,
    match_case: bool = False,
) -> bool

Summary

Check whether all of the values in columns exist in dataframe.columns.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`columns`	`Union[str_list, str_tuple, str_set]`	The columns to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. If `False`, will default to: `[col.upper() for col in columns]`. Default: `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`bool`	`True` if all columns exist or `False` otherwise.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import columns_exists
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: Columns exist
>>> columns_exists(df, ["a", "b"])

Terminal

True

Conclusion: All columns exist.

Example 2: One column missing
>>> columns_exists(df, ["b", "d"])

Terminal

False

Conclusion: One column is missing.

Example 3: All columns missing
>>> columns_exists(df, ["c", "d"])

Terminal

False

Conclusion: All columns are missing.

See Also

column_exists
columns_exists
assert_column_exists
assert_columns_exists
warn_column_missing
warn_columns_missing

Source code in src/toolbox_pyspark/checks.py

@typechecked
def columns_exists(
    dataframe: psDataFrame,
    columns: str_collection,
    match_case: bool = False,
) -> bool:
    """
    !!! note "Summary"
        Check whether all of the values in `#!py columns` exist in `#!py dataframe.columns`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        columns (Union[str_list, str_tuple, str_set]):
            The columns to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            If `#!py False`, will default to: `#!py [col.upper() for col in columns]`.<br>
            Default: `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (bool):
            `#!py True` if all columns exist or `#!py False` otherwise.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import columns_exists
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: Columns exist"}
        >>> columns_exists(df, ["a", "b"])
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        True
        ```
        !!! success "Conclusion: All columns exist."
        </div>

        ```{.py .python linenums="1" title="Example 2: One column missing"}
        >>> columns_exists(df, ["b", "d"])
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: One column is missing."
        </div>

        ```{.py .python linenums="1" title="Example 3: All columns missing"}
        >>> columns_exists(df, ["c", "d"])
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: All columns are missing."
        </div>

    ??? tip "See Also"
        - [`column_exists`][toolbox_pyspark.checks.column_exists]
        - [`columns_exists`][toolbox_pyspark.checks.columns_exists]
        - [`assert_column_exists`][toolbox_pyspark.checks.assert_column_exists]
        - [`assert_columns_exists`][toolbox_pyspark.checks.assert_columns_exists]
        - [`warn_column_missing`][toolbox_pyspark.checks.warn_column_missing]
        - [`warn_columns_missing`][toolbox_pyspark.checks.warn_columns_missing]
    """
    return _columns_exists(dataframe, columns, match_case).result

assert_column_exists 🔗

assert_column_exists(
    dataframe: psDataFrame,
    column: str,
    match_case: bool = False,
) -> None

Summary

Check whether a given column exists as a valid column within dataframe.columns.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`column`	`str`	The column to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. If `False`, will default to: `column.upper()`. Default: `True`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If the `column` does not exist within `dataframe.columns`.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `ColumnDoesNotExistError` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import assert_column_exists
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1,2,3,4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: No error
>>> assert_column_exists(df, "a")

Terminal

None

Conclusion: Column exists.

Example 2: Error raised
>>> assert_column_exists(df, "c")

Terminal

ColumnDoesNotExistError: Column "c" does not exist in "dataframe".
Try one of: ["a", "b"].

Conclusion: Column does not exist.

See Also

column_exists
columns_exists
assert_column_exists
assert_columns_exists
warn_column_missing
warn_columns_missing

Source code in src/toolbox_pyspark/checks.py

@typechecked
def assert_column_exists(
    dataframe: psDataFrame,
    column: str,
    match_case: bool = False,
) -> None:
    """
    !!! note "Summary"
        Check whether a given `#!py column` exists as a valid column within `#!py dataframe.columns`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        column (str):
            The column to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            If `#!py False`, will default to: `#!py column.upper()`.<br>
            Default: `#!py True`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ColumnDoesNotExistError:
            If the `#!py column` does not exist within `#!py dataframe.columns`.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py ColumnDoesNotExistError` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import assert_column_exists
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1,2,3,4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No error"}
        >>> assert_column_exists(df, "a")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        None
        ```
        !!! success "Conclusion: Column exists."
        </div>

        ```{.py .python linenums="1" title="Example 2: Error raised"}
        >>> assert_column_exists(df, "c")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ColumnDoesNotExistError: Column "c" does not exist in "dataframe".
        Try one of: ["a", "b"].
        ```
        !!! failure "Conclusion: Column does not exist."
        </div>

    ??? tip "See Also"
        - [`column_exists`][toolbox_pyspark.checks.column_exists]
        - [`columns_exists`][toolbox_pyspark.checks.columns_exists]
        - [`assert_column_exists`][toolbox_pyspark.checks.assert_column_exists]
        - [`assert_columns_exists`][toolbox_pyspark.checks.assert_columns_exists]
        - [`warn_column_missing`][toolbox_pyspark.checks.warn_column_missing]
        - [`warn_columns_missing`][toolbox_pyspark.checks.warn_columns_missing]
    """
    if not column_exists(dataframe, column, match_case):
        raise ColumnDoesNotExistError(
            f"Column '{column}' does not exist in 'dataframe'.\n"
            f"Try one of: {dataframe.columns}."
        )

assert_columns_exists 🔗

assert_columns_exists(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    match_case: bool = False,
) -> None

Summary

Check whether all of the values in columns exist in dataframe.columns.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`columns`	`Union[str_list, str_tuple, str_set]`	The columns to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. If `False`, will default to: `[col.upper() for col in columns]`. Default: `True`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If any of the `columns` do not exist within `dataframe.columns`.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `ColumnDoesNotExistError` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import assert_columns_exists
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: No error
>>> assert_columns_exists(df, ["a", "b"])

Terminal

None

Conclusion: Columns exist.

Example 2: One column missing
>>> assert_columns_exists(df, ["b", "c"])

Terminal

ColumnDoesNotExistError: Columns ["c"] do not exist in "dataframe".
Try one of: ["a", "b"].

Conclusion: Column "c" does not exist.

Example 3: Multiple columns missing
>>> assert_columns_exists(df, ["b", "c", "d"])

Terminal

ColumnDoesNotExistError: Columns ["c", "d"] do not exist in "dataframe".
Try one of: ["a", "b"].

Conclusion: Columns "c" and "d" does not exist.

See Also

column_exists
columns_exists
assert_column_exists
assert_columns_exists
warn_column_missing
warn_columns_missing

Source code in src/toolbox_pyspark/checks.py

@typechecked
def assert_columns_exists(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    match_case: bool = False,
) -> None:
    """
    !!! note "Summary"
        Check whether all of the values in `#!py columns` exist in `#!py dataframe.columns`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        columns (Union[str_list, str_tuple, str_set]):
            The columns to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            If `#!py False`, will default to: `#!py [col.upper() for col in columns]`.<br>
            Default: `#!py True`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ColumnDoesNotExistError:
            If any of the `#!py columns` do not exist within `#!py dataframe.columns`.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py ColumnDoesNotExistError` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import assert_columns_exists
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No error"}
        >>> assert_columns_exists(df, ["a", "b"])
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        None
        ```
        !!! success "Conclusion: Columns exist."
        </div>

        ```{.py .python linenums="1" title="Example 2: One column missing"}
        >>> assert_columns_exists(df, ["b", "c"])
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ColumnDoesNotExistError: Columns ["c"] do not exist in "dataframe".
        Try one of: ["a", "b"].
        ```
        !!! failure "Conclusion: Column "c" does not exist."
        </div>

        ```{.py .python linenums="1" title="Example 3: Multiple columns missing"}
        >>> assert_columns_exists(df, ["b", "c", "d"])
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ColumnDoesNotExistError: Columns ["c", "d"] do not exist in "dataframe".
        Try one of: ["a", "b"].
        ```
        !!! failure "Conclusion: Columns "c" and "d" does not exist."
        </div>

    ??? tip "See Also"
        - [`column_exists`][toolbox_pyspark.checks.column_exists]
        - [`columns_exists`][toolbox_pyspark.checks.columns_exists]
        - [`assert_column_exists`][toolbox_pyspark.checks.assert_column_exists]
        - [`assert_columns_exists`][toolbox_pyspark.checks.assert_columns_exists]
        - [`warn_column_missing`][toolbox_pyspark.checks.warn_column_missing]
        - [`warn_columns_missing`][toolbox_pyspark.checks.warn_columns_missing]
    """
    columns = [columns] if is_type(columns, str) else columns
    (exist, missing_cols) = _columns_exists(dataframe, columns, match_case)
    if not exist:
        raise ColumnDoesNotExistError(
            f"Columns {missing_cols} do not exist in 'dataframe'.\n"
            f"Try one of: {dataframe.columns}"
        )

warn_column_missing 🔗

warn_column_missing(
    dataframe: psDataFrame,
    column: str,
    match_case: bool = False,
) -> None

Summary

Check whether a given column exists as a valid column within dataframe.columns.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`column`	`str`	The column to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `ColumnDoesNotExistWarning` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import warn_column_missing
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: No error
>>> warn_column_missing(df, ["a", "b"])

Terminal

None

Conclusion: Columns exist.

Example 2: Warning raised
>>> warn_column_missing(df, "c")

Terminal

ColumnDoesNotExistWarning: Column "c" does not exist in "dataframe".
Try one of: ["a", "b"].

Conclusion: Column does not exist.

See Also

column_exists
columns_exists
[assert_column_exists][toolbox_pyspark.checks.assert_column_exists
assert_columns_exists
warn_column_missing
warn_columns_missing

Source code in src/toolbox_pyspark/checks.py

@typechecked
def warn_column_missing(
    dataframe: psDataFrame,
    column: str,
    match_case: bool = False,
) -> None:
    """
    !!! summary "Summary"
        Check whether a given `#!py column` exists as a valid column within `#!py dataframe.columns`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        column (str):
            The column to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py ColumnDoesNotExistWarning` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import warn_column_missing
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No error"}
        >>> warn_column_missing(df, ["a", "b"])
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        None
        ```
        !!! success "Conclusion: Columns exist."
        </div>

        ```{.py .python linenums="1" title="Example 2: Warning raised"}
        >>> warn_column_missing(df, "c")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ColumnDoesNotExistWarning: Column "c" does not exist in "dataframe".
        Try one of: ["a", "b"].
        ```
        !!! failure "Conclusion: Column does not exist."
        </div>

    ??? tip "See Also"
        - [`column_exists`][toolbox_pyspark.checks.column_exists]
        - [`columns_exists`][toolbox_pyspark.checks.columns_exists]
        - [`assert_column_exists`][toolbox_pyspark.checks.assert_column_exists
        - [`assert_columns_exists`][toolbox_pyspark.checks.assert_columns_exists]
        - [`warn_column_missing`][toolbox_pyspark.checks.warn_column_missing]
        - [`warn_columns_missing`][toolbox_pyspark.checks.warn_columns_missing]
    """
    if not column_exists(dataframe, column, match_case):
        warn(
            f"Column '{column}' does not exist in 'dataframe'.\n"
            f"Try one of: {dataframe.columns}.",
            ColumnDoesNotExistWarning,
        )

warn_columns_missing 🔗

warn_columns_missing(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    match_case: bool = False,
) -> None

Summary

Check whether all of the values in columns exist in dataframe.columns.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`columns`	`Union[str, str_collection]`	The columns to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `ColumnDoesNotExistWarning` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import warn_columns_missing
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: No error
>>> warn_columns_missing(df, ["a", "b"])

Terminal

None

Conclusion: Columns exist.

Example 2: One column missing
>>> warn_columns_missing(df, ["b", "c"])

Terminal

ColumnDoesNotExistWarning: Columns ["c"] do not exist in "dataframe".
Try one of: ["a", "b"].

Conclusion: Column "c" does not exist.

Example 3: Multiple columns missing
>>> warn_columns_missing(df, ["b", "c", "d"])

Terminal

ColumnDoesNotExistWarning: Columns ["c", "d"] do not exist in "dataframe".
Try one of: ["a", "b"].

Conclusion: Columns "c" and "d" does not exist.

See Also

column_exists
columns_exists
assert_column_exists
assert_columns_exists
warn_column_missing
warn_columns_missing

Source code in src/toolbox_pyspark/checks.py

@typechecked
def warn_columns_missing(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    match_case: bool = False,
) -> None:
    """
    !!! summary "Summary"
        Check whether all of the values in `#!py columns` exist in `#!py dataframe.columns`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        columns (Union[str, str_collection]):
            The columns to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py ColumnDoesNotExistWarning` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import warn_columns_missing
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No error"}
        >>> warn_columns_missing(df, ["a", "b"])
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        None
        ```
        !!! success "Conclusion: Columns exist."
        </div>

        ```{.py .python linenums="1" title="Example 2: One column missing"}
        >>> warn_columns_missing(df, ["b", "c"])
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ColumnDoesNotExistWarning: Columns ["c"] do not exist in "dataframe".
        Try one of: ["a", "b"].
        ```
        !!! failure "Conclusion: Column "c" does not exist."
        </div>

        ```{.py .python linenums="1" title="Example 3: Multiple columns missing"}
        >>> warn_columns_missing(df, ["b", "c", "d"])
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ColumnDoesNotExistWarning: Columns ["c", "d"] do not exist in "dataframe".
        Try one of: ["a", "b"].
        ```
        !!! failure "Conclusion: Columns "c" and "d" does not exist."
        </div>

    ??? tip "See Also"
        - [`column_exists`][toolbox_pyspark.checks.column_exists]
        - [`columns_exists`][toolbox_pyspark.checks.columns_exists]
        - [`assert_column_exists`][toolbox_pyspark.checks.assert_column_exists]
        - [`assert_columns_exists`][toolbox_pyspark.checks.assert_columns_exists]
        - [`warn_column_missing`][toolbox_pyspark.checks.warn_column_missing]
        - [`warn_columns_missing`][toolbox_pyspark.checks.warn_columns_missing]
    """
    columns = [columns] if is_type(columns, str) else columns
    (exist, missing_cols) = _columns_exists(dataframe, columns, match_case)
    if not exist:
        warn(
            f"Columns {missing_cols} do not exist in 'dataframe'.\n"
            f"Try one of: {dataframe.columns}",
            ColumnDoesNotExistWarning,
        )

Column Values🔗

Summary

The checks module is used to check and validate various attributed about a given pyspark dataframe.

column_contains_value 🔗

column_contains_value(
    dataframe: psDataFrame,
    column: str,
    value: str,
    match_case: bool = False,
) -> bool

Summary

Check whether a given column contains a specific value in dataframe.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`column`	`str`	The column to check.	required
`value`	`str`	The value to check for.	required
`match_case`	`bool`	Whether or not to match the string case for the value. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If the `column` does not exist within `dataframe.columns`.

Returns:

Type	Description
`bool`	`True` if the column contains the value, `False` otherwise.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import column_contains_value
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: Value exists
>>> column_contains_value(df, "b", "a")

Terminal

True

Conclusion: Value exists in column.

Example 2: Value does not exist
>>> column_contains_value(df, "b", "z")

Terminal

False

Conclusion: Value does not exist in column.

See Also

assert_column_exists

Source code in src/toolbox_pyspark/checks.py

@typechecked
def column_contains_value(
    dataframe: psDataFrame,
    column: str,
    value: str,
    match_case: bool = False,
) -> bool:
    """
    !!! note "Summary"
        Check whether a given `#!py column` contains a specific `#!py value` in `#!py dataframe`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        column (str):
            The column to check.
        value (str):
            The value to check for.
        match_case (bool, optional):
            Whether or not to match the string case for the value.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ColumnDoesNotExistError:
            If the `#!py column` does not exist within `#!py dataframe.columns`.

    Returns:
        (bool):
            `#!py True` if the column contains the value, `#!py False` otherwise.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import column_contains_value
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: Value exists"}
        >>> column_contains_value(df, "b", "a")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        True
        ```
        !!! success "Conclusion: Value exists in column."
        </div>

        ```{.py .python linenums="1" title="Example 2: Value does not exist"}
        >>> column_contains_value(df, "b", "z")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: Value does not exist in column."
        </div>

    ??? tip "See Also"
        - [`assert_column_exists`][toolbox_pyspark.checks.assert_column_exists]
    """
    assert_column_exists(dataframe, column, match_case)

    if not match_case:
        value = value.lower()
        dataframe = dataframe.withColumn(column, F.lower(F.col(column)))

    return dataframe.filter(f"{column} = '{value}'").count() > 0

Type Checks🔗

Summary

The checks module is used to check and validate various attributed about a given pyspark dataframe.

is_vaid_spark_type 🔗

is_vaid_spark_type(datatype: str) -> bool

Summary

Check whether a given datatype is a correct and valid pyspark data type.

Parameters:

Name	Type	Description	Default
`datatype`	`str`	The name of the data type to check.	required

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`InvalidPySparkDataTypeError`	If the given `datatype` is not a valid `pyspark` data type.

Returns:

Type	Description
`bool`	`True` if the datatype is valid, `False` otherwise.

Examples

Set up
>>> from toolbox_pyspark.checks import is_vaid_spark_type

Loop through all valid types
>>> type_names = ["string", "char", "varchar", "binary", "boolean", "decimal", "float", "double", "byte", "short", "integer", "long", "date", "timestamp", "timestamp_ntz", "void"]
>>> for type_name in type_names:
...     is_vaid_spark_type(type_name)

Nothing is returned each time. Because they're all valid.

Conclusion: They're all valid.

Check some invalid types
>>> type_names = ["np.ndarray", "pd.DataFrame", "dict"]
>>> for type_name in type_names:
...     is_vaid_spark_type(type_name)

Terminal

InvalidPySparkDataTypeError: DataType 'np.ndarray' is not valid.
Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]

Terminal

InvalidPySparkDataTypeError: DataType 'pd.DataFrame' is not valid.
Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]

Terminal

InvalidPySparkDataTypeError: DataType 'dict' is not valid.
Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]

Conclusion: All of these types are invalid.

See Also

assert_valid_spark_type

Source code in src/toolbox_pyspark/checks.py

@typechecked
def is_vaid_spark_type(datatype: str) -> bool:
    """
    !!! note "Summary"
        Check whether a given `#!py datatype` is a correct and valid `#!py pyspark` data type.

    Params:
        datatype (str):
            The name of the data type to check.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        InvalidPySparkDataTypeError:
            If the given `#!py datatype` is not a valid `#!py pyspark` data type.

    Returns:
        (bool):
            `#!py True` if the datatype is valid, `#!py False` otherwise.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> from toolbox_pyspark.checks import is_vaid_spark_type
        ```

        ```{.py .python linenums="1" title="Loop through all valid types"}
        >>> type_names = ["string", "char", "varchar", "binary", "boolean", "decimal", "float", "double", "byte", "short", "integer", "long", "date", "timestamp", "timestamp_ntz", "void"]
        >>> for type_name in type_names:
        ...     is_vaid_spark_type(type_name)
        ```
        <div class="result" markdown>
        Nothing is returned each time. Because they're all valid.
        !!! success "Conclusion: They're all valid."
        </div>

        ```{.py .python linenums="1" title="Check some invalid types"}
        >>> type_names = ["np.ndarray", "pd.DataFrame", "dict"]
        >>> for type_name in type_names:
        ...     is_vaid_spark_type(type_name)
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeError: DataType 'np.ndarray' is not valid.
        Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]
        ```
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeError: DataType 'pd.DataFrame' is not valid.
        Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]
        ```
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeError: DataType 'dict' is not valid.
        Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]
        ```
        !!! failure "Conclusion: All of these types are invalid."
        </div>

    ??? tip "See Also"
        - [`assert_valid_spark_type`][toolbox_pyspark.checks.assert_valid_spark_type]
    """
    return datatype in VALID_PYSPARK_TYPE_NAMES

assert_valid_spark_type 🔗

assert_valid_spark_type(datatype: str) -> None

Summary

Assert whether a given datatype is a correct and valid pyspark data type.

Parameters:

Name	Type	Description	Default
`datatype`	`str`	The name of the data type to check.	required

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`InvalidPySparkDataTypeError`	If the given `datatype` is not a valid `pyspark` data type.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `InvalidPySparkDataTypeError` exception is raised, or nothing.

Examples

Set up
>>> from toolbox_pyspark.checks import assert_valid_spark_type

Example 1: Valid type
>>> assert_valid_spark_type("string")

Terminal

None

Conclusion: Valid type.

Example 2: Invalid type
>>> assert_valid_spark_type("invalid_type")

Terminal

InvalidPySparkDataTypeError: DataType 'invalid_type' is not valid.
Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]

Conclusion: Invalid type.

See Also

is_vaid_spark_type

Source code in src/toolbox_pyspark/checks.py

@typechecked
def assert_valid_spark_type(datatype: str) -> None:
    """
    !!! note "Summary"
        Assert whether a given `#!py datatype` is a correct and valid `#!py pyspark` data type.

    Params:
        datatype (str):
            The name of the data type to check.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        InvalidPySparkDataTypeError:
            If the given `#!py datatype` is not a valid `#!py pyspark` data type.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py InvalidPySparkDataTypeError` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> from toolbox_pyspark.checks import assert_valid_spark_type
        ```

        ```{.py .python linenums="1" title="Example 1: Valid type"}
        >>> assert_valid_spark_type("string")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        None
        ```
        !!! success "Conclusion: Valid type."
        </div>

        ```{.py .python linenums="1" title="Example 2: Invalid type"}
        >>> assert_valid_spark_type("invalid_type")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeError: DataType 'invalid_type' is not valid.
        Must be one of: ["binary", "bool", "boolean", "byte", "char", "date", "decimal", "double", "float", "int", "integer", "long", "short", "str", "string", "timestamp", "timestamp_ntz", "varchar", "void"]
        ```
        !!! failure "Conclusion: Invalid type."
        </div>

    ??? tip "See Also"
        - [`is_vaid_spark_type`][toolbox_pyspark.checks.is_vaid_spark_type]
    """
    if not is_vaid_spark_type(datatype):
        raise InvalidPySparkDataTypeError(
            f"DataType '{datatype}' is not valid.\n"
            f"Must be one of: {VALID_PYSPARK_TYPE_NAMES}"
        )

Column Types🔗

Summary

The checks module is used to check and validate various attributed about a given pyspark dataframe.

ColumnsAreTypeResult `dataclass` 🔗

Source code in src/toolbox_pyspark/checks.py

@dataclass
class ColumnsAreTypeResult:
    result: bool
    invalid_types: list[tuple[str, str]]

    def __iter__(self):
        for field in fields(self):
            yield getattr(self, field.name)

init 🔗

__init__(
    result: bool, invalid_types: list[tuple[str, str]]
) -> None

result `instance-attribute` 🔗

result: bool

invalid_types `instance-attribute` 🔗

invalid_types: list[tuple[str, str]]

iter 🔗

__iter__()

Source code in src/toolbox_pyspark/checks.py

def __iter__(self):
    for field in fields(self):
        yield getattr(self, field.name)

column_is_type 🔗

column_is_type(
    dataframe: psDataFrame,
    column: str,
    datatype: str,
    match_case: bool = False,
) -> bool

Summary

Check whether a given column is of a given datatype in dataframe.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`column`	`str`	The column to check.	required
`datatype`	`str`	The data type to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If the `column` does not exist within `dataframe.columns`.
`InvalidPySparkDataTypeError`	If the `datatype` is not a valid `pyspark` data type.

Returns:

Type	Description
`bool`	`True` if the column is of the given `datatype`, `False` otherwise.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import column_is_type
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: Column is of type
>>> column_is_type(df, "a", "integer")

Terminal

True

Conclusion: Column is the correct type.

Example 2: Column is not of type
>>> column_is_type(df, "b", "integer")

Terminal

False

Conclusion: Column is not the correct type.

See Also

column_is_type
columns_are_type
assert_column_is_type
assert_columns_are_type
warn_column_invalid_type
warn_columns_invalid_type

Source code in src/toolbox_pyspark/checks.py

@typechecked
def column_is_type(
    dataframe: psDataFrame,
    column: str,
    datatype: str,
    match_case: bool = False,
) -> bool:
    """
    !!! note "Summary"
        Check whether a given `#!py column` is of a given `#!py datatype` in `#!py dataframe`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        column (str):
            The column to check.
        datatype (str):
            The data type to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ColumnDoesNotExistError:
            If the `#!py column` does not exist within `#!py dataframe.columns`.
        InvalidPySparkDataTypeError:
            If the `#!py datatype` is not a valid `#!py pyspark` data type.

    Returns:
        (bool):
            `#!py True` if the column is of the given `#!py datatype`, `#!py False` otherwise.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import column_is_type
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: Column is of type"}
        >>> column_is_type(df, "a", "integer")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        True
        ```
        !!! success "Conclusion: Column is the correct type."
        </div>

        ```{.py .python linenums="1" title="Example 2: Column is not of type"}
        >>> column_is_type(df, "b", "integer")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: Column is not the correct type."
        </div>

    ??? tip "See Also"
        - [`column_is_type`][toolbox_pyspark.checks.column_is_type]
        - [`columns_are_type`][toolbox_pyspark.checks.columns_are_type]
        - [`assert_column_is_type`][toolbox_pyspark.checks.assert_column_is_type]
        - [`assert_columns_are_type`][toolbox_pyspark.checks.assert_columns_are_type]
        - [`warn_column_invalid_type`][toolbox_pyspark.checks.warn_column_invalid_type]
        - [`warn_columns_invalid_type`][toolbox_pyspark.checks.warn_columns_invalid_type]
    """
    return _columns_are_type(dataframe, column, datatype, match_case).result

columns_are_type 🔗

columns_are_type(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    datatype: str,
    match_case: bool = False,
) -> bool

Summary

Check whether the given columns are of a given datatype in dataframe.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`columns`	`Union[str, str_collection]`	The columns to check.	required
`datatype`	`str`	The data type to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If any of the `columns` do not exist within `dataframe.columns`.
`InvalidPySparkDataTypeError`	If the `datatype` is not a valid `pyspark` data type.

Returns:

Type	Description
`bool`	`True` if all the columns are of the given `datatype`, `False` otherwise.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import columns_are_type
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1.1, 2.2, 3.3, 4.4],
...         }
...     )
... )

Example 1: Columns are of type
>>> columns_are_type(df, ["a", "c"], "double")

Terminal

True

Conclusion: Columns are the correct type.

Example 2: Columns are not of type
>>> columns_are_type(df, ["a", "b"], "double")

Terminal

False

Conclusion: Columns are not the correct type.

Example 3: Single column is of type
>>> columns_are_type(df, "a", "integer")

Terminal

True

Conclusion: Column is the correct type.

Example 4: Single column is not of type
>>> columns_are_type(df, "b", "integer")

Terminal

False

Conclusion: Column is not the correct type.

See Also

column_is_type
columns_are_type
assert_column_is_type
assert_columns_are_type
warn_column_invalid_type
[warn_columns_invalid_type][toolbox_pyspark.checks.warn_columns_invalid

Source code in src/toolbox_pyspark/checks.py

@typechecked
def columns_are_type(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    datatype: str,
    match_case: bool = False,
) -> bool:
    """
    !!! note "Summary"
        Check whether the given `#!py columns` are of a given `#!py datatype` in `#!py dataframe`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        columns (Union[str, str_collection]):
            The columns to check.
        datatype (str):
            The data type to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ColumnDoesNotExistError:
            If any of the `#!py columns` do not exist within `#!py dataframe.columns`.
        InvalidPySparkDataTypeError:
            If the `#!py datatype` is not a valid `#!py pyspark` data type.

    Returns:
        (bool):
            `#!py True` if all the columns are of the given `#!py datatype`, `#!py False` otherwise.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import columns_are_type
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...             "c": [1.1, 2.2, 3.3, 4.4],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: Columns are of type"}
        >>> columns_are_type(df, ["a", "c"], "double")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        True
        ```
        !!! success "Conclusion: Columns are the correct type."
        </div>

        ```{.py .python linenums="1" title="Example 2: Columns are not of type"}
        >>> columns_are_type(df, ["a", "b"], "double")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: Columns are not the correct type."
        </div>

        ```{.py .python linenums="1" title="Example 3: Single column is of type"}
        >>> columns_are_type(df, "a", "integer")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        True
        ```
        !!! success "Conclusion: Column is the correct type."
        </div>

        ```{.py .python linenums="1" title="Example 4: Single column is not of type"}
        >>> columns_are_type(df, "b", "integer")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: Column is not the correct type."
        </div>

    ??? tip "See Also"
        - [`column_is_type`][toolbox_pyspark.checks.column_is_type]
        - [`columns_are_type`][toolbox_pyspark.checks.columns_are_type]
        - [`assert_column_is_type`][toolbox_pyspark.checks.assert_column_is_type]
        - [`assert_columns_are_type`][toolbox_pyspark.checks.assert_columns_are_type]
        - [`warn_column_invalid_type`][toolbox_pyspark.checks.warn_column_invalid_type]
        - [`warn_columns_invalid_type`][toolbox_pyspark.checks.warn_columns_invalid
    """
    return _columns_are_type(dataframe, columns, datatype, match_case).result

assert_column_is_type 🔗

assert_column_is_type(
    dataframe: psDataFrame,
    column: str,
    datatype: str,
    match_case: bool = False,
) -> None

Summary

Check whether a given column is of a given datatype in dataframe.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`column`	`str`	The column to check.	required
`datatype`	`str`	The data type to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If the `column` does not exist within `dataframe.columns`.
`InvalidPySparkDataTypeError`	If the given `column` is not of the given `datatype`.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `InvalidPySparkDataTypeError` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import assert_column_is_type
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: No error
>>> assert_column_is_type(df, "a", "integer")

Terminal

None

Conclusion: Column is of type.

Example 2: Error raised
>>> assert_column_is_type(df, "b", "integer")

Terminal

InvalidPySparkDataTypeError: Column 'b' is not of type 'integer'.

Conclusion: Column is not of type.

See Also

column_is_type
columns_are_type
assert_column_is_type
assert_columns_are_type
warn_column_invalid_type
[warn_columns_invalid_type][toolbox_pyspark.checks.warn_columns_invalid

Source code in src/toolbox_pyspark/checks.py

@typechecked
def assert_column_is_type(
    dataframe: psDataFrame,
    column: str,
    datatype: str,
    match_case: bool = False,
) -> None:
    """
    !!! note "Summary"
        Check whether a given `#!py column` is of a given `#!py datatype` in `#!py dataframe`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        column (str):
            The column to check.
        datatype (str):
            The data type to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ColumnDoesNotExistError:
            If the `#!py column` does not exist within `#!py dataframe.columns`.
        InvalidPySparkDataTypeError:
            If the given `#!py column` is not of the given `#!py datatype`.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py InvalidPySparkDataTypeError` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import assert_column_is_type
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No error"}
        >>> assert_column_is_type(df, "a", "integer")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        None
        ```
        !!! success "Conclusion: Column is of type."
        </div>

        ```{.py .python linenums="1" title="Example 2: Error raised"}
        >>> assert_column_is_type(df, "b", "integer")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeError: Column 'b' is not of type 'integer'.
        ```
        !!! failure "Conclusion: Column is not of type."
        </div>

    ??? tip "See Also"
        - [`column_is_type`][toolbox_pyspark.checks.column_is_type]
        - [`columns_are_type`][toolbox_pyspark.checks.columns_are_type]
        - [`assert_column_is_type`][toolbox_pyspark.checks.assert_column_is_type]
        - [`assert_columns_are_type`][toolbox_pyspark.checks.assert_columns_are_type]
        - [`warn_column_invalid_type`][toolbox_pyspark.checks.warn_column_invalid_type]
        - [`warn_columns_invalid_type`][toolbox_pyspark.checks.warn_columns_invalid
    """
    result, invalid_types = _columns_are_type(dataframe, column, datatype, match_case)
    if not result:
        raise InvalidPySparkDataTypeError(
            f"Column '{column}' is type '{invalid_types[0][1]}', "
            f"which is not the required type: '{datatype}'."
        )

assert_columns_are_type 🔗

assert_columns_are_type(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    datatype: str,
    match_case: bool = False,
) -> None

Summary

Check whether the given columns are of a given datatype in dataframe.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`columns`	`Union[str, str_collection]`	The columns to check.	required
`datatype`	`str`	The data type to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`ColumnDoesNotExistError`	If any of the `columns` do not exist within `dataframe.columns`.
`InvalidPySparkDataTypeError`	If any of the given `columns` are not of the given `datatype`.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `InvalidPySparkDataTypeError` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import assert_columns_are_type
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1.1, 2.2, 3.3, 4.4],
...         }
...     )
... )

Example 1: No error
>>> assert_columns_are_type(df, ["a", "c"], "double")

Terminal

None

Conclusion: Columns are of type.

Example 2: Error raised
>>> assert_columns_are_type(df, ["a", "b"], "double")

Terminal

InvalidPySparkDataTypeError: Columns ['a', 'b'] are types ['int', 'string'], which are not the required type: 'double'.

Conclusion: Columns are not of type.

Example 3: Single column is of type
>>> assert_columns_are_type(df, "a", "integer")

Terminal

None

Conclusion: Column is of type.

Example 4: Single column is not of type
>>> assert_columns_are_type(df, "b", "integer")

Terminal

InvalidPySparkDataTypeError: Columns ['b'] are types ['string'], which are not the required type: 'integer'.

Conclusion: Column is not of type.

See Also

column_is_type
columns_are_type
assert_column_is_type
assert_columns_are_type
warn_column_invalid_type
[warn_columns_invalid_type][toolbox_pyspark.checks.warn_columns_invalid

Source code in src/toolbox_pyspark/checks.py

@typechecked
def assert_columns_are_type(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    datatype: str,
    match_case: bool = False,
) -> None:
    """
    !!! note "Summary"
        Check whether the given `#!py columns` are of a given `#!py datatype` in `#!py dataframe`.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        columns (Union[str, str_collection]):
            The columns to check.
        datatype (str):
            The data type to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ColumnDoesNotExistError:
            If any of the `#!py columns` do not exist within `#!py dataframe.columns`.
        InvalidPySparkDataTypeError:
            If any of the given `#!py columns` are not of the given `#!py datatype`.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py InvalidPySparkDataTypeError` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import assert_columns_are_type
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...             "c": [1.1, 2.2, 3.3, 4.4],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No error"}
        >>> assert_columns_are_type(df, ["a", "c"], "double")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        None
        ```
        !!! success "Conclusion: Columns are of type."
        </div>

        ```{.py .python linenums="1" title="Example 2: Error raised"}
        >>> assert_columns_are_type(df, ["a", "b"], "double")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeError: Columns ['a', 'b'] are types ['int', 'string'], which are not the required type: 'double'.
        ```
        !!! failure "Conclusion: Columns are not of type."
        </div>

        ```{.py .python linenums="1" title="Example 3: Single column is of type"}
        >>> assert_columns_are_type(df, "a", "integer")
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        None
        ```
        !!! success "Conclusion: Column is of type."
        </div>

        ```{.py .python linenums="1" title="Example 4: Single column is not of type"}
        >>> assert_columns_are_type(df, "b", "integer")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeError: Columns ['b'] are types ['string'], which are not the required type: 'integer'.
        ```
        !!! failure "Conclusion: Column is not of type."
        </div>

    ??? tip "See Also"
        - [`column_is_type`][toolbox_pyspark.checks.column_is_type]
        - [`columns_are_type`][toolbox_pyspark.checks.columns_are_type]
        - [`assert_column_is_type`][toolbox_pyspark.checks.assert_column_is_type]
        - [`assert_columns_are_type`][toolbox_pyspark.checks.assert_columns_are_type]
        - [`warn_column_invalid_type`][toolbox_pyspark.checks.warn_column_invalid_type]
        - [`warn_columns_invalid_type`][toolbox_pyspark.checks.warn_columns_invalid
    """
    result, invalid_types = _columns_are_type(dataframe, columns, datatype, match_case)
    if not result:
        raise InvalidPySparkDataTypeError(
            f"Columns {[col for col, _ in invalid_types]} are types {[typ for _, typ in invalid_types]}, "
            f"which are not the required type: '{datatype}'."
        )

warn_column_invalid_type 🔗

warn_column_invalid_type(
    dataframe: psDataFrame,
    column: str,
    datatype: str,
    match_case: bool = False,
) -> None

Summary

Check whether a given column is of a given datatype in dataframe and raise a warning if not.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`column`	`str`	The column to check.	required
`datatype`	`str`	The data type to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `InvalidPySparkDataTypeWarning` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import warn_column_invalid_type
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )

Example 1: No warning
>>> warn_column_invalid_type(df, "a", "integer")

Terminal

None

Conclusion: Column is of type.

Example 2: Warning raised
>>> warn_column_invalid_type(df, "b", "integer")

Terminal

InvalidPySparkDataTypeWarning: Column 'b' is type 'string', which is not the required type: 'integer'.

Conclusion: Column is not of type.

See Also

column_is_type
columns_are_type
assert_column_is_type
assert_columns_are_type
warn_column_invalid_type
[warn_columns_invalid_type][toolbox_pyspark.checks.warn_columns_invalid

Source code in src/toolbox_pyspark/checks.py

@typechecked
def warn_column_invalid_type(
    dataframe: psDataFrame,
    column: str,
    datatype: str,
    match_case: bool = False,
) -> None:
    """
    !!! note "Summary"
        Check whether a given `#!py column` is of a given `#!py datatype` in `#!py dataframe` and raise a warning if not.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        column (str):
            The column to check.
        datatype (str):
            The data type to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py InvalidPySparkDataTypeWarning` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import warn_column_invalid_type
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No warning"}
        >>> warn_column_invalid_type(df, "a", "integer")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        None
        ```
        !!! success "Conclusion: Column is of type."
        </div>

        ```{.py .python linenums="1" title="Example 2: Warning raised"}
        >>> warn_column_invalid_type(df, "b", "integer")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeWarning: Column 'b' is type 'string', which is not the required type: 'integer'.
        ```
        !!! failure "Conclusion: Column is not of type."
        </div>

    ??? tip "See Also"
        - [`column_is_type`][toolbox_pyspark.checks.column_is_type]
        - [`columns_are_type`][toolbox_pyspark.checks.columns_are_type]
        - [`assert_column_is_type`][toolbox_pyspark.checks.assert_column_is_type]
        - [`assert_columns_are_type`][toolbox_pyspark.checks.assert_columns_are_type]
        - [`warn_column_invalid_type`][toolbox_pyspark.checks.warn_column_invalid_type]
        - [`warn_columns_invalid_type`][toolbox_pyspark.checks.warn_columns_invalid
    """
    result, invalid_types = _columns_are_type(dataframe, column, datatype, match_case)
    if not result:
        warn(
            f"Column '{column}' is type '{invalid_types[0][1]}', "
            f"which is not the required type: '{datatype}'.",
            InvalidPySparkDataTypeWarning,
        )

warn_columns_invalid_type 🔗

warn_columns_invalid_type(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    datatype: str,
    match_case: bool = False,
) -> None

Summary

Check whether the given columns are of a given datatype in dataframe and raise a warning if not.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The DataFrame to check.	required
`columns`	`Union[str, str_collection]`	The columns to check.	required
`datatype`	`str`	The data type to check.	required
`match_case`	`bool`	Whether or not to match the string case for the columns. Defaults to `False`.	`False`

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `InvalidPySparkDataTypeWarning` exception is raised, or nothing.

Examples

Set up
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.checks import warn_columns_invalid_type
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1.1, 2.2, 3.3, 4.4],
...         }
...     )
... )

Example 1: No warning
>>> warn_columns_invalid_type(df, ["a", "c"], "double")

Terminal

None

Conclusion: Columns are of type.

Example 2: Warning raised
>>> warn_columns_invalid_type(df, ["a", "b"], "double")

Terminal

InvalidPySparkDataTypeWarning: Columns ['a', 'b'] are types ['int', 'string'], which are not the required type: 'double'.

Conclusion: Columns are not of type.

See Also

column_is_type
columns_are_type
assert_column_is_type
assert_columns_are_type
warn_column_invalid_type
[warn_columns_invalid_type][toolbox_pyspark.checks.warn_columns_invalid

Source code in src/toolbox_pyspark/checks.py

@typechecked
def warn_columns_invalid_type(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
    datatype: str,
    match_case: bool = False,
) -> None:
    """
    !!! note "Summary"
        Check whether the given `#!py columns` are of a given `#!py datatype` in `#!py dataframe` and raise a warning if not.

    Params:
        dataframe (psDataFrame):
            The DataFrame to check.
        columns (Union[str, str_collection]):
            The columns to check.
        datatype (str):
            The data type to check.
        match_case (bool, optional):
            Whether or not to match the string case for the columns.<br>
            Defaults to `#!py False`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py InvalidPySparkDataTypeWarning` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.checks import warn_columns_invalid_type
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...             "c": [1.1, 2.2, 3.3, 4.4],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: No warning"}
        >>> warn_columns_invalid_type(df, ["a", "c"], "double")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        None
        ```
        !!! success "Conclusion: Columns are of type."
        </div>

        ```{.py .python linenums="1" title="Example 2: Warning raised"}
        >>> warn_columns_invalid_type(df, ["a", "b"], "double")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        InvalidPySparkDataTypeWarning: Columns ['a', 'b'] are types ['int', 'string'], which are not the required type: 'double'.
        ```
        !!! failure "Conclusion: Columns are not of type."
        </div>

    ??? tip "See Also"
        - [`column_is_type`][toolbox_pyspark.checks.column_is_type]
        - [`columns_are_type`][toolbox_pyspark.checks.columns_are_type]
        - [`assert_column_is_type`][toolbox_pyspark.checks.assert_column_is_type]
        - [`assert_columns_are_type`][toolbox_pyspark.checks.assert_columns_are_type]
        - [`warn_column_invalid_type`][toolbox_pyspark.checks.warn_column_invalid_type]
        - [`warn_columns_invalid_type`][toolbox_pyspark.checks.warn_columns_invalid
    """
    result, invalid_types = _columns_are_type(dataframe, columns, datatype, match_case)
    if not result:
        warn(
            f"Columns {[col for col, _ in invalid_types]} are types {[typ for _, typ in invalid_types]}, "
            f"which are not the required type: '{datatype}'.",
            InvalidPySparkDataTypeWarning,
        )

Table Existence🔗

Summary

The checks module is used to check and validate various attributed about a given pyspark dataframe.

table_exists 🔗

table_exists(
    name: str,
    path: str,
    data_format: SPARK_FORMATS,
    spark_session: SparkSession,
) -> bool

Summary

Will try to read table from path using format, and if successful will return True otherwise False.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table to check exists.	required
`path`	`str`	The directory where the table should be existing.	required
`data_format`	`str`	The format of the table to try checking.	required
`spark_session`	`SparkSession`	The `spark` session to use for the importing.	required

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.

Returns:

Type	Description
`bool`	Returns `True` if the table exists, `False` otherwise.

Examples

Set up
>>> # Imports
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.io import write_to_path
>>> from toolbox_pyspark.checks import table_exists
>>>
>>> # Constants
>>> write_name = "test_df"
>>> write_path = f"./test"
>>> write_format = "parquet"
>>>
>>> # Instantiate Spark
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create data
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )
>>>
>>> # Write data
>>> write_to_path(df, f"{write_name}.{write_format}", write_path)

Example 1: Table exists
>>> table_exists("test_df.parquet", "./test", "parquet", spark)

Terminal

True

Conclusion: Table exists.

Example 2: Table does not exist
>>> table_exists("bad_table_name.parquet", "./test", "parquet", spark)

Terminal

False

Conclusion: Table does not exist.

See Also

assert_table_exists

Source code in src/toolbox_pyspark/checks.py

@typechecked
def table_exists(
    name: str,
    path: str,
    data_format: SPARK_FORMATS,
    spark_session: SparkSession,
) -> bool:
    """
    !!! note "Summary"
        Will try to read `#!py table` from `#!py path` using `#!py format`, and if successful will return `#!py True` otherwise `#!py False`.

    Params:
        name (str):
            The name of the table to check exists.
        path (str):
            The directory where the table should be existing.
        data_format (str):
            The format of the table to try checking.
        spark_session (SparkSession):
            The `#!py spark` session to use for the importing.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (bool):
            Returns `#!py True` if the table exists, `False` otherwise.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> # Imports
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.io import write_to_path
        >>> from toolbox_pyspark.checks import table_exists
        >>>
        >>> # Constants
        >>> write_name = "test_df"
        >>> write_path = f"./test"
        >>> write_format = "parquet"
        >>>
        >>> # Instantiate Spark
        >>> spark = SparkSession.builder.getOrCreate()
        >>>
        >>> # Create data
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        >>>
        >>> # Write data
        >>> write_to_path(df, f"{write_name}.{write_format}", write_path)
        ```

        ```{.py .python linenums="1" title="Example 1: Table exists"}
        >>> table_exists("test_df.parquet", "./test", "parquet", spark)
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        True
        ```
        !!! success "Conclusion: Table exists."
        </div>

        ```{.py .python linenums="1" title="Example 2: Table does not exist"}
        >>> table_exists("bad_table_name.parquet", "./test", "parquet", spark)
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        False
        ```
        !!! failure "Conclusion: Table does not exist."
        </div>

    ??? tip "See Also"
        - [`assert_table_exists`][toolbox_pyspark.checks.assert_table_exists]
    """
    try:
        _ = read_from_path(
            name=name,
            path=path,
            data_format=data_format,
            spark_session=spark_session,
        )
    except Exception:
        return False
    return True

assert_table_exists 🔗

assert_table_exists(
    name: str,
    path: str,
    data_format: SPARK_FORMATS,
    spark_session: SparkSession,
) -> None

Summary

Assert whether a table exists at a given path using data_format.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the table to check exists.	required
`path`	`str`	The directory where the table should be existing.	required
`data_format`	`str`	The format of the table to try checking.	required
`spark_session`	`SparkSession`	The `spark` session to use for the importing.	required

Raises:

Type	Description
`TypeError`	If any of the inputs parsed to the parameters of this function are not the correct type. Uses the `@typeguard.typechecked` decorator.
`TableDoesNotExistError`	If the table does not exist at the specified location.

Returns:

Type	Description
`type(None)`	Nothing is returned. Either an `TableDoesNotExistError` exception is raised, or nothing.

Examples

Set up
>>> # Imports
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.io import write_to_path
>>> from toolbox_pyspark.checks import assert_table_exists
>>>
>>> # Constants
>>> write_name = "test_df"
>>> write_path = f"./test"
>>> write_format = "parquet"
>>>
>>> # Instantiate Spark
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create data
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...         }
...     )
... )
>>>
>>> # Write data
>>> write_to_path(df, f"{write_name}.{write_format}", write_path)

Example 1: Table exists
>>> assert_table_exists("test_df.parquet", "./test", "parquet", spark)

Terminal

None

Conclusion: Table exists.

Example 2: Table does not exist
>>> assert_table_exists("bad_table_name.parquet", "./test", "parquet", spark)

Terminal

TableDoesNotExistError: Table 'bad_table_name.parquet' does not exist at path './test'.

Conclusion: Table does not exist.

See Also

table_exists

Source code in src/toolbox_pyspark/checks.py

@typechecked
def assert_table_exists(
    name: str,
    path: str,
    data_format: SPARK_FORMATS,
    spark_session: SparkSession,
) -> None:
    """
    !!! note "Summary"
        Assert whether a table exists at a given `path` using `data_format`.

    Params:
        name (str):
            The name of the table to check exists.
        path (str):
            The directory where the table should be existing.
        data_format (str):
            The format of the table to try checking.
        spark_session (SparkSession):
            The `#!py spark` session to use for the importing.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        TableDoesNotExistError:
            If the table does not exist at the specified location.

    Returns:
        (type(None)):
            Nothing is returned. Either an `#!py TableDoesNotExistError` exception is raised, or nothing.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> # Imports
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.io import write_to_path
        >>> from toolbox_pyspark.checks import assert_table_exists
        >>>
        >>> # Constants
        >>> write_name = "test_df"
        >>> write_path = f"./test"
        >>> write_format = "parquet"
        >>>
        >>> # Instantiate Spark
        >>> spark = SparkSession.builder.getOrCreate()
        >>>
        >>> # Create data
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...         }
        ...     )
        ... )
        >>>
        >>> # Write data
        >>> write_to_path(df, f"{write_name}.{write_format}", write_path)
        ```

        ```{.py .python linenums="1" title="Example 1: Table exists"}
        >>> assert_table_exists("test_df.parquet", "./test", "parquet", spark)
        ```
        <div class="result" markdown>
        ```{.sh .shell title="Terminal"}
        None
        ```
        !!! success "Conclusion: Table exists."
        </div>

        ```{.py .python linenums="1" title="Example 2: Table does not exist"}
        >>> assert_table_exists("bad_table_name.parquet", "./test", "parquet", spark)
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        TableDoesNotExistError: Table 'bad_table_name.parquet' does not exist at path './test'.
        ```
        !!! failure "Conclusion: Table does not exist."
        </div>

    ??? tip "See Also"
        - [`table_exists`][toolbox_pyspark.checks.table_exists]
    """
    if not table_exists(
        name=name, path=path, data_format=data_format, spark_session=spark_session
    ):
        raise TableDoesNotExistError(f"Table '{name}' does not exist at path '{path}'.")

Checks

toolbox_pyspark.checks 🔗

Column Existence🔗

ColumnExistsResult dataclass 🔗

__init__ 🔗

result instance-attribute 🔗

missing_cols instance-attribute 🔗

__iter__ 🔗

column_exists 🔗

columns_exists 🔗

assert_column_exists 🔗

assert_columns_exists 🔗

warn_column_missing 🔗

warn_columns_missing 🔗

Column Values🔗

column_contains_value 🔗

Type Checks🔗

is_vaid_spark_type 🔗

assert_valid_spark_type 🔗

Column Types🔗

ColumnsAreTypeResult dataclass 🔗

__init__ 🔗

result instance-attribute 🔗

invalid_types instance-attribute 🔗

__iter__ 🔗

column_is_type 🔗

columns_are_type 🔗

assert_column_is_type 🔗

assert_columns_are_type 🔗

warn_column_invalid_type 🔗

warn_columns_invalid_type 🔗

Table Existence🔗

table_exists 🔗

assert_table_exists 🔗

ColumnExistsResult `dataclass` 🔗

init 🔗

result `instance-attribute` 🔗

missing_cols `instance-attribute` 🔗

iter 🔗

ColumnsAreTypeResult `dataclass` 🔗

init 🔗

result `instance-attribute` 🔗

invalid_types `instance-attribute` 🔗

iter 🔗