Skip to content

Info

toolbox_pyspark.info 🔗

Summary

The info module is used to provide utility functions for retrieving information from pyspark dataframes.

extract_column_values 🔗

extract_column_values(
    dataframe: psDataFrame,
    column: str,
    distinct: bool = True,
    return_type: Union[
        LITERAL_PYSPARK_DATAFRAME_NAMES,
        LITERAL_PANDAS_DATAFRAME_NAMES,
        LITERAL_NUMPY_ARRAY_NAMES,
        LITERAL_LIST_OBJECT_NAMES,
    ] = "pd",
) -> Optional[
    Union[psDataFrame, pdDataFrame, npArray, list]
]

Summary

Retrieve the values from a specified column in a pyspark dataframe.

Parameters:

Name Type Description Default
dataframe DataFrame

The DataFrame to retrieve the column values from.

required
column str

The column to retrieve the values from.

required
distinct bool

Whether to retrieve only distinct values.
Defaults to True.

True
return_type Union[LITERAL_PYSPARK_DATAFRAME_NAMES, LITERAL_PANDAS_DATAFRAME_NAMES, LITERAL_NUMPY_ARRAY_NAMES, LITERAL_LIST_OBJECT_NAMES]

The type of object to return.
Defaults to "pd".

'pd'

Raises:

Type Description
TypeError

If any of the inputs parsed to the parameters of this function are not the correct type. Uses the @typeguard.typechecked decorator.

ValueError

If the return_type is not one of the valid options.

ColumnDoesNotExistError

If the column does not exist within dataframe.columns.

Returns:

Type Description
Optional[Union[DataFrame, DataFrame, ndarray, list]]

The values from the specified column in the specified return type.

Examples

Set up
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
>>> # Imports
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.info import get_column_values
>>>
>>> # Instantiate Spark
>>> spark = SparkSession.builder.getOrCreate()
>>>
>>> # Create data
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1, 1, 1, 1],
...             "d": ["2", "3", "3", "3"],
...             "e": ["a", "a", "b", "b"],
...         }
...     )
... )
>>>
>>> # Check
>>> df.show()
Terminal
+---+---+---+---+---+
| a | b | c | d | e |
+---+---+---+---+---+
| 1 | a | 1 | 2 | a |
| 2 | b | 1 | 3 | a |
| 3 | c | 1 | 3 | b |
| 4 | d | 1 | 3 | b |
+---+---+---+---+---+

Example 1: Retrieve all values as pyspark DataFrame
1
2
>>> result = get_column_values(df, "e", distinct=False, return_type="ps")
>>> result.show()
Terminal
+---+
| e |
+---+
| a |
| a |
| b |
| b |
+---+

Conclusion: Successfully retrieved all values as pyspark DataFrame.

Example 2: Retrieve distinct values as pandas DataFrame
1
2
>>> result = get_column_values(df, "b", distinct=True, return_type="pd")
>>> print(result)
Terminal
   b
0  a
1  b
2  c
3  d

Conclusion: Successfully retrieved distinct values as pandas DataFrame.

Example 3: Retrieve all values as list
1
2
>>> result = get_column_values(df, "c", distinct=False, return_type="list")
>>> print(result)
Terminal
['1', '1', '1', '1']

Conclusion: Successfully retrieved all values as list.

Example 4: Retrieve distinct values as numpy array
1
2
>>> result = get_column_values(df, "d", distinct=True, return_type="np")
>>> print(result)
Terminal
['2' '3']

Conclusion: Successfully retrieved distinct values as numpy array.

Example 5: Invalid column
1
>>> result = get_column_values(df, "invalid", distinct=True, return_type="pd")
Terminal
ColumnDoesNotExistError: Column 'invalid' does not exist. Did you mean one of the following? [a, b, c, d, e]

Conclusion: Failed to retrieve values due to invalid column.

Example 6: Invalid return type
1
>>> result = get_column_values(df, "b", distinct=True, return_type="invalid")
Terminal
ValueError: Invalid return type: invalid

Conclusion: Failed to retrieve values due to invalid return type.

See Also
Source code in src/toolbox_pyspark/info.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
@typechecked
def extract_column_values(
    dataframe: psDataFrame,
    column: str,
    distinct: bool = True,
    return_type: Union[
        LITERAL_PYSPARK_DATAFRAME_NAMES,
        LITERAL_PANDAS_DATAFRAME_NAMES,
        LITERAL_NUMPY_ARRAY_NAMES,
        LITERAL_LIST_OBJECT_NAMES,
    ] = "pd",
) -> Optional[Union[psDataFrame, pdDataFrame, npArray, list]]:
    """
    !!! note "Summary"
        Retrieve the values from a specified column in a `pyspark` dataframe.

    Params:
        dataframe (psDataFrame):
            The DataFrame to retrieve the column values from.
        column (str):
            The column to retrieve the values from.
        distinct (bool, optional):
            Whether to retrieve only distinct values.<br>
            Defaults to `#!py True`.
        return_type (Union[LITERAL_PYSPARK_DATAFRAME_NAMES, LITERAL_PANDAS_DATAFRAME_NAMES, LITERAL_NUMPY_ARRAY_NAMES, LITERAL_LIST_OBJECT_NAMES], optional):
            The type of object to return.<br>
            Defaults to `#!py "pd"`.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.
        ValueError:
            If the `return_type` is not one of the valid options.
        ColumnDoesNotExistError:
            If the `#!py column` does not exist within `#!py dataframe.columns`.

    Returns:
        (Optional[Union[psDataFrame, pdDataFrame, npArray, list]]):
            The values from the specified column in the specified return type.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> # Imports
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.info import get_column_values
        >>>
        >>> # Instantiate Spark
        >>> spark = SparkSession.builder.getOrCreate()
        >>>
        >>> # Create data
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...             "c": [1, 1, 1, 1],
        ...             "d": ["2", "3", "3", "3"],
        ...             "e": ["a", "a", "b", "b"],
        ...         }
        ...     )
        ... )
        >>>
        >>> # Check
        >>> df.show()
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        +---+---+---+---+---+
        | a | b | c | d | e |
        +---+---+---+---+---+
        | 1 | a | 1 | 2 | a |
        | 2 | b | 1 | 3 | a |
        | 3 | c | 1 | 3 | b |
        | 4 | d | 1 | 3 | b |
        +---+---+---+---+---+
        ```
        </div>

        ```{.py .python linenums="1" title="Example 1: Retrieve all values as pyspark DataFrame"}
        >>> result = get_column_values(df, "e", distinct=False, return_type="ps")
        >>> result.show()
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        +---+
        | e |
        +---+
        | a |
        | a |
        | b |
        | b |
        +---+
        ```
        !!! success "Conclusion: Successfully retrieved all values as pyspark DataFrame."
        </div>

        ```{.py .python linenums="1" title="Example 2: Retrieve distinct values as pandas DataFrame"}
        >>> result = get_column_values(df, "b", distinct=True, return_type="pd")
        >>> print(result)
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
           b
        0  a
        1  b
        2  c
        3  d
        ```
        !!! success "Conclusion: Successfully retrieved distinct values as pandas DataFrame."
        </div>

        ```{.py .python linenums="1" title="Example 3: Retrieve all values as list"}
        >>> result = get_column_values(df, "c", distinct=False, return_type="list")
        >>> print(result)
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ['1', '1', '1', '1']
        ```
        !!! success "Conclusion: Successfully retrieved all values as list."
        </div>

        ```{.py .python linenums="1" title="Example 4: Retrieve distinct values as numpy array"}
        >>> result = get_column_values(df, "d", distinct=True, return_type="np")
        >>> print(result)
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ['2' '3']
        ```
        !!! success "Conclusion: Successfully retrieved distinct values as numpy array."
        </div>

        ```{.py .python linenums="1" title="Example 5: Invalid column"}
        >>> result = get_column_values(df, "invalid", distinct=True, return_type="pd")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ColumnDoesNotExistError: Column 'invalid' does not exist. Did you mean one of the following? [a, b, c, d, e]
        ```
        !!! failure "Conclusion: Failed to retrieve values due to invalid column."
        </div>

        ```{.py .python linenums="1" title="Example 6: Invalid return type"}
        >>> result = get_column_values(df, "b", distinct=True, return_type="invalid")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ValueError: Invalid return type: invalid
        ```
        !!! failure "Conclusion: Failed to retrieve values due to invalid return type."
        </div>

    ??? tip "See Also"
        - [`get_distinct_values`][toolbox_pyspark.info.get_distinct_values]
    """

    assert_column_exists(dataframe, column)

    dataframe = dataframe.select(column)

    if distinct:
        dataframe = dataframe.distinct()

    if return_type in VALID_PYSPARK_DATAFRAME_NAMES:
        return dataframe
    elif return_type in VALID_PANDAS_DATAFRAME_NAMES:
        return dataframe.toPandas()
    elif return_type in VALID_NUMPY_ARRAY_NAMES:
        return dataframe.select(column).toPandas().to_numpy()
    elif return_type in VALID_LIST_OBJECT_NAMES:
        return dataframe.select(column).toPandas()[column].tolist()

get_distinct_values 🔗

get_distinct_values(
    dataframe: psDataFrame,
    columns: Union[str, str_collection],
) -> tuple[Any, ...]

Summary

Retrieve the distinct values from a specified column in a pyspark dataframe.

Parameters:

Name Type Description Default
dataframe DataFrame

The DataFrame to retrieve the distinct column values from.

required
columns str

The column(s) to retrieve the distinct values from.

required

Raises:

Type Description
TypeError

If any of the inputs parsed to the parameters of this function are not the correct type. Uses the @typeguard.typechecked decorator.

Returns:

Type Description
str_tuple

The distinct values from the specified column.

Examples
Set up
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
>>> import pandas as pd
>>> from pyspark.sql import SparkSession
>>> from toolbox_pyspark.info import get_distinct_values
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame(
...     pd.DataFrame(
...         {
...             "a": [1, 2, 3, 4],
...             "b": ["a", "b", "c", "d"],
...             "c": [1, 1, 1, 1],
...             "d": ["2", "2", "2", "2"],
...         }
...     )
... )

Example 1: Retrieve distinct values
1
2
>>> result = get_distinct_values(df, "b")
>>> print(result)
Terminal
('a', 'b', 'c', 'd')

Conclusion: Successfully retrieved distinct values.

Example 2: Invalid column
1
>>> result = get_distinct_values(df, "invalid")
Terminal
AnalysisException: Column 'invalid' does not exist. Did you mean one of the following? [a, b, c, d]

Conclusion: Failed to retrieve values due to invalid column.

See Also
Source code in src/toolbox_pyspark/info.py
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
@typechecked
def get_distinct_values(
    dataframe: psDataFrame, columns: Union[str, str_collection]
) -> tuple[Any, ...]:
    """
    !!! note "Summary"
        Retrieve the distinct values from a specified column in a `pyspark` dataframe.

    Params:
        dataframe (psDataFrame):
            The DataFrame to retrieve the distinct column values from.
        columns (str):
            The column(s) to retrieve the distinct values from.

    Raises:
        TypeError:
            If any of the inputs parsed to the parameters of this function are not the correct type. Uses the [`@typeguard.typechecked`](https://typeguard.readthedocs.io/en/stable/api.html#typeguard.typechecked) decorator.

    Returns:
        (str_tuple):
            The distinct values from the specified column.

    ???+ example "Examples"

        ```{.py .python linenums="1" title="Set up"}
        >>> import pandas as pd
        >>> from pyspark.sql import SparkSession
        >>> from toolbox_pyspark.info import get_distinct_values
        >>> spark = SparkSession.builder.getOrCreate()
        >>> df = spark.createDataFrame(
        ...     pd.DataFrame(
        ...         {
        ...             "a": [1, 2, 3, 4],
        ...             "b": ["a", "b", "c", "d"],
        ...             "c": [1, 1, 1, 1],
        ...             "d": ["2", "2", "2", "2"],
        ...         }
        ...     )
        ... )
        ```

        ```{.py .python linenums="1" title="Example 1: Retrieve distinct values"}
        >>> result = get_distinct_values(df, "b")
        >>> print(result)
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        ('a', 'b', 'c', 'd')
        ```
        !!! success "Conclusion: Successfully retrieved distinct values."
        </div>

        ```{.py .python linenums="1" title="Example 2: Invalid column"}
        >>> result = get_distinct_values(df, "invalid")
        ```
        <div class="result" markdown>
        ```{.txt .text title="Terminal"}
        AnalysisException: Column 'invalid' does not exist. Did you mean one of the following? [a, b, c, d]
        ```
        !!! failure "Conclusion: Failed to retrieve values due to invalid column."
        </div>

    ??? tip "See Also"
        - [`get_column_values`][toolbox_pyspark.info.extract_column_values]
    """
    columns = [columns] if is_type(columns, str) else columns
    rows: list[T.Row] = dataframe.select(*columns).distinct().collect()
    if len(columns) == 1:
        return tuple(row[columns[0]] for row in rows)
    return tuple(tuple(row[col] for col in columns) for row in rows)