Skip to content

Modules🔗

Overview🔗

There are 12 modules used in this package, which covers 41 functions

Module Descriptions🔗

Module Description
constants The constants module is used to hold the definitions of all constant values used across the package.
io The io module is used for reading and writing tables to/from directories.
checks The checks module is used to check and validate various attributed about a given pyspark dataframe.
types The types module is used to get, check, and change a datafames column data types.
keys The keys module is used for creating new columns to act as keys (primary and foreign), to be used for joins with other tables, or to create relationships within downstream applications, like PowerBI.
scale The scale module is used for rounding a column (or columns) to a given rounding accuracy.
dimensions The dimensions module is used for checking the dimensions of pyspark dataframe's.
columns The columns module is used to fetch columns from a given DataFrame using convenient syntax.
datetime The datetime module is used for fixing column names that contain datetime data, adding conversions to local datetimes, and for splitting a column in to their date and time components.
info The info module is used to provide utility functions for retrieving information from pyspark dataframes.
formatting The formatting module provides functions for formatting and displaying.
cleaning The cleaning module is used to clean, fix, and fetch various aspects on a given DataFrame.
duplication The duplication module is used for duplicating data from an existing dataframe, or unioning multiple dataframe's together.
schema The schema module is used for checking, validating, and viewing any schema differences between two different tables, either from in-memory variables, or pointing to locations on disk.
delta The delta module is for various processes related to Delta Lake tables. Including optimising tables, merging tables, retrieving table history, and transferring between locations.

Functions by Module🔗

Module Function
constants
io read_from_path()
write_to_path()
transfer_by_path()
read_from_table()
write_to_table()
transfer_by_table()
read()
write()
transfer()
load_from_path()
save_to_path()
load_from_table()
save_to_table()
load()
save()
checks column_exists()
columns_exists()
assert_column_exists()
assert_columns_exists()
warn_column_missing()
warn_columns_missing()
is_vaid_spark_type()
assert_valid_spark_type()
column_is_type()
columns_are_type()
assert_column_is_type()
assert_columns_are_type()
warn_column_invalid_type()
warn_columns_invalid_type()
column_contains_value()
table_exists()
assert_table_exists()
types get_column_types()
cast_column_to_type()
cast_columns_to_type()
map_cast_columns_to_type()
keys add_keys_from_columns()
add_key_from_columns()
scale round_column()
round_columns()
dimensions get_dims()
get_dims_of_tables()
columns get_columns()
get_columns_by_likeness()
rename_columns()
reorder_columns()
delete_columns()
datetime rename_datetime_columns()
rename_datetime_column()
add_local_datetime_columns()
add_local_datetime_column()
split_datetime_column()
split_datetime_columns()
info extract_column_values()
get_distinct_values()
formatting format_numbers()
display_intermediary_table()
display_intermediary_schema()
display_intermediary_columns()
cleaning create_empty_dataframe()
keep_first_record_by_columns()
convert_dataframe()
update_nullability()
trim_spaces_from_column()
trim_spaces_from_columns()
apply_function_to_column()
apply_function_to_columns()
drop_matching_rows()
duplication duplicate_union_dataframe()
union_all()
schema check_schemas_match()
view_schema_differences()
delta load_table()
count_rows()
get_history()
optimise_table()
retry_optimise_table()
merge_spark_to_delta()
merge_delta_to_delta()
retry_merge_spark_to_delta()
DeltaLoader()

Testing🔗

This package is fully tested against:

  1. Unit tests
  2. Lint tests
  3. MyPy tests
  4. Build tests

Latest Code Coverage🔗