Elwood Functions
Main Module
The main module of Elwood provides functionality for data transformation and standardization. It includes functions for reading and processing data, as well as performing various transformations.
Standardization processor
process Function
The process function in the Elwood library is responsible for data transformation and standardization based on a JSON schema. It takes various parameters and performs a series of steps to transform input data and produce standardized output. Below is an overview of the function’s parameters and its workflow:
Parameters
fp(str): The file path to the input data file that needs to be processed.mp(str): The filename for the JSON mapper from spacetag. This mapper defines the schema for the data transformation.admin(str): The administrative level for geocoding. If not provided, it may be inferred from the JSON schema.output_file(str): The base filename for the output parquet files.write_output(bool, default True): Specifies whether to write output files.gadm(gpd.GeoDataFrame, default None): Optional specification of a GeoDataFrame of GADM shapes for geocoding.overrides(dict, default None): User-provided overrides for data transformation.
Workflow
- Read the User defined values from the specified
mp(mapper) file. - Determine the file type from the transformation details.
- Process the input data file based on the determined file type and transformation metadata.
- Run the
standardizerfunction to perform data standardization using the provided mapper, admin level, and GADM shapes. - Apply manual user overrides to hardcoded countries for GADM.
- Optionally enforce feature column data types and separate string data from other types.
- Write normalized data to parquet files with compression, if specified.
- Return the normalized dataframe and a dictionary of renamed columns.
Example Usage
from elwood import elwood
norm_df, renamed_col_dict = elwood.process(
fp="data.csv",
mp="mapper.json",
admin="admin2",
output_file="output",
gadm=gadm_data,
overrides=overrides_data
)
Transformation Functions
Elwood provides several transformation functions that operate on dataframes:
normalize_features: Scales numerical features in a dataframe to a 0 to 1 scale. Takes in a dataframe and an optional output_file name. Outputs a normalized dataframe and optionally writes an output parquet file.output = elwood.normalize_features( dataframe=df, output_file="data_parquet" ) # `output` is the normalized dataframe result. Also outputs a parquet file called data_parquet_normalized.parquet.gzipnormalize_features_robust: Performs robust normalization of numerical features. Takes in a dataframe and an optional output_file name. Outputs a normalized dataframe and optionally writes an output parquet file.output = elwood.normalize_features_robust( dataframe=df, output_file="data_parquet" ) # `output` is the normalized dataframe result. Also outputs a parquet file called data_parquet_normalized.parquet.gzipclip_geo: Clips data based on geographical shapes or shapefiles. The geo_columns input is a dict object containing the two column names for the lat/lon columns in the dataframe. The polygons_list is a list of lists containing lists of objects that represent polygon shapes to clip to. This polygons_list is ultimately a geopandas clipping maskclipped_frame = elwood.clip_geo( dataframe=df, geo_columns={"lon_column": "longitude", "lat_column": "latitude"}, polygons_list=[[10.0, 5.0], [20.0, 5.0], [20.0, 10.0], [10.0, 10.0]] ) # Returns a dataframe with all columns removed outside of the specified shapes.clip_dataframe_time: Clips data based on time ranges. The time_column input is a string which is the name of target time column. The time_ranges input is a list of dictionary objects containing “start” and “end” datetime valuesclipped_frame = elwood.clip_dataframe_time( dataframe=df, time_column="Date", time_ranges=[{"start":"01-01-2022", "end":"01-05-2022"}, {"start": "01-01-2023", "end": "12-01-2023"}] ) # Returns a dataframe with all columns removed outside of the specified time ranges.rescale_dataframe_time: Rescales a dataframe’s time periodicity using aggregation functions. Thedataframeinput is apandas.DataFramecontaining a column of time values to be rescaled. Thetime_columninput is a string representing the name of the target time column. Thetime_bucketinput is aDateOffset,Timedelta, or string representing a time bucketing rule (e.g., ‘M’, ‘A’, ‘2H’) used to aggregate the time. Theaggregation_functionsinput is a list of strings containing aggregation functions to apply to the data (e.g., [‘sum’]). The optionalgeo_columnsinput can be used for specifying geo columns.scaled_frame = elwood.rescale_dataframe_time( dataframe=df, time_column="Date", time_bucket='M', aggregation_functions=['sum'], geo_columns=None ) # Returns a dataframe with rescaled time periodicity using specified aggregation functions.regrid_dataframe_geo: Regrids a dataframe with detectable geo-resolution.- The
dataframeinput is of typepandas.DataFrameor of typexarray.Datasetand represents the dataframe to be regridded. - The
geo_columnsinput is of typeDictand contains the columns representing geo information. - The
time_columninput is of typeStringand is the name of the target time column. - The
scale_multiinput is of typeFloatand represents the scaling factor for geo-resolution. Theaggregation_functionsinput is of typeDict[str, str]and is a Dict of column names and aggregation functions to apply to that column. - The optional
scaleinput, if provided, is of typefloatordict. It represents a user override for initial scale of the data. If the desired latitude and longitude scales are different, pass a dict with{"lat_scale": <Desired Latitude Scale>, "lon_scale": <Desired Longitude Scale>}The function will auto assess data scale if not provided. - The optional
xarray_returninput, if provided, expects a boolean value. This argument determines if the function should return the data inxarray.Datasetformat. IfTrue, the regridded data will be returned as anxarray.Datasetobject.regridded_frame = regrid_dataframe_geo( dataframe=df, geo_columns=["latitude", "longitude"], time_column="Date", scale_multi=2, aggregation_functions={"Rainfall": "sum", "Temperature": "mean"} ) # Returns a dataframe regridded with newly specified geo-resolution. get_boundary_box: Returns the minimum and maximum x, y coordinates for a geographical dataset with latitude and longitude.boundary_box = elwood.get_boundary_box( dataframe=df, geo_columns={"latitude_column": "Latitude", "longitude_column": "Longitude"} ) # Returns an object containing xmin, xmax, ymin, and ymax coordinates.get_temporal_boundary: Returns the minimum and maximum time values in a dataframe. The return is a dictionary object with a ‘min’ key and a ‘max’ key.temporal_boundary = elwood.get_temporal_boundary( dataframe=df, time_column="Timestamp" ) # Returns an object containing a 'min' key for the minimum time value, and a 'max' key for the maximum time value.
File Processing Functions
Elwood includes functions to process raster and NetCDF files:
raster2df: Converts a raster file to a dataframe. TheInRasterinput is a string representing the path to the input raster file. The optionalfeature_nameinput is a string that specifies the name of the feature column in the resulting dataframe. The optionalbandinput is an integer representing the band number to extract from the raster. The optionalnodatavalinput is an integer representing the nodata value in the raster. The optionaldateinput is a string representing a date associated with the raster data. The optionalband_nameinput is a string specifying the name of the band column in the resulting dataframe. The optionalbandsinput is a dictionary specifying additional bands to extract with their corresponding names. The optionalband_typeinput is a string indicating the type of data in the band column.df_result = elwood.raster2df( InRaster="path/to/raster/file.tif", feature_name="feature", band=0, nodataval=-9999, date="2023-08-25", band_name="feature2", bands={"band1": 1, "band2": 2}, band_type="category" ) # Returns a dataframe converted from the raster file.netcdf2df: Converts a NetCDF file to a dataframe. Thenetcdfinput is a string representing the path to the NetCDF file.df_result = elwood.netcdf2df("path/to/netcdf/file.nc") # Returns a dataframe converted from the NetCDF file.