nmdc_api_utilities package
Submodules
nmdc_api_utilities.api module
nmdc_api_utilities.collection module
nmdc_api_utilities.data_processing module
- class nmdc_api_utilities.data_processing.DataProcessing[source]
Bases:
object
- build_filter(attributes, exact_match=False)[source]
Create a MongoDB filter using $regex for each attribute in the input dictionary. For nested attributes, use dot notation.
- Parameters:
attributes (dict) – Dictionary of attribute names and their corresponding values to match using regex. Example: {“name”: “example”, “description”: “example”, “geo_loc_name”: “example”}
exact_match – bool This var is used to determine if the inputted attribute value is an exact match or a partial match. Default is False, meaning the user does not need to input an exact match. Under the hood this is used to determine if the inputted attribute value should be wrapped in a regex expression.
Returns: dict: A MongoDB filter dictionary.
- convert_to_df(data: list) DataFrame [source]
Convert a list of dictionaries to a pandas dataframe. params:
- data: list
A list of dictionaries.
- extract_field(api_results: list, field_name: str) list [source]
This function is used to extract a field from the API results. params:
- api_results: list
A list of dictionaries.
- field_name: str
The name of the field to extract.
- Returns:
A list of IDs.
- Return type:
list
- merge_dataframes(column: str, df1: DataFrame, df2: DataFrame) DataFrame [source]
Merge two dataframes. params:
- column: str
The column to merge on.
- df1: pd.DataFrame
The first dataframe to merge.
- df2: pd.DataFrame
The second dataframe to merge.
- Returns:
pd.DataFrame
- merge_df(df1, df2, key1: str, key2: str)[source]
Define a merging function to join results This function merges new results with the previous results that were used for the new API request. It uses two keys from each result to match on. params:
df1 and df2 are the two dataframes that need to be merged. key1 is the column name in df1 that will be used to match with key2 in df2.
This function automatically identifies columns that need to be exploded because they contain list-like elements, as drop_duplicates can’t handle list elements.
- rename_columns(df: DataFrame, new_col_names: list) DataFrame [source]
Rename columns in a pandas dataframe. params:
- df: pd.DataFrame
The pandas dataframe to rename columns.
- new_col_names: list
A list of new column names. Names MUST be in order of the columns in the dataframe.
- Example:
If the current column names are - [‘old_col1’, ‘old_col2’, ‘old_col3’] You will need to pass in the new names like - [‘new_col1’, ‘new_col2’, ‘new_col3’]