API Reference

This page documents objects and functions provided by flux-data-qaqc.

Data

class fluxdataqaqc.Data(config)[source]

Bases: Plot, Convert

An object for interfacing flux-data-qaqc with input metadata (config) and time series input, it provides methods and attributes for parsing, temporal analysis, visualization, and filtering data.

A Data object is initialized from a config file (see Setting up a config file) with metadata for an eddy covariance tower or other dataset containing time series meterological data. It serves as a starting point in the Python API of the energy balance closure analysis and data validation routines that are provided by flux-data-qaqc.

Manual pre-filtering of data based on user-defined quality is aided with the Data.apply_qc_flags method. Weighted or non-weighted means of variables with multiple sensors/recordings is performed upon initialization if these options are declared in the config file. The Data class also includes the Data.df property which returns the time series data in the form of a pandas.DataFrame object for custom workflows. Data inherits line and scatter plot methods from Plot which allows for the creation of interactive visualizations of input time series data.

climate_file

Absolute path to climate input file.

Type:: pathlib.Path

config

Config parser instance created from the data within the config.ini file.

Type:: configparser.ConfigParser

config_file

Absolute path to config.ini file used for initialization of Data instance.

Type:: pathlib.Path

header

Header as found in input climate file.

Type:: numpy.ndarray or pandas.DataFrame.index

elevation

Site elevation in meters as set in config.ini.

Type:: float

inv_map

Dictionary with input climate file names as keys and internal names as values. May only include pairs when they differ.

Type:: dict

latitude

Site latitude in decimal degrees, set in config.

Type:: float

longitude

Site longitude in decimal degrees, set in config.

Type:: float

out_dir

Default directory to save output of QaQc.write or QaQc.plot methods.

Type:: pathlib.Path

plot_file

path to plot file once it is created/saved by Data.plot.

Type:: pathlib.Path or None

site_id

Site ID as found in site_id config.ini entry.

Type:: str

soil_var_weight_pairs

Dictionary with names and weights for weighted averaging of soil heat flux or soil moisture variables.

Type:: dict

qc_var_pairs

Dictionary with variable names as keys and QC value columns (numeric of characters) as values.

Type:: dict

units

Dictionary with internal variable names as keys and units as found in config as values.

Type:: dict

variables

Dictionary with internal names for variables as keys and names as found in the input data as values.

Type:: dict

variable_names_dict

Dictionary with internal variable names as keys and keys in config.ini file as values.

Type:: dict

xl_parser

engine for reading excel files with Pandas. If None use ‘openpyxl’.

Type:: str or None

apply_qc_flags(threshold=None, flag=None, threshold_inequality='lt')[source]

Apply user-provided QC values or flags for climate variables to filter poor-quality data by converting them to null values, updates Data.df.

Specifically where the QC value is < threshold change the variables value for that date-time to null. The other option is to use a column of flags, e.g. ‘x’ for data values to be filtered out. The threshold value or flag may be specified in the config file’s METADATA section otherwise they should be assigned as keyword arguments here.

Specification of which QC (flag or numeric threshold) columns should be applied to which variables is set in the DATA section of the config file. For datasets with QC value columns with names identical to the variable they correspond to with the suffix “_QC” the QC column names for each variable do not need to be specified in the config file.

Keyword Arguments:

threshold (float) – default None. Threshold for QC values, if flag is below threshold replace that variables value with null.
flag (str, list, or tuple) – default None. Character flag signifying data to filter out. Can be list or tuple of multiple flags.
threshold_inequality (str) – default ‘lt’. ‘lt’ for filtering values that are less than threshold value, ‘gt’ for filtering values that are greater.

Returns:

None

Example

If the input time series file has a column with numeric quality values named “LE_QC” which signify the data quality for latent energy measurements, then in the config.ini file’s DATA section the following must be specified:

[DATA]
latent_heat_flux_qc = LE_QC
...

Now you must specify the threshold of this column in which to filter out when using Data.apply_qc_flags. For example if you want to remove all data entries of latent energy where the “LE_QC” value is below 5, then the threshold value would be 5. The threshold can either be set in the config file or passed as an argument. If it is set in the config file, i.e.:

[METADATA]
qc_threshold = 0.5

Then you would cimply call the method and this threshold would be applied to all QC columns specified in the config file,

>>> from fluxdataqaqc import Data
>>> d = Data('path/to/config.ini')
>>> d.apply_qc_flags()

Alternatively, if the threshold is not defined in the config file or if you would like to use a different value then pass it in,

>>> d.apply_qc_flags(threshold=2.5)

Lastly, this method also can filter out based on a single or list of character flags, e.g. “x” or “bad” gievn that the column containing these is specified in the config file for whichever variable they are to be applied to. For example, if a flag column contains multiple flags signifying different data quality control info and two in particular signify poor quality data, say “b” and “a”, then apply them either in the config file:

[METADATA]
qc_flag = b,a

Of within Python

>>> d.apply_qc_flags(flag=['b', 'a'])

For more explanation and examples see the “Configuration Options” section of the online documentation.

property df

Pull variables out of the config and climate time series files load them into a datetime-indexed pandas.DataFrame.

Metadata about input time series file format: “missing_data_value”, “skiprows”, and “date_parser” are utilized when first loading the df into memory. Also, weighted and non-weighted averaging of multiple measurements of the same climatic variable occurs on the first call of Data.df, if these options are declared in the config file. For more details and example uses of these config options please see the “Configuration Options” section of the online documentation.

Returns:: df (pandas.DataFrame)

Examples

You can utilize the df property as with any pandas.DataFrame object. However, if you would like to make changes to the data you must first make a copy, then make the changes and then reassign it to Data.df, e.g. if you wanted to add 5 degrees to air temp.

>>> from fluxdataqaqc import Data
>>> d = Data('path_to_config.ini')
>>> df = d.df.copy()
>>> df['air_temp_column'] = df['air_temp_column'] + 5
>>> d.df = df

The functionality shown above allows for user-controlled preprocessing and modification of any time series data in the initial dataset. It also allows for adding new columns but if they are variables used by flux-data-qaqc e.g. Rn or other energy balance variables, be sure to also update Data.variables and Data.units with the appropriate entries. New or modified values will be used in any further analysis/ploting routines within flux-data-qaqc.

By default the names of variables as found within input data are retained in QaQc.df, however you can use the naming scheme as flux-data-qaqc which can be viewed in Data.variable_names_dict by using the the Data.inv_map dictionary which maps names from user-defined to internal names (as opposed to Data.variables) which maps from internal names to user-defined. For example if your input data had the following names for LE, H, Rn, and G set in your config:

[DATA]
net_radiation_col = Net radiation, W/m2
ground_flux_col = Soil-heat flux, W/m2
latent_heat_flux_col = Latent-heat flux, W/m2
sensible_heat_flux_col = Sensible-heat flux, W/m2

Then the Data.df will utilize the same names, e.g.

>>> # d is a Data instance
>>> d.df.head()

produces:

date	Net radiation, W/m2	Latent-heat flux, W/m2	Sensible-heat flux, W/m2	Soil-heat flux, W/m2
10/1/2009 0:00	-54.02421778	0.70761	0.95511	-40.42365926
10/1/2009 0:30	-51.07744708	0.04837	-1.24935	-33.35383253
10/1/2009 1:00	-50.99438925	0.68862	1.91101	-43.17900525
10/1/2009 1:30	-51.35032377	-1.85829	-15.4944	-40.86201497
10/1/2009 2:00	-51.06604228	-1.80485	-19.1357	-39.80936855

Here is how you could rename your dataframe using flux-data-qaqc internal names,

>>> d.df.rename(columns=q.inv_map).head()

date	Rn	LE	H	G
10/1/2009 0:00	-54.02421778	0.70761	0.95511	-40.42365926
10/1/2009 0:30	-51.07744708	0.04837	-1.24935	-33.35383253
10/1/2009 1:00	-50.99438925	0.68862	1.91101	-43.17900525
10/1/2009 1:30	-51.35032377	-1.85829	-15.4944	-40.86201497
10/1/2009 2:00	-51.06604228	-1.80485	-19.1357	-39.80936855

A minor note on variable naming, if your input data variables use exactly the same names used by flux-data-qaqc, they will be renamed by adding the prefix “input_”, e.g. “G” becomes “input_G” on the first time reading the data from disk, i.e. the first time accessing Data.df.

Note

The temporal frequency of the input data is retained unlike the Qaqc.df which automatically resamples time series data to daily frequency.

hourly_ASCE_refET(reference='short', anemometer_height=None)[source]

Calculate hourly ASCE standardized short (ETo) or tall (ETr) reference ET from input data and wind measurement height.

If input data’s time frequency is < hourly the input data will be resampled to hourly and the output reference ET time series will be returned as a datetime pandas.Series object, if the input data is already hourly then the resulting time series will automatically be merged into the Data.df dataframe named “ASCE_ETo” or “ASCE_ETr” respectively.

Keyword Arguments:

reference (str) – default “short”, calculate tall or short ASCE reference ET.
anemometer_height (float or None) – wind measurement height in meters , default None. If None then look for the “anemometer_height” entry in the METADATA section of the config.ini, if not there then print a warning and use 2 meters.

Returns:

None or pandas.Series

Hint

The input variables needed to run this method are: vapor pressure, wind speed, incoming shortwave radiation, and average air temperature. If vapor pressure deficit and average air temperature exist, the actual vapor pressure will automatically be calculated.

plot(ncols=1, output_type='save', out_file=None, suptitle='', plot_width=1000, plot_height=450, sizing_mode='fixed', merge_tools=False, link_x=True, **kwargs)[source]

Creates a series of interactive diagnostic line and scatter plots of input data in whichever temporal frequency it is in.

The main interactive features of the plots include: pan, selection and scrol zoom, hover tool that shows paired variable values including date, and linked x-axes that pan/zoom togehter for daily and monthly time series plots.

It is possible to change the format of the output plots including adjusting the dimensions of subplots, defining the number of columns of subplots, setting a super title that accepts HTML, and other options. If variables are not present for plots they will not be created and a warning message will be printed. There are two options for output: open a temporary file for viewing or saving a copy to QaQc.out_dir.

A list of all potential time series plots created:

energy balance components
radiation components
multiple soil heat flux measurements
air temperature
vapor pressure and vapor pressure deficit
wind speed
precipitation
latent energy
multiple soil moisture measurements

Keyword Arguments:

ncols (int) – default 1. Number of columns of subplots.
output_type (str) – default “save”. How to output plots, “save”, “show” in browser, “notebook” for Jupyter Notebooks, “return_figs” to return a list of Bokeh bokeh.plotting.figure.Figure`s, or "return_grid" to return the :obj:`bokeh.layouts.gridplot.
out_file (str or None) – default None. Path to save output file, if None save output to Data.out_dir with the name [site_id]_input_plots.html where [site_id] is Data.site_id.
suptitle (str or None) – default None. Super title to go above plots, accepts HTML/CSS syntax.
plot_width (int) – default 1000. Width of subplots in pixels.
plot_height (int) – default 450. Height of subplots in pixels, note for subplots the height will be forced as the same as plot_width.
sizing_mode (str) – default “scale_both”. Bokeh option to scale dimensions of bokeh.layouts.gridplot.
merge_tools (bool) – default False. Merges all subplots toolbars into a single location if True.
link_x (bool) – default True. If True link x axes of daily time series plots and monthly time series plots so that when zooming or panning on one plot they all zoom accordingly, the axes will also be of the same length.

Example

Starting from a correctly formatted config.ini and climate time series file, this example shows how to read in the data and produce a series of plots of input data as it is found in the input data file (unlike QaQc.plot which produces plots at daily and monthly temporal frequency). This example also shows how to display a title at the top of plot with the site’s location and site ID.

>>> from fluxdataqaqc import Data
>>> d = Data('path/to/config.ini')
>>> # create plot title from site ID and location in N. America
>>> title = "<b>Site:</b> {}; <b>Lat:</b> {}N; <b>Long:</b> {}W".format(
>>>     q.site_id, q.latitude, q.longitude
>>> )
>>> q.plot(
>>>     ncols=2, output_type='show', plot_width=500, suptitle=title
>>> )

Note, we specified the width of plots to be smaller than default because we want both columns of subplots to be viewable on the screen.

Tip

To reset all subplots at once, refresh the page with your web browser.

Note

Additional keyword arguments that are recognized by bokeh.layouts.gridplot are also accepted by Data.plot.

QaQc

class fluxdataqaqc.QaQc(data=None, drop_gaps=True, daily_frac=1.0, max_interp_hours=2, max_interp_hours_night=4)[source]

Bases: Plot, Convert

Numerical routines for correcting daily energy balance closure for eddy covariance data and other data analysis tools.

Two routines are provided for improving energy balance closure by adjusting turbulent fluxes, latent energy and sensible heat, the Energy Balance Ratio method (modified from FLUXNET) and the Bowen Ratio method.

The QaQc object also has multiple tools for temporal frequency aggregation and resampling, estimation of climatic and statistical variables (e.g. ET and potential shortwave radiation), downloading gridMET reference ET, managing data and metadata, interactive validation plots, and managing a structure for input and output data files. Input data is expected to be a Data instance or a pandas.DataFrame.

Keyword Arguments:

data (Data) – Data instance to create QaQc instance.
drop_gaps (bool) – default True. If True automatically filter variables on days with sub-daily measurement gaps less than daily_frac.
daily_frac (float) – default 1.00. Fraction of sub-daily data required otherwise the daily value will be filtered out if drop_gaps is True. E.g. if daily_frac = 0.5 and the input data is hourly, then data on days with less than 12 hours of data will be forced to null within QaQc.df. This is important because systematic diurnal gaps will affect the autmoatic resampling that occurs when creating a QaQc instance and the daily data is used in closure corrections, other calculations, and plots. If sub-daily linear interpolation is applied to energy balance variables the gaps are counted after the interpolation.
max_interp_hours (None or float) – default 2. Length of largest gap to fill with linear interpolation in energy balance variables if input datas temporal frequency is less than daily. This value will be used to fill gaps when \(Rn > 0\) or \(Rn\) is missing during each day.
max_interp_hours_night (None or float) – default 4. Length of largest gap to fill with linear interpolation in energy balance variables if input datas temporal frequency is less than daily when \(Rn < 0\) within 12:00PM-12:00PM daily intervals.

agg_dict

Dictionary with internal variable names as keys and method of temporal resampling (e.g. “mean” or “sum”) as values.

Type:: dict

config

Config parser instance created from the data within the config.ini file.

Type:: configparser.ConfigParser

config_file

Absolute path to config.ini file used for initialization of the fluxdataqaqc.Data instance used to create the QaQc instance.

Type:: pathlib.Path

corrected

False until an energy balance closure correction has been run by calling QaQc.correct_data.

Type:: bool

corr_methods

List of Energy Balance Closure correction routines usable by QaQc.correct_data.

Type:: tuple

corr_meth

Name of most recently applied energy balance closure correction.

Type:: str or None

elevation

Site elevation in meters.

Type:: float

gridMET_exists

True if path to matching gridMET time series file exists on disk and has time series for reference ET and precipitation and the dates for these fully overlap with the energy balance variables, i.e. the date index of QaQc.df.

Type:: bool

gridMET_meta

Dictionary with information for gridMET variables that may be downloaded using QaQc.download_gridMET.

Type:: dict

inv_map

Dictionary with input climate file names as keys and internal names as values. May only include pairs when they differ.

Type:: dict

latitude

Site latitude in decimal degrees.

Type:: float

longitude

Site longitude in decimal degrees.

Type:: float

out_dir

Default directory to save output of QaQc.write or QaQc.plot methods.

Type:: pathlib.Path

n_samples_per_day

If initial time series temporal frequency is less than 0 then this value will be updated to the number of samples detected per day, useful for post-processing based on the count of sub-daily gaps in energy balance variables, e.g. “LE_subday_gaps”.

Type:: int

plot_file

path to plot file once it is created/saved by QaQc.plot.

Type:: pathlib.Path or None

site_id

Site ID.

Type:: str

temporal_freq

Temporal frequency of initial (as found in input climate file) data as determined by pandas.infer_freq.

Type:: str

units

Dictionary with internal variable names as keys and units as found in config as values.

Type:: dict

variables

Dictionary with internal variable names as keys and names as found in the input data as values.

Type:: dict

Note

Upon initialization of a QaQc instance the temporal frequency of the input data checked using pandas.infer_freq which does not always correctly parse datetime indices, if it is not able to correctly determine the temporal frequency the time series will be resampled to daily frequency but if it is in fact already at daily frequency the data will be unchanged. In this case the QaQc.temporal_freq will be set to “na”.

agg_dict = {'ASCE_ETo': 'sum', 'ASCE_ETr': 'sum', 'ET': 'sum', 'ET_corr': 'sum', 'ET_fill': 'sum', 'ET_fill_val': 'sum', 'ET_gap': 'sum', 'ET_user_corr': 'sum', 'G': 'mean', 'G_subday_gaps': 'sum', 'H': 'mean', 'H_corr': 'mean', 'H_subday_gaps': 'sum', 'H_user_corr': 'mean', 'LE': 'mean', 'LE_corr': 'mean', 'LE_subday_gaps': 'sum', 'LE_user_corr': 'mean', 'Rn': 'mean', 'Rn_subday_gaps': 'sum', 'br': 'mean', 'ebr': 'mean', 'ebr_5day_clim': 'mean', 'ebr_corr': 'mean', 'ebr_user_corr': 'mean', 'energy': 'mean', 'flux': 'mean', 'flux_corr': 'mean', 'gridMET_ETo': 'sum', 'gridMET_ETr': 'sum', 'gridMET_prcp': 'sum', 'lw_in': 'mean', 'lw_out': 'mean', 'ppt': 'sum', 'ppt_corr': 'sum', 'rh': 'mean', 'rso': 'mean', 'sw_in': 'mean', 'sw_out': 'mean', 'sw_pot': 'mean', 't_avg': 'mean', 't_dew': 'mean', 't_max': 'mean', 't_min': 'mean', 'vp': 'mean', 'vpd': 'mean', 'ws': 'mean'}

corr_methods = ('ebr', 'br', 'lin_regress')

correct_data(meth='ebr', et_gap_fill=True, y='Rn', refET='ETr', x=['G', 'LE', 'H'], fit_intercept=False)[source]

Correct turblent fluxes to improve energy balance closure using an Energy Balance Ratio method modified from FLUXNET.

Currently three correction options are available: ‘ebr’ (Energy Balance Ratio), ‘br’ (Bowen Ratio), and ‘lin_regress’ (least squares linear regression). If you use one method followed by another corrected,the corrected versions of LE, H, ET, ebr, etc. will be overwritten with the most recently used approach.

This method also computes potential clear sky radiation (saved as “rso”) using an ASCE approach based on station elevation and latitude. ET is calculated from raw and corrected LE using daily air temperature to correct the latent heat of vaporization, if air temp. is not available in the input data then air temp. is assumed at 20 degrees celcius.

Corrected or otherwise newly calculated variables are named using the following suffixes to distinguish them:

uncorrected LE, H, etc. from input data have no suffix
_corr uses adjusted LE, H, etc. from the correction method used
_user_corr uses corrected LE, H, etc. found in data file (if provided)

Parameters:

y (str) – name of dependent variable for regression, must be in QaQc.variables keys, or a user-added variable. Only used if meth='lin_regress'.
x (str or list) – name or list of independent variables for regression, names must be in QaQc.variables keys, or a user-added variable. Only used if meth='lin_regress'.

Keyword Arguments:

meth (str) – default ‘ebr’. Method to correct energy balance.
et_gap_fill (bool) – default True. If true fill any remaining gaps in corrected ET with ETr * ETrF, where ETr is alfalfa reference ET from gridMET and ETrF is the filtered, smoothed (7 day moving avg. min 2 days) and linearly interpolated crop coefficient. The number of days in each month that corrected ET are filled will is provided in QaQc.monthly_df as the column “ET_gap”.
refET (str) – default “ETr”. Which gridMET reference product to use for ET gap filling, “ETr” or “ETo” are valid options.
fit_intercept (bool) – default False. Fit intercept for regression or set to zero if False. Only used if meth='lin_regress'.
apply_coefs (bool) – default False. If True then apply fitted coefficients to their respective variables for linear regression correction method, rename the variables with the suffix “_corr”.

Returns:

None

Example

Starting from a correctly formatted config.ini and climate time series file, this example shows how to read in the data and apply the energy balance ratio correction without gap-filling with gridMET ETr x ETrF.

>>> from fluxdataqaqc import Data, QaQc
>>> d = Data('path/to/config.ini')
>>> q = QaQc(d)
>>> q.corrected
    False

Now apply the energy balance closure correction

>>> q.correct_data(meth='ebr', et_gap_fill=False)
>>> q.corrected
    True

Note

If et_gap_fill is set to True (default) the gap filled days of corrected ET will be used to recalculate LE_corr for those days with the gap filled values, i.e. LE_corr will also be gap-filled.

Note

The ebr_corr variable or energy balance closure ratio is calculated from the corrected versions of LE and H independent of the method. When using the ‘ebr’ method the energy balance correction factor (what is applied to the raw H and LE) is left as calculated (inverse of ebr) and saved as ebc_cf.

See also

For explanation of the linear regression method see the QaQc.lin_regress method, calling that method with the keyword argument apply_coefs=True and \(Rn\) as the y variable and the other energy balance components as the x variables will give the same result as the default inputs to this function when meth='lin_regress.

daily_ASCE_refET(reference='short', anemometer_height=None)[source]

Calculate daily ASCE standardized short (ETo) or tall (ETr) reference ET from input data and wind measurement height.

The resulting time series will automatically be merged into the Data.df dataframe named “ASCE_ETo” or “ASCE_ETr” respectively.

Keyword Arguments:

reference (str) – default “short”, calculate tall or short ASCE reference ET.
anemometer_height (float or None) – wind measurement height in meters , default None. If None then look for the “anemometer_height” entry in the METADATA section of the config.ini, if not there then print a warning and use 2 meters.

Returns:

None

Note

If the hourly ASCE variables were prior calculated from a Data instance they will be overwritten as they are saved with the same names.

property df: See fluxdataqaqc.Data.df as the only difference is that the QaQc.df is first resampled to daily frequency.

download_gridMET(variables=None)[source]

Download reference ET (alfalfa and grass) and precipitation from gridMET for all days in flux station time series by default.

Also has ability to download other specific gridMET variables by passing a list of gridMET variable names. Possible variables and their long form can be found in QaQc.gridMET_meta.

Upon download gridMET time series for the nearest gridMET cell will be merged into the instances dataframe attibute QaQc.df and all gridMET variable names will have the prefix “gridMET_” for identification.

The gridMET time series file will be saved to a subdirectory called “gridMET_data” within the directory that contains the config file for the current QaQc instance and named with the site ID and gridMET cell centroid lat and long coordinates in decimal degrees.

Parameters:: variables (None, str, list, or tuple) – default None. List of gridMET variable names to download, if None download ETr and precipitation. See the keys of the QaQc.gridMET_meta dictionary for a list of all variables that can be downloaded by this method.
Returns:: None

Note

Any previously downloaded gridMET time series will be overwritten when calling the method, however if using the the gap filling method of the “ebr” correction routine the download will not overwrite currently existing data so long as gridMET reference ET and precipitation is on disk and its path is properly set in the config file.

classmethod from_dataframe(df, site_id, elev_m, lat_dec_deg, var_dict, drop_gaps=True, daily_frac=1.0, max_interp_hours=2, max_interp_hours_night=4)[source]

Create a QaQc object from a pandas.DataFrame object.

Parameters:

df (pandas.DataFrame) – DataFrame of climate variables with datetime index named ‘date’
site_id (str) – site identifier such as station name
elev_m (int or float) – elevation of site in meters
lat_dec_deg (float) – latitude of site in decimal degrees
var_dict (dict) – dictionary that maps flux-data-qaqc variable names to user’s columns in df e.g. {‘Rn’: ‘netrad’, …} see fluxdataqaqc.Data.variable_names_dict for list of flux-data-qaqc variable names

Returns:

None

Note

When using this method, any output files (CSVs, plots) will be saved to a directory named “output” in the current working directory.

gridMET_meta = {'ETr': {'name': 'daily_mean_reference_evapotranspiration_alfalfa', 'nc_suffix': 'agg_met_etr_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_ETr', 'units': 'mm'}, 'pet': {'name': 'daily_mean_reference_evapotranspiration_grass', 'nc_suffix': 'agg_met_pet_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_ETo', 'units': 'mm'}, 'pr': {'name': 'precipitation_amount', 'nc_suffix': 'agg_met_pr_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_prcp', 'units': 'mm'}, 'sph': {'name': 'daily_mean_specific_humidity', 'nc_suffix': 'agg_met_sph_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_q', 'units': 'kg/kg'}, 'srad': {'name': 'daily_mean_shortwave_radiation_at_surface', 'nc_suffix': 'agg_met_srad_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_srad', 'units': 'w/m2'}, 'tmmn': {'name': 'daily_minimum_temperature', 'nc_suffix': 'agg_met_tmmn_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_tmin', 'units': 'K'}, 'tmmx': {'name': 'daily_maximum_temperature', 'nc_suffix': 'agg_met_tmmx_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_tmax', 'units': 'K'}, 'vs': {'name': 'daily_mean_wind_speed', 'nc_suffix': 'agg_met_vs_1979_CurrentYear_CONUS.nc#fillmismatch', 'rename': 'gridMET_u10', 'units': 'm/s'}}

lin_regress(y, x, fit_intercept=False, apply_coefs=False)[source]

Least squares linear regression on single or multiple independent variables.

For example, if the dependent variable (y) is \(Rn\) and the independent variables (x) are \(LE\) and \(H\), then the linear regression will solve for the best fit coefficients of \(Rn = c_0 + c_1 LE + c_2 H\). Any number of variables in the QaQc.variables can be used for x and one for y.

If the variables chosen for regression are part of the energy balance components, i.e. \(Rn, G, LE, H\) and apply_coefs=True, then the best fit coefficients will be applied to their respecitive variables with consideration of the energy balance equation, i.e. the signs of the coefficients will be corrected according to \(Rn - G = LE + H\), for example if y=H and x=['Rn','G','LE]. i.e. solving \(H = c_0 + c_1 Rn + c_2 G + c_3 LE\) then the coefficients \(c_2\) and \(c_3\) will be multiplied by -1 before applying them to correct \(G\) and \(LE\) according to the energy balance equation.

This method returns an pandas.DataFrame object containing results of the linear regression including the coefficient values, number of data pairs used in the, the root-mean-square-error, and coefficient of determination. This table can also be retrieved from the QaQc.lin_regress_results instance attribute.
Arguments:

y (str): name of dependent variable for regression, must be in
QaQc.variables keys, or a user-added variable.

x (str or list): name or list of independent variables for
regression, names must be in QaQc.variables keys, or a user-added variable.

Keyword Arguments:

fit_intercept (bool): default False. Fit intercept for regression or
set to zero if False.

apply_coefs (bool): default False. If True then apply fitted
coefficients to their respective variables, rename the variables with the suffix “_corr”.

Returns:
pandas.DataFrame

Example:
Let’s say we wanted to compute the linear relationship between net radiation to the other energy balance components which may be useful if we have strong confidence in net radiation measurements for example. The resulting coefficients of regression would give us an idea of whether the other components were “under-measured” or “over-measured”. Then, starting with a Data instance:
>>> Q = QaQc(Data_instance)
>>> Q.lin_regress(y='Rn', x=['G','H','LE'], fit_intercept=True)
>>> Q.lin_regress_result
This would produce something like the following,

SITE_ID	Y (dependent var.)	c0 (intercept)	c1 (coef on G)	c2 (coef on LE)	c3 (coef on H)	RMSE (w/m2)	r2 (coef. det.)	n (sample count)
a_site	Rn	6.99350781229883	1.552	1.054	0.943	18.25	0.91	3386

In this case the intercept is telling us that there may be a systematic or constant error in the independent variables and that \(G\) is “under-measured” at the sensor by over 50 precent, etc. if we assume daily \(Rn\) is accurate as measured.

Tip:
You may also use multiple linear regression to correct energy balance components using the QaQc.correct_data method by passing the meth='lin_regress' keyword argument.

property monthly_df

Temporally resample time series data to monthly frequency based on monthly means or sums based on QaQc.agg_dict, provides data as pandas.DataFrame.

Note that monthly means or sums are forced to null values if less than 20 percent of a months days are missing in the daily data (QaQc.df). Also, for variables that are summed (e.g. ET or precipitation) missing days (if less than 20 percent of the month) will be filled with the month’s daily mean value before summation.

If a QaQc instance has not yet run an energy balance correction i.e. QaQc.corrected = False before accessing monthly_df then the default routine of data correction (energy balance ratio method) will be conducted.

Utilize the QaQc.monthly_df property the same way as the fluxdataqaqc.Data.df, see it’s API documentation for examples.

Tip

If you have additional variables in QaQc.df or would like to change the aggregation method for the monthly time series, adjust the instance attribute QaQc.agg_dict before accessing the QaQc.monthly_df.

plot(ncols=1, output_type='save', out_file=None, suptitle='', plot_width=1000, plot_height=450, sizing_mode='fixed', merge_tools=False, link_x=True, **kwargs)[source]

Creates a series of interactive diagnostic line and scatter plots of input and computed daily and monthly aggregated data.

The main interactive features of the plots include: pan, selection and scrol zoom, hover tool that shows paired variable values including date, and linked x-axes that pan/zoom togehter for daily and monthly time series plots.

It is possible to change the format of the output plots including adjusting the dimensions of subplots, defining the number of columns of subplots, setting a super title that accepts HTML, and other options. If variables are not present for plots they will not be created and a warning message will be printed. There are two options for output: open a temporary file for viewing or saving a copy to QaQc.out_dir.

A list of all potential time series plots created:

energy balance components
radiation components
incoming shortwave radiation with ASCE potential clear sky (daily only)
multiple soil heat flux measurements
air temperature
vapor pressure and vapor pressure deficit
wind speed
station precipitation and gridMET precipitation
initial and corrected latent energy
initial, corrected, gap filled, and reference evapotranspiration
crop coefficient and smoothed and interpolated crop coefficient
initial and corrected energy balance ratio
multiple soil moisture measurements

A list of all potential scatter plots created:

radiative versus turblent fluxes, initial and corrected
initial versus corrected latent energy
initial versus corrected evapotranspiration

Keyword Arguments:

ncols (int) – default 1. Number of columns of subplots.
output_type (str) – default “save”. How to output plots, “save”, “show” in browser, “notebook” for Jupyter Notebooks, “return_figs” to return a list of Bokeh bokeh.plotting.figure.Figure`s, or "return_grid" to return the :obj:`bokeh.layouts.gridplot.
out_file (str or None) – default None. Path to save output file, if None save output to QaQc.out_dir with the name [site_id]_plots.html where [site_id] is QaQc.site_id.
suptitle (str or None) – default None. Super title to go above plots, accepts HTML/CSS syntax.
plot_width (int) – default 1000. Width of subplots in pixels.
plot_height (int) – default 450. Height of subplots in pixels, note for subplots the height will be forced as the same as plot_width.
sizing_mode (str) – default “scale_both”. Bokeh option to scale dimensions of bokeh.layouts.gridplot.
merge_tools (bool) – default False. Merges all subplots toolbars into a single location if True.
link_x (bool) – default True. If True link x axes of daily time series plots and monthly time series plots so that when zooming or panning on one plot they all zoom accordingly, the axes will also be of the same length.

Example

Starting from a correctly formatted config.ini and climate time series file, this example shows how to read in the data and produce the default series of plots for viewing with the addition of text at the top of plot that states the site’s location and ID.

>>> from fluxdataqaqc import Data, QaQc
>>> d = Data('path/to/config.ini')
>>> q = QaQc(d)
>>> q.correct_data()
>>> # create plot title from site ID and location in N. America
>>> title = "<b>Site:</b> {}; <b>Lat:</b> {}N; <b>Long:</b> {}W".format(
>>>     q.site_id, q.latitude, q.longitude
>>> )
>>> q.plot(
>>>     ncols=2, output_type='show', plot_width=500, suptitle=title
>>> )

Note, we specified the width of plots to be smaller than default because we want both columns of subplots to be viewable on one page.

Tip

To reset all subplots at once, refresh the page with your web browser.

Note

Additional keyword arguments that are recognized by bokeh.layouts.gridplot are also accepted by QaQc.plot.

write(out_dir=None, use_input_names=False)[source]

Save daily and monthly time series of initial and “corrected” data in CSV format.

Note, if the energy balance closure correction (QaQc.correct_data) has not been run, this method will run it with default options before saving time series files to disk.

The default location for saving output time series files is within an “output” subdifrectory of the parent directory containing the config.ini file that was used to create the fluxdataqaqc.Data and QaQc objects, the names of the files will start with the site_id and have either the “daily_data” or “monthly_data” suffix.

Keyword Arguments:

out_dir (str or None) – default None. Directory to save CSVs, if None save to out_dir instance variable (typically “output” directory where config.ini file exists).
use_input_names (bool) – default False. If False use flux-data-qaqc variable names as in output file header, or if True use the user’s input variable names where possible (for variables that were read in and not modified or calculated by flux-data-qaqc).

Returns:

None

Example

Starting from a config.ini file,

>>> from fluxdataqaqc import Data, QaQc
>>> d = Data('path/to/config.ini')
>>> q = QaQc(d)
>>> # note no energy balance closure correction has been run
>>> q.corrected
    False
>>> q.write()
>>> q.corrected
    True

Note

To save data created by multiple correction routines, be sure to run the correction and then save to different output directories otherwise output files will be overwritten with the most recently used correction option.

Plot

class fluxdataqaqc.Plot[source]

Bases: object

Container of plot routines of fluxdataqaqc including static methods that can be used to create and update interactive line and scatter plots from an arbitrary pandas.DataFrame instance.

Note

The Data and QaQc objects both inherit all methods of Plot therefore allowing them to be easily used for custom interactive time series plots for data within input data (in fluxdataqaqc.Data.df) and daily and monthly data in fluxdataqaqc.QaQc.df and QaQc.monthly_df.

static add_lines(fig, df, plt_vars, colors, x_name, source, labels=None, **kwargs)[source]

Add a multiple time series to a bokeh.plotting.figure.Figure object using data from a datetime indexed pandas.DataFrame with an interactive hover tool.

Interactive hover shows the values of all time series data and date that is added to the figure.

Parameters:

df (pandas.DataFrame) – pandas.DataFrame containing time series data.
plt_vars (list) – list of data columns in df to plot.
colors (list) – list of line colors for variables in plt_vars.
x_name (str) – name of the x-axis variable, e.g. the datetime index, in the pandas.DataFrame (df) containing data to plot.
source (bokeh.models.sources.ColumnDataSource) – column data source created from the pandas.DataFrame with data to plot, i.e. df.
labels (list or None) – default None. Labels for each plot variable in plt_vars.

Returns:

if none of the variables in plt_vars are found in df then return None otherwise returns the updated figure.

Return type:

ret (None or bokeh.plotting.figure.Figure)

Example

Similar to Plot.line_plot we first need to create a bokeh.models.sources.ColumnDataSource from a pandas.DataFrame. This example shows how to plot two variables, daily corrected latent energy and sensible heat on the same plot.

>>> from fluxdataqaqc import Data, QaQc, Plot
>>> d = Data('path/to/config.ini')
>>> q = QaQc(d)
>>> q.correct_data()

Now the QaQc instance should have the “LE_corr” (corrected latent energy) and “H_corr” (corrected sensible heat) columns, we can now make a bokeh.models.sources.ColumnDataSource from fluxdataqaqc.QaQc.df or fluxdataqaqc.QaQc.monthly_df,

>>> from bokeh.plotting import ColumnDataSource, figure, show
>>> df = q.df
>>> plt_vars = ['LE_corr', 'H_corr']
>>> colors = ['blue', 'red']
>>> labels = ['LE', 'H']
>>> source = ColumnDataSource(df)
>>> fig = figure(
>>>     x_axis_label='date', y_axis_label='Corrected Turbulent Flux'
>>> )
>>> Plot.add_lines(
>>>     fig, df, plt_vars, colors, 'date', source, labels=labels
>>> )
>>> show(fig)

Note

This method is also available from the Data and QaQc objects.

static line_plot(fig, x, y, source, color, label=None, x_axis_type='date', **kwargs)[source]

Add a single time series to a bokeh.plotting.figure.Figure object using data from a datetime indexed pandas.DataFrame with an interactive hover tool.

Interactive hover shows the values of all time series data and date that is added to the figure.

Parameters:

fig (bokeh.plotting.figure.Figure) – a figure instance to add the line to.
x (str) – name of the datetime index or column in the pandas.DataFrame containing data to plot.
y (str) – name of the column in the pandas.DataFrame to plot.
source (bokeh.models.sources.ColumnDataSource) – column data source created from the pandas.DataFrame with data to plot.
color (str) – color of plot line, see Bokeh for color options.
label (str or None) – default None. Label for plot legend (for y).
x_axis_type (str or None) – default ‘date’. If “date” then the x-axis will be formatted as month-day-year.

Returns:

None

Example

To use the Plot.line_plot function we first need to create a bokeh.models.sources.ColumnDataSource from a pandas.DataFrame. Let’s say we want to plot the monthly time series of corrected latent energy, starting from a config.ini file,

>>> from fluxdataqaqc import Data, QaQc, Plot
>>> d = Data('path/to/config.ini')
>>> q = QaQc(d)
>>> q.correct_data()

Now the QaQc should have the “LE_corr” (corrected latent energy) column, we can now make a bokeh.models.sources.ColumnDataSource from fluxdataqaqc.QaQc.df or fluxdataqaqc.QaQc.monthly_df,

>>> from bokeh.plotting import ColumnDataSource, figure, show
>>> source = ColumnDataSource(q.monthly_df)
>>> # create the figure before using line_plot
>>> fig = figure(x_axis_label='date', y_axis_label='Corrected LE')
>>> Plot.line_plot(
>>>     fig, 'date', 'LE_corr', source, color='red', line_width=3
>>> )
>>> show(fig)

Notice, line_width is not an argument to Plot.line_plot but it is an acceptable keyword argument to bokeh.plotting.figure.Figure and therefore will work as expected.

Note

This method is also available from the Data and QaQc objects.

static scatter_plot(fig, x, y, source, color, label='', lsrl=True, date_name='date', **kwargs)[source]

Add paired time series data to an interactive Bokeh scatter plot bokeh.plotting.figure.Figure.

Handles missing data points (gaps) by masking out indices in x and y where one or both are null. The lsrl option adds the best fit least squares linear regression line with y-intercept through zero and reports the slope of the line in the figure legend. Interactive hover shows the values of all paired (x,y) data and date that is added to the figure.

Returns:: minimum and maximum x and minimum and maximum y values of paired data which can be used for adding a one to one line to the figure or other uses.
Return type:: (tuple)

Example

Let’s say that we wanted to run the energy balance ratio closure correction including gap filling with reference ET * crop coefficient and then plot corrected ET versus the calculated ET from reference ET (named “et_fill” in flux-data-qaqc) which is calculated on all days even those without gaps. Similar to Plot.line_plot we first need to create a bokeh.models.sources.ColumnDataSource from a pandas.DataFrame.

>>> from fluxdataqaqc import Data, QaQc
>>> d = Data('path/to/config.ini')
>>> q = QaQc(d)
>>> q.correct_data()

Now the QaQc instance should have the “et_corr” (corrected ET) and “et_fill” (et calculated from reference ET and crop coefficient) columns, we can now make a bokeh.models.sources.ColumnDataSource from fluxdataqaqc.QaQc.df or fluxdataqaqc.QaQc.monthly_df,

>>> from bokeh.plotting import ColumnDataSource, figure, show
>>> df = q.df
>>> source = ColumnDataSource(df)
>>> fig = figure(
>>>     x_axis_label='ET, corrected', y_axis_label='ET, gap fill'
>>> )
>>> # note, we are calling this plot method from a QaQc instace
>>> q.scatter_plot(
>>>     fig, 'ET_corr', 'ET_fill', source, 'red', label='lslr'
>>> )
>>> show(fig)

The label keyword argument will be used in the legend and since the least squares linear regression line between x and y is being calculated the slope of the line will also be printed in the legend. In this case, if the slope of the regression line is 0.94 then the legend will read “lslr, slope=0.94”.

Note

Extra keyword arguments (accepted by bokeh.plotting.figure.Figure) will be passed to the scatter plot but not to the least squares regression line plot.

Note

This method is also available from the Data and QaQc objects.

utility classes and functions

Collection of utility objects and functions for the fluxdataqaqc module.

class fluxdataqaqc.util.Convert[source]

Bases: object

Tools for unit conversions for flux-data-qaqc module.

allowable_units = {'G': ['w/m2', 'mj/m2'], 'H': ['w/m2', 'mj/m2'], 'LE': ['w/m2', 'mj/m2'], 'Rn': ['w/m2', 'mj/m2'], 'lw_in': ['w/m2', 'mj/m2'], 'lw_out': ['w/m2', 'mj/m2'], 'ppt': ['mm', 'in', 'm'], 'sw_in': ['w/m2'], 'sw_out': ['w/m2', 'mj/m2'], 't_avg': ['c', 'f', 'k'], 't_max': ['c', 'f', 'k'], 't_min': ['c', 'f', 'k'], 'vp': ['kpa', 'hpa', 'pa'], 'vpd': ['kpa', 'hpa', 'pa'], 'ws': ['m/s', 'mph']}

classmethod convert(var_name, initial_unit, desired_unit, df)[source]

Givin a valid initial and desired variable dimension for a variable within a pandas.DataFrame, make the conversion and return the updated pandas.DataFrame.

For a list of variables that require certain units within flux-data-qaqc see Convert.allowable_units (names of allowable options of input variable dimensions) and Convert.required_units (for the mandatory dimensions of certain variables before running QaQc calculations).

Parameters:

var_name (str) – name of variable to convert in df.
initial_unit (str) – name of initial unit of variable, must be valid from Convert.allowable_units.
desired_unit (str) – name of units to convert to, also must be valid.
df (pandas.DataFrame) – pandas.DataFrame containing variable to be converted, i.e. with var_name in columns.

Returns:

updated dataframe with specified variable’s units converted

Return type:

df (pandas.DataFrame)

Note

Many potential dimensions may not be provided for automatic conversion, if so you may need to update your variable dimensions manually, e.g. within a Data.df before creating a QaQc instance. Unit conversions are required for variables that can potentially be used in calculations within Data or QaQc.

pretty_unit_names = {'c': 'C', 'f': 'F', 'hpa': 'hPa', 'k': 'K', 'kpa': 'kPa', 'pa': 'Pa'}

required_units = {'G': 'w/m2', 'H': 'w/m2', 'LE': 'w/m2', 'Rn': 'w/m2', 'lw_in': 'w/m2', 'lw_out': 'w/m2', 'ppt': 'mm', 'sw_in': 'w/m2', 'sw_out': 'w/m2', 't_avg': 'c', 't_max': 'c', 't_min': 'c', 'vp': 'kpa', 'vpd': 'kpa', 'ws': 'm/s'}

fluxdataqaqc.util.monthly_resample(df, cols, agg_str, thresh=0.75)[source]

Resample dataframe to monthly frequency while excluding months missing more than a specified percentage of days of the month.

Parameters:

df (pandas.DataFrame) – datetime indexed DataFrame instance
cols (list) – list of columns in df to resample to monthy frequency
agg_str (str) – resample function as string, e.g. ‘mean’ or ‘sum’

Keyword Arguments:

thresh (float) – threshold (decimal fraction) of how many days in a month must exist for it to be temporally resampled, otherwise the monthly value for the month will be null.

Returns:

datetime indexed DataFrame that has been resampled to monthly time frequency.

Return type:

ret (pandas.DataFrame)

Note

If taking monthly totals (agg_str = ‘sum’) missing days will be filled with the months daily mean before summation.

fluxdataqaqc.util.write_configs(meta_df, data_dict, out_dir=None)[source]

Write multiple config files based on collection of site metadata and a dictionary containing variable information.

Useful for creating config files for flux-data-qaqc for batches of flux stations that utilize the same naming conventions and formatting.

Parameters:

meta_df (pandas.DataFrame) – dataframe that contains the following columns (or more) that describe metadata for multiple climate stations: ‘site_id’, ‘climate_file_path’, ‘station_longitude’ ‘station_elevation’, ‘station_latitude’, and ‘missing_data_value’. Elevation should be in meters and latitude is in decimal degrees. Additional metadata columns will be added to the config file for each site, e.g. ‘QC_flag’, ‘anemometer_height’, and any others.
data_dict (dict) – dictionary that maps flux-data-qaqc config names to user’s column names in input files header e.g. {‘net_radiation_col’: ‘netrad’, ‘net_radiation_units’ : ‘w/m2’} Anything that flux-data-qaqc config files “DATA” section can be present here including QC flag names, multiple soil moisture names and weights.

Keyword Arguments:

out_dir (str or None) – default None. Directory to save config files, if None then save to currect working directory.

Returns:

list of pathlib.Path objects of full paths: to each config file written.

Return type:

configs (list)

Raises:

Exception – if one of the mandatory metadata columns does not exist in meta_df.