API Reference
This page documents objects and functions provided by flux-data-qaqc.
Data
- class fluxdataqaqc.data.Data(config)[source]
-
An object for interfacing
flux-data-qaqcwith input metadata (config) and time series input, it provides methods and attributes for parsing, temporal analysis, visualization, and filtering data.A
Dataobject is initialized from a config file (see Setting up a config file) with metadata for an eddy covariance tower or other dataset containing time series meterological data. It serves as a starting point in the Python API of the energy balance closure analysis and data validation routines that are provided byflux-data-qaqc.Manual pre-filtering of data based on user-defined quality is aided with the
Data.apply_qc_flagsmethod. Weighted or non-weighted means of variables with multiple sensors/recordings is performed upon initialization if these options are declared in the config file. TheDataclass also includes theData.dfproperty which returns the time series data in the form of apandas.DataFrameobject for custom workflows.Datainherits line and scatter plot methods fromPlotwhich allows for the creation of interactive visualizations of input time series data.- climate_file
Absolute path to climate input file.
- Type:
- config
Config parser instance created from the data within the config.ini file.
- header
Header as found in input climate file.
- Type:
- inv_map
Dictionary with input climate file names as keys and internal names as values. May only include pairs when they differ.
- Type:
- out_dir
Default directory to save output of
QaQc.writeorQaQc.plotmethods.- Type:
- plot_file
path to plot file once it is created/saved by
Data.plot.- Type:
pathlib.Path or None
- soil_var_weight_pairs
Dictionary with names and weights for weighted averaging of soil heat flux or soil moisture variables.
- Type:
- qc_var_pairs
Dictionary with variable names as keys and QC value columns (numeric of characters) as values.
- Type:
- units
Dictionary with internal variable names as keys and units as found in config as values.
- Type:
- variables
Dictionary with internal names for variables as keys and names as found in the input data as values.
- Type:
- variable_names_dict
Dictionary with internal variable names as keys and keys in config.ini file as values.
- Type:
- apply_qc_flags(threshold=None, flag=None, threshold_inequality='lt')[source]
Apply user-provided QC values or flags for climate variables to filter poor-quality data by converting them to null values, updates
Data.df.Specifically where the QC value is < threshold change the variables value for that date-time to null. The other option is to use a column of flags, e.g. ‘x’ for data values to be filtered out. The threshold value or flag may be specified in the config file’s METADATA section otherwise they should be assigned as keyword arguments here.
Specification of which QC (flag or numeric threshold) columns should be applied to which variables is set in the DATA section of the config file. For datasets with QC value columns with names identical to the variable they correspond to with the suffix “_QC” the QC column names for each variable do not need to be specified in the config file.
- Keyword Arguments:
threshold (float) – default
None. Threshold for QC values, if flag is below threshold replace that variables value with null.flag (str, list, or tuple) – default
None. Character flag signifying data to filter out. Can be list or tuple of multiple flags.threshold_inequality (str) – default ‘lt’. ‘lt’ for filtering values that are less than
thresholdvalue, ‘gt’ for filtering values that are greater.
- Returns:
Example
If the input time series file has a column with numeric quality values named “LE_QC” which signify the data quality for latent energy measurements, then in the config.ini file’s DATA section the following must be specified:
[DATA] latent_heat_flux_qc = LE_QC ...
Now you must specify the threshold of this column in which to filter out when using
Data.apply_qc_flags. For example if you want to remove all data entries of latent energy where the “LE_QC” value is below 5, then the threshold value would be 5. The threshold can either be set in the config file or passed as an argument. If it is set in the config file, i.e.:[METADATA] qc_threshold = 0.5
Then you would cimply call the method and this threshold would be applied to all QC columns specified in the config file,
>>> from fluxdataqaqc import Data >>> d = Data('path/to/config.ini') >>> d.apply_qc_flags()
Alternatively, if the threshold is not defined in the config file or if you would like to use a different value then pass it in,
>>> d.apply_qc_flags(threshold=2.5)
Lastly, this method also can filter out based on a single or list of character flags, e.g. “x” or “bad” gievn that the column containing these is specified in the config file for whichever variable they are to be applied to. For example, if a flag column contains multiple flags signifying different data quality control info and two in particular signify poor quality data, say “b” and “a”, then apply them either in the config file:
[METADATA] qc_flag = b,a
Of within Python
>>> d.apply_qc_flags(flag=['b', 'a'])
For more explanation and examples see the “Configuration Options” section of the online documentation.
- calc_pes(gpp_var='gpp', coeff_kj_per_mol=422.0, clip_negative_gpp=True, inplace=True)[source]
Calculate photosynthetic energy storage (pes) and photosynthetic energy storage flux term (pes_flux) from sub-daily GPP.
Simple approach based on Meyers & Hollinger (2004). By default, negative GPP values are assumed invalid and are clipped to zero.
- Keyword Arguments:
gpp_var (str) – default ‘gpp’. Internal variable name for gross primary productivity.
coeff_kj_per_mol (float) – default 422.0. Energy equivalent per mol CO2 fixed, in kJ mol-1 CO2.
clip_negative_gpp (bool) – default True If True, negative GPP values are set to zero before calculating photosynthetic energy storage.
inplace (bool) – default True If True, add
pesandpes_fluxtoData.df. If False, return them in a new DataFrame.
- Returns:
pandas.DataFrameorNoneIf
inplace=False, returns a DataFrame withpesandpes_flux. Otherwise returnsNone.
- property df
Pull variables out of the config and climate time series files load them into a datetime-indexed
pandas.DataFrame.Metadata about input time series file format: “missing_data_value”, “skiprows”, and “date_parser” are utilized when first loading the
dfinto memory. Also, weighted and non-weighted averaging of multiple measurements of the same climatic variable occurs on the first call ofData.df, if these options are declared in the config file. For more details and example uses of these config options please see the “Configuration Options” section of the online documentation.- Returns:
df (
pandas.DataFrame)
Examples
You can utilize the df property as with any
pandas.DataFrameobject. However, if you would like to make changes to the data you must first make a copy, then make the changes and then reassign it toData.df, e.g. if you wanted to add 5 degrees to air temp.>>> from fluxdataqaqc import Data >>> d = Data('path_to_config.ini') >>> df = d.df.copy() >>> df['air_temp_column'] = df['air_temp_column'] + 5 >>> d.df = df
The functionality shown above allows for user-controlled preprocessing and modification of any time series data in the initial dataset. It also allows for adding new columns but if they are variables used by
flux-data-qaqce.g. Rn or other energy balance variables, be sure to also updateData.variablesandData.unitswith the appropriate entries. New or modified values will be used in any further analysis/ploting routines withinflux-data-qaqc.By default the names of variables as found within input data are retained in
QaQc.df, however you can use the naming scheme asflux-data-qaqcwhich can be viewed inData.variable_names_dictby using the theData.inv_mapdictionary which maps names from user-defined to internal names (as opposed toData.variables) which maps from internal names to user-defined. For example if your input data had the following names for LE, H, Rn, and G set in your config:[DATA] net_radiation_col = Net radiation, W/m2 ground_flux_col = Soil-heat flux, W/m2 latent_heat_flux_col = Latent-heat flux, W/m2 sensible_heat_flux_col = Sensible-heat flux, W/m2
Then the
Data.dfwill utilize the same names, e.g.>>> # d is a Data instance >>> d.df.head()
produces:
date
Net radiation, W/m2
Latent-heat flux, W/m2
Sensible-heat flux, W/m2
Soil-heat flux, W/m2
10/1/2009 0:00
-54.02421778
0.70761
0.95511
-40.42365926
10/1/2009 0:30
-51.07744708
0.04837
-1.24935
-33.35383253
10/1/2009 1:00
-50.99438925
0.68862
1.91101
-43.17900525
10/1/2009 1:30
-51.35032377
-1.85829
-15.4944
-40.86201497
10/1/2009 2:00
-51.06604228
-1.80485
-19.1357
-39.80936855
Here is how you could rename your dataframe using
flux-data-qaqcinternal names,>>> d.df.rename(columns=q.inv_map).head()
date
Rn
LE
H
G
10/1/2009 0:00
-54.02421778
0.70761
0.95511
-40.42365926
10/1/2009 0:30
-51.07744708
0.04837
-1.24935
-33.35383253
10/1/2009 1:00
-50.99438925
0.68862
1.91101
-43.17900525
10/1/2009 1:30
-51.35032377
-1.85829
-15.4944
-40.86201497
10/1/2009 2:00
-51.06604228
-1.80485
-19.1357
-39.80936855
A minor note on variable naming, if your input data variables use exactly the same names used by
flux-data-qaqc, they will be renamed by adding the prefix “input_”, e.g. “G” becomes “input_G” on the first time reading the data from disk, i.e. the first time accessingData.df.Note
The temporal frequency of the input data is retained unlike the
Qaqc.dfwhich automatically resamples time series data to daily frequency.
- hourly_ASCE_refET(reference='short', anemometer_height=None)[source]
Calculate hourly ASCE standardized short (ETo) or tall (ETr) reference ET from input data and wind measurement height.
If input data’s time frequency is < hourly the input data will be resampled to hourly and the output reference ET time series will be returned as a datetime
pandas.Seriesobject, if the input data is already hourly then the resulting time series will automatically be merged into theData.dfdataframe named “ASCE_ETo” or “ASCE_ETr” respectively.- Keyword Arguments:
reference (str) – default “short”, calculate tall or short ASCE reference ET.
anemometer_height (float or None) – wind measurement height in meters , default
None. IfNonethen look for the “anemometer_height” entry in the METADATA section of the config.ini, if not there then print a warning and use 2 meters.
- Returns:
Hint
The input variables needed to run this method are: vapor pressure, wind speed, incoming shortwave radiation, and average air temperature. If vapor pressure deficit and average air temperature exist, the actual vapor pressure will automatically be calculated.
- plot(ncols=1, output_type='save', out_file=None, suptitle='', plot_width=1000, plot_height=450, sizing_mode='fixed', merge_tools=False, link_x=True, **kwargs)[source]
Creates a series of interactive diagnostic line and scatter plots of input data in whichever temporal frequency it is in.
The main interactive features of the plots include: pan, selection and scrol zoom, hover tool that shows paired variable values including date, and linked x-axes that pan/zoom togehter for daily and monthly time series plots.
It is possible to change the format of the output plots including adjusting the dimensions of subplots, defining the number of columns of subplots, setting a super title that accepts HTML, and other options. If variables are not present for plots they will not be created and a warning message will be printed. There are two options for output: open a temporary file for viewing or saving a copy to
QaQc.out_dir.A list of all potential time series plots created:
energy balance components
radiation components
multiple soil heat flux measurements
air temperature
vapor pressure and vapor pressure deficit
wind speed
precipitation
latent energy
multiple soil moisture measurements
- Keyword Arguments:
ncols (int) – default 1. Number of columns of subplots.
output_type (str) – default “save”. How to output plots, “save”, “show” in browser, “notebook” for Jupyter Notebooks, “return_figs” to return a list of Bokeh
bokeh.plotting.figure.Figure`s, or "return_grid" to return the :obj:`bokeh.layouts.gridplot.out_file (str or None) – default
None. Path to save output file, ifNonesave output toData.out_dirwith the name [site_id]_input_plots.html where [site_id] isData.site_id.suptitle (str or None) – default
None. Super title to go above plots, accepts HTML/CSS syntax.plot_width (int) – default 1000. Width of subplots in pixels.
plot_height (int) – default 450. Height of subplots in pixels, note for subplots the height will be forced as the same as
plot_width.sizing_mode (str) – default “scale_both”. Bokeh option to scale dimensions of
bokeh.layouts.gridplot.merge_tools (bool) – default False. Merges all subplots toolbars into a single location if True.
link_x (bool) – default True. If True link x axes of daily time series plots and monthly time series plots so that when zooming or panning on one plot they all zoom accordingly, the axes will also be of the same length.
Example
Starting from a correctly formatted config.ini and climate time series file, this example shows how to read in the data and produce a series of plots of input data as it is found in the input data file (unlike
QaQc.plotwhich produces plots at daily and monthly temporal frequency). This example also shows how to display a title at the top of plot with the site’s location and site ID.>>> from fluxdataqaqc import Data >>> d = Data('path/to/config.ini') >>> # create plot title from site ID and location in N. America >>> title = "<b>Site:</b> {}; <b>Lat:</b> {}N; <b>Long:</b> {}W".format( >>> q.site_id, q.latitude, q.longitude >>> ) >>> q.plot( >>> ncols=2, output_type='show', plot_width=500, suptitle=title >>> )
Note, we specified the width of plots to be smaller than default because we want both columns of subplots to be viewable on the screen.
Tip
To reset all subplots at once, refresh the page with your web browser.
Note
Additional keyword arguments that are recognized by
bokeh.layouts.gridplotare also accepted byData.plot.See also
- write(out_dir=None, use_input_names=False)[source]
Save time series of initially read in data after performing default naming formatting and unit conversions, save as CSV file. File name will be in the format “[site_ID]_input_data.csv”.
The default location for saving output time series files is within an “output” subdirectory of the parent directory containing the config.ini file.
- Keyword Arguments:
out_dir (str or
None) – defaultNone. Directory to save CSVs, ifNonesave toout_dirinstance variable (typically “output” directory where config.ini file exists).use_input_names (bool) – default
False. IfFalseuseflux-data-qaqcvariable names as in output file header, or ifTrueuse the user’s input variable names where possible (for variables that were read in and not modified or calculated byflux-data-qaqc).
- Returns:
Example
Starting from a config.ini file,
>>> from fluxdataqaqc import Data, QaQc >>> d = Data('path/to/config.ini') >>> d.write()
QaQc
- class fluxdataqaqc.qaqc.QaQc(data=None, drop_gaps=True, daily_frac=1.0, max_interp_hours=2, max_interp_hours_night=4)[source]
-
Numerical routines for correcting daily energy balance closure for eddy covariance data and other data analysis tools.
Two routines are provided for improving energy balance closure by adjusting turbulent fluxes, latent energy and sensible heat, the Energy Balance Ratio method (modified from FLUXNET) and the Bowen Ratio method.
The
QaQcobject also has multiple tools for temporal frequency aggregation and resampling, estimation of climatic and statistical variables (e.g. ET and potential shortwave radiation), downloading gridMET reference ET, managing data and metadata, interactive validation plots, and managing a structure for input and output data files. Input data is expected to be aDatainstance or apandas.DataFrame.- Keyword Arguments:
drop_gaps (bool) – default
True. IfTrueautomatically filter variables on days with sub-daily measurement gaps less thandaily_frac.daily_frac (float) – default 1.00. Fraction of sub-daily data required otherwise the daily value will be filtered out if
drop_gapsisTrue. E.g. ifdaily_frac = 0.5and the input data is hourly, then data on days with less than 12 hours of data will be forced to null withinQaQc.df. This is important because systematic diurnal gaps will affect the autmoatic resampling that occurs when creating aQaQcinstance and the daily data is used in closure corrections, other calculations, and plots. If sub-daily linear interpolation is applied to energy balance variables the gaps are counted after the interpolation.max_interp_hours (None or float) – default 2. Length of largest gap to fill with linear interpolation in energy balance variables if input datas temporal frequency is less than daily. This value will be used to fill gaps when \(Rn > 0\) or \(Rn\) is missing during each day.
max_interp_hours_night (None or float) – default 4. Length of largest gap to fill with linear interpolation in energy balance variables if input datas temporal frequency is less than daily when \(Rn < 0\) within 12:00PM-12:00PM daily intervals.
- agg_dict
Dictionary with internal variable names as keys and method of temporal resampling (e.g. “mean” or “sum”) as values.
- Type:
- config
Config parser instance created from the data within the config.ini file.
- config_file
Absolute path to config.ini file used for initialization of the
fluxdataqaqc.Datainstance used to create theQaQcinstance.- Type:
- corrected
False until an energy balance closure correction has been run by calling
QaQc.correct_data.- Type:
- corr_methods
List of Energy Balance Closure correction routines usable by
QaQc.correct_data.- Type:
- gridMET_exists
True if path to matching gridMET time series file exists on disk and has time series for reference ET and precipitation and the dates for these fully overlap with the energy balance variables, i.e. the date index of
QaQc.df.- Type:
- gridMET_meta
Dictionary with information for gridMET variables that may be downloaded using
QaQc.download_gridMET.- Type:
- inv_map
Dictionary with input climate file names as keys and internal names as values. May only include pairs when they differ.
- Type:
- out_dir
Default directory to save output of
QaQc.writeorQaQc.plotmethods.- Type:
- n_samples_per_day
If initial time series temporal frequency is less than 0 then this value will be updated to the number of samples detected per day, useful for post-processing based on the count of sub-daily gaps in energy balance variables, e.g. “LE_subday_gaps”.
- Type:
- plot_file
path to plot file once it is created/saved by
QaQc.plot.- Type:
pathlib.Path or None
- temporal_freq
Temporal frequency of initial (as found in input climate file) data as determined by
pandas.infer_freq.- Type:
- units
Dictionary with internal variable names as keys and units as found in config as values.
- Type:
- variables
Dictionary with internal variable names as keys and names as found in the input data as values.
- Type:
Note
Upon initialization of a
QaQcinstance the temporal frequency of the input data checked usingpandas.infer_freqwhich does not always correctly parse datetime indices, if it is not able to correctly determine the temporal frequency the time series will be resampled to daily frequency but if it is in fact already at daily frequency the data will be unchanged. In this case theQaQc.temporal_freqwill be set to “na”.- correct_data(meth='ebr', et_gap_fill=True, y='Rn', refET='ETr', x=['G', 'LE', 'H'], fit_intercept=False)[source]
Correct turblent fluxes to improve energy balance closure using an Energy Balance Ratio method modified from FLUXNET.
Currently three correction options are available: ‘ebr’ (Energy Balance Ratio), ‘br’ (Bowen Ratio), and ‘lin_regress’ (least squares linear regression). If you use one method followed by another corrected,the corrected versions of LE, H, ET, ebr, etc. will be overwritten with the most recently used approach.
This method also computes potential clear sky radiation (saved as “rso”) using an ASCE approach based on station elevation and latitude. ET is calculated from raw and corrected LE using daily air temperature to correct the latent heat of vaporization, if air temp. is not available in the input data then air temp. is assumed at 20 degrees celcius.
Corrected or otherwise newly calculated variables are named using the following suffixes to distinguish them:
uncorrected LE, H, etc. from input data have no suffix _corr uses adjusted LE, H, etc. from the correction method used _user_corr uses corrected LE, H, etc. found in data file (if provided)
- Parameters:
y (str) – name of dependent variable for regression, must be in
QaQc.variableskeys, or a user-added variable. Only used ifmeth='lin_regress'.x (str or list) – name or list of independent variables for regression, names must be in
QaQc.variableskeys, or a user-added variable. Only used ifmeth='lin_regress'.
- Keyword Arguments:
meth (str) – default ‘ebr’. Method to correct energy balance.
et_gap_fill (bool) – default True. If true fill any remaining gaps in corrected ET with ETr * ETrF, where ETr is alfalfa reference ET from gridMET and ETrF is the filtered, smoothed (7 day moving avg. min 2 days) and linearly interpolated crop coefficient. The number of days in each month that corrected ET are filled will is provided in
QaQc.monthly_dfas the column “ET_gap”.refET (str) – default “ETr”. Which gridMET reference product to use for ET gap filling, “ETr” or “ETo” are valid options.
fit_intercept (bool) – default False. Fit intercept for regression or set to zero if False. Only used if
meth='lin_regress'.apply_coefs (bool) – default False. If
Truethen apply fitted coefficients to their respective variables for linear regression correction method, rename the variables with the suffix “_corr”.
- Returns:
Example
Starting from a correctly formatted config.ini and climate time series file, this example shows how to read in the data and apply the energy balance ratio correction without gap-filling with gridMET ETr x ETrF.
>>> from fluxdataqaqc import Data, QaQc >>> d = Data('path/to/config.ini') >>> q = QaQc(d) >>> q.corrected False
Now apply the energy balance closure correction
>>> q.correct_data(meth='ebr', et_gap_fill=False) >>> q.corrected True
Note
If
et_gap_fillis set toTrue(default) the gap filled days of corrected ET will be used to recalculate LE_corr for those days with the gap filled values, i.e. LE_corr will also be gap-filled.Note
The ebr_corr variable or energy balance closure ratio is calculated from the corrected versions of LE and H independent of the method. When using the ‘ebr’ method the energy balance correction factor (what is applied to the raw H and LE) is left as calculated (inverse of ebr) and saved as ebc_cf.
See also
For explanation of the linear regression method see the
QaQc.lin_regressmethod, calling that method with the keyword argumentapply_coefs=Trueand \(Rn\) as the y variable and the other energy balance components as the x variables will give the same result as the default inputs to this function whenmeth='lin_regress.
- daily_ASCE_refET(reference='short', anemometer_height=None)[source]
Calculate daily ASCE standardized short (ETo) or tall (ETr) reference ET from input data and wind measurement height.
The resulting time series will automatically be merged into the
Data.dfdataframe named “ASCE_ETo” or “ASCE_ETr” respectively.- Keyword Arguments:
reference (str) – default “short”, calculate tall or short ASCE reference ET.
anemometer_height (float or None) – wind measurement height in meters , default
None. IfNonethen look for the “anemometer_height” entry in the METADATA section of the config.ini, if not there then print a warning and use 2 meters.
- Returns:
Note
If the hourly ASCE variables were prior calculated from a
Datainstance they will be overwritten as they are saved with the same names.
- property df
See
fluxdataqaqc.Data.dfas the only difference is that theQaQc.dfis first resampled to daily frequency.
- download_gridMET(variables=None)[source]
Download reference ET (alfalfa and grass) and precipitation from gridMET for all days in flux station time series by default.
Also has ability to download other specific gridMET variables by passing a list of gridMET variable names. Possible variables and their long form can be found in
QaQc.gridMET_meta.Upon download gridMET time series for the nearest gridMET cell will be merged into the instances dataframe attibute
QaQc.dfand all gridMET variable names will have the prefixgridMET_for identification.The gridMET time series file will be saved to a subdirectory called “gridMET_data” within the directory that contains the config file for the current
QaQcinstance and named with the site ID and gridMET cell centroid lat and long coordinates in decimal degrees.- Parameters:
variables (None, str, list, or tuple) – default None. List of gridMET variable names to download, if None download ETr and precipitation. See the keys of the
QaQc.gridMET_metadictionary for a list of all variables that can be downloaded by this method.- Returns:
Note
Any previously downloaded gridMET time series will be overwritten when calling the method, however if using the the gap filling method of the “ebr” correction routine the download will not overwrite currently existing data so long as gridMET reference ET and precipitation is on disk and its path is properly set in the config file.
- classmethod from_dataframe(df, site_id, elev_m, lat_dec_deg, var_dict, drop_gaps=True, daily_frac=1.0, max_interp_hours=2, max_interp_hours_night=4)[source]
Create a
QaQcobject from apandas.DataFrameobject.- Parameters:
df (
pandas.DataFrame) – DataFrame of climate variables with datetime index named ‘date’site_id (str) – site identifier such as station name
lat_dec_deg (float) – latitude of site in decimal degrees
var_dict (dict) – dictionary that maps flux-data-qaqc variable names to user’s columns in df e.g. {‘Rn’: ‘netrad’, …} see
fluxdataqaqc.Data.variable_names_dictfor list of flux-data-qaqc variable names
- Returns:
None
Note
When using this method, any output files (CSVs, plots) will be saved to a directory named “output” in the current working directory.
- lin_regress(y, x, fit_intercept=False, apply_coefs=False)[source]
Least squares linear regression on single or multiple independent variables.
For example, if the dependent variable (
y) is \(Rn\) and the independent variables (x) are \(LE\) and \(H\), then the linear regression will solve for the best fit coefficients of \(Rn = c_0 + c_1 LE + c_2 H\). Any number of variables in theQaQc.variablescan be used forxand one fory.If the variables chosen for regression are part of the energy balance components, i.e. \(Rn, G, LE, H\) and
apply_coefs=True, then the best fit coefficients will be applied to their respecitive variables with consideration of the energy balance equation, i.e. the signs of the coefficients will be corrected according to \(Rn - G = LE + H\), for example ify=Handx=['Rn','G','LE]. i.e. solving \(H = c_0 + c_1 Rn + c_2 G + c_3 LE\) then the coefficients \(c_2\) and \(c_3\) will be multiplied by -1 before applying them to correct \(G\) and \(LE\) according to the energy balance equation.This method returns an
pandas.DataFrameobject containing results of the linear regression including the coefficient values, number of data pairs used in the, the root-mean-square-error, and coefficient of determination. This table can also be retrieved from theQaQc.lin_regress_resultsinstance attribute.- Arguments:
- y (str): name of dependent variable for regression, must be in
QaQc.variableskeys, or a user-added variable.- x (str or list): name or list of independent variables for
regression, names must be in
QaQc.variableskeys, or a user-added variable.
- Keyword Arguments:
- fit_intercept (bool): default False. Fit intercept for regression or
set to zero if False.
- apply_coefs (bool): default False. If
Truethen apply fitted coefficients to their respective variables, rename the variables with the suffix “_corr”.
- Returns:
- Example:
Let’s say we wanted to compute the linear relationship between net radiation to the other energy balance components which may be useful if we have strong confidence in net radiation measurements for example. The resulting coefficients of regression would give us an idea of whether the other components were “under-measured” or “over-measured”. Then, starting with a
Datainstance:>>> Q = QaQc(Data_instance) >>> Q.lin_regress(y='Rn', x=['G','H','LE'], fit_intercept=True) >>> Q.lin_regress_result
This would produce something like the following,
SITE_ID
Y (dependent var.)
c0 (intercept)
c1 (coef on G)
c2 (coef on LE)
c3 (coef on H)
RMSE (w/m2)
r2 (coef. det.)
n (sample count)
a_site
Rn
6.99350781229883
1.552
1.054
0.943
18.25
0.91
3386
In this case the intercept is telling us that there may be a systematic or constant error in the independent variables and that \(G\) is “under-measured” at the sensor by over 50 precent, etc. if we assume daily \(Rn\) is accurate as measured.
- Tip:
You may also use multiple linear regression to correct energy balance components using the
QaQc.correct_datamethod by passing themeth='lin_regress'keyword argument.
- property monthly_df
Temporally resample time series data to monthly frequency based on monthly means or sums based on
QaQc.agg_dict, provides data aspandas.DataFrame.Note that monthly means or sums are forced to null values if less than 20 percent of a months days are missing in the daily data (
QaQc.df). Also, for variables that are summed (e.g. ET or precipitation) missing days (if less than 20 percent of the month) will be filled with the month’s daily mean value before summation.If a
QaQcinstance has not yet run an energy balance correction i.e.QaQc.corrected=Falsebefore accessingmonthly_dfthen the default routine of data correction (energy balance ratio method) will be conducted.Utilize the
QaQc.monthly_dfproperty the same way as thefluxdataqaqc.Data.df, see it’s API documentation for examples.Tip
If you have additional variables in
QaQc.dfor would like to change the aggregation method for the monthly time series, adjust the instance attributeQaQc.agg_dictbefore accessing theQaQc.monthly_df.
- plot(ncols=1, output_type='save', out_file=None, suptitle='', plot_width=1000, plot_height=450, sizing_mode='fixed', merge_tools=False, link_x=True, **kwargs)[source]
Creates a series of interactive diagnostic line and scatter plots of input and computed daily and monthly aggregated data.
The main interactive features of the plots include: pan, selection and scrol zoom, hover tool that shows paired variable values including date, and linked x-axes that pan/zoom togehter for daily and monthly time series plots.
It is possible to change the format of the output plots including adjusting the dimensions of subplots, defining the number of columns of subplots, setting a super title that accepts HTML, and other options. If variables are not present for plots they will not be created and a warning message will be printed. There are two options for output: open a temporary file for viewing or saving a copy to
QaQc.out_dir.A list of all potential time series plots created:
energy balance components
radiation components
incoming shortwave radiation with ASCE potential clear sky (daily only)
multiple soil heat flux measurements
air temperature
vapor pressure and vapor pressure deficit
wind speed
station precipitation and gridMET precipitation
initial and corrected latent energy
initial, corrected, gap filled, and reference evapotranspiration
crop coefficient and smoothed and interpolated crop coefficient
initial and corrected energy balance ratio
multiple soil moisture measurements
A list of all potential scatter plots created:
radiative versus turblent fluxes, initial and corrected
initial versus corrected latent energy
initial versus corrected evapotranspiration
- Keyword Arguments:
ncols (int) – default 1. Number of columns of subplots.
output_type (str) – default “save”. How to output plots, “save”, “show” in browser, “notebook” for Jupyter Notebooks, “return_figs” to return a list of Bokeh
bokeh.plotting.figure.Figure`s, or "return_grid" to return the :obj:`bokeh.layouts.gridplot.out_file (str or None) – default
None. Path to save output file, ifNonesave output toQaQc.out_dirwith the name [site_id]_plots.html where [site_id] isQaQc.site_id.suptitle (str or None) – default
None. Super title to go above plots, accepts HTML/CSS syntax.plot_width (int) – default 1000. Width of subplots in pixels.
plot_height (int) – default 450. Height of subplots in pixels, note for subplots the height will be forced as the same as
plot_width.sizing_mode (str) – default “scale_both”. Bokeh option to scale dimensions of
bokeh.layouts.gridplot.merge_tools (bool) – default False. Merges all subplots toolbars into a single location if True.
link_x (bool) – default True. If True link x axes of daily time series plots and monthly time series plots so that when zooming or panning on one plot they all zoom accordingly, the axes will also be of the same length.
Example
Starting from a correctly formatted config.ini and climate time series file, this example shows how to read in the data and produce the default series of plots for viewing with the addition of text at the top of plot that states the site’s location and ID.
>>> from fluxdataqaqc import Data, QaQc >>> d = Data('path/to/config.ini') >>> q = QaQc(d) >>> q.correct_data() >>> # create plot title from site ID and location in N. America >>> title = "<b>Site:</b> {}; <b>Lat:</b> {}N; <b>Long:</b> {}W".format( >>> q.site_id, q.latitude, q.longitude >>> ) >>> q.plot( >>> ncols=2, output_type='show', plot_width=500, suptitle=title >>> )
Note, we specified the width of plots to be smaller than default because we want both columns of subplots to be viewable on one page.
Tip
To reset all subplots at once, refresh the page with your web browser.
Note
Additional keyword arguments that are recognized by
bokeh.layouts.gridplotare also accepted byQaQc.plot.
- write(out_dir=None, use_input_names=False)[source]
Save daily and monthly time series of initial and “corrected” data in CSV format.
Note, if the energy balance closure correction (
QaQc.correct_data) has not been run, this method will run it with default options before saving time series files to disk.The default location for saving output time series files is within an “output” subdifrectory of the parent directory containing the config.ini file that was used to create the
fluxdataqaqc.DataandQaQcobjects, the names of the files will start with the site_id and have either the “daily_data” or “monthly_data” suffix.- Keyword Arguments:
out_dir (str or
None) – defaultNone. Directory to save CSVs, ifNonesave toout_dirinstance variable (typically “output” directory where config.ini file exists).use_input_names (bool) – default
False. IfFalseuseflux-data-qaqcvariable names as in output file header, or ifTrueuse the user’s input variable names where possible (for variables that were read in and not modified or calculated byflux-data-qaqc).
- Returns:
Example
Starting from a config.ini file,
>>> from fluxdataqaqc import Data, QaQc >>> d = Data('path/to/config.ini') >>> q = QaQc(d) >>> # note no energy balance closure correction has been run >>> q.corrected False >>> q.write() >>> q.corrected True
Note
To save data created by multiple correction routines, be sure to run the correction and then save to different output directories otherwise output files will be overwritten with the most recently used correction option.
Plot
- class fluxdataqaqc.plot.Plot[source]
Bases:
objectContainer of plot routines of
fluxdataqaqcincluding static methods that can be used to create and update interactive line and scatter plots from an arbitrarypandas.DataFrameinstance.Note
The
DataandQaQcobjects both inherit all methods ofPlottherefore allowing them to be easily used for custom interactive time series plots for data within input data (inData.df) and daily and monthly data inQaQc.dfandQaQc.monthly_df.- static add_lines(fig, df, plt_vars, colors, x_name, source, labels=None, **kwargs)[source]
Add a multiple time series to a
bokeh.plotting.figure.Figureobject using data from a datetime indexedpandas.DataFramewith an interactive hover tool.Interactive hover shows the values of all time series data and date that is added to the figure.
- Parameters:
df (
pandas.DataFrame) –pandas.DataFramecontaining time series data.plt_vars (list) – list of data columns in
dfto plot.colors (list) – list of line colors for variables in
plt_vars.x_name (str) – name of the x-axis variable, e.g. the datetime index, in the
pandas.DataFrame(df) containing data to plot.source (
bokeh.models.sources.ColumnDataSource) – column data source created from thepandas.DataFramewith data to plot, i.e.df.labels (
listorNone) – defaultNone. Labels for each plot variable inplt_vars.
- Returns:
if none of the variables in
plt_varsare found indfthen returnNoneotherwise returns the updated figure.- Return type:
ret (
Noneorbokeh.plotting.figure.Figure)
Example
Similar to
Plot.line_plotwe first need to create abokeh.models.sources.ColumnDataSourcefrom apandas.DataFrame. This example shows how to plot two variables, daily corrected latent energy and sensible heat on the same plot.>>> from fluxdataqaqc import Data, QaQc, Plot >>> d = Data('path/to/config.ini') >>> q = QaQc(d) >>> q.correct_data()
Now the
QaQcinstance should have the “LE_corr” (corrected latent energy) and “H_corr” (corrected sensible heat) columns, we can now make abokeh.models.sources.ColumnDataSourcefromfluxdataqaqc.QaQc.dforfluxdataqaqc.QaQc.monthly_df,>>> from bokeh.plotting import ColumnDataSource, figure, show >>> df = q.df >>> plt_vars = ['LE_corr', 'H_corr'] >>> colors = ['blue', 'red'] >>> labels = ['LE', 'H'] >>> source = ColumnDataSource(df) >>> fig = figure( >>> x_axis_label='date', y_axis_label='Corrected Turbulent Flux' >>> ) >>> Plot.add_lines( >>> fig, df, plt_vars, colors, 'date', source, labels=labels >>> ) >>> show(fig)
- static line_plot(fig, x, y, source, color, label=None, x_axis_type='date', **kwargs)[source]
Add a single time series to a
bokeh.plotting.figure.Figureobject using data from a datetime indexedpandas.DataFramewith an interactive hover tool.Interactive hover shows the values of all time series data and date that is added to the figure.
- Parameters:
fig (
bokeh.plotting.figure.Figure) – a figure instance to add the line to.x (str) – name of the datetime index or column in the
pandas.DataFramecontaining data to plot.y (str) – name of the column in the
pandas.DataFrameto plot.source (
bokeh.models.sources.ColumnDataSource) – column data source created from thepandas.DataFramewith data to plot.color (str) – color of plot line, see Bokeh for color options.
label (str or
None) – defaultNone. Label for plot legend (fory).x_axis_type (
strorNone) – default ‘date’. If “date” then the x-axis will be formatted as month-day-year.
- Returns:
Example
To use the
Plot.line_plotfunction we first need to create abokeh.models.sources.ColumnDataSourcefrom apandas.DataFrame. Let’s say we want to plot the monthly time series of corrected latent energy, starting from a config.ini file,>>> from fluxdataqaqc import Data, QaQc, Plot >>> d = Data('path/to/config.ini') >>> q = QaQc(d) >>> q.correct_data()
Now the
QaQcshould have the “LE_corr” (corrected latent energy) column, we can now make abokeh.models.sources.ColumnDataSourcefromfluxdataqaqc.QaQc.dforfluxdataqaqc.QaQc.monthly_df,>>> from bokeh.plotting import ColumnDataSource, figure, show >>> source = ColumnDataSource(q.monthly_df) >>> # create the figure before using line_plot >>> fig = figure(x_axis_label='date', y_axis_label='Corrected LE') >>> Plot.line_plot( >>> fig, 'date', 'LE_corr', source, color='red', line_width=3 >>> ) >>> show(fig)
Notice,
line_widthis not an argument toPlot.line_plotbut it is an acceptable keyword argument tobokeh.plotting.figure.Figureand therefore will work as expected.
- static scatter_plot(fig, x, y, source, color, label='', lsrl=True, date_name='date', **kwargs)[source]
Add paired time series data to an interactive Bokeh scatter plot
bokeh.plotting.figure.Figure.Handles missing data points (gaps) by masking out indices in
xandywhere one or both are null. Thelsrloption adds the best fit least squares linear regression line with y-intercept through zero and reports the slope of the line in the figure legend. Interactive hover shows the values of all paired (x,y) data and date that is added to the figure.- Returns:
minimum and maximum
xand minimum and maximumyvalues of paired data which can be used for adding a one to one line to the figure or other uses.- Return type:
(tuple)
Example
Let’s say that we wanted to run the energy balance ratio closure correction including gap filling with reference ET * crop coefficient and then plot corrected ET versus the calculated ET from reference ET (named “et_fill” in
flux-data-qaqc) which is calculated on all days even those without gaps. Similar toPlot.line_plotwe first need to create abokeh.models.sources.ColumnDataSourcefrom apandas.DataFrame.>>> from fluxdataqaqc import Data, QaQc >>> d = Data('path/to/config.ini') >>> q = QaQc(d) >>> q.correct_data()
Now the
QaQcinstance should have the “et_corr” (corrected ET) and “et_fill” (et calculated from reference ET and crop coefficient) columns, we can now make abokeh.models.sources.ColumnDataSourcefromfluxdataqaqc.QaQc.dforfluxdataqaqc.QaQc.monthly_df,>>> from bokeh.plotting import ColumnDataSource, figure, show >>> df = q.df >>> source = ColumnDataSource(df) >>> fig = figure( >>> x_axis_label='ET, corrected', y_axis_label='ET, gap fill' >>> ) >>> # note, we are calling this plot method from a QaQc instace >>> q.scatter_plot( >>> fig, 'ET_corr', 'ET_fill', source, 'red', label='lslr' >>> ) >>> show(fig)
The
labelkeyword argument will be used in the legend and since the least squares linear regression line between x and y is being calculated the slope of the line will also be printed in the legend. In this case, if the slope of the regression line is 0.94 then the legend will read “lslr, slope=0.94”.Note
Extra keyword arguments (accepted by
bokeh.plotting.figure.Figure) will be passed to the scatter plot but not to the least squares regression line plot.
utility classes and functions
Collection of utility objects and functions for the fluxdataqaqc
module.
- class fluxdataqaqc.util.Convert[source]
Bases:
objectTools for unit conversions for
flux-data-qaqcmodule.- allowable_units = {'G': ['w/m2', 'mj/m2'], 'H': ['w/m2', 'mj/m2'], 'LE': ['w/m2', 'mj/m2'], 'Rn': ['w/m2', 'mj/m2'], 'blh': ['m'], 'co2': ['umol/mol', 'ppm'], 'fc': ['umolco2/m2/s'], 'gpp': ['umolco2/m2/s'], 'lw_in': ['w/m2', 'mj/m2'], 'lw_out': ['w/m2', 'mj/m2'], 'mo_length': ['m'], 'nee': ['umolco2/m2/s'], 'ppfd_in': ['umolphoton/m2/s'], 'ppt': ['mm', 'in', 'm'], 'reco': ['umolco2/m2/s'], 'sc': ['umolco2/m2/s'], 'sigmav': ['m/s'], 'sw_in': ['w/m2'], 'sw_out': ['w/m2', 'mj/m2'], 't_avg': ['c', 'f', 'k'], 't_max': ['c', 'f', 'k'], 't_min': ['c', 'f', 'k'], 'ustar': ['m/s'], 'vp': ['kpa', 'hpa', 'pa'], 'vpd': ['kpa', 'hpa', 'pa'], 'ws': ['m/s', 'mph'], 'zeta': ['dimensionless', 'nondimensional']}
- classmethod convert(var_name, initial_unit, desired_unit, df)[source]
Givin a valid initial and desired variable dimension for a variable within a
pandas.DataFrame, make the conversion and return the updatedpandas.DataFrame.For a list of variables that require certain units within
flux-data-qaqcseeConvert.allowable_units(names of allowable options of input variable dimensions) andConvert.required_units(for the mandatory dimensions of certain variables before running QaQc calculations).- Parameters:
var_name (str) – name of variable to convert in
df.initial_unit (str) – name of initial unit of variable, must be valid from
Convert.allowable_units.desired_unit (str) – name of units to convert to, also must be valid.
df (
pandas.DataFrame) –pandas.DataFramecontaining variable to be converted, i.e. withvar_namein columns.
- Returns:
updated dataframe with specified variable’s units converted
- Return type:
df (
pandas.DataFrame)
Note
Many potential dimensions may not be provided for automatic conversion, if so you may need to update your variable dimensions manually, e.g. within a
Data.dfbefore creating aQaQcinstance. Unit conversions are required for variables that can potentially be used in calculations withinDataorQaQc.
- pretty_unit_names = {'c': 'C', 'dimensionless': '—', 'f': 'F', 'hpa': 'hPa', 'j/m2': 'J m⁻²', 'k': 'K', 'kpa': 'kPa', 'm/s': 'm s⁻¹', 'pa': 'Pa', 'umol/mol': 'μmol mol⁻¹', 'umolco2/m2/s': 'μmol CO₂ m⁻² s⁻¹', 'umolphoton/m2/s': 'μmol photons m⁻² s⁻¹', 'w/m2': 'W m⁻²'}
- required_units = {'G': 'w/m2', 'H': 'w/m2', 'LE': 'w/m2', 'Rn': 'w/m2', 'blh': 'm', 'co2': 'umol/mol', 'fc': 'umolco2/m2/s', 'gpp': 'umolco2/m2/s', 'lw_in': 'w/m2', 'lw_out': 'w/m2', 'mo_length': 'm', 'nee': 'umolco2/m2/s', 'ppfd_in': 'umolphoton/m2/s', 'ppt': 'mm', 'reco': 'umolco2/m2/s', 'sc': 'umolco2/m2/s', 'sigmav': 'm/s', 'sw_in': 'w/m2', 'sw_out': 'w/m2', 't_avg': 'c', 't_max': 'c', 't_min': 'c', 'ustar': 'm/s', 'vp': 'kpa', 'vpd': 'kpa', 'ws': 'm/s', 'zeta': 'dimensionless'}
- fluxdataqaqc.util.monthly_resample(df, cols, agg_str, thresh=0.75)[source]
Resample dataframe to monthly frequency while excluding months missing more than a specified percentage of days of the month.
- Parameters:
df (
pandas.DataFrame) – datetime indexed DataFrame instancecols (list) – list of columns in df to resample to monthy frequency
agg_str (str) – resample function as string, e.g. ‘mean’ or ‘sum’
- Keyword Arguments:
thresh (float) – threshold (decimal fraction) of how many days in a month must exist for it to be temporally resampled, otherwise the monthly value for the month will be null.
- Returns:
datetime indexed DataFrame that has been resampled to monthly time frequency.
- Return type:
ret (
pandas.DataFrame)
Note
If taking monthly totals (agg_str = ‘sum’) missing days will be filled with the months daily mean before summation.
- fluxdataqaqc.util.write_configs(meta_df, data_dict, out_dir=None)[source]
Write multiple config files based on collection of site metadata and a dictionary containing variable information.
Useful for creating config files for flux-data-qaqc for batches of flux stations that utilize the same naming conventions and formatting.
- Parameters:
meta_df (
pandas.DataFrame) – dataframe that contains the following columns (or more) that describe metadata for multiple climate stations: ‘site_id’, ‘climate_file_path’, ‘station_longitude’ ‘station_elevation’, ‘station_latitude’, and ‘missing_data_value’. Elevation should be in meters and latitude is in decimal degrees. Additional metadata columns will be added to the config file for each site, e.g. ‘QC_flag’, ‘anemometer_height’, and any others.data_dict (dict) – dictionary that maps flux-data-qaqc config names to user’s column names in input files header e.g. {‘net_radiation_col’: ‘netrad’, ‘net_radiation_units’ : ‘w/m2’} Anything that flux-data-qaqc config files “DATA” section can be present here including QC flag names, multiple soil moisture names and weights.
- Keyword Arguments:
out_dir (str or None) – default None. Directory to save config files, if None then save to currect working directory.
- Returns:
- list of
pathlib.Pathobjects of full paths to each config file written.
- list of
- Return type:
configs (list)
- Raises:
Exception – if one of the mandatory metadata columns does not exist in meta_df.