Tutorial

This tutorial demonstrates the most important features of the flux-data-qaqc Python package for management, analysis, and visualization of eddy covariance time series data. It is recommended to read the Installation and Configuration Options and Caveats tutorials before this one.

A Jupyter Notebook of this tutorial is available here.

Tip

Currently, the software does not include a command line interface therefore to use the software you must use Python, e.g. make your own scripts or use an interactive shell. However, you will see that common workflows can be accomplished with a few (5-10) lines of code and you can simply follow the templates given here to make custom scripts.

Description of example datasets

The data for this example comes from the “Twitchell Alfalfa” AmeriFlux eddy covariance flux tower site in California. The site is located in alfalfa fields and exhibits a mild Mediterranean climate with dry and hot summers, for more information on this site or to download data click here.

Loading input

The loading and management of input climatic data and metadata from a config.ini file is done using the fluxdataqaqc.Data object. In a nutshell, a Data object is created from a properly formatted config file (see Setting up a config file) and has tools for parsing input climate data, averaging input climate time series, accessing/managing metadata, flag-based data filtering, and creating interactive visualizations of input data.

There is only one argument to create a Data object, the path to the config.ini file:

>>> # imports for code snippets within tutorial
>>> import pandas as pd
>>> from fluxdataqaqc import Data, QaQc, Plot
>>> from bokeh.plotting import figure, show, ColumnDataSource
>>> from bokeh.models.formatters import DatetimeTickFormatter
>>> from bokeh.models import LinearAxis, Range1d
>>> # create a Data object from the config.ini file
>>> config_path = 'US-Tw3_config.ini'
>>> d = Data(config_path)

Attributes of a Data object

Below are some of the useful attributes of the Data object and how they may be used.

The full path to the config.ini file that was used to create the Data instance can be accessed, note that it will return a system-depenedent pathlib.Path object. E.g. on my Linux machine the path is:

>>> d.config_file
    PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/US-Tw3_config.ini')

On a Windows machine the path will have the appropriate backslashes.

Similarly to access the climate time series file:

>>> d.climate_file
    PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/AMF_US-Tw3_BASE_HH_5-5.csv')

The Data.config attribute is a configparser.ConfigParser object, it allows you to access metadata and data in the config file in multiple ways and to modify them. In flux-data-qaqc it is mainly used for accessing information about the input data.

>>> # get a list of all entries in the METADATA section of the config.ini
>>> d.config.items('METADATA') # access the DATA section the same way
    [('climate_file_path', 'AMF_US-Tw3_BASE_HH_5-5.csv'),
     ('station_latitude', '38.1159'),
     ('station_longitude', '-121.6467'),
     ('station_elevation', '-9.0'),
     ('missing_data_value', '-9999'),
     ('skiprows', '2'),
     ('date_parser', '%Y%m%d%H%M'),
     ('site_id', 'US-Tw3'),
     ('country', 'USA'),
     ('doi_contributor_name', 'Dennis Baldocchi'),
     ('doi_contributor_role', 'Author'),
     ('doi_contributor_email', 'baldocchi@berkeley.edu'),
     ('doi_contributor_institution', 'University of California, Berkeley'),
     ('doi_organization', 'California Department of Water Resources'),
     ('doi_organization_role', 'Sponsor'),
     ('flux_measurements_method', 'Eddy Covariance'),
     ('flux_measurements_variable', 'CO2'),
     ('flux_measurements_operations', 'Continuous operation'),
     ('site_name', 'Twitchell Alfalfa'),
     ('igbp', 'CRO'),
     ('igbp_comment',
      'alfalfa is a fast growing leguminous crop raised for animal feed of low stature.  It is planted in rows and typically reaches 60-70 cm in height prior to harvest.'),
     ('land_ownership', 'public'),
     ('network', 'AmeriFlux'),
     ('reference_paper',
      'Baldocchi, D., Penuelas, J. (2018) The Physics And Ecology Of Mining Carbon Dioxide From The Atmosphere By Ecosystems, Global Change Biology, 45(), 9275–9287'),
     ('reference_doi', '10.1111/gcb.14559'),
     ('reference_usage', 'Reference'),
     ('research_topic',
      'The research approach of the University of California, Berkeley Biometeorology Laboratory involves the coordinated use of experimental measurements and theoretical models to understand the physical, biological, and chemical processes that control trace gas fluxes between the biosphere and atmosphere and to quantify their temporal and spatial variations. The research objectives of the Mayberry Wetland, Twitchell Wetland, Sherman Island, Twitchell Island, Twitchell Alfalfa,  and Twitchell Corn sites are as follows: 1) Describe differences in the fluxes of CO2, CH4, H2O, and energy between different land uses, 2) Understand the mechanisms controlling these fluxes, 3) Use ecosystem modeling to understand controls on these mechanisms under different environmental scenarios. These six sites were selected to capture a wide range of inundated conditions within the Sacramento-San Joaquin River Delta. The research focuses on the eddy covariance technique to measure CH4, CO2, H2O, and energy fluxes and works to combine measurements of both net fluxes and partitioned fluxes in order to achieve a mechanistic understanding of the ecological controls on current and future carbon flux in the Delta.'),
     ('terrain', 'Flat'),
     ('aspect', 'FLAT'),
     ('wind_direction', 'W'),
     ('surface_homogeneity', '370.0'),
     ('site_desc',
      "The Twitchell Alfalfa site is an alfalfa field owned by the state of California and leased to third parties for farming. The tower was installed on May 24, 2013. This site and the surrounding region are part of the San Joaquin - Sacramento River Delta drained beginning in the 1850's and subsequently used for agriculture. The field has been alfalfa for X years…., Crop rotation occurs every 5-6 years.  The site is harvested by mowing and bailing several times per year.  The field is fallow typically between November and February. The site is irrigated by periodically-flooded ditches surrounding the field. The site is irrigated by raising, and subsequently lowering the water table??"),
     ('site_funding', 'California Department of Water Resources'),
     ('team_member_name', 'Joe Verfaillie'),
     ('team_member_role', 'Technician'),
     ('team_member_email', 'jverfail@berkeley.edu'),
     ('team_member_institution', 'University of California, Berkeley'),
     ('url_ameriflux', 'http://ameriflux.lbl.gov/sites/siteinfo/US-Tw3'),
     ('utc_offset', '-8'),
     ('mat', '15.6'),
     ('map', '421.0'),
     ('land_owner', 'California Department of Water Resources'),
     ('climate_koeppen', 'Csa'),
     ('doi', '10.17190/AMF/1246149'),
     ('doi_citation',
      'Dennis Baldocchi (2013-) AmeriFlux US-Tw3 Twitchell Alfalfa, 10.17190/AMF/1246149'),
     ('doi_dataproduct', 'AmeriFlux'),
     ('team_member_address',
      'Department of Environmental Science, Policy and Management, 137 Mulford Hall, 345 Hilgard Hall,Berkeley, CA USA 94720-3110'),
     ('url', 'http://nature.berkeley.edu/biometlab/sites.php?tab=US-Tw3'),
     ('dom_dist_mgmt', 'Agriculture'),
     ('site_snow_cover_days', '0.0'),
     ('state', 'CA'),
     ('location_date_start', '20130524.0'),
     ('acknowledgement',
      'Biometeorology Lab, University of California, Berkeley, PI:  Dennis Baldocchi')]

A useful method is the configparser.ConfigParser.get which takes the section of the config file and the “option” and returns the value:

>>> d.config.get(section='METADATA', option='site_name')
    'Twitchell Alfalfa'

>>> # section and option are optional keywords
>>> d.config.get('METADATA', 'site_name')
    'Twitchell Alfalfa'

Tip

If you are unsure if an entry or option exists in the config file, use the fallback keyword argument

>>> # section and option are optional keywords
>>> d.config.get('METADATA', 'site name', fallback='na')
    'na'

Some metadata entries are added as Data attributes for easier access as they are used in multiple ways later, these include:

site_id\(^*\)
elevation\(^*\)
latitude\(^*\)
longitude\(^*\)
na_val
qc_threshold
qc_flag

\(^*\)mandatory METADATA entries in the config file, see Setting up a config file for further explanation.

View all the columns as found in the header row of the input time series climate file.

>>> d.header
    array(['TIMESTAMP_START', 'TIMESTAMP_END', 'CO2', 'H2O', 'CH4', 'FC',
           'FCH4', 'FC_SSITC_TEST', 'FCH4_SSITC_TEST', 'G', 'H', 'LE',
           'H_SSITC_TEST', 'LE_SSITC_TEST', 'WD', 'WS', 'USTAR', 'ZL', 'TAU',
           'MO_LENGTH', 'V_SIGMA', 'W_SIGMA', 'TAU_SSITC_TEST', 'PA', 'RH',
           'TA', 'VPD_PI', 'T_SONIC', 'T_SONIC_SIGMA', 'SWC_1_1_1',
           'SWC_1_2_1', 'TS_1_1_1', 'TS_1_2_1', 'TS_1_3_1', 'TS_1_4_1',
           'TS_1_5_1', 'NETRAD', 'PPFD_DIF', 'PPFD_IN', 'PPFD_OUT', 'SW_IN',
           'SW_OUT', 'LW_IN', 'LW_OUT', 'P', 'FC_PI_F', 'RECO_PI_F',
           'GPP_PI_F', 'H_PI_F', 'LE_PI_F'], dtype='<U15')

Note

All of the header columns will not necessarily be loaded, only those specified in the config file. Also, no data other than the header line is loaded into memory when creating a Data object, the time series data is only loaded when calling Data.df for increased efficiency for some workflows involving only metadata.

Variable names and units

In flux-data-qaqc there are two naming schemes for climate variables, the names as defined by the column headers in the input time series file and the internal names for some variables and calculated variables created by the package. We will refer to these two sets as “user-defined” and “internal” names hereforth.

The Data.variables attribute maps the internal to user-defined variable names:

>>> d.variables
    {'date': 'TIMESTAMP_START',
     'Rn': 'NETRAD',
     'G': 'G',
     'LE': 'LE_PI_F',
     'H': 'H_PI_F',
     'sw_in': 'SW_IN',
     'sw_out': 'SW_OUT',
     'lw_in': 'LW_IN',
     'lw_out': 'LW_OUT',
     'vpd': 'VPD_PI',
     't_avg': 'T_SONIC',
     'ws': 'WS',
     'theta_1': 'SWC_1_1_1',
     'theta_2': 'SWC_1_2_1'}

And, the Data.inv_map maps the internal to user-defined names if they differ, however this is only created once the data is loaded by calling Data.df.

>>> # a similar dictionary attribute for input units
>>> d.units
    {'Rn': 'w/m2',
     'G': 'w/m2',
     'LE': 'w/m2',
     'H': 'w/m2',
     'sw_in': 'w/m2',
     'sw_out': 'w/m2',
     'lw_in': 'w/m2',
     'lw_out': 'w/m2',
     'vpd': 'hPa',
     't_avg': 'C',
     'ws': 'm/s',
     'theta_1': '(%): Soil water content (volumetric), range 0-100',
     'theta_2': '(%): Soil water content (volumetric), range 0-100'}

Accessing input data

The Data.df property gves access to the time series input climate data for columns specified in the config file as a datetime-indexed pandas.DataFrame object. This object has numerous powerful built in tools for time series analysis and visualization.

>>> # first 5 datetimes that are not gaps
>>> d.df.dropna().head()

	input_G	WS	VPD_PI	T_SONIC	SWC_1_1_1	SWC_1_2_1	NETRAD	SW_IN	SW_OUT	LW_IN	LW_OUT	H_PI_F	LE_PI_F	theta_mean
date
2013-05-24 12:30:00	122.194848	3.352754	18.853678	25.682739	6.6790	26.1655	652.648719	1027.756939	212.800000	300.363524	462.671744	95.487930	375.841436	16.42225
2013-05-24 13:00:00	108.054863	3.882154	18.560999	26.057700	6.7065	26.1600	629.990486	997.749437	209.933333	303.269447	461.095065	96.584383	371.619775	16.43325
2013-05-24 13:30:00	79.330662	4.646089	18.900260	26.067374	6.7120	26.1545	595.817687	954.988747	206.733333	303.017852	455.455579	84.066406	358.194935	16.43325
2013-05-24 14:00:00	52.366527	5.048825	20.440061	25.961307	6.7395	26.1325	549.039365	900.975244	201.333333	298.914731	449.517276	69.449710	406.528564	16.43600
2013-05-24 14:30:00	35.658417	5.302946	21.064824	25.954462	6.7450	26.1215	493.519695	833.458365	192.066667	296.791541	444.663544	47.774030	315.295309	16.43325

Tip

There are many tutorials on how to use the pandas.DataFrame and its powerful data analysis tools for multiple purposes online, to get started you may want to visit Panda’s own list of tutorials here.

By default the column names in Data.df are retained from user-defined names unless they were named exactly the same as an internal name. For example the input ground heat flux column in this dataset is named “G”, therefore it was renamed as “input_g”

>>> d.df.columns
    Index(['input_G', 'WS', 'VPD_PI', 'T_SONIC', 'SWC_1_1_1', 'SWC_1_2_1',
           'NETRAD', 'SW_IN', 'SW_OUT', 'LW_IN', 'LW_OUT', 'H_PI_F', 'LE_PI_F',
           'theta_mean'],
          dtype='object')

>>> # the new name was also updated in Data.variables
>>> d.variables.get('G')
    'input_G'

As stated earlier, Data.inv_map maps the user-defined names to internal flux-data-qaqc names only after loading Data.df:

>>> d.inv_map
    {'TIMESTAMP_START': 'date',
     'NETRAD': 'Rn',
     'input_G': 'G',
     'LE_PI_F': 'LE',
     'H_PI_F': 'H',
     'SW_IN': 'sw_in',
     'SW_OUT': 'sw_out',
     'LW_IN': 'lw_in',
     'LW_OUT': 'lw_out',
     'VPD_PI': 'vpd',
     'T_SONIC': 't_avg',
     'WS': 'ws',
     'SWC_1_1_1': 'theta_1',
     'SWC_1_2_1': 'theta_2'}

Tip

The Data.inv_map is mainly used to rename the dataframe to internal names, this can be very useful if you are creating your own custom workflows using the flux-data-qaqc API because it allows you to only know the internal names of variables therefore they can be hard coded into your workflow and applied to different eddy covariance datasets. For example, let’s say we wanted to make HTML tables of basic statistics of just the energy balance components for many datasets (that may have different names for the same variables) and save the file using the user-defined names:

>>> d = Data('US-Tw3_config.ini')
>>> df = d.df.rename(columns=d.inv_map)
>>> # get some metadata for saving
>>> site_id = d.site_id
>>> vars_we_want = ['H', 'LE', 'Rn', 'G']
>>> # rename variables, calculate basice statistics table and save to HTML
>>> df[vars_we_want].rename(columns=d.variables).describe().to_html('{}.html'.format(site_id))
    Calculating mean for var: THETA from columns: ['SWC_1_1_1', 'SWC_1_2_1']
    WARNING: renaming column G to input_G
>>> # which produces the following HTML table with user-defined names:
>>> from IPython.display import HTML
>>> HTML(filename='{}.html'.format(site_id))

	H_PI_F	LE_PI_F	NETRAD	input_G
count	88224.000000	88224.000000	75907.000000	83635.000000
mean	11.406226	68.404030	101.920827	2.518969
std	69.703895	99.070142	216.580113	28.046602
min	-526.371895	-202.958294	-123.084525	-47.545875
25%	-31.398696	1.952121	-61.986728	-17.174822
50%	-8.100198	18.846609	-10.864949	-8.162624
75%	38.495230	105.196206	239.810124	17.611898
max	399.902074	603.686401	748.806624	152.888745

Another powerful feature of the Data.df property is that it is datetime-indexed using the input data’s temporal frequency, view the date index like so:

>>> d.df.index
    DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:30:00',
                   '2013-01-01 01:00:00', '2013-01-01 01:30:00',
                   '2013-01-01 02:00:00', '2013-01-01 02:30:00',
                   '2013-01-01 03:00:00', '2013-01-01 03:30:00',
                   '2013-01-01 04:00:00', '2013-01-01 04:30:00',
                   ...
                   '2018-06-04 19:00:00', '2018-06-04 19:30:00',
                   '2018-06-04 20:00:00', '2018-06-04 20:30:00',
                   '2018-06-04 21:00:00', '2018-06-04 21:30:00',
                   '2018-06-04 22:00:00', '2018-06-04 22:30:00',
                   '2018-06-04 23:00:00', '2018-06-04 23:30:00'],
                  dtype='datetime64[ns]', name='date', length=95088, freq=None)

Datetime-indexed pandas.DataFrame objects have useful features for time series analysis like grouping and calculating statistics by time aggregates. The example below shows how to calculate the day of year mean for energy balance components, it also demonstrates how to use the add_lines plotting method available to Data, QaQc, and Plot objects.

>>> # convert to internal names, copy dataframe
>>> df = d.df.rename(columns=d.inv_map)
>>> # day of year mean of input energy balance components
>>> vars_we_want = ['H', 'LE', 'Rn', 'G']
>>> doy_means = df[vars_we_want].groupby(d.df.index.dayofyear).mean()
>>> # create a Bokeh figure
>>> fig = figure(x_axis_label='day of year', y_axis_label='day of year mean (w/m2)')
>>> # arguements needed for creating interactive plots
>>> plt_vars = vars_we_want
>>> colors = ['red', 'blue', 'black', 'green']
>>> x_name = 'date'
>>> source = ColumnDataSource(doy_means)
>>> Plot.add_lines(fig, doy_means, plt_vars, colors, x_name, source, labels=vars_we_want,
>>>     x_axis_type=None)
>>> show(fig)

Bokeh Plot

Note

The x_axis_type=None is a unique argument to Plot.add_lines and Plot.line_plot that in this case means to not try to force the x-axis format to a datetime representation, default is x_axis_type='date'.

Modifying input data

A last note on the Data object (same goes for the QaQc object) is that Data.df is a class property, in this case that means that it can be reassigned with a different pandas.DataFrame. This is critical for manual pre-filtering and validation of data before proceeding with energy balance closure routines. A simple example is shown here:

>>> # add 5 to air temperature, this would effect ET calculations later
>>> x = d.df
>>> x['T_SONIC'] += 5
>>> d.df = x

A realistic use of the reassignability of the Data.df and QaQc.df properties is shown in manual cleaning of poor quality data.

Visualize input data

The Data.plot method create a series of interactive time series plots of input data, potential plots inlcude:

energy balance components
radiation components
multiple soil heat flux measurements
air temperature
vapor pressure and vapor pressure deficit
wind speed
precipitation
latent energy
multiple soil moisture measurements

If any of these variables are not found the plot(s) will not be added.

The most useful interactive features of plots created by flux-data-qaqc are:

pan/zoom
hover tolltips on var names, values, date
linked x-axes on time series plots
save plot option (can save specific subplot zoomed in)

Here is an example,

>>> d.plot(output_type='notebook', plot_width=700)

The output plot is not shown in the online documentation due to memory constraints.

Hint

The plot methods of Data and QaQc objects have the keyword argument output_type which by default is set to “save”, the other two options are “notebook” for showing within a Jupyter Notebook and “show” which opens a temporary file in the default web browser.

If you rather save the plot, and maybe you want 2 columns of plots,

>>> d.plot(ncols=2, plot_width=500)

After saving a plot without specifying the output file path (keyword argument out_file), it will be saved to an “output” directory where the config file is with the file name based on Data.site_id with the suffix “_input_plots”:

>>> # where the plot file was saved by default
>>> d.plot_file
    PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/output/US-Tw3_input_plots.html')

The following plot is not shown due to excessive memory usage needed to build online documentation.

>>> # view outplot plots within Jupyter notebook
>>> from IPython.display import HTML
>>> HTML(filename=d.plot_file)

Hint

The QaQc.plot method shown below is similar however it may include added plots with calculated and corrected variables (if they exist) and will always plot data in daily and monthly temporal frequency because daily frequency is required before applying flux-data-qaqc energy balance closure corrections.

Temporal resampling

The QaQc object holds several tools for managing data and eddy covariance data analysis, but one of it’s primary features is temporal resampling of input data to daily and monthly frequencies. The resampling of time series data to daily frequency occurs upon the creation of a QaQc instance if the frequency within the preceeding Data object is not already daily:

>>> # the frequency of the input data is 30 minute
>>> d.df.index[0:5]
    DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:30:00',
                   '2013-01-01 01:00:00', '2013-01-01 01:30:00',
                   '2013-01-01 02:00:00'],
                  dtype='datetime64[ns]', name='date', freq=None)

>>> # creating a QaQc instance will automatically convert to daily
>>> q = QaQc(d)
    The input data temporal frequency appears to be less than daily.
    Data is being resampled to daily temporal frequency.
    Filtering days with less then 100.0% or 48/48 sub-daily measurements
    Converting vpd from hpa to kpa

>>> # first 5 datetime indices are dates now
>>> q.df.index[0:5]
    DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
                   '2013-01-05'],
                  dtype='datetime64[ns]', name='date', freq=None)

The method used for aggregating different variables, e.g. mean or sum, when resampling to daily or monthly frequency is defined in the QaQc.agg_dict class attribute:

>>> # these are the internal names as keys and temporal aggregation method as values
>>> QaQc.agg_dict
    {'energy': 'mean',
     'flux': 'mean',
     'flux_corr': 'mean',
     'br': 'mean',
     'ET': 'sum',
     'ET_corr': 'sum',
     'ET_gap': 'sum',
     'ET_fill': 'sum',
     'ET_fill_val': 'sum',
     'ET_user_corr': 'sum',
     'ebr': 'mean',
     'ebr_corr': 'mean',
     'ebr_user_corr': 'mean',
     'ebr_5day_clim': 'mean',
     'gridMET_ETr': 'sum',
     'gridMET_prcp': 'sum',
     'lw_in': 'mean',
     't_avg': 'mean',
     'rso': 'mean',
     'sw_pot': 'mean',
     'sw_in': 'mean',
     'vp': 'mean',
     'vpd': 'mean',
     'ppt': 'sum',
     'ws': 'mean',
     'Rn': 'mean',
     'sw_out': 'mean',
     'lw_out': 'mean',
     'G': 'mean',
     'LE': 'mean',
     'LE_corr': 'mean',
     'LE_user_corr': 'mean',
     'H': 'mean',
     'H_corr': 'mean',
     'H_user_corr': 'mean'}

Note

There are several calculated variables above that may not look familiar, many are calculated by the energy balance closure correction routines and described in Closure Methodologies. Also, any other variables (not found in QaQc.agg_dict that exist in a QaQc.df before accessing QaQc.monthly_df the first time will be averaged in the monthly time series dataframe (QaQc.monthly_df).

The QaQc constructor tries to infer the temporal frequency of the input time series data, however the method is not always accurate, to access the inferred initial temporal frequency of the data view the QaQc.temporal_freq attribute:

>>> q.temporal_freq
    '30T'

If the inferred input frequency was accurate you will see a Pandas datetime alias, in this case ‘30T’ is thirty minutes. If the temporal frequency is not automatically detected you should be able to rely on the n_samples_per_day instance attribute that is manually estimated by the QaQc constructor:

>>> q.n_samples_per_day
    48

Filter days with sub-daily gaps

The drop_gaps and daily_frac keyword arguments used when creating a QaQc instance allow you to control how days with sub-daily measurement gaps will or will not be filtered out when resampling to daily frequency.

Sub-daily gaps in energy balance variables \(LE\), \(H\), \(Rn\), and \(G\) , and daily ASCE standaridized reference ET inputs, e.g. hourly \(ea\) (“vp”), \(rs\) (“sw_in”), \(t_min\), \(t_max\), and \(ws\), can be linearly interpolated automatically before daily aggregations. Interpolation is performed over gap lengths measured in hours, with options to control the longest length of gap to interpolate when \(Rn \ge 0\) controlled by the QaQc keyword argument max_interp_hours (default 2 hours) and the longest gap to interpolate when \(Rn < 0\) set by the max_interp_hours_night (default 4 hours).:math:`

Important

By default the QaQc constructor will first linearly interpolate energy balance and ASCE ref. ET variables (\(LE\), \(H\), \(Rn\), \(G\), \(ea\) (“vp”), \(rs\) (“sw_in”), \(t_min\), \(t_max\), and \(ws\)) according to the maximum gap lengths (max_interp_hours and max_interp_hours_night) and then count sub-daily gaps and drop days (set values to null) for all climate data columns (not QC flag or sub-daily gap count columns) where any of the sub-daily data are missing because by default drop_gaps=True and daily_frac=1.0. In other words, if you have hourly input data for(\(LE\) and one hour was missing on a given day, by default that hour will be linearly interpolated before calculating the daily time series and the daily mean will be calculated after. On the other hand, if other climate variables had a single hour missing on a given day, e.g. wind direction, this day would be filtered out by the QaQc constructor. This is important because the daily time series is what is used in all energy balance closure correction and daily ASCE standardized reference ET algorithms.

The percentage of sub-daily samples to require set by the daily_frac argument and the maximum length of gaps to linearly interpolate set by max_interp_hours and max_interp_hours_night complement each other and are used in tandem. For example, if the input data is half-hourly and you only want a maximum of 4 hours to be interpolated on any given day and gap lengths to interpolate should be no more than 2 hours each then you would pass the following parameters to the QaQc constructor:

>>> q = QaQc(d, daily_frac=20/24, max_interp_hours=2, max_interp_hours_night=2)
    The input data temporal frequency appears to be less than daily.
    Data is being resampled to daily temporal frequency.
    Linearly interpolating gaps in energy balance components up to 2 hours when Rn < 0 and up to 2 hours when Rn >= 0.
    Filtering days with less then 83.33333333333334% or 40/48 sub-daily measurements

In this case we set daily_frac=20/24 because we are only allowing a maximum of 4 hours of total gaps in the day in other words we are requiring 40 of the 48 half hourly samples to exist before we filter out a day. Remember, because linear interpolation of gaps is done before counting sub-daily gaps, this could result in retaining days with more than 4 hours of gaps in the original time series of energy balance components. You may also pass the daily_frac arugment as a decimal fraction, e.g. \(0.8333 \approx 20/24\).

To not drop any days and take daily means/sums based on whatever data exists in a given day without any interpolation of energy balance variables,

>>> q = QaQc(d, drop_gaps=False, max_interp_hours=None)
    The input data temporal frequency appears to be less than daily.
    Data is being resampled to daily temporal frequency.

Let’s view a comparison of \(Rn\) using different options of filtering days with sub-daily gaps in the working dataset, because it has several periods of systematic gaps which cause upwards skewing of daily mean \(Rn\) if not filtered carefully:

>>> # make an empty pandas dataframe for Rn series
>>> Rn_df = pd.DataFrame()
>>> # recreate multiplt QaQc instances using different sub-day gap filters
>>> q = QaQc(d, drop_gaps=False, max_interp_hours=None)
>>> Rn_df['sub_day_gaps'] = q.df.Rn_subday_gaps
>>> Rn_df['no_filter_no_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, drop_gaps=False)
>>> Rn_df['no_filter_with_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=0.5) # filter days with less than 50% data
>>> Rn_df['require_50'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=0.75)
>>> Rn_df['require_75'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=1, max_interp_hours=24, max_interp_hours_night=24)
>>> Rn_df['require_100_with_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=1, max_interp_hours=None)
>>> Rn_df['require_100_no_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> # plot to compare results of day-gap filter
>>> fig = figure(x_axis_label='date', y_axis_label='mean daily net radiation (w/m2), filtered based on sub-daily gaps')
>>> # arguments needed for creating interactive line plots
>>> colors = ['red', 'darkred','orange', 'blue', 'black', 'tan']
>>> plt_vars = ['no_filter_no_interp', 'no_filter_with_interp', 'require_50', 'require_75', 'require_100_with_interp', 'require_100_no_interp']
>>> labels = ['no filter wout/interp.', 'no filter w/interp.', 'require > 50% w/interp.', 'require > 75% w/interp.', 'require 100% w/interp.', 'require 100% wout/interp.']
>>> x_name = 'date'
>>> source = ColumnDataSource(Rn_df)
>>> Plot.add_lines(fig, Rn_df, plt_vars, colors, x_name, source, labels=labels)
>>> # add daily gap counts to secondary y
>>> fig.extra_y_ranges['gap_counts'] = Range1d(start=0, end=48)
>>> fig.add_layout(LinearAxis(y_range_name='gap_counts', axis_label='number of sub-daily gaps'), 'right')
>>> fig.circle('date', 'sub_day_gaps', legend_label='n sub-day gaps', y_range_name='gap_counts',
>>>     color='silver', source=source
>>> )
>>> fig.hover[0].tooltips.append(('sub_day_gaps','@{}'.format('sub_day_gaps')))
>>> fig.legend.location = 'top_right'
>>> show(fig)

Bokeh Plot

Try zooming in on the gaps filled by the “no filter wout/interp.” line to compare which days are retained/filtered by different options, also remove lines by clicking on them in the legend to compare subsets of options.

Tip

For a more fine-grained approach to filtering out days where perhaps multiple 2 hour gaps were filled use the newly created daily gap count columns: “LE_subday_gaps”, “H_subday_gaps”, “Rn_subday_gaps”, and “G_subday_gaps”:

>>> q = QaQc(d)
>>> df = q.df.rename(columns=q.inv_map)

For example, you could post-filter out days in any given energy balance variable, in this case \(Rn\) where sub-daily gaps exceed a threshold:

>>> df.loc[(df.Rn_subday_gaps > 4) & (df.Rn.notna()), ['Rn','Rn_subday_gaps']]
Rn Rn_subday_gaps

date

2015-06-09 101.710194 5.0

2015-11-20 47.990988 5.0

2016-01-15 72.495973 8.0

2018-01-06 79.507008 7.0

2018-05-10 160.997332 6.0

Monthly time series

The QaQc.monthly_df property allows for creating the monthly time series of input anc calculated variables provided by QaQc.correct_data. It uses the same temporal aggregation methods as the daily time series i.e. from QaQc.agg_dict. Although there are many similarities there are important differences between QaQc.df and QaQc.monthly_df other than the obvious: when accessing the QaQc.monthly_df it will automatically run the default energy balance closure correction routine provided by QaQc.correct_data if it has not yet been run. You can check if it has been run at anytime by:

>>> q.corrected
    False

To show how this works let’s access the monthly data and show the monthly statistics of the “corrected” evapotranspiration (ET_corr):

>>> # first note, ET_corr is not in the dataset yet
>>> 'ET_corr' in q.df.columns
    False

Now access the monthly time series,

>>> q.monthly_df;
>>> 'ET_corr' in q.df.columns
    True

By calling the monthly dataframe, the energy balance closure was applied automatically

>>> q.monthly_df.ET_corr.describe()
    count     61.000000
    mean      87.858135
    std       49.938287
    min       11.370062
    25%       41.418994
    50%       84.383190
    75%      127.500125
    max      192.033481
    Name: ET_corr, dtype: float64

>>> q.corrected
    True

Note

The QaQc.monthly_df also filters out months with less than 30% of days of the month missing by default. To calculate monthly time series with other threshold fractions of days required use the util.monthly_resample function and adjust the keyword argument thresh which is the fraction (0-1) of days of the month required to not be gaps otherwise the month’s value will be forced to null, e.g. if you wanted to caclulate the monthly mean air temperature requiring 30 and 90 percent of the days in the month to not be gaps:

>>> from fluxdataqaqc.util import monthly_resample
>>> # select just t_avg for example
>>> cols = ['t_avg']
>>> df = q.df.rename(columns=q.inv_map)
>>> # create temporary df with different monthly resample results
>>> tmp_df = monthly_resample(df, cols, 'mean', thresh=0.9).rename(
>>>     columns={'t_avg': 'thresh_90'}
>>> )
>>> # join temp dataframe with monthly resample results using different thresh
>>> monthly_gap_comp = tmp_df.join(monthly_resample(df, cols, 'mean', thresh=0.3).rename(
>>>     columns={'t_avg': 'thresh_30'})
>>> )
>>> # plot to compare results of day-gap filter
>>> fig = figure(x_axis_label='date', y_axis_label='monthy mean air temperature (C), filtered based on daily gaps')
>>> # arguments needed for creating interactive line plots
>>> x = 'date'
>>> source = ColumnDataSource(monthly_gap_comp)
>>> # this example also shows how to use other Bokeh plot arguments
>>> Plot.line_plot(fig,'date','thresh_30',source,'red',label='require > 30%', line_alpha=0.5)
>>> Plot.line_plot(fig,'date','thresh_90',source,'black',label='require > 90%',line_dash='dotted', line_width=2)
>>> fig.legend.location = 'top_right'
>>> show(fig)

Bokeh Plot

Energy balance corrections

flux-data-qaqc provides routines that adjust surface energy balance fluxes to improve energy balance closure of eddy covariance flux station data. These routines ultimately result in a corrected daily and monthly time series of latent energy, sensible heat, and evapotranspiration with the option to gap-fill days in corrected ET with ET calculated from gridMET reference ET and fraction of reference ET.

There are two methods currently implemented:

Energy Balance Ratio method (default), modified from the FLUXNET method
Bowen Ratio approach (forces closure)
Multiple least squares regression - user defines LHS and RHS from \(LE\), \(H\), \(Rn\), and \(G\),

Detailed descriptions of methods including daily ET gap-filling methods can be found in the online documentation Closure Methodologies page. A few important notes on the API of these methods and other hydro-climatic statistical variables that are calculated are shown below.

The QaQc.correct_data method is used to run energy balance closure corrections. Here are a few tips on using them,

>>> # note above the monthly_df automatically applied the 'ebr' Energy Balance Ratio correction
>>> q.corr_meth
    'ebr'

>>> # potential correction options
>>> q.corr_methods
    ('ebr', 'br', 'lin_regress')

>>> # to specify the Bowen Raito method:
>>> q.correct_data(meth='br')

>>> # the most recently used correction method is now shown
>>> q.corr_meth
    'br'

Tip

After applying any energy balance closure correction routine all previous corrected variables will be overwritten or dropped in QaQc.df, QaQc.monthly_df, and QaQc.variables, therefore to make a comparison of different methods on the same data make a copy of the df or monthly_df properties before running the next correction, e.g.

>>> # make copies of daily results of different correction options
>>> q.correct_data(meth='ebr')
>>> ebr_gapfilled = q.df
>>> q.correct_data(meth='ebr', etr_gap_fill=False)
>>> ebr_notgapfilled = q.df
>>> q.correct_data(meth='br')
>>> br_gapfilled = q.df
>>> q.correct_data(meth='br', etr_gap_fill=False)
>>> br_notgapfilled = q.df

ET gap-filling

A few notes on the option that uses reference ET and fraction of daily reference ET to fill in large gaps in corrected ET, i.e. the keyword argument QaQc.correct_data(etr_gap_fill = True).

The nearest gridMET cell’s time series data for precipitation and alfalfa reference ET is attempted to be downloaded if it is not found in the gridmet_file_path entry of the config.ini file.
If the path to a gridMET file is not found it is re-downloaded, the config file will be updated with the new path and resaved.
Only the overlapping time period that matches the eddy covariance time series data is attempted to be downloaded, i.e. the period in QaQc.df.index.
When a gridMET file is downloaded it will always be saved in a subdirectory where the config file is located called “gridMET_data” and named using the QaQc.site_id and gridMET cell centroid latitude and longitude.
Corrected latent energy (\(LE_{corr}\)) gaps are also backwards filled from gap-filled ET.

Caution

gridMET only exists within the contiguous United States and from 1979 to present, therefore if your station lies outside of this region or you are analyzing eddy flux data recorded before 1979 this option will not be ususable and you should always run corrections with etr_gap_fill=False to avoid potential errors.

The Bowen Ratio correction method will produce the ‘br’ variable which is the Bowen Ratio.

Other calculations

By default, QaQc.correct_data also calculates ET from input latent energy (LE) and air temperature, corrected ET from corrected LE and air temperature, potential clear sky radiation (ASCE formulation), and the Data object attempts to calculate vapor pressure deficit from vapor pressure and air temperature or vapor pressure from vapor pressure deficit and air temperature if they exist at hourly or shorter temporal frequency.

Evapotranspiration

The evapotranspiration (ET) calculations are described in Steps 7 and 8 correct turbulent fluxes, EBR, and ET of the Energy Balance Ratio correction explanation.

ASCE clear sky radiation

Daily ASCE potential clear sky radiation (\(R_{so}\)) is calculated using equation 19 in the “ASCE Standardized Reference Evapotranspiration Equation” final report by the Task Committee on Standardization of Reference Evapotranspiration Environmental and Water Resources Institute of the American Society of Civil Engineers January, 2005 here. This calculation is a simple method based primarily on elevation and latitude which results in a theoretical envelope of \(R_{so}\) as a function of the day of year,

\[R_{so} = \left(5 + 2 \times 10^{-5} z \right) R_a\]

where \(z\) is elevation in meters and \(R_a\) is daily extraterrestrial radiation (radiation with in the absence of an atmosphere), which itself is a well-behaved function of solar declination, the day of the year and the solar constant (see equations 21-29 in the ASCE report).

Vapor pressure/deficit

The Data object will attempt to calculate vapor pressure or vapor pressure deficit if one exists but not the other and average air temperature time series also exists with the input data at hourly or shorter temporal frequency. The Tetens equation (eqn. 37 in the ASCE report) s an accurate approximation for saturation vapor pressure (\(es\)) in kPa as a function of air temperature,

\[es = 0.6108 e^{\left(\frac{17.27 \cdot T}{(T + 237.3)}\right)}\]

where \(T\) is average hourly air temperature in degrees celcius. Vapor pressure deficit (\(vpd\)) is,

\[vpd = es - ea,\]

where \(ea\) is actual vapor pressure in kPa. Note, The equations above are defined for hourly measurements however they are used for hourly or shorter mean variables (\(T\), \(ea\), or \(vpd\)) within flux-data-qaqc and then converted to daily means, if they are not present in the input data at hourly or shorter frequencies then they are not calculated.

These equations can be rearanged to solve for either \(es\) or \(vpd\) given the other variable and air temperature. For example, if given \(T\) and \(vpd\), then to get actual vapor pressure

\[es = 0.6108 e^{\left(\frac{17.27 \cdot T}{(T + 237.3)}\right)}\]

\[ea = es - vpd.\]

In flux-data-qaqc actual vapor pressure is named “vp” not “ea”. Also, during these calculations, if relative humidity is not found in the input dataset then it will subsequently be estimated as

\[rh = 100 \times \frac{ea}{es}.\]

Hint

The same calculations are available at the daily timestep but are not automatically applied as the hourly or higher temporal frequency calculation is preffered. To apply the estimates of vapor pressure or vapor pressure deficit, and saturation vapor pressure, and relative humidity with daily data one must call the QaQc._calc_vpd_from_vp method from a QaQc instance.

Legend for calculated variable names

Atmospheric and related variables created by energy balance closure corrections are described in Closure Methodologies and other calculations can be found throughout this documentation. Below is a reference table with basic descriptions of all variables that are calculated or renamed by flux-data-qaqc using its standardized naming scheme. The names found here (in the “Variable” column) are the default names that are saved to the header in both daily and monthly time series output CSV files as well as what are shown in the interactive plots upon hovering with a cursor. Note, not all of the following variables will be estimated by flux-data-qaqc in all scenarios, the list of variables that may be found in output files is completely dependent on the input data provided in the configuration file and also dependend on which calculations are used. For example, ASCE_ETo will only exist if the hourly or daily methods for computing ASCE standardized reference ET are used and the required input variables exist.

Variable

Description

Unit

ASCE_ETo

short/grass ASCE standardized reference ET

mm time-1

ASCE_ETr

tall/alfalfa ASCE standardized reference ET

mm time-1

br

bowen ratio

unitless

ebc_cf

energy balance closure correction factor (inverse of ebr_corr)

unitless

ebr

input energy balance ratio

unitless

ebr_5day_clim

5 day climatology of the filtered Energy Balance Ratio

unitless

ebr_corr

corrected energy balance ratio

unitless

energy

input Rn - G

w m-2

es

saturation vapor pressure

kPa

ET

ET calculated from input LE and average air temperature

mm time-1

ET_corr

ET calculated from LE_corr

mm time-1

ET_fill

gridMET_ETr * ETrF_filtered (fills gaps in ET_corr)

mm time-1

ET_fill_val

value of ET_fill on gap days

mm time-1

ET_gap

True on gap days in ET_corr, False otherwise

unitless

ET_user_corr

corrected ET, user-provided

mm time-1

EToF

fraction of reference ET for ET_corr, i.e. ET_corr / gridMET_ETo

unitless

EToF_filtered

filtered and gap-filled EToF

unitless

ETrF

fraction of reference ET for ET_corr, i.e. ET_corr / gridMET_ETr

unitless

ETrF_filtered

filtered and gap-filled EtrF

unitless

flux

input LE + H

w m-2

flux_corr

LE_corr + H_corr

w m-2

G

average or single soil heat flux

w m-2

G_[1,2,3,…]

soil heat flux at sensor

w m-2

G_subday_gaps

number of gaps in initial G per day

unitless

gridMET_ETo

gridMET short/grass reference ET (nearest cell)

mm time-1

gridMET_ETr

gridMET tall/alfalfa reference ET (nearest cell)

mm time-1

gridMET_prcp

gridMET precipitation (nearest cell)

mm time-1

gridMET_[other]

other optional gridMET variables (nearest cell)

NA

H_corr

corrected sensible heat

w m-2

H_subday_gaps

number of gaps in initial H per day

unitless

H_user_corr

corrected sensible heat, user-provided

w m-2

LE_corr

corrected latent energy

w m-2

LE_subday_gaps

number of gaps in initial LE per day

unitless

LE_user_corr

corrected latent energy, user-provided

w m-2

lw_in

incoming longwave radiation

w m-2

lw_out

outgoing longwave radiation

w m-2

ppt

precipitation

mm time-1

rh

relative humidity

%

Rn

net radiation

w m-2

Rn_subday_gaps

number of gaps in initial Rn per day

unitless

rso

clear sky radiation (ASCE formulation)

w m-2

sw_in

incoming shortwave radiation

w m-2

sw_out

outgoing shortwave radiation

w m-2

sw_pot

potential shortwave radiation, user-provided

w m-2

t_avg

average temperature

C

t_dew

dew point temperature

C

t_max

maximum temperature

C

t_min

minimum temperature

C

theta

average or single soil moisture

user defined

theta_[1,2,3,…]

soil moisture at sensor

user defined

vp

actual vapor pressure

kPa

vpd

vapor pressure deficit

kPa

wd

wind direction

user defined

ws

wind speed

m s-1

A note on units

Upon creation of a QaQc object, variables are checked for valid input units and converted to required units needed for internal calculations when running QaQc.correct_data and for certain default plots (see below). For a list of valid input units for different variables refer to the QaQc.allowable_units attribute:

>>> q.allowable_units
    {'LE': ['w/m2'],
     'H': ['w/m2'],
     'Rn': ['w/m2'],
     'G': ['w/m2'],
     'lw_in': ['w/m2'],
     'lw_out': ['w/m2'],
     'sw_in': ['w/m2'],
     'sw_out': ['w/m2'],
     'ppt': ['mm', 'in'],
     'vp': ['kpa', 'hpa'],
     'vpd': ['kpa', 'hpa'],
     't_avg': ['c', 'f']}

For each variable above, if given one of the units allowable the units will automatically be converted to the required units.

To know which variables are required to be in particular units view Qc.required_units:

>>> q.required_units
    {'LE': 'w/m2',
     'H': 'w/m2',
     'Rn': 'w/m2',
     'G': 'w/m2',
     'lw_in': 'w/m2',
     'lw_out': 'w/m2',
     'sw_in': 'w/m2',
     'sw_out': 'w/m2',
     'ppt': 'mm',
     'vp': 'kpa',
     'vpd': 'kpa',
     't_avg': 'c'}

Note

The list of allowable units is a work in progress, if your input units are not available consider raising an issue on GitHub or providing the conversion directly with a pull request. Automatic unit conversions are handled within the util.Convert class using the util.Convert.convert class method.

Save resampled and corrected data

The QaQc.write method conveniently writes daily and monthly time series of input and calculated variables to comma separated value (CSV) files. If the QaQc.correct_data method has not yet been run it will be and the monthly time series will also be created using the default parameters for the correction routine (Energy Balance Ratio method with ETr-based gap filling).

The default output directory for time series files can be accessed/changed by the out_dir attribute, if not changed it will be located in the same directory of the config.ini file. The daily and monthly time series file names will begin with the QaQc.site_id followed by “daily_data” or “monthly_data” resepctively. For example,

>>> # new QaQc instance
>>> q = QaQc(d)
>>> # a platform dependent pathlib.Path object
>>> q.out_dir
    PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/output')

The line below shows that no output files have been written to QaQc.out_dir yet,

>>> # print files in output directory that begin with the site_id
>>> [f.name for f in q.out_dir.glob('{}*'.format(q.site_id))]
    ['US-Tw3_input_plots.html']

>>> q.corrected
    False

>>> # writing files also ran corrections since they were not yet run
>>> q.write()
>>> q.corrected
    True

Now the respective daily and monthly time series have been written to QaQc.out_dir,

>>> [f.name for f in q.out_dir.glob('{}*'.format(q.site_id))]
    ['US-Tw3_daily_data.csv', 'US-Tw3_monthly_data.csv', 'US-Tw3_input_plots.html']

Hint

You can overwrite the default name of the output directory to save the daily and monthly time series using the out_dir keyword argument to QaQc.write, this option keeps the location within the directory of the config file but just changes the name, whereas to change the entire output directory path adjust the QaQc.out_dir attribute directly. Also, the naming scheme of output files created will use user-defined names for all input variables.

Visualize resampled and corrected data

Similar to the Data.plot, the QaQc.plot method creates a series of default time series and scatter plots of input and in this case calculated variables. The temporal frequency of plots from QaQc.plot will always be daily and monthly and additional plots are created for validation of energy balance closure corrections, otherwise the same options such as number of subplot columns, super title, subplot dimensions, output type, output file path, etc. are available. Similar to QaQc.write and QaQc.monthly_df, if the data has not yet been corrected the plot method will correct it using the default parameters before creating the plots.

Here is an example of the default daily and monthly time series plots produced after running the Energy Balance Ratio closure correction:

>>> q = QaQc(d)
>>> q.plot(output_type='notebook', plot_width=700)

Bokeh Plot