Tutorial
This tutorial demonstrates the most important features of the
flux-data-qaqc
Python package for management, analysis, and visualization
of eddy covariance time series data. It is recommended to read the
Installation and Configuration Options and Caveats tutorials before this one.
A Jupyter Notebook of this tutorial is available here.
Tip
Currently, the software does not include a command line interface therefore to use the software you must use Python, e.g. make your own scripts or use an interactive shell. However, you will see that common workflows can be accomplished with a few (5-10) lines of code and you can simply follow the templates given here to make custom scripts.
Description of example datasets
The data for this example comes from the “Twitchell Alfalfa” AmeriFlux eddy covariance flux tower site in California. The site is located in alfalfa fields and exhibits a mild Mediterranean climate with dry and hot summers, for more information on this site or to download data click here.
Loading input
The loading and management of input climatic data and metadata from a
config.ini file is done using the fluxdataqaqc.Data
object. In a
nutshell, a Data
object is created from a properly formatted config file
(see Setting up a config file) and has tools for parsing input climate
data, averaging input climate time series, accessing/managing metadata,
flag-based data filtering, and creating interactive visualizations of input
data.
There is only one argument to create a Data object, the path to the config.ini file:
>>> # imports for code snippets within tutorial
>>> import pandas as pd
>>> from fluxdataqaqc import Data, QaQc, Plot
>>> from bokeh.plotting import figure, show, ColumnDataSource
>>> from bokeh.models.formatters import DatetimeTickFormatter
>>> from bokeh.models import LinearAxis, Range1d
>>> # create a Data object from the config.ini file
>>> config_path = 'US-Tw3_config.ini'
>>> d = Data(config_path)
Attributes of a Data object
Below are some of the useful attributes of the Data
object and how
they may be used.
The full path to the config.ini file that was used to create the
Data
instance can be accessed, note that it will return a
system-depenedent pathlib.Path
object. E.g. on my Linux machine the
path is:
>>> d.config_file
PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/US-Tw3_config.ini')
On a Windows machine the path will have the appropriate backslashes.
Similarly to access the climate time series file:
>>> d.climate_file
PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/AMF_US-Tw3_BASE_HH_5-5.csv')
The Data.config
attribute is a configparser.ConfigParser
object,
it allows you to access metadata and data in the config file in multiple
ways and to modify them. In flux-data-qaqc
it is mainly used for
accessing information about the input data.
>>> # get a list of all entries in the METADATA section of the config.ini
>>> d.config.items('METADATA') # access the DATA section the same way
[('climate_file_path', 'AMF_US-Tw3_BASE_HH_5-5.csv'),
('station_latitude', '38.1159'),
('station_longitude', '-121.6467'),
('station_elevation', '-9.0'),
('missing_data_value', '-9999'),
('skiprows', '2'),
('date_parser', '%Y%m%d%H%M'),
('site_id', 'US-Tw3'),
('country', 'USA'),
('doi_contributor_name', 'Dennis Baldocchi'),
('doi_contributor_role', 'Author'),
('doi_contributor_email', 'baldocchi@berkeley.edu'),
('doi_contributor_institution', 'University of California, Berkeley'),
('doi_organization', 'California Department of Water Resources'),
('doi_organization_role', 'Sponsor'),
('flux_measurements_method', 'Eddy Covariance'),
('flux_measurements_variable', 'CO2'),
('flux_measurements_operations', 'Continuous operation'),
('site_name', 'Twitchell Alfalfa'),
('igbp', 'CRO'),
('igbp_comment',
'alfalfa is a fast growing leguminous crop raised for animal feed of low stature. It is planted in rows and typically reaches 60-70 cm in height prior to harvest.'),
('land_ownership', 'public'),
('network', 'AmeriFlux'),
('reference_paper',
'Baldocchi, D., Penuelas, J. (2018) The Physics And Ecology Of Mining Carbon Dioxide From The Atmosphere By Ecosystems, Global Change Biology, 45(), 9275–9287'),
('reference_doi', '10.1111/gcb.14559'),
('reference_usage', 'Reference'),
('research_topic',
'The research approach of the University of California, Berkeley Biometeorology Laboratory involves the coordinated use of experimental measurements and theoretical models to understand the physical, biological, and chemical processes that control trace gas fluxes between the biosphere and atmosphere and to quantify their temporal and spatial variations. The research objectives of the Mayberry Wetland, Twitchell Wetland, Sherman Island, Twitchell Island, Twitchell Alfalfa, and Twitchell Corn sites are as follows: 1) Describe differences in the fluxes of CO2, CH4, H2O, and energy between different land uses, 2) Understand the mechanisms controlling these fluxes, 3) Use ecosystem modeling to understand controls on these mechanisms under different environmental scenarios. These six sites were selected to capture a wide range of inundated conditions within the Sacramento-San Joaquin River Delta. The research focuses on the eddy covariance technique to measure CH4, CO2, H2O, and energy fluxes and works to combine measurements of both net fluxes and partitioned fluxes in order to achieve a mechanistic understanding of the ecological controls on current and future carbon flux in the Delta.'),
('terrain', 'Flat'),
('aspect', 'FLAT'),
('wind_direction', 'W'),
('surface_homogeneity', '370.0'),
('site_desc',
"The Twitchell Alfalfa site is an alfalfa field owned by the state of California and leased to third parties for farming. The tower was installed on May 24, 2013. This site and the surrounding region are part of the San Joaquin - Sacramento River Delta drained beginning in the 1850's and subsequently used for agriculture. The field has been alfalfa for X years…., Crop rotation occurs every 5-6 years. The site is harvested by mowing and bailing several times per year. The field is fallow typically between November and February. The site is irrigated by periodically-flooded ditches surrounding the field. The site is irrigated by raising, and subsequently lowering the water table??"),
('site_funding', 'California Department of Water Resources'),
('team_member_name', 'Joe Verfaillie'),
('team_member_role', 'Technician'),
('team_member_email', 'jverfail@berkeley.edu'),
('team_member_institution', 'University of California, Berkeley'),
('url_ameriflux', 'http://ameriflux.lbl.gov/sites/siteinfo/US-Tw3'),
('utc_offset', '-8'),
('mat', '15.6'),
('map', '421.0'),
('land_owner', 'California Department of Water Resources'),
('climate_koeppen', 'Csa'),
('doi', '10.17190/AMF/1246149'),
('doi_citation',
'Dennis Baldocchi (2013-) AmeriFlux US-Tw3 Twitchell Alfalfa, 10.17190/AMF/1246149'),
('doi_dataproduct', 'AmeriFlux'),
('team_member_address',
'Department of Environmental Science, Policy and Management, 137 Mulford Hall, 345 Hilgard Hall,Berkeley, CA USA 94720-3110'),
('url', 'http://nature.berkeley.edu/biometlab/sites.php?tab=US-Tw3'),
('dom_dist_mgmt', 'Agriculture'),
('site_snow_cover_days', '0.0'),
('state', 'CA'),
('location_date_start', '20130524.0'),
('acknowledgement',
'Biometeorology Lab, University of California, Berkeley, PI: Dennis Baldocchi')]
A useful method is the configparser.ConfigParser.get
which takes the
section of the config file and the “option” and returns the value:
>>> d.config.get(section='METADATA', option='site_name')
'Twitchell Alfalfa'
>>> # section and option are optional keywords
>>> d.config.get('METADATA', 'site_name')
'Twitchell Alfalfa'
Tip
If you are unsure if an entry or option exists in the config file, use the
fallback
keyword argument
>>> # section and option are optional keywords
>>> d.config.get('METADATA', 'site name', fallback='na')
'na'
Some metadata entries are added as Data
attributes for easier access
as they are used in multiple ways later, these include:
site_id\(^*\)
elevation\(^*\)
latitude\(^*\)
longitude\(^*\)
na_val
qc_threshold
qc_flag
\(^*\)mandatory METADATA entries in the config file, see Setting up a config file for further explanation.
View all the columns as found in the header row of the input time series climate file.
>>> d.header
array(['TIMESTAMP_START', 'TIMESTAMP_END', 'CO2', 'H2O', 'CH4', 'FC',
'FCH4', 'FC_SSITC_TEST', 'FCH4_SSITC_TEST', 'G', 'H', 'LE',
'H_SSITC_TEST', 'LE_SSITC_TEST', 'WD', 'WS', 'USTAR', 'ZL', 'TAU',
'MO_LENGTH', 'V_SIGMA', 'W_SIGMA', 'TAU_SSITC_TEST', 'PA', 'RH',
'TA', 'VPD_PI', 'T_SONIC', 'T_SONIC_SIGMA', 'SWC_1_1_1',
'SWC_1_2_1', 'TS_1_1_1', 'TS_1_2_1', 'TS_1_3_1', 'TS_1_4_1',
'TS_1_5_1', 'NETRAD', 'PPFD_DIF', 'PPFD_IN', 'PPFD_OUT', 'SW_IN',
'SW_OUT', 'LW_IN', 'LW_OUT', 'P', 'FC_PI_F', 'RECO_PI_F',
'GPP_PI_F', 'H_PI_F', 'LE_PI_F'], dtype='<U15')
Note
All of the header columns will not necessarily be loaded, only those
specified in the config file. Also, no data other than the header line is
loaded into memory when creating a Data
object, the time series data
is only loaded when calling Data.df
for increased efficiency for
some workflows involving only metadata.
Variable names and units
In flux-data-qaqc
there are two naming schemes for climate
variables, the names as defined by the column headers in the input time
series file and the internal names for some variables and calculated
variables created by the package. We will refer to these two sets as
“user-defined” and “internal” names hereforth.
The Data.variables
attribute maps the internal to user-defined
variable names:
>>> d.variables
{'date': 'TIMESTAMP_START',
'Rn': 'NETRAD',
'G': 'G',
'LE': 'LE_PI_F',
'H': 'H_PI_F',
'sw_in': 'SW_IN',
'sw_out': 'SW_OUT',
'lw_in': 'LW_IN',
'lw_out': 'LW_OUT',
'vpd': 'VPD_PI',
't_avg': 'T_SONIC',
'ws': 'WS',
'theta_1': 'SWC_1_1_1',
'theta_2': 'SWC_1_2_1'}
And, the Data.inv_map
maps the internal to user-defined names if
they differ, however this is only created once the data is loaded by
calling Data.df
.
>>> # a similar dictionary attribute for input units
>>> d.units
{'Rn': 'w/m2',
'G': 'w/m2',
'LE': 'w/m2',
'H': 'w/m2',
'sw_in': 'w/m2',
'sw_out': 'w/m2',
'lw_in': 'w/m2',
'lw_out': 'w/m2',
'vpd': 'hPa',
't_avg': 'C',
'ws': 'm/s',
'theta_1': '(%): Soil water content (volumetric), range 0-100',
'theta_2': '(%): Soil water content (volumetric), range 0-100'}
Accessing input data
The Data.df
property gves access to the time series input climate
data for columns specified in the config file as a datetime-indexed
pandas.DataFrame
object. This object has numerous powerful built in
tools for time series analysis and visualization.
>>> # first 5 datetimes that are not gaps
>>> d.df.dropna().head()
input_G | WS | VPD_PI | T_SONIC | SWC_1_1_1 | SWC_1_2_1 | NETRAD | SW_IN | SW_OUT | LW_IN | LW_OUT | H_PI_F | LE_PI_F | theta_mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||||||
2013-05-24 12:30:00 | 122.194848 | 3.352754 | 18.853678 | 25.682739 | 6.6790 | 26.1655 | 652.648719 | 1027.756939 | 212.800000 | 300.363524 | 462.671744 | 95.487930 | 375.841436 | 16.42225 |
2013-05-24 13:00:00 | 108.054863 | 3.882154 | 18.560999 | 26.057700 | 6.7065 | 26.1600 | 629.990486 | 997.749437 | 209.933333 | 303.269447 | 461.095065 | 96.584383 | 371.619775 | 16.43325 |
2013-05-24 13:30:00 | 79.330662 | 4.646089 | 18.900260 | 26.067374 | 6.7120 | 26.1545 | 595.817687 | 954.988747 | 206.733333 | 303.017852 | 455.455579 | 84.066406 | 358.194935 | 16.43325 |
2013-05-24 14:00:00 | 52.366527 | 5.048825 | 20.440061 | 25.961307 | 6.7395 | 26.1325 | 549.039365 | 900.975244 | 201.333333 | 298.914731 | 449.517276 | 69.449710 | 406.528564 | 16.43600 |
2013-05-24 14:30:00 | 35.658417 | 5.302946 | 21.064824 | 25.954462 | 6.7450 | 26.1215 | 493.519695 | 833.458365 | 192.066667 | 296.791541 | 444.663544 | 47.774030 | 315.295309 | 16.43325 |
Tip
There are many tutorials on how to use the pandas.DataFrame
and its
powerful data analysis tools for multiple purposes online, to get started
you may want to visit Panda’s own list of tutorials here.
By default the column names in Data.df
are retained from
user-defined names unless they were named exactly the same as an
internal name. For example the input ground heat flux column in this
dataset is named “G”, therefore it was renamed as “input_g”
>>> d.df.columns
Index(['input_G', 'WS', 'VPD_PI', 'T_SONIC', 'SWC_1_1_1', 'SWC_1_2_1',
'NETRAD', 'SW_IN', 'SW_OUT', 'LW_IN', 'LW_OUT', 'H_PI_F', 'LE_PI_F',
'theta_mean'],
dtype='object')
>>> # the new name was also updated in Data.variables
>>> d.variables.get('G')
'input_G'
As stated earlier, Data.inv_map
maps the user-defined names to
internal flux-data-qaqc
names only after loading Data.df
:
>>> d.inv_map
{'TIMESTAMP_START': 'date',
'NETRAD': 'Rn',
'input_G': 'G',
'LE_PI_F': 'LE',
'H_PI_F': 'H',
'SW_IN': 'sw_in',
'SW_OUT': 'sw_out',
'LW_IN': 'lw_in',
'LW_OUT': 'lw_out',
'VPD_PI': 'vpd',
'T_SONIC': 't_avg',
'WS': 'ws',
'SWC_1_1_1': 'theta_1',
'SWC_1_2_1': 'theta_2'}
Tip
The Data.inv_map
is mainly used to rename the dataframe to internal
names, this can be very useful if you are creating your own custom workflows
using the flux-data-qaqc
API because it allows you to only know the
internal names of variables therefore they can be hard coded into your
workflow and applied to different eddy covariance datasets. For example,
let’s say we wanted to make HTML tables of basic statistics of just the
energy balance components for many datasets (that may have different names
for the same variables) and save the file using the user-defined names:
>>> d = Data('US-Tw3_config.ini')
>>> df = d.df.rename(columns=d.inv_map)
>>> # get some metadata for saving
>>> site_id = d.site_id
>>> vars_we_want = ['H', 'LE', 'Rn', 'G']
>>> # rename variables, calculate basice statistics table and save to HTML
>>> df[vars_we_want].rename(columns=d.variables).describe().to_html('{}.html'.format(site_id))
Calculating mean for var: THETA from columns: ['SWC_1_1_1', 'SWC_1_2_1']
WARNING: renaming column G to input_G
>>> # which produces the following HTML table with user-defined names:
>>> from IPython.display import HTML
>>> HTML(filename='{}.html'.format(site_id))
H_PI_F | LE_PI_F | NETRAD | input_G | |
---|---|---|---|---|
count | 88224.000000 | 88224.000000 | 75907.000000 | 83635.000000 |
mean | 11.406226 | 68.404030 | 101.920827 | 2.518969 |
std | 69.703895 | 99.070142 | 216.580113 | 28.046602 |
min | -526.371895 | -202.958294 | -123.084525 | -47.545875 |
25% | -31.398696 | 1.952121 | -61.986728 | -17.174822 |
50% | -8.100198 | 18.846609 | -10.864949 | -8.162624 |
75% | 38.495230 | 105.196206 | 239.810124 | 17.611898 |
max | 399.902074 | 603.686401 | 748.806624 | 152.888745 |
Another powerful feature of the Data.df
property is that it is
datetime-indexed using the input data’s temporal frequency, view the
date index like so:
>>> d.df.index
DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:30:00',
'2013-01-01 01:00:00', '2013-01-01 01:30:00',
'2013-01-01 02:00:00', '2013-01-01 02:30:00',
'2013-01-01 03:00:00', '2013-01-01 03:30:00',
'2013-01-01 04:00:00', '2013-01-01 04:30:00',
...
'2018-06-04 19:00:00', '2018-06-04 19:30:00',
'2018-06-04 20:00:00', '2018-06-04 20:30:00',
'2018-06-04 21:00:00', '2018-06-04 21:30:00',
'2018-06-04 22:00:00', '2018-06-04 22:30:00',
'2018-06-04 23:00:00', '2018-06-04 23:30:00'],
dtype='datetime64[ns]', name='date', length=95088, freq=None)
Datetime-indexed pandas.DataFrame
objects have useful features for
time series analysis like grouping and calculating statistics by time
aggregates. The example below shows how to calculate the day of year
mean for energy balance components, it also demonstrates how to use the
add_lines
plotting method available to Data
, QaQc
, and
Plot
objects.
>>> # convert to internal names, copy dataframe
>>> df = d.df.rename(columns=d.inv_map)
>>> # day of year mean of input energy balance components
>>> vars_we_want = ['H', 'LE', 'Rn', 'G']
>>> doy_means = df[vars_we_want].groupby(d.df.index.dayofyear).mean()
>>> # create a Bokeh figure
>>> fig = figure(x_axis_label='day of year', y_axis_label='day of year mean (w/m2)')
>>> # arguements needed for creating interactive plots
>>> plt_vars = vars_we_want
>>> colors = ['red', 'blue', 'black', 'green']
>>> x_name = 'date'
>>> source = ColumnDataSource(doy_means)
>>> Plot.add_lines(fig, doy_means, plt_vars, colors, x_name, source, labels=vars_we_want,
>>> x_axis_type=None)
>>> show(fig)
Note
The x_axis_type=None
is a unique argument to Plot.add_lines
and
Plot.line_plot
that in this case means to not try to force the
x-axis format to a datetime representation, default is
x_axis_type='date'
.
See also
Some routines occur automatically when creating a Data
object,
including calcuation of weighted and non-weighted averages of soil heat flux
and soil moisture which is described in Averaging data from multiple sensors.
Modifying input data
A last note on the Data
object (same goes for the QaQc
object)
is that Data.df
is a class property, in this case that means that it
can be reassigned with a different pandas.DataFrame
. This is
critical for manual pre-filtering and validation of data before
proceeding with energy balance closure routines. A simple example is
shown here:
>>> # add 5 to air temperature, this would effect ET calculations later
>>> x = d.df
>>> x['T_SONIC'] += 5
>>> d.df = x
A realistic use of the reassignability of the Data.df
and
QaQc.df
properties is shown in manual cleaning of poor quality
data.
See also
The Data.apply_qc_flags
method allows for reading in quality
control flags with the input data and filtering specific data out based on
user-defined numeric or character flags. This routine is specific to
Data
and includes several attributes that are added to a
Data
instance, for full explanation and examples see
Quality-based data filtering.
Visualize input data
The Data.plot
method create a series of interactive time series
plots of input data, potential plots inlcude:
energy balance components
radiation components
multiple soil heat flux measurements
air temperature
vapor pressure and vapor pressure deficit
wind speed
precipitation
latent energy
multiple soil moisture measurements
If any of these variables are not found the plot(s) will not be added.
The most useful interactive features of plots created by
flux-data-qaqc
are:
pan/zoom
hover tolltips on var names, values, date
linked x-axes on time series plots
save plot option (can save specific subplot zoomed in)
Here is an example,
>>> d.plot(output_type='notebook', plot_width=700)
The output plot is not shown in the online documentation due to memory constraints.
Hint
The plot methods of Data
and QaQc
objects have the keyword
argument output_type
which by default is set to “save”, the other two
options are “notebook” for showing within a Jupyter Notebook and “show”
which opens a temporary file in the default web browser.
If you rather save the plot, and maybe you want 2 columns of plots,
>>> d.plot(ncols=2, plot_width=500)
After saving a plot without specifying the output file path (keyword
argument out_file
), it will be saved to an “output” directory where
the config file is with the file name based on Data.site_id
with the
suffix “_input_plots”:
>>> # where the plot file was saved by default
>>> d.plot_file
PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/output/US-Tw3_input_plots.html')
The following plot is not shown due to excessive memory usage needed to build online documentation.
>>> # view outplot plots within Jupyter notebook
>>> from IPython.display import HTML
>>> HTML(filename=d.plot_file)
Hint
The QaQc.plot
method shown below is similar however it may include
added plots with calculated and corrected variables (if they exist) and will
always plot data in daily and monthly temporal frequency because daily
frequency is required before applying flux-data-qaqc
energy balance
closure corrections.
Temporal resampling
The QaQc
object holds several tools for managing data and eddy
covariance data analysis, but one of it’s primary features is temporal
resampling of input data to daily and monthly frequencies. The
resampling of time series data to daily frequency occurs upon the
creation of a QaQc
instance if the frequency within the preceeding
Data
object is not already daily:
>>> # the frequency of the input data is 30 minute
>>> d.df.index[0:5]
DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:30:00',
'2013-01-01 01:00:00', '2013-01-01 01:30:00',
'2013-01-01 02:00:00'],
dtype='datetime64[ns]', name='date', freq=None)
>>> # creating a QaQc instance will automatically convert to daily
>>> q = QaQc(d)
The input data temporal frequency appears to be less than daily.
Data is being resampled to daily temporal frequency.
Filtering days with less then 100.0% or 48/48 sub-daily measurements
Converting vpd from hpa to kpa
>>> # first 5 datetime indices are dates now
>>> q.df.index[0:5]
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05'],
dtype='datetime64[ns]', name='date', freq=None)
The method used for aggregating different variables, e.g. mean or sum,
when resampling to daily or monthly frequency is defined in the
QaQc.agg_dict
class attribute:
>>> # these are the internal names as keys and temporal aggregation method as values
>>> QaQc.agg_dict
{'energy': 'mean',
'flux': 'mean',
'flux_corr': 'mean',
'br': 'mean',
'ET': 'sum',
'ET_corr': 'sum',
'ET_gap': 'sum',
'ET_fill': 'sum',
'ET_fill_val': 'sum',
'ET_user_corr': 'sum',
'ebr': 'mean',
'ebr_corr': 'mean',
'ebr_user_corr': 'mean',
'ebr_5day_clim': 'mean',
'gridMET_ETr': 'sum',
'gridMET_prcp': 'sum',
'lw_in': 'mean',
't_avg': 'mean',
'rso': 'mean',
'sw_pot': 'mean',
'sw_in': 'mean',
'vp': 'mean',
'vpd': 'mean',
'ppt': 'sum',
'ws': 'mean',
'Rn': 'mean',
'sw_out': 'mean',
'lw_out': 'mean',
'G': 'mean',
'LE': 'mean',
'LE_corr': 'mean',
'LE_user_corr': 'mean',
'H': 'mean',
'H_corr': 'mean',
'H_user_corr': 'mean'}
Note
There are several calculated variables above that may not look familiar,
many are calculated by the energy balance closure correction routines and
described in Closure Methodologies. Also, any other variables (not
found in QaQc.agg_dict
that exist in a QaQc.df
before
accessing QaQc.monthly_df
the first time will be averaged in the
monthly time series dataframe (QaQc.monthly_df
).
The QaQc
constructor tries to infer the temporal frequency of the
input time series data, however the method is not always accurate, to
access the inferred initial temporal frequency of the data view the
QaQc.temporal_freq
attribute:
>>> q.temporal_freq
'30T'
If the inferred input frequency was accurate you will see a Pandas datetime
alias,
in this case ‘30T’ is thirty minutes. If the temporal frequency is not automatically detected you should be able to rely on the n_samples_per_day
instance attribute that is manually estimated by the QaQc
constructor:
>>> q.n_samples_per_day
48
Filter days with sub-daily gaps
The drop_gaps
and daily_frac
keyword arguments used when creating a QaQc
instance allow you to control how days with sub-daily measurement gaps will or will not be filtered out when resampling to daily frequency.
Sub-daily gaps in energy balance variables \(LE\), \(H\), \(Rn\), and \(G\) , and daily ASCE standaridized reference ET inputs, e.g. hourly \(ea\) (“vp”), \(rs\) (“sw_in”), \(t_min\), \(t_max\), and \(ws\), can be linearly interpolated automatically before daily aggregations. Interpolation is performed over gap lengths measured in hours, with options to control the longest length of gap to interpolate when \(Rn \ge 0\) controlled by the QaQc
keyword argument max_interp_hours
(default 2 hours) and the longest gap to interpolate when \(Rn < 0\) set by the max_interp_hours_night
(default 4 hours).:math:`
Important
By default the QaQc
constructor will first linearly interpolate
energy balance and ASCE ref. ET variables (\(LE\), \(H\),
\(Rn\), \(G\), \(ea\) (“vp”), \(rs\) (“sw_in”),
\(t_min\), \(t_max\), and \(ws\)) according to the maximum gap
lengths (max_interp_hours
and max_interp_hours_night
) and then count
sub-daily gaps and drop days (set values to null) for all climate data
columns (not QC flag or sub-daily gap count columns) where any of the
sub-daily data are missing because by default drop_gaps=True
and
daily_frac=1.0
. In other words, if you have hourly input data
for(\(LE\) and one hour was missing on a given day, by default that hour
will be linearly interpolated before calculating the daily time series and
the daily mean will be calculated after. On the other hand, if other climate
variables had a single hour missing on a given day, e.g. wind direction,
this day would be filtered out by the QaQc
constructor. This is
important because the daily time series is what is used in all energy
balance closure correction and daily ASCE standardized reference ET
algorithms.
The percentage of sub-daily samples to require set by the daily_frac
argument and the maximum length of gaps to linearly interpolate set by max_interp_hours
and max_interp_hours_night
complement each other and are used in tandem. For example, if the input data is half-hourly and you only want a maximum of 4 hours to be interpolated on any given day and gap lengths to interpolate should be no more than 2 hours each then you would pass the following parameters to the QaQc
constructor:
>>> q = QaQc(d, daily_frac=20/24, max_interp_hours=2, max_interp_hours_night=2)
The input data temporal frequency appears to be less than daily.
Data is being resampled to daily temporal frequency.
Linearly interpolating gaps in energy balance components up to 2 hours when Rn < 0 and up to 2 hours when Rn >= 0.
Filtering days with less then 83.33333333333334% or 40/48 sub-daily measurements
In this case we set daily_frac=20/24
because we are only allowing a maximum of 4 hours of total gaps in the day in other words we are requiring 40 of the 48 half hourly samples to exist before we filter out a day. Remember, because linear interpolation of gaps is done before counting sub-daily gaps, this could result in retaining days with more than 4 hours of gaps in the original time series of energy balance components. You may also pass the daily_frac
arugment as a decimal fraction, e.g. \(0.8333 \approx 20/24\).
To not drop any days and take daily means/sums based on whatever data exists in a given day without any interpolation of energy balance variables,
>>> q = QaQc(d, drop_gaps=False, max_interp_hours=None)
The input data temporal frequency appears to be less than daily.
Data is being resampled to daily temporal frequency.
Let’s view a comparison of \(Rn\) using different options of filtering days with sub-daily gaps in the working dataset, because it has several periods of systematic gaps which cause upwards skewing of daily mean \(Rn\) if not filtered carefully:
>>> # make an empty pandas dataframe for Rn series
>>> Rn_df = pd.DataFrame()
>>> # recreate multiplt QaQc instances using different sub-day gap filters
>>> q = QaQc(d, drop_gaps=False, max_interp_hours=None)
>>> Rn_df['sub_day_gaps'] = q.df.Rn_subday_gaps
>>> Rn_df['no_filter_no_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, drop_gaps=False)
>>> Rn_df['no_filter_with_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=0.5) # filter days with less than 50% data
>>> Rn_df['require_50'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=0.75)
>>> Rn_df['require_75'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=1, max_interp_hours=24, max_interp_hours_night=24)
>>> Rn_df['require_100_with_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> q = QaQc(d, daily_frac=1, max_interp_hours=None)
>>> Rn_df['require_100_no_interp'] = q.df.rename(columns=q.inv_map).Rn
>>> # plot to compare results of day-gap filter
>>> fig = figure(x_axis_label='date', y_axis_label='mean daily net radiation (w/m2), filtered based on sub-daily gaps')
>>> # arguments needed for creating interactive line plots
>>> colors = ['red', 'darkred','orange', 'blue', 'black', 'tan']
>>> plt_vars = ['no_filter_no_interp', 'no_filter_with_interp', 'require_50', 'require_75', 'require_100_with_interp', 'require_100_no_interp']
>>> labels = ['no filter wout/interp.', 'no filter w/interp.', 'require > 50% w/interp.', 'require > 75% w/interp.', 'require 100% w/interp.', 'require 100% wout/interp.']
>>> x_name = 'date'
>>> source = ColumnDataSource(Rn_df)
>>> Plot.add_lines(fig, Rn_df, plt_vars, colors, x_name, source, labels=labels)
>>> # add daily gap counts to secondary y
>>> fig.extra_y_ranges['gap_counts'] = Range1d(start=0, end=48)
>>> fig.add_layout(LinearAxis(y_range_name='gap_counts', axis_label='number of sub-daily gaps'), 'right')
>>> fig.circle('date', 'sub_day_gaps', legend_label='n sub-day gaps', y_range_name='gap_counts',
>>> color='silver', source=source
>>> )
>>> fig.hover[0].tooltips.append(('sub_day_gaps','@{}'.format('sub_day_gaps')))
>>> fig.legend.location = 'top_right'
>>> show(fig)
Try zooming in on the gaps filled by the “no filter wout/interp.” line to compare which days are retained/filtered by different options, also remove lines by clicking on them in the legend to compare subsets of options.
Tip
For a more fine-grained approach to filtering out days where perhaps multiple 2 hour gaps were filled use the newly created daily gap count columns: “LE_subday_gaps”, “H_subday_gaps”, “Rn_subday_gaps”, and “G_subday_gaps”:
>>> q = QaQc(d)
>>> df = q.df.rename(columns=q.inv_map)
For example, you could post-filter out days in any given energy balance variable, in this case \(Rn\) where sub-daily gaps exceed a threshold:
>>> df.loc[(df.Rn_subday_gaps > 4) & (df.Rn.notna()), ['Rn','Rn_subday_gaps']]
Rn Rn_subday_gaps date 2015-06-09 101.710194 5.0 2015-11-20 47.990988 5.0 2016-01-15 72.495973 8.0 2018-01-06 79.507008 7.0 2018-05-10 160.997332 6.0
Monthly time series
The QaQc.monthly_df
property allows for creating the monthly time
series of input anc calculated variables provided by
QaQc.correct_data
. It uses the same temporal aggregation methods as
the daily time series i.e. from QaQc.agg_dict
. Although there are
many similarities there are important differences between QaQc.df
and QaQc.monthly_df
other than the obvious: when accessing the
QaQc.monthly_df
it will automatically run the default energy balance
closure correction routine provided by QaQc.correct_data
if it has
not yet been run. You can check if it has been run at anytime by:
>>> q.corrected
False
To show how this works let’s access the monthly data and show the monthly statistics of the “corrected” evapotranspiration (ET_corr):
>>> # first note, ET_corr is not in the dataset yet
>>> 'ET_corr' in q.df.columns
False
Now access the monthly time series,
>>> q.monthly_df;
>>> 'ET_corr' in q.df.columns
True
By calling the monthly dataframe, the energy balance closure was applied automatically
>>> q.monthly_df.ET_corr.describe()
count 61.000000
mean 87.858135
std 49.938287
min 11.370062
25% 41.418994
50% 84.383190
75% 127.500125
max 192.033481
Name: ET_corr, dtype: float64
>>> q.corrected
True
Note
The QaQc.monthly_df
also filters out months with less than 30% of
days of the month missing by default. To calculate monthly time series with
other threshold fractions of days required use the
util.monthly_resample
function and adjust the keyword argument
thresh
which is the fraction (0-1) of days of the month required to not
be gaps otherwise the month’s value will be forced to null, e.g. if you
wanted to caclulate the monthly mean air temperature requiring 30 and 90
percent of the days in the month to not be gaps:
>>> from fluxdataqaqc.util import monthly_resample
>>> # select just t_avg for example
>>> cols = ['t_avg']
>>> df = q.df.rename(columns=q.inv_map)
>>> # create temporary df with different monthly resample results
>>> tmp_df = monthly_resample(df, cols, 'mean', thresh=0.9).rename(
>>> columns={'t_avg': 'thresh_90'}
>>> )
>>> # join temp dataframe with monthly resample results using different thresh
>>> monthly_gap_comp = tmp_df.join(monthly_resample(df, cols, 'mean', thresh=0.3).rename(
>>> columns={'t_avg': 'thresh_30'})
>>> )
>>> # plot to compare results of day-gap filter
>>> fig = figure(x_axis_label='date', y_axis_label='monthy mean air temperature (C), filtered based on daily gaps')
>>> # arguments needed for creating interactive line plots
>>> x = 'date'
>>> source = ColumnDataSource(monthly_gap_comp)
>>> # this example also shows how to use other Bokeh plot arguments
>>> Plot.line_plot(fig,'date','thresh_30',source,'red',label='require > 30%', line_alpha=0.5)
>>> Plot.line_plot(fig,'date','thresh_90',source,'black',label='require > 90%',line_dash='dotted', line_width=2)
>>> fig.legend.location = 'top_right'
>>> show(fig)
Energy balance corrections
flux-data-qaqc
provides routines that adjust surface energy balance fluxes
to improve energy balance closure of eddy covariance flux station data.
These routines ultimately result in a corrected daily and monthly time series
of latent energy, sensible heat, and evapotranspiration with the option to
gap-fill days in corrected ET with ET calculated from gridMET reference ET and
fraction of reference ET.
There are two methods currently implemented:
Energy Balance Ratio method (default), modified from the FLUXNET method
Bowen Ratio approach (forces closure)
Multiple least squares regression - user defines LHS and RHS from \(LE\), \(H\), \(Rn\), and \(G\),
Detailed descriptions of methods including daily ET gap-filling methods can be found in the online documentation Closure Methodologies page. A few important notes on the API of these methods and other hydro-climatic statistical variables that are calculated are shown below.
The QaQc.correct_data
method is used to run energy balance closure
corrections. Here are a few tips on using them,
>>> # note above the monthly_df automatically applied the 'ebr' Energy Balance Ratio correction
>>> q.corr_meth
'ebr'
>>> # potential correction options
>>> q.corr_methods
('ebr', 'br', 'lin_regress')
>>> # to specify the Bowen Raito method:
>>> q.correct_data(meth='br')
>>> # the most recently used correction method is now shown
>>> q.corr_meth
'br'
Tip
After applying any energy balance closure correction routine all previous
corrected variables will be overwritten or dropped in QaQc.df
,
QaQc.monthly_df
, and QaQc.variables
, therefore to make a comparison
of different methods on the same data make a copy of the df
or
monthly_df
properties before running the next correction, e.g.
>>> # make copies of daily results of different correction options
>>> q.correct_data(meth='ebr')
>>> ebr_gapfilled = q.df
>>> q.correct_data(meth='ebr', etr_gap_fill=False)
>>> ebr_notgapfilled = q.df
>>> q.correct_data(meth='br')
>>> br_gapfilled = q.df
>>> q.correct_data(meth='br', etr_gap_fill=False)
>>> br_notgapfilled = q.df
ET gap-filling
A few notes on the option that uses reference ET and fraction of daily
reference ET to fill in large gaps in corrected ET, i.e. the keyword
argument QaQc.correct_data(etr_gap_fill = True)
.
The nearest gridMET cell’s time series data for precipitation and alfalfa reference ET is attempted to be downloaded if it is not found in the
gridmet_file_path
entry of the config.ini file.If the path to a gridMET file is not found it is re-downloaded, the config file will be updated with the new path and resaved.
Only the overlapping time period that matches the eddy covariance time series data is attempted to be downloaded, i.e. the period in
QaQc.df.index
.When a gridMET file is downloaded it will always be saved in a subdirectory where the config file is located called “gridMET_data” and named using the
QaQc.site_id
and gridMET cell centroid latitude and longitude.Corrected latent energy (\(LE_{corr}\)) gaps are also backwards filled from gap-filled ET.
Caution
gridMET only exists within
the contiguous United States and from 1979 to present, therefore if your
station lies outside of this region or you are analyzing eddy flux data
recorded before 1979 this option will not be ususable and you should always
run corrections with etr_gap_fill=False
to avoid potential errors.
The Bowen Ratio correction method will produce the ‘br’ variable which is the Bowen Ratio.
Other calculations
By default, QaQc.correct_data
also calculates ET from input latent
energy (LE) and air temperature, corrected ET from corrected LE and air
temperature, potential clear sky radiation (ASCE formulation), and the
Data
object attempts to calculate vapor pressure deficit from vapor
pressure and air temperature or vapor pressure from vapor pressure
deficit and air temperature if they exist at hourly or shorter temporal
frequency.
Evapotranspiration
The evapotranspiration (ET) calculations are described in Steps 7 and 8 correct turbulent fluxes, EBR, and ET of the Energy Balance Ratio correction explanation.
ASCE clear sky radiation
Daily ASCE potential clear sky radiation (\(R_{so}\)) is calculated using equation 19 in the “ASCE Standardized Reference Evapotranspiration Equation” final report by the Task Committee on Standardization of Reference Evapotranspiration Environmental and Water Resources Institute of the American Society of Civil Engineers January, 2005 here. This calculation is a simple method based primarily on elevation and latitude which results in a theoretical envelope of \(R_{so}\) as a function of the day of year,
where \(z\) is elevation in meters and \(R_a\) is daily extraterrestrial radiation (radiation with in the absence of an atmosphere), which itself is a well-behaved function of solar declination, the day of the year and the solar constant (see equations 21-29 in the ASCE report).
Vapor pressure/deficit
The Data
object will attempt to calculate vapor pressure or vapor
pressure deficit if one exists but not the other and average air
temperature time series also exists with the input data at hourly or
shorter temporal frequency. The Tetens equation (eqn. 37 in the ASCE report)
s an accurate approximation for saturation vapor pressure (\(es\)) in kPa as a function of air temperature,
where \(T\) is average hourly air temperature in degrees celcius. Vapor pressure deficit (\(vpd\)) is,
where \(ea\) is actual vapor pressure in kPa. Note, The
equations above are defined for hourly measurements however they are used for
hourly or shorter mean variables (\(T\), \(ea\), or \(vpd\))
within flux-data-qaqc
and then converted to daily means, if they are
not present in the input data at hourly or shorter frequencies then they
are not calculated.
These equations can be rearanged to solve for either \(es\) or \(vpd\) given the other variable and air temperature. For example, if given \(T\) and \(vpd\), then to get actual vapor pressure
In flux-data-qaqc
actual vapor pressure is named “vp” not “ea”. Also, during these calculations, if relative humidity is not found in the input dataset then it will subsequently be estimated as
Hint
The same calculations are available at the daily timestep but are not
automatically applied as the hourly or higher temporal frequency calculation
is preffered. To apply the estimates of vapor pressure or vapor pressure
deficit, and saturation vapor pressure, and relative humidity with daily data
one must call the QaQc._calc_vpd_from_vp
method from a QaQc
instance.
Legend for calculated variable names
Atmospheric and related variables created by energy balance closure corrections
are described in Closure Methodologies and other calculations can be
found throughout this documentation. Below is a reference table with basic
descriptions of all variables that are calculated or renamed by flux-data-qaqc
using its standardized naming scheme. The names found here (in the “Variable”
column) are the default names that are saved to the header in both daily and
monthly time series output CSV files as well as what are shown in the
interactive plots upon hovering with a cursor. Note, not all of the following
variables will be estimated by flux-data-qaqc
in all scenarios, the list of
variables that may be found in output files is completely dependent on the
input data provided in the configuration file and also dependend on which
calculations are used. For example, ASCE_ETo will only exist if the hourly or
daily methods for computing ASCE standardized reference ET are used and the
required input variables exist.
Variable
Description
Unit
ASCE_ETo
short/grass ASCE standardized reference ET
mm time-1
ASCE_ETr
tall/alfalfa ASCE standardized reference ET
mm time-1
br
bowen ratio
unitless
ebc_cf
energy balance closure correction factor (inverse of ebr_corr)
unitless
ebr
input energy balance ratio
unitless
ebr_5day_clim
5 day climatology of the filtered Energy Balance Ratio
unitless
ebr_corr
corrected energy balance ratio
unitless
energy
input Rn - G
w m-2
es
saturation vapor pressure
kPa
ET
ET calculated from input LE and average air temperature
mm time-1
ET_corr
ET calculated from LE_corr
mm time-1
ET_fill
gridMET_ETr * ETrF_filtered (fills gaps in ET_corr)
mm time-1
ET_fill_val
value of ET_fill on gap days
mm time-1
ET_gap
True on gap days in ET_corr, False otherwise
unitless
ET_user_corr
corrected ET, user-provided
mm time-1
EToF
fraction of reference ET for ET_corr, i.e. ET_corr / gridMET_ETo
unitless
EToF_filtered
filtered and gap-filled EToF
unitless
ETrF
fraction of reference ET for ET_corr, i.e. ET_corr / gridMET_ETr
unitless
ETrF_filtered
filtered and gap-filled EtrF
unitless
flux
input LE + H
w m-2
flux_corr
LE_corr + H_corr
w m-2
G
average or single soil heat flux
w m-2
G_[1,2,3,…]
soil heat flux at sensor
w m-2
G_subday_gaps
number of gaps in initial G per day
unitless
gridMET_ETo
gridMET short/grass reference ET (nearest cell)
mm time-1
gridMET_ETr
gridMET tall/alfalfa reference ET (nearest cell)
mm time-1
gridMET_prcp
gridMET precipitation (nearest cell)
mm time-1
gridMET_[other]
other optional gridMET variables (nearest cell)
NA
H_corr
corrected sensible heat
w m-2
H_subday_gaps
number of gaps in initial H per day
unitless
H_user_corr
corrected sensible heat, user-provided
w m-2
LE_corr
corrected latent energy
w m-2
LE_subday_gaps
number of gaps in initial LE per day
unitless
LE_user_corr
corrected latent energy, user-provided
w m-2
lw_in
incoming longwave radiation
w m-2
lw_out
outgoing longwave radiation
w m-2
ppt
precipitation
mm time-1
rh
relative humidity
%
Rn
net radiation
w m-2
Rn_subday_gaps
number of gaps in initial Rn per day
unitless
rso
clear sky radiation (ASCE formulation)
w m-2
sw_in
incoming shortwave radiation
w m-2
sw_out
outgoing shortwave radiation
w m-2
sw_pot
potential shortwave radiation, user-provided
w m-2
t_avg
average temperature
C
t_dew
dew point temperature
C
t_max
maximum temperature
C
t_min
minimum temperature
C
theta
average or single soil moisture
user defined
theta_[1,2,3,…]
soil moisture at sensor
user defined
vp
actual vapor pressure
kPa
vpd
vapor pressure deficit
kPa
wd
wind direction
user defined
ws
wind speed
m s-1
A note on units
Upon creation of a QaQc
object, variables are checked for valid input
units and converted to required units needed for internal calculations when
running QaQc.correct_data
and for certain default plots (see below).
For a list of valid input units for different variables refer to the
QaQc.allowable_units
attribute:
>>> q.allowable_units
{'LE': ['w/m2'],
'H': ['w/m2'],
'Rn': ['w/m2'],
'G': ['w/m2'],
'lw_in': ['w/m2'],
'lw_out': ['w/m2'],
'sw_in': ['w/m2'],
'sw_out': ['w/m2'],
'ppt': ['mm', 'in'],
'vp': ['kpa', 'hpa'],
'vpd': ['kpa', 'hpa'],
't_avg': ['c', 'f']}
For each variable above, if given one of the units allowable the units will automatically be converted to the required units.
To know which variables are required to be in particular units view
Qc.required_units
:
>>> q.required_units
{'LE': 'w/m2',
'H': 'w/m2',
'Rn': 'w/m2',
'G': 'w/m2',
'lw_in': 'w/m2',
'lw_out': 'w/m2',
'sw_in': 'w/m2',
'sw_out': 'w/m2',
'ppt': 'mm',
'vp': 'kpa',
'vpd': 'kpa',
't_avg': 'c'}
Note
The list of allowable units is a work in progress, if your input units are
not available consider raising an issue on GitHub or providing the
conversion directly with a pull request. Automatic unit conversions are
handled within the util.Convert
class using the
util.Convert.convert
class method.
Save resampled and corrected data
The QaQc.write
method conveniently writes daily and monthly time
series of input and calculated variables to comma separated value (CSV)
files. If the QaQc.correct_data
method has not yet been run it will
be and the monthly time series will also be created using the default
parameters for the correction routine (Energy Balance Ratio method with
ETr-based gap filling).
The default output directory for time series files can be
accessed/changed by the out_dir
attribute, if not changed it will be
located in the same directory of the config.ini file. The daily and
monthly time series file names will begin with the QaQc.site_id
followed by “daily_data” or “monthly_data” resepctively. For example,
>>> # new QaQc instance
>>> q = QaQc(d)
>>> # a platform dependent pathlib.Path object
>>> q.out_dir
PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/output')
The line below shows that no output files have been written to
QaQc.out_dir
yet,
>>> # print files in output directory that begin with the site_id
>>> [f.name for f in q.out_dir.glob('{}*'.format(q.site_id))]
['US-Tw3_input_plots.html']
>>> q.corrected
False
>>> # writing files also ran corrections since they were not yet run
>>> q.write()
>>> q.corrected
True
Now the respective daily and monthly time series have been written to
QaQc.out_dir
,
>>> [f.name for f in q.out_dir.glob('{}*'.format(q.site_id))]
['US-Tw3_daily_data.csv', 'US-Tw3_monthly_data.csv', 'US-Tw3_input_plots.html']
Hint
You can overwrite the default name of the output directory to save the daily
and monthly time series using the out_dir
keyword argument to
QaQc.write
, this option keeps the location within the directory of
the config file but just changes the name, whereas to change the entire
output directory path adjust the QaQc.out_dir
attribute directly.
Also, the naming scheme of output files created will use user-defined names
for all input variables.
Visualize resampled and corrected data
Similar to the Data.plot
, the QaQc.plot
method creates a
series of default time series and scatter plots of input and in this
case calculated variables. The temporal frequency of plots from
QaQc.plot
will always be daily and monthly and additional plots are
created for validation of energy balance closure corrections, otherwise the
same options such as number of subplot columns, super title, subplot
dimensions, output type, output file path, etc. are available. Similar to
QaQc.write
and QaQc.monthly_df
, if the data has not yet been
corrected the plot
method will correct it using the default parameters
before creating the plots.
Here is an example of the default daily and monthly time series plots produced after running the Energy Balance Ratio closure correction:
>>> q = QaQc(d)
>>> q.plot(output_type='notebook', plot_width=700)