Tutorial ======== This tutorial demonstrates the most important features of the ``flux-data-qaqc`` Python package for management, analysis, and visualization of eddy covariance time series data. It is recommended to read the :ref:`Installation` and :ref:`Configuration Options and Caveats` tutorials before this one. A Jupyter Notebook of this tutorial is available `here `__. .. Tip:: Currently, the software does not include a command line interface therefore to use the software you must use Python, e.g. make your own scripts or use an interactive shell. However, you will see that common workflows can be accomplished with a few (5-10) lines of code and you can simply follow the templates given here to make custom scripts. Description of example datasets ------------------------------- The data for this example comes from the “Twitchell Alfalfa” AmeriFlux eddy covariance flux tower site in California. The site is located in alfalfa fields and exhibits a mild Mediterranean climate with dry and hot summers, for more information on this site or to download data click `here `__. Loading input ------------- The loading and management of input climatic data and metadata from a config.ini file is done using the :obj:`fluxdataqaqc.Data` object. In a nutshell, a :obj:`.Data` object is created from a properly formatted config file (see :ref:`Setting up a config file`) and has tools for parsing input climate data, averaging input climate time series, accessing/managing metadata, flag-based data filtering, and creating interactive visualizations of input data. There is only one argument to create a Data object, the path to the config.ini file: >>> # imports for code snippets within tutorial >>> import pandas as pd >>> from fluxdataqaqc import Data, QaQc, Plot >>> from bokeh.plotting import figure, show, ColumnDataSource >>> from bokeh.models.formatters import DatetimeTickFormatter >>> from bokeh.models import LinearAxis, Range1d >>> # create a Data object from the config.ini file >>> config_path = 'US-Tw3_config.ini' >>> d = Data(config_path) Attributes of a Data object ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Below are some of the useful attributes of the :obj:`.Data` object and how they may be used. The full path to the config.ini file that was used to create the :obj:`.Data` instance can be accessed, note that it will return a system-depenedent :obj:`pathlib.Path` object. E.g. on my Linux machine the path is: >>> d.config_file PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/US-Tw3_config.ini') On a Windows machine the path will have the appropriate backslashes. Similarly to access the climate time series file: >>> d.climate_file PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/AMF_US-Tw3_BASE_HH_5-5.csv') The :attr:`.Data.config` attribute is a :obj:`configparser.ConfigParser` object, it allows you to access metadata and data in the config file in multiple ways and to modify them. In ``flux-data-qaqc`` it is mainly used for accessing information about the input data. >>> # get a list of all entries in the METADATA section of the config.ini >>> d.config.items('METADATA') # access the DATA section the same way [('climate_file_path', 'AMF_US-Tw3_BASE_HH_5-5.csv'), ('station_latitude', '38.1159'), ('station_longitude', '-121.6467'), ('station_elevation', '-9.0'), ('missing_data_value', '-9999'), ('skiprows', '2'), ('date_parser', '%Y%m%d%H%M'), ('site_id', 'US-Tw3'), ('country', 'USA'), ('doi_contributor_name', 'Dennis Baldocchi'), ('doi_contributor_role', 'Author'), ('doi_contributor_email', 'baldocchi@berkeley.edu'), ('doi_contributor_institution', 'University of California, Berkeley'), ('doi_organization', 'California Department of Water Resources'), ('doi_organization_role', 'Sponsor'), ('flux_measurements_method', 'Eddy Covariance'), ('flux_measurements_variable', 'CO2'), ('flux_measurements_operations', 'Continuous operation'), ('site_name', 'Twitchell Alfalfa'), ('igbp', 'CRO'), ('igbp_comment', 'alfalfa is a fast growing leguminous crop raised for animal feed of low stature. It is planted in rows and typically reaches 60-70 cm in height prior to harvest.'), ('land_ownership', 'public'), ('network', 'AmeriFlux'), ('reference_paper', 'Baldocchi, D., Penuelas, J. (2018) The Physics And Ecology Of Mining Carbon Dioxide From The Atmosphere By Ecosystems, Global Change Biology, 45(), 9275–9287'), ('reference_doi', '10.1111/gcb.14559'), ('reference_usage', 'Reference'), ('research_topic', 'The research approach of the University of California, Berkeley Biometeorology Laboratory involves the coordinated use of experimental measurements and theoretical models to understand the physical, biological, and chemical processes that control trace gas fluxes between the biosphere and atmosphere and to quantify their temporal and spatial variations. The research objectives of the Mayberry Wetland, Twitchell Wetland, Sherman Island, Twitchell Island, Twitchell Alfalfa, and Twitchell Corn sites are as follows: 1) Describe differences in the fluxes of CO2, CH4, H2O, and energy between different land uses, 2) Understand the mechanisms controlling these fluxes, 3) Use ecosystem modeling to understand controls on these mechanisms under different environmental scenarios. These six sites were selected to capture a wide range of inundated conditions within the Sacramento-San Joaquin River Delta. The research focuses on the eddy covariance technique to measure CH4, CO2, H2O, and energy fluxes and works to combine measurements of both net fluxes and partitioned fluxes in order to achieve a mechanistic understanding of the ecological controls on current and future carbon flux in the Delta.'), ('terrain', 'Flat'), ('aspect', 'FLAT'), ('wind_direction', 'W'), ('surface_homogeneity', '370.0'), ('site_desc', "The Twitchell Alfalfa site is an alfalfa field owned by the state of California and leased to third parties for farming. The tower was installed on May 24, 2013. This site and the surrounding region are part of the San Joaquin - Sacramento River Delta drained beginning in the 1850's and subsequently used for agriculture. The field has been alfalfa for X years…., Crop rotation occurs every 5-6 years. The site is harvested by mowing and bailing several times per year. The field is fallow typically between November and February. The site is irrigated by periodically-flooded ditches surrounding the field. The site is irrigated by raising, and subsequently lowering the water table??"), ('site_funding', 'California Department of Water Resources'), ('team_member_name', 'Joe Verfaillie'), ('team_member_role', 'Technician'), ('team_member_email', 'jverfail@berkeley.edu'), ('team_member_institution', 'University of California, Berkeley'), ('url_ameriflux', 'http://ameriflux.lbl.gov/sites/siteinfo/US-Tw3'), ('utc_offset', '-8'), ('mat', '15.6'), ('map', '421.0'), ('land_owner', 'California Department of Water Resources'), ('climate_koeppen', 'Csa'), ('doi', '10.17190/AMF/1246149'), ('doi_citation', 'Dennis Baldocchi (2013-) AmeriFlux US-Tw3 Twitchell Alfalfa, 10.17190/AMF/1246149'), ('doi_dataproduct', 'AmeriFlux'), ('team_member_address', 'Department of Environmental Science, Policy and Management, 137 Mulford Hall, 345 Hilgard Hall,Berkeley, CA USA 94720-3110'), ('url', 'http://nature.berkeley.edu/biometlab/sites.php?tab=US-Tw3'), ('dom_dist_mgmt', 'Agriculture'), ('site_snow_cover_days', '0.0'), ('state', 'CA'), ('location_date_start', '20130524.0'), ('acknowledgement', 'Biometeorology Lab, University of California, Berkeley, PI: Dennis Baldocchi')] A useful method is the :meth:`configparser.ConfigParser.get` which takes the section of the config file and the “option” and returns the value: >>> d.config.get(section='METADATA', option='site_name') 'Twitchell Alfalfa' >>> # section and option are optional keywords >>> d.config.get('METADATA', 'site_name') 'Twitchell Alfalfa' .. Tip:: If you are unsure if an entry or option exists in the config file, use the ``fallback`` keyword argument >>> # section and option are optional keywords >>> d.config.get('METADATA', 'site name', fallback='na') 'na' Some metadata entries are added as :obj:`.Data` attributes for easier access as they are used in multiple ways later, these include: - site_id\ :math:`^*` - elevation\ :math:`^*` - latitude\ :math:`^*` - longitude\ :math:`^*` - na_val - qc_threshold - qc_flag :math:`^*`\ mandatory **METADATA** entries in the config file, see :ref:`Setting up a Config File` for further explanation. View all the columns as found in the header row of the input time series climate file. >>> d.header array(['TIMESTAMP_START', 'TIMESTAMP_END', 'CO2', 'H2O', 'CH4', 'FC', 'FCH4', 'FC_SSITC_TEST', 'FCH4_SSITC_TEST', 'G', 'H', 'LE', 'H_SSITC_TEST', 'LE_SSITC_TEST', 'WD', 'WS', 'USTAR', 'ZL', 'TAU', 'MO_LENGTH', 'V_SIGMA', 'W_SIGMA', 'TAU_SSITC_TEST', 'PA', 'RH', 'TA', 'VPD_PI', 'T_SONIC', 'T_SONIC_SIGMA', 'SWC_1_1_1', 'SWC_1_2_1', 'TS_1_1_1', 'TS_1_2_1', 'TS_1_3_1', 'TS_1_4_1', 'TS_1_5_1', 'NETRAD', 'PPFD_DIF', 'PPFD_IN', 'PPFD_OUT', 'SW_IN', 'SW_OUT', 'LW_IN', 'LW_OUT', 'P', 'FC_PI_F', 'RECO_PI_F', 'GPP_PI_F', 'H_PI_F', 'LE_PI_F'], dtype='>> d.variables {'date': 'TIMESTAMP_START', 'Rn': 'NETRAD', 'G': 'G', 'LE': 'LE_PI_F', 'H': 'H_PI_F', 'sw_in': 'SW_IN', 'sw_out': 'SW_OUT', 'lw_in': 'LW_IN', 'lw_out': 'LW_OUT', 'vpd': 'VPD_PI', 't_avg': 'T_SONIC', 'ws': 'WS', 'theta_1': 'SWC_1_1_1', 'theta_2': 'SWC_1_2_1'} And, the :attr:`.Data.inv_map` maps the internal to user-defined names if they differ, however this is only created once the data is loaded by calling :attr:`.Data.df`. >>> # a similar dictionary attribute for input units >>> d.units {'Rn': 'w/m2', 'G': 'w/m2', 'LE': 'w/m2', 'H': 'w/m2', 'sw_in': 'w/m2', 'sw_out': 'w/m2', 'lw_in': 'w/m2', 'lw_out': 'w/m2', 'vpd': 'hPa', 't_avg': 'C', 'ws': 'm/s', 'theta_1': '(%): Soil water content (volumetric), range 0-100', 'theta_2': '(%): Soil water content (volumetric), range 0-100'} Accessing input data ^^^^^^^^^^^^^^^^^^^^ The :py:attr:`Data.df` property gves access to the time series input climate data for columns specified in the config file as a datetime-indexed :obj:`pandas.DataFrame` object. This object has numerous powerful built in tools for time series analysis and visualization. >>> # first 5 datetimes that are not gaps >>> d.df.dropna().head() .. raw:: html
input_G WS VPD_PI T_SONIC SWC_1_1_1 SWC_1_2_1 NETRAD SW_IN SW_OUT LW_IN LW_OUT H_PI_F LE_PI_F theta_mean
date
2013-05-24 12:30:00 122.194848 3.352754 18.853678 25.682739 6.6790 26.1655 652.648719 1027.756939 212.800000 300.363524 462.671744 95.487930 375.841436 16.42225
2013-05-24 13:00:00 108.054863 3.882154 18.560999 26.057700 6.7065 26.1600 629.990486 997.749437 209.933333 303.269447 461.095065 96.584383 371.619775 16.43325
2013-05-24 13:30:00 79.330662 4.646089 18.900260 26.067374 6.7120 26.1545 595.817687 954.988747 206.733333 303.017852 455.455579 84.066406 358.194935 16.43325
2013-05-24 14:00:00 52.366527 5.048825 20.440061 25.961307 6.7395 26.1325 549.039365 900.975244 201.333333 298.914731 449.517276 69.449710 406.528564 16.43600
2013-05-24 14:30:00 35.658417 5.302946 21.064824 25.954462 6.7450 26.1215 493.519695 833.458365 192.066667 296.791541 444.663544 47.774030 315.295309 16.43325

.. Tip:: There are *many* tutorials on how to use the :obj:`pandas.DataFrame` and its powerful data analysis tools for multiple purposes online, to get started you may want to visit Panda’s own list of tutorials `here `__. By default the column names in :attr:`.Data.df` are retained from user-defined names unless they were named exactly the same as an internal name. For example the input ground heat flux column in this dataset is named “G”, therefore it was renamed as “input_g” >>> d.df.columns Index(['input_G', 'WS', 'VPD_PI', 'T_SONIC', 'SWC_1_1_1', 'SWC_1_2_1', 'NETRAD', 'SW_IN', 'SW_OUT', 'LW_IN', 'LW_OUT', 'H_PI_F', 'LE_PI_F', 'theta_mean'], dtype='object') >>> # the new name was also updated in Data.variables >>> d.variables.get('G') 'input_G' As stated earlier, :attr:`.Data.inv_map` maps the user-defined names to internal ``flux-data-qaqc`` names only after loading :attr:`.Data.df`: >>> d.inv_map {'TIMESTAMP_START': 'date', 'NETRAD': 'Rn', 'input_G': 'G', 'LE_PI_F': 'LE', 'H_PI_F': 'H', 'SW_IN': 'sw_in', 'SW_OUT': 'sw_out', 'LW_IN': 'lw_in', 'LW_OUT': 'lw_out', 'VPD_PI': 'vpd', 'T_SONIC': 't_avg', 'WS': 'ws', 'SWC_1_1_1': 'theta_1', 'SWC_1_2_1': 'theta_2'} .. Tip:: The :attr:`.Data.inv_map` is mainly used to rename the dataframe to internal names, this can be very useful if you are creating your own custom workflows using the ``flux-data-qaqc`` API because it allows you to only know the internal names of variables therefore they can be hard coded into your workflow and applied to different eddy covariance datasets. For example, let’s say we wanted to make HTML tables of basic statistics of just the energy balance components for many datasets (that may have different names for the same variables) and save the file using the user-defined names: >>> d = Data('US-Tw3_config.ini') >>> df = d.df.rename(columns=d.inv_map) >>> # get some metadata for saving >>> site_id = d.site_id >>> vars_we_want = ['H', 'LE', 'Rn', 'G'] >>> # rename variables, calculate basice statistics table and save to HTML >>> df[vars_we_want].rename(columns=d.variables).describe().to_html('{}.html'.format(site_id)) Calculating mean for var: THETA from columns: ['SWC_1_1_1', 'SWC_1_2_1'] WARNING: renaming column G to input_G >>> # which produces the following HTML table with user-defined names: >>> from IPython.display import HTML >>> HTML(filename='{}.html'.format(site_id)) .. raw:: html :file: _static/tutorial/US-Tw3.html .. raw:: html
Another powerful feature of the :attr:`.Data.df` property is that it is datetime-indexed using the input data’s temporal frequency, view the date index like so: >>> d.df.index DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:30:00', '2013-01-01 01:00:00', '2013-01-01 01:30:00', '2013-01-01 02:00:00', '2013-01-01 02:30:00', '2013-01-01 03:00:00', '2013-01-01 03:30:00', '2013-01-01 04:00:00', '2013-01-01 04:30:00', ... '2018-06-04 19:00:00', '2018-06-04 19:30:00', '2018-06-04 20:00:00', '2018-06-04 20:30:00', '2018-06-04 21:00:00', '2018-06-04 21:30:00', '2018-06-04 22:00:00', '2018-06-04 22:30:00', '2018-06-04 23:00:00', '2018-06-04 23:30:00'], dtype='datetime64[ns]', name='date', length=95088, freq=None) Datetime-indexed :obj:`pandas.DataFrame` objects have useful features for time series analysis like grouping and calculating statistics by time aggregates. The example below shows how to calculate the day of year mean for energy balance components, it also demonstrates how to use the ``add_lines`` plotting method available to :obj:`.Data`, :obj:`.QaQc`, and :obj:`.Plot` objects. >>> # convert to internal names, copy dataframe >>> df = d.df.rename(columns=d.inv_map) >>> # day of year mean of input energy balance components >>> vars_we_want = ['H', 'LE', 'Rn', 'G'] >>> doy_means = df[vars_we_want].groupby(d.df.index.dayofyear).mean() >>> # create a Bokeh figure >>> fig = figure(x_axis_label='day of year', y_axis_label='day of year mean (w/m2)') >>> # arguements needed for creating interactive plots >>> plt_vars = vars_we_want >>> colors = ['red', 'blue', 'black', 'green'] >>> x_name = 'date' >>> source = ColumnDataSource(doy_means) >>> Plot.add_lines(fig, doy_means, plt_vars, colors, x_name, source, labels=vars_we_want, >>> x_axis_type=None) >>> show(fig) .. raw:: html :file: _static/tutorial/doy_mean_example.html .. Note:: The ``x_axis_type=None`` is a unique argument to :meth:`.Plot.add_lines` and :meth:`.Plot.line_plot` that in this case means to not try to force the x-axis format to a datetime representation, default is ``x_axis_type='date'``. .. seealso:: Some routines occur automatically when creating a :obj:`.Data` object, including calcuation of weighted and non-weighted averages of soil heat flux and soil moisture which is described in :ref:`Averaging data from multiple sensors`. Modifying input data ^^^^^^^^^^^^^^^^^^^^ A last note on the :obj:`.Data` object (same goes for the :obj:`.QaQc` object) is that :attr:`Data.df` is a class property, in this case that means that it can be reassigned with a different :obj:`pandas.DataFrame`. This is critical for manual pre-filtering and validation of data before proceeding with energy balance closure routines. A simple example is shown here: >>> # add 5 to air temperature, this would effect ET calculations later >>> x = d.df >>> x['T_SONIC'] += 5 >>> d.df = x A realistic use of the reassignability of the :attr:`.Data.df` and :attr:`.QaQc.df` properties is shown in `manual cleaning of poor quality data `__. .. seealso:: The :meth:`.Data.apply_qc_flags` method allows for reading in quality control flags with the input data and filtering specific data out based on user-defined numeric or character flags. This routine is specific to :obj:`.Data` and includes several attributes that are added to a :obj:`.Data` instance, for full explanation and examples see :ref:`Quality-based data filtering`. Visualize input data -------------------- The :meth:`.Data.plot` method create a series of interactive time series plots of input data, potential plots inlcude: - energy balance components - radiation components - multiple soil heat flux measurements - air temperature - vapor pressure and vapor pressure deficit - wind speed - precipitation - latent energy - multiple soil moisture measurements If any of these variables are not found the plot(s) will not be added. The most useful interactive features of plots created by ``flux-data-qaqc`` are: - pan/zoom - hover tolltips on var names, values, date - linked x-axes on time series plots - save plot option (can save specific subplot zoomed in) Here is an example, >>> d.plot(output_type='notebook', plot_width=700) The output plot is not shown in the online documentation due to memory constraints. .. hint:: The plot methods of :obj:`.Data` and :obj:`.QaQc` objects have the keyword argument ``output_type`` which by default is set to “save”, the other two options are “notebook” for showing within a Jupyter Notebook and “show” which opens a temporary file in the default web browser. If you rather save the plot, and maybe you want 2 columns of plots, >>> d.plot(ncols=2, plot_width=500) After saving a plot without specifying the output file path (keyword argument ``out_file``), it will be saved to an “output” directory where the config file is with the file name based on :attr:`.Data.site_id` with the suffix “\_input_plots”: >>> # where the plot file was saved by default >>> d.plot_file PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/output/US-Tw3_input_plots.html') The following plot is not shown due to excessive memory usage needed to build online documentation. >>> # view outplot plots within Jupyter notebook >>> from IPython.display import HTML >>> HTML(filename=d.plot_file) .. hint:: The :meth:`.QaQc.plot` method shown below is similar however it may include added plots with calculated and corrected variables (if they exist) and will always plot data in daily and monthly temporal frequency because daily frequency is required before applying ``flux-data-qaqc`` energy balance closure corrections. Temporal resampling ------------------- The :obj:`.QaQc` object holds several tools for managing data and eddy covariance data analysis, but one of it’s primary features is temporal resampling of input data to daily and monthly frequencies. The resampling of time series data to daily frequency occurs upon the creation of a :obj:`.QaQc` instance if the frequency within the preceeding :obj:`.Data` object is not already daily: >>> # the frequency of the input data is 30 minute >>> d.df.index[0:5] DatetimeIndex(['2013-01-01 00:00:00', '2013-01-01 00:30:00', '2013-01-01 01:00:00', '2013-01-01 01:30:00', '2013-01-01 02:00:00'], dtype='datetime64[ns]', name='date', freq=None) >>> # creating a QaQc instance will automatically convert to daily >>> q = QaQc(d) The input data temporal frequency appears to be less than daily. Data is being resampled to daily temporal frequency. Filtering days with less then 100.0% or 48/48 sub-daily measurements Converting vpd from hpa to kpa >>> # first 5 datetime indices are dates now >>> q.df.index[0:5] DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05'], dtype='datetime64[ns]', name='date', freq=None) The method used for aggregating different variables, e.g. mean or sum, when resampling to daily or monthly frequency is defined in the :attr:`QaQc.agg_dict` class attribute: >>> # these are the internal names as keys and temporal aggregation method as values >>> QaQc.agg_dict {'energy': 'mean', 'flux': 'mean', 'flux_corr': 'mean', 'br': 'mean', 'ET': 'sum', 'ET_corr': 'sum', 'ET_gap': 'sum', 'ET_fill': 'sum', 'ET_fill_val': 'sum', 'ET_user_corr': 'sum', 'ebr': 'mean', 'ebr_corr': 'mean', 'ebr_user_corr': 'mean', 'ebr_5day_clim': 'mean', 'gridMET_ETr': 'sum', 'gridMET_prcp': 'sum', 'lw_in': 'mean', 't_avg': 'mean', 'rso': 'mean', 'sw_pot': 'mean', 'sw_in': 'mean', 'vp': 'mean', 'vpd': 'mean', 'ppt': 'sum', 'ws': 'mean', 'Rn': 'mean', 'sw_out': 'mean', 'lw_out': 'mean', 'G': 'mean', 'LE': 'mean', 'LE_corr': 'mean', 'LE_user_corr': 'mean', 'H': 'mean', 'H_corr': 'mean', 'H_user_corr': 'mean'} .. note:: There are several calculated variables above that may not look familiar, many are calculated by the energy balance closure correction routines and described in :ref:`Closure Methodologies`. Also, any other variables (not found in :attr:`.QaQc.agg_dict` that exist in a :attr:`.QaQc.df` before accessing :attr:`.QaQc.monthly_df` the first time will be averaged in the monthly time series dataframe (:attr:`.QaQc.monthly_df`). The :obj:`.QaQc` constructor tries to infer the temporal frequency of the input time series data, however the method is not always accurate, to access the inferred initial temporal frequency of the data view the :attr:`.QaQc.temporal_freq` attribute: >>> q.temporal_freq '30T' If the inferred input frequency was accurate you will see a `Pandas datetime alias `__, in this case ‘30T’ is thirty minutes. If the temporal frequency is not automatically detected you should be able to rely on the ``n_samples_per_day`` instance attribute that is manually estimated by the ``QaQc`` constructor: >>> q.n_samples_per_day 48 Filter days with sub-daily gaps ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The ``drop_gaps`` and ``daily_frac`` keyword arguments used when creating a :obj:`.QaQc` instance allow you to control how days with sub-daily measurement gaps will or will not be filtered out when resampling to daily frequency. Sub-daily gaps in energy balance variables :math:`LE`, :math:`H`, :math:`Rn`, and :math:`G` , and daily ASCE standaridized reference ET inputs, e.g. hourly :math:`ea` ("vp"), :math:`rs` ("sw_in"), :math:`t_min`, :math:`t_max`, and :math:`ws`, can be linearly interpolated automatically before daily aggregations. Interpolation is performed over gap lengths measured in hours, with options to control the longest length of gap to interpolate when :math:`Rn \ge 0` controlled by the :obj:`.QaQc` keyword argument ``max_interp_hours`` (default 2 hours) and the longest gap to interpolate when :math:`Rn < 0` set by the ``max_interp_hours_night`` (default 4 hours).:math:` .. Important:: By default the :obj:`.QaQc` constructor will first linearly interpolate energy balance and ASCE ref. ET variables (:math:`LE`, :math:`H`, :math:`Rn`, :math:`G`, :math:`ea` ("vp"), :math:`rs` ("sw_in"), :math:`t_min`, :math:`t_max`, and :math:`ws`) according to the maximum gap lengths (``max_interp_hours`` and ``max_interp_hours_night``) and then count sub-daily gaps and drop days (set values to null) for all climate data columns (not QC flag or sub-daily gap count columns) where any of the sub-daily data are missing because by default ``drop_gaps=True`` and ``daily_frac=1.0``. In other words, if you have hourly input data for(:math:`LE` and one hour was missing on a given day, by default that hour will be linearly interpolated before calculating the daily time series and the daily mean will be calculated after. On the other hand, if other climate variables had a single hour missing on a given day, e.g. wind direction, this day would be filtered out by the :obj:`.QaQc` constructor. This is important because the daily time series is what is used in all energy balance closure correction and daily ASCE standardized reference ET algorithms. The percentage of sub-daily samples to require set by the ``daily_frac`` argument and the maximum length of gaps to linearly interpolate set by ``max_interp_hours`` and ``max_interp_hours_night`` complement each other and are used in tandem. For example, if the input data is half-hourly and you only want a maximum of 4 hours to be interpolated on any given day and gap lengths to interpolate should be no more than 2 hours each then you would pass the following parameters to the :obj:`.QaQc` constructor: >>> q = QaQc(d, daily_frac=20/24, max_interp_hours=2, max_interp_hours_night=2) The input data temporal frequency appears to be less than daily. Data is being resampled to daily temporal frequency. Linearly interpolating gaps in energy balance components up to 2 hours when Rn < 0 and up to 2 hours when Rn >= 0. Filtering days with less then 83.33333333333334% or 40/48 sub-daily measurements In this case we set ``daily_frac=20/24`` because we are only allowing a maximum of 4 hours of total gaps in the day in other words we are requiring 40 of the 48 half hourly samples to exist before we filter out a day. Remember, because linear interpolation of gaps is done before counting sub-daily gaps, this could result in retaining days with more than 4 hours of gaps in the original time series of energy balance components. You may also pass the ``daily_frac`` arugment as a decimal fraction, e.g. :math:`0.8333 \approx 20/24`. To not drop any days and take daily means/sums based on whatever data exists in a given day *without* any interpolation of energy balance variables, >>> q = QaQc(d, drop_gaps=False, max_interp_hours=None) The input data temporal frequency appears to be less than daily. Data is being resampled to daily temporal frequency. Let’s view a comparison of :math:`Rn` using different options of filtering days with sub-daily gaps in the working dataset, because it has several periods of systematic gaps which cause upwards skewing of daily mean :math:`Rn` if not filtered carefully: >>> # make an empty pandas dataframe for Rn series >>> Rn_df = pd.DataFrame() >>> # recreate multiplt QaQc instances using different sub-day gap filters >>> q = QaQc(d, drop_gaps=False, max_interp_hours=None) >>> Rn_df['sub_day_gaps'] = q.df.Rn_subday_gaps >>> Rn_df['no_filter_no_interp'] = q.df.rename(columns=q.inv_map).Rn >>> q = QaQc(d, drop_gaps=False) >>> Rn_df['no_filter_with_interp'] = q.df.rename(columns=q.inv_map).Rn >>> q = QaQc(d, daily_frac=0.5) # filter days with less than 50% data >>> Rn_df['require_50'] = q.df.rename(columns=q.inv_map).Rn >>> q = QaQc(d, daily_frac=0.75) >>> Rn_df['require_75'] = q.df.rename(columns=q.inv_map).Rn >>> q = QaQc(d, daily_frac=1, max_interp_hours=24, max_interp_hours_night=24) >>> Rn_df['require_100_with_interp'] = q.df.rename(columns=q.inv_map).Rn >>> q = QaQc(d, daily_frac=1, max_interp_hours=None) >>> Rn_df['require_100_no_interp'] = q.df.rename(columns=q.inv_map).Rn >>> # plot to compare results of day-gap filter >>> fig = figure(x_axis_label='date', y_axis_label='mean daily net radiation (w/m2), filtered based on sub-daily gaps') >>> # arguments needed for creating interactive line plots >>> colors = ['red', 'darkred','orange', 'blue', 'black', 'tan'] >>> plt_vars = ['no_filter_no_interp', 'no_filter_with_interp', 'require_50', 'require_75', 'require_100_with_interp', 'require_100_no_interp'] >>> labels = ['no filter wout/interp.', 'no filter w/interp.', 'require > 50% w/interp.', 'require > 75% w/interp.', 'require 100% w/interp.', 'require 100% wout/interp.'] >>> x_name = 'date' >>> source = ColumnDataSource(Rn_df) >>> Plot.add_lines(fig, Rn_df, plt_vars, colors, x_name, source, labels=labels) >>> # add daily gap counts to secondary y >>> fig.extra_y_ranges['gap_counts'] = Range1d(start=0, end=48) >>> fig.add_layout(LinearAxis(y_range_name='gap_counts', axis_label='number of sub-daily gaps'), 'right') >>> fig.circle('date', 'sub_day_gaps', legend_label='n sub-day gaps', y_range_name='gap_counts', >>> color='silver', source=source >>> ) >>> fig.hover[0].tooltips.append(('sub_day_gaps','@{}'.format('sub_day_gaps'))) >>> fig.legend.location = 'top_right' >>> show(fig) .. raw:: html :file: _static/tutorial/filter_subday.html Try zooming in on the gaps filled by the "no filter wout/interp." line to compare which days are retained/filtered by different options, also remove lines by clicking on them in the legend to compare subsets of options. .. Tip:: For a more fine-grained approach to filtering out days where perhaps multiple 2 hour gaps were filled use the newly created daily gap count columns: "LE_subday_gaps", "H_subday_gaps", "Rn_subday_gaps", and "G_subday_gaps": >>> q = QaQc(d) >>> df = q.df.rename(columns=q.inv_map) For example, you could post-filter out days in any given energy balance variable, in this case :math:`Rn` where sub-daily gaps exceed a threshold: >>> df.loc[(df.Rn_subday_gaps > 4) & (df.Rn.notna()), ['Rn','Rn_subday_gaps']] .. raw:: html
Rn Rn_subday_gaps
date
2015-06-09 101.710194 5.0
2015-11-20 47.990988 5.0
2016-01-15 72.495973 8.0
2018-01-06 79.507008 7.0
2018-05-10 160.997332 6.0

Monthly time series ^^^^^^^^^^^^^^^^^^^ The :attr:`.QaQc.monthly_df` property allows for creating the monthly time series of input anc calculated variables provided by :meth:`.QaQc.correct_data`. It uses the same temporal aggregation methods as the daily time series i.e. from :attr:`.QaQc.agg_dict`. Although there are many similarities there are important differences between :attr:`.QaQc.df` and :attr:`.QaQc.monthly_df` other than the obvious: when accessing the :attr:`.QaQc.monthly_df` it will automatically run the default energy balance closure correction routine provided by :meth:`.QaQc.correct_data` *if* it has not yet been run. You can check if it has been run at anytime by: >>> q.corrected False To show how this works let’s access the monthly data and show the monthly statistics of the “corrected” evapotranspiration (ET_corr): >>> # first note, ET_corr is not in the dataset yet >>> 'ET_corr' in q.df.columns False Now access the monthly time series, >>> q.monthly_df; >>> 'ET_corr' in q.df.columns True By calling the monthly dataframe, the energy balance closure was applied automatically >>> q.monthly_df.ET_corr.describe() count 61.000000 mean 87.858135 std 49.938287 min 11.370062 25% 41.418994 50% 84.383190 75% 127.500125 max 192.033481 Name: ET_corr, dtype: float64 >>> q.corrected True .. note:: The :attr:`.QaQc.monthly_df` also filters out months with less than 30% of days of the month missing by default. To calculate monthly time series with other threshold fractions of days required use the :func:`.util.monthly_resample` function and adjust the keyword argument ``thresh`` which is the fraction (0-1) of days of the month required to not be gaps otherwise the month’s value will be forced to null, e.g. if you wanted to caclulate the monthly mean air temperature requiring 30 and 90 percent of the days in the month to not be gaps: >>> from fluxdataqaqc.util import monthly_resample >>> # select just t_avg for example >>> cols = ['t_avg'] >>> df = q.df.rename(columns=q.inv_map) >>> # create temporary df with different monthly resample results >>> tmp_df = monthly_resample(df, cols, 'mean', thresh=0.9).rename( >>> columns={'t_avg': 'thresh_90'} >>> ) >>> # join temp dataframe with monthly resample results using different thresh >>> monthly_gap_comp = tmp_df.join(monthly_resample(df, cols, 'mean', thresh=0.3).rename( >>> columns={'t_avg': 'thresh_30'}) >>> ) >>> # plot to compare results of day-gap filter >>> fig = figure(x_axis_label='date', y_axis_label='monthy mean air temperature (C), filtered based on daily gaps') >>> # arguments needed for creating interactive line plots >>> x = 'date' >>> source = ColumnDataSource(monthly_gap_comp) >>> # this example also shows how to use other Bokeh plot arguments >>> Plot.line_plot(fig,'date','thresh_30',source,'red',label='require > 30%', line_alpha=0.5) >>> Plot.line_plot(fig,'date','thresh_90',source,'black',label='require > 90%',line_dash='dotted', line_width=2) >>> fig.legend.location = 'top_right' >>> show(fig) .. raw:: html :file: _static/tutorial/filter_monthly.html .. raw:: html
Energy balance corrections -------------------------- ``flux-data-qaqc`` provides routines that adjust surface energy balance fluxes to improve energy balance closure of eddy covariance flux station data. These routines ultimately result in a corrected daily and monthly time series of latent energy, sensible heat, and evapotranspiration with the option to gap-fill days in corrected ET with ET calculated from gridMET reference ET and fraction of reference ET. There are two methods currently implemented: * Energy Balance Ratio method (default), modified from the `FLUXNET method `__ * Bowen Ratio approach (forces closure) * Multiple least squares regression - user defines LHS and RHS from :math:`LE`, :math:`H`, :math:`Rn`, and :math:`G`, Detailed descriptions of methods including daily ET gap-filling methods can be found in the online documentation :ref:`Closure Methodologies` page. A few important notes on the API of these methods and other hydro-climatic statistical variables that are calculated are shown below. The :meth:`.QaQc.correct_data` method is used to run energy balance closure corrections. Here are a few tips on using them, >>> # note above the monthly_df automatically applied the 'ebr' Energy Balance Ratio correction >>> q.corr_meth 'ebr' >>> # potential correction options >>> q.corr_methods ('ebr', 'br', 'lin_regress') >>> # to specify the Bowen Raito method: >>> q.correct_data(meth='br') >>> # the most recently used correction method is now shown >>> q.corr_meth 'br' .. Tip:: After applying any energy balance closure correction routine all previous corrected variables will be overwritten or dropped in :attr:`.QaQc.df`, :attr:`.QaQc.monthly_df`, and :attr:`.QaQc.variables`, therefore to make a comparison of different methods on the same data make a copy of the ``df`` or ``monthly_df`` properties before running the next correction, e.g. >>> # make copies of daily results of different correction options >>> q.correct_data(meth='ebr') >>> ebr_gapfilled = q.df >>> q.correct_data(meth='ebr', etr_gap_fill=False) >>> ebr_notgapfilled = q.df >>> q.correct_data(meth='br') >>> br_gapfilled = q.df >>> q.correct_data(meth='br', etr_gap_fill=False) >>> br_notgapfilled = q.df ET gap-filling ^^^^^^^^^^^^^^ A few notes on the option that uses reference ET and fraction of daily reference ET to fill in large gaps in corrected ET, i.e. the keyword argument ``QaQc.correct_data(etr_gap_fill = True)``. - The nearest `gridMET `__ cell’s time series data for precipitation and alfalfa reference ET is attempted to be downloaded if it is not found in the ``gridmet_file_path`` entry of the config.ini file. - If the path to a gridMET file is not found it is re-downloaded, the config file will be updated with the new path and resaved. - Only the overlapping time period that matches the eddy covariance time series data is attempted to be downloaded, i.e. the period in ``QaQc.df.index``. - When a gridMET file is downloaded it will always be saved in a subdirectory where the config file is located called “gridMET_data” and named using the :attr:`QaQc.site_id` and gridMET cell centroid latitude and longitude. - Corrected latent energy (:math:`LE_{corr}`) gaps are also backwards filled from gap-filled ET. .. caution:: `gridMET `__ only exists within the contiguous United States and from 1979 to present, therefore if your station lies outside of this region or you are analyzing eddy flux data recorded before 1979 this option will not be ususable and you should always run corrections with ``etr_gap_fill=False`` to avoid potential errors. The Bowen Ratio correction method will produce the ‘br’ variable which is the Bowen Ratio. Other calculations ------------------ By default, :meth:`.QaQc.correct_data` also calculates ET from input latent energy (LE) and air temperature, corrected ET from corrected LE and air temperature, potential clear sky radiation (ASCE formulation), and the :obj:`.Data` object attempts to calculate vapor pressure deficit from vapor pressure and air temperature or vapor pressure from vapor pressure deficit and air temperature if they exist at hourly or shorter temporal frequency. Evapotranspiration ^^^^^^^^^^^^^^^^^^ The evapotranspiration (ET) calculations are described in :ref:`Steps 7 and 8 correct turbulent fluxes, EBR, and ET` of the Energy Balance Ratio correction explanation. ASCE clear sky radiation ^^^^^^^^^^^^^^^^^^^^^^^^ Daily ASCE potential clear sky radiation (:math:`R_{so}`) is calculated using equation 19 in the “ASCE Standardized Reference Evapotranspiration Equation” final report by the Task Committee on Standardization of Reference Evapotranspiration Environmental and Water Resources Institute of the American Society of Civil Engineers January, 2005 `here `__. This calculation is a simple method based primarily on elevation and latitude which results in a theoretical envelope of :math:`R_{so}` as a function of the day of year, .. math:: R_{so} = \left(5 + 2 \times 10^{-5} z \right) R_a where :math:`z` is elevation in meters and :math:`R_a` is daily extraterrestrial radiation (radiation with in the absence of an atmosphere), which itself is a well-behaved function of solar declination, the day of the year and the solar constant (see equations 21-29 in the `ASCE report `__). Vapor pressure/deficit ^^^^^^^^^^^^^^^^^^^^^^ The :obj:`.Data` object will attempt to calculate vapor pressure or vapor pressure deficit if one exists but not the other and average air temperature time series also exists with the input data at hourly or shorter temporal frequency. The Tetens equation (eqn. 37 in the ASCE report) s an accurate approximation for saturation vapor pressure (:math:`es`) in kPa as a function of air temperature, .. math:: es = 0.6108 e^{\left(\frac{17.27 \cdot T}{(T + 237.3)}\right)} where :math:`T` is average hourly air temperature in degrees celcius. Vapor pressure deficit (:math:`vpd`) is, .. math:: vpd = es - ea, where :math:`ea` is actual vapor pressure in kPa. **Note,** The equations above are defined for hourly measurements however they are used for hourly or shorter mean variables (:math:`T`, :math:`ea`, or :math:`vpd`) within ``flux-data-qaqc`` and then converted to daily means, if they are not present in the input data at hourly or shorter frequencies then they are not calculated. These equations can be rearanged to solve for either :math:`es` or :math:`vpd` given the other variable and air temperature. For example, if given :math:`T` and :math:`vpd`, then to get actual vapor pressure .. math:: es = 0.6108 e^{\left(\frac{17.27 \cdot T}{(T + 237.3)}\right)} .. math:: ea = es - vpd. In ``flux-data-qaqc`` actual vapor pressure is named "vp" not "ea". Also, during these calculations, if relative humidity is not found in the input dataset then it will subsequently be estimated as .. math:: rh = 100 \times \frac{ea}{es}. .. hint:: The same calculations are available at the daily timestep but are not automatically applied as the hourly or higher temporal frequency calculation is preffered. To apply the estimates of vapor pressure or vapor pressure deficit, and saturation vapor pressure, and relative humidity with daily data one must call the :meth:`.QaQc._calc_vpd_from_vp` method from a :obj:`.QaQc` instance. Legend for calculated variable names ------------------------------------ Atmospheric and related variables created by energy balance closure corrections are described in :ref:`Closure Methodologies` and other calculations can be found throughout this documentation. Below is a reference table with basic descriptions of all variables that are calculated or renamed by ``flux-data-qaqc`` using its standardized naming scheme. The names found here (in the "Variable" column) are the default names that are saved to the header in both daily and monthly time series output CSV files as well as what are shown in the interactive plots upon hovering with a cursor. Note, not all of the following variables will be estimated by ``flux-data-qaqc`` in all scenarios, the list of variables that may be found in output files is completely dependent on the input data provided in the configuration file and also dependend on which calculations are used. For example, ASCE_ETo will only exist if the hourly or daily methods for computing ASCE standardized reference ET are used and the required input variables exist. ================= ================================================================================ ============ Variable Description Unit ================= ================================================================================ ============ ASCE_ETo short/grass ASCE standardized reference ET mm time-1 ASCE_ETr tall/alfalfa ASCE standardized reference ET mm time-1 br bowen ratio unitless ebc_cf energy balance closure correction factor (inverse of ebr_corr) unitless ebr input energy balance ratio unitless ebr_5day_clim 5 day climatology of the filtered Energy Balance Ratio unitless ebr_corr corrected energy balance ratio unitless energy input Rn - G w m-2 es saturation vapor pressure kPa ET ET calculated from input LE and average air temperature mm time-1 ET_corr ET calculated from LE_corr mm time-1 ET_fill gridMET_ETr * ETrF_filtered (fills gaps in ET_corr) mm time-1 ET_fill_val value of ET_fill on gap days mm time-1 ET_gap True on gap days in ET_corr, False otherwise unitless ET_user_corr corrected ET, user-provided mm time-1 EToF fraction of reference ET for ET_corr, i.e. ET_corr / gridMET_ETo unitless EToF_filtered filtered and gap-filled EToF unitless ETrF fraction of reference ET for ET_corr, i.e. ET_corr / gridMET_ETr unitless ETrF_filtered filtered and gap-filled EtrF unitless flux input LE + H w m-2 flux_corr LE_corr + H_corr w m-2 G average or single soil heat flux w m-2 G_[1,2,3,…] soil heat flux at sensor w m-2 G_subday_gaps number of gaps in initial G per day unitless gridMET_ETo gridMET short/grass reference ET (nearest cell) mm time-1 gridMET_ETr gridMET tall/alfalfa reference ET (nearest cell) mm time-1 gridMET_prcp gridMET precipitation (nearest cell) mm time-1 gridMET_[other] other optional gridMET variables (nearest cell) NA H_corr corrected sensible heat w m-2 H_subday_gaps number of gaps in initial H per day unitless H_user_corr corrected sensible heat, user-provided w m-2 LE_corr corrected latent energy w m-2 LE_subday_gaps number of gaps in initial LE per day unitless LE_user_corr corrected latent energy, user-provided w m-2 lw_in incoming longwave radiation w m-2 lw_out outgoing longwave radiation w m-2 ppt precipitation mm time-1 rh relative humidity % Rn net radiation w m-2 Rn_subday_gaps number of gaps in initial Rn per day unitless rso clear sky radiation (ASCE formulation) w m-2 sw_in incoming shortwave radiation w m-2 sw_out outgoing shortwave radiation w m-2 sw_pot potential shortwave radiation, user-provided w m-2 t_avg average temperature C t_dew dew point temperature C t_max maximum temperature C t_min minimum temperature C theta average or single soil moisture user defined theta_[1,2,3,…] soil moisture at sensor user defined vp actual vapor pressure kPa vpd vapor pressure deficit kPa wd wind direction user defined ws wind speed m s-1 ================= ================================================================================ ============ A note on units --------------- Upon creation of a :obj:`.QaQc` object, variables are checked for valid input units and converted to required units needed for internal calculations when running :meth:`.QaQc.correct_data` and for certain default plots (see below). For a list of valid input units for different variables refer to the :attr:`.QaQc.allowable_units` attribute: >>> q.allowable_units {'LE': ['w/m2'], 'H': ['w/m2'], 'Rn': ['w/m2'], 'G': ['w/m2'], 'lw_in': ['w/m2'], 'lw_out': ['w/m2'], 'sw_in': ['w/m2'], 'sw_out': ['w/m2'], 'ppt': ['mm', 'in'], 'vp': ['kpa', 'hpa'], 'vpd': ['kpa', 'hpa'], 't_avg': ['c', 'f']} For each variable above, if given one of the units allowable the units will automatically be converted to the required units. To know which variables are required to be in particular units view :attr:`.Qc.required_units`: >>> q.required_units {'LE': 'w/m2', 'H': 'w/m2', 'Rn': 'w/m2', 'G': 'w/m2', 'lw_in': 'w/m2', 'lw_out': 'w/m2', 'sw_in': 'w/m2', 'sw_out': 'w/m2', 'ppt': 'mm', 'vp': 'kpa', 'vpd': 'kpa', 't_avg': 'c'} .. note:: The list of allowable units is a work in progress, if your input units are not available consider raising an issue on `GitHub `__ or providing the conversion directly with a pull request. Automatic unit conversions are handled within the :obj:`.util.Convert` class using the :meth:`.util.Convert.convert` class method. Save resampled and corrected data --------------------------------- The :meth:`.QaQc.write` method conveniently writes daily and monthly time series of input and calculated variables to comma separated value (CSV) files. If the :meth:`.QaQc.correct_data` method has not yet been run it will be and the monthly time series will also be created using the default parameters for the correction routine (Energy Balance Ratio method with ETr-based gap filling). The default output directory for time series files can be accessed/changed by the ``out_dir`` attribute, if not changed it will be located in the same directory of the config.ini file. The daily and monthly time series file names will begin with the :attr:`.QaQc.site_id` followed by “daily_data” or “monthly_data” resepctively. For example, >>> # new QaQc instance >>> q = QaQc(d) >>> # a platform dependent pathlib.Path object >>> q.out_dir PosixPath('/home/john/flux-data-qaqc/examples/Basic_usage/output') The line below shows that no output files have been written to :attr:`.QaQc.out_dir` yet, >>> # print files in output directory that begin with the site_id >>> [f.name for f in q.out_dir.glob('{}*'.format(q.site_id))] ['US-Tw3_input_plots.html'] >>> q.corrected False >>> # writing files also ran corrections since they were not yet run >>> q.write() >>> q.corrected True Now the respective daily and monthly time series have been written to :attr:`.QaQc.out_dir`, >>> [f.name for f in q.out_dir.glob('{}*'.format(q.site_id))] ['US-Tw3_daily_data.csv', 'US-Tw3_monthly_data.csv', 'US-Tw3_input_plots.html'] .. hint:: You can overwrite the default name of the output directory to save the daily and monthly time series using the ``out_dir`` keyword argument to :meth:`.QaQc.write`, this option keeps the location within the directory of the config file but just changes the name, whereas to change the entire output directory path adjust the :attr:`.QaQc.out_dir` attribute directly. Also, the naming scheme of output files created will use user-defined names for all input variables. Visualize resampled and corrected data -------------------------------------- Similar to the :meth:`.Data.plot`, the :meth:`.QaQc.plot` method creates a series of default time series and scatter plots of input and in this case calculated variables. The temporal frequency of plots from :meth:`.QaQc.plot` will always be daily and monthly and additional plots are created for validation of energy balance closure corrections, otherwise the same options such as number of subplot columns, super title, subplot dimensions, output type, output file path, etc. are available. Similar to :meth:`.QaQc.write` and :attr:`.QaQc.monthly_df`, if the data has not yet been corrected the ``plot`` method will correct it using the default parameters before creating the plots. Here is an example of the default daily and monthly time series plots produced after running the Energy Balance Ratio closure correction: >>> q = QaQc(d) >>> q.plot(output_type='notebook', plot_width=700) .. raw:: html :file: _static/tutorial/US-Tw3_plots.html