Binder

SINGLE SITE ANOMALY DETECTION AND CORRECTION

This script performs anomaly detection and correction for multiple sensors at a single monitoring site for one year. The script imports data, performs initial anomaly detection based on rules, uses models (ARIMA and 4 flavors of LSTM) and associated thresholds to detect anomalies, aggregates for overall anomaly detection, and performs correction based on ARIMA.

Site: Logan River at Main Street

Sensors: temperature, specific conductance, pH, dissolved oxygen

Year: 2017

Created: Amber Jones, March 2021

Import Libraries and Functions

import pandas as pd
import matplotlib.pyplot as plt
import math
from PyHydroQC import anomaly_utilities
from PyHydroQC import calibration
from PyHydroQC import model_workflow
from PyHydroQC import rules_detect
from PyHydroQC import ARIMA_correct
from PyHydroQC import modeling_utilities
from PyHydroQC.model_workflow import ModelType

Retrieve data

Creates an object with a data frame specific to each sensor.

Functions called:

  • get_data
site = 'MainStreet'
sensors = ['temp', 'cond', 'ph', 'do']
sensor_array = anomaly_utilities.get_data(sensors=sensors, filename='MS2017.csv', path='../LRO_data/')
for snsr in sensor_array:
    print(snsr + str(sensor_array[snsr]))

temp                      raw
datetime                 
2017-01-01 00:00:00  2.02
2017-01-01 00:15:00  2.01
2017-01-01 00:30:00  2.00
2017-01-01 00:45:00  1.99
2017-01-01 01:00:00  1.96
...                   ...
2017-12-31 22:45:00  2.34
2017-12-31 23:00:00  2.31
2017-12-31 23:15:00  2.28
2017-12-31 23:30:00  2.26
2017-12-31 23:45:00  2.23

[35019 rows x 1 columns]
cond                       raw
datetime                  
2017-01-01 00:00:00  410.8
2017-01-01 00:15:00  410.8
2017-01-01 00:30:00  410.9
2017-01-01 00:45:00  410.8
2017-01-01 01:00:00  410.9
...                    ...
2017-12-31 22:45:00  382.2
2017-12-31 23:00:00  382.0
2017-12-31 23:15:00  382.0
2017-12-31 23:30:00  381.9
2017-12-31 23:45:00  381.9

[35019 rows x 1 columns]
ph                      raw
datetime                 
2017-01-01 00:00:00  8.54
2017-01-01 00:15:00  8.54
2017-01-01 00:30:00  8.54
2017-01-01 00:45:00  8.54
2017-01-01 01:00:00  8.54
...                   ...
2017-12-31 22:45:00  8.47
2017-12-31 23:00:00  8.47
2017-12-31 23:15:00  8.47
2017-12-31 23:30:00  8.47
2017-12-31 23:45:00  8.47

[35019 rows x 1 columns]
do                       raw
datetime                  
2017-01-01 00:00:00  11.78
2017-01-01 00:15:00  11.78
2017-01-01 00:30:00  11.78
2017-01-01 00:45:00  11.78
2017-01-01 01:00:00  11.79
...                    ...
2017-12-31 22:45:00  11.77
2017-12-31 23:00:00  11.78
2017-12-31 23:15:00  11.79
2017-12-31 23:30:00  11.80
2017-12-31 23:45:00  11.80

[35019 rows x 1 columns]

Import Parameters

# Parameters may be specified in a parameters file or in this script
from PyHydroQC.parameters import site_params, LSTM_params, calib_params
for snsr in sensors:
    print(snsr + str(site_params[site][snsr]))
print('LSTM' + str(LSTM_params))
print('calib' + str(calib_params))

temp{'max_range': 20, 'min_range': -2, 'persist': 30, 'window_sz': 30, 'alpha': 1e-05, 'threshold_min': 0.4, 'widen': 1, 'pdq': [0, 0, 0]}
cond{'max_range': 2700, 'min_range': 150, 'persist': 30, 'window_sz': 40, 'alpha': 1e-06, 'threshold_min': 5.0, 'widen': 1, 'pdq': [1, 1, 5]}
ph{'max_range': 9.5, 'min_range': 7.5, 'persist': 45, 'window_sz': 20, 'alpha': 0.0001, 'threshold_min': 0.03, 'widen': 1, 'pdq': [3, 1, 1]}
do{'max_range': 15, 'min_range': 5, 'persist': 45, 'window_sz': 30, 'alpha': 1e-05, 'threshold_min': 0.25, 'widen': 1, 'pdq': [1, 1, 1]}
LSTM{'time_steps': 5, 'samples': 20000, 'cells': 128, 'dropout': 0.2, 'patience': 6}
calib{'persist_low': 3, 'persist_high': 7, 'hour_low': 7, 'hour_high': 17}

Rules Based Anomaly Detection

Performs checks for range and persistence. Min/max range and duration are defined in the parameters. Data outside a range or persisting longer than a duration are detected as anomalous, corrected by linear interpolation. The output is a column 'observed' of intermediate results that are used for subsequent modeling.

Functions called:

  • range_check
  • persistence
  • interpolate
range_count = dict()
persist_count = dict()
rules_metrics = dict()
for snsr in sensor_array:
    sensor_array[snsr], range_count[snsr] = rules_detect.range_check(df=sensor_array[snsr],
                                                                     maximum=site_params[site][snsr]['max_range'],
                                                                     minimum=site_params[site][snsr]['min_range'])
    sensor_array[snsr], persist_count[snsr] = rules_detect.persistence(df=sensor_array[snsr],
                                                                       length=site_params[site][snsr]['persist'],
                                                                       output_grp=True)
    sensor_array[snsr] = rules_detect.interpolate(df=sensor_array[snsr])
print('Rules based detection complete.\n')
Rules based detection complete.


Calibration and Drift Correction

Detect Calibration Events

Two separate rules based approaches attempt to identify calibration events. Calibration events detected here should be reviewed, compared to field records, and organized as input to the following step: drift correction. A subset of sensors are selected (1:4) because temperature is not calibrated. Calibration Approaches:

  1. Concurrent persistence: where persistence is within a certain window (e.g., after a sensor is returned to the water, it is 'stuck' and reports the same values for several time steps), the time of day within a certain window, and where these overlap for all sensors, the points are identified as an event. Hours and durations are defined in the parameters.

  2. Edge detection: Edges are determined using the differences in data before and after each point. The width may be a single point (width = 1) or greater. If the difference exceeds a threshold (a parameter setting), then it is identified as an event.

Functions called:

  • calib_overlap
    • calib_detect
  • edge detect
# Concurrent persistence
calib_sensors = sensors[1:4]
input_array = dict()
for snsr in calib_sensors:
    input_array[snsr] = sensor_array[snsr]
all_calib, all_calib_dates, df_all_calib, calib_dates_overlap = calibration.calib_overlap(sensor_names=calib_sensors,
                                                                                          input_array=input_array,
                                                                                          calib_params=calib_params)
print('Possible calibration events identified by concurrent persistence.')
print(calib_dates_overlap)

# Edge detection
calib_candidates = dict()
edge_diff = dict()
threshold = dict()
# Set threshold for each variable - the level of change for a difference to be identified as a calibration event.
# This can be an iterative process.
threshold['cond'] = 60
threshold['ph'] = 0.1
threshold['do'] = 0.4
print('Possible calibration events identified by edge detection.')
for snsr in calib_sensors:
    # Width is the window of time to consider in the edge detect difference.
    # 1 determines the difference between each point independently.
    # Higher numbers use the difference over a longer window.
    calib_candidates[snsr], edge_diff[snsr] = calibration.calib_edge_detect(observed=sensor_array[snsr]['observed'],
                                                                            width=1,
                                                                            calib_params=calib_params,
                                                                            threshold=threshold[snsr])
    print(snsr)
    print(calib_candidates[snsr])
Possible calibration events identified by concurrent persistence.
DatetimeIndex(['2017-01-31 06:45:00', '2017-01-31 07:00:00',
               '2017-01-31 07:15:00', '2017-01-31 07:30:00',
               '2017-01-31 07:45:00', '2017-02-17 12:30:00',
               '2017-03-16 15:30:00', '2017-03-16 15:45:00',
               '2017-03-16 16:00:00', '2017-03-22 13:15:00',
               '2017-03-22 13:30:00', '2017-04-07 11:00:00',
               '2017-04-10 11:00:00', '2017-04-14 17:30:00',
               '2017-04-14 17:45:00', '2017-04-14 18:00:00',
               '2017-05-18 08:30:00', '2017-05-18 08:45:00',
               '2017-05-18 09:00:00', '2017-05-18 09:15:00',
               '2017-05-18 09:30:00', '2017-05-19 08:15:00',
               '2017-05-19 08:30:00', '2017-05-19 08:45:00',
               '2017-05-19 09:00:00', '2017-05-19 09:15:00',
               '2017-05-19 09:30:00', '2017-05-19 09:45:00',
               '2017-05-19 10:00:00', '2017-05-19 10:15:00',
               '2017-05-25 08:45:00', '2017-05-25 09:00:00',
               '2017-05-25 09:15:00', '2017-05-26 15:30:00',
               '2017-05-26 15:45:00', '2017-06-12 13:15:00',
               '2017-06-12 13:30:00', '2017-06-12 13:45:00',
               '2017-06-12 14:00:00', '2017-06-12 14:15:00',
               '2017-06-12 14:30:00', '2017-06-12 14:45:00',
               '2017-06-13 13:15:00', '2017-06-13 13:30:00',
               '2017-06-13 13:45:00', '2017-06-13 14:00:00',
               '2017-06-13 14:15:00', '2017-06-13 14:30:00',
               '2017-06-13 14:45:00', '2017-06-13 15:00:00',
               '2017-06-13 15:15:00', '2017-06-21 08:45:00',
               '2017-06-21 09:00:00', '2017-06-21 09:15:00',
               '2017-06-21 09:30:00', '2017-07-17 10:15:00',
               '2017-07-17 10:30:00', '2017-07-17 10:45:00',
               '2017-08-02 10:30:00', '2017-08-10 11:30:00',
               '2017-08-10 11:45:00', '2017-08-10 12:00:00',
               '2017-08-10 12:15:00', '2017-08-10 12:30:00',
               '2017-08-10 12:45:00', '2017-10-03 10:15:00',
               '2017-10-11 10:45:00', '2017-10-11 11:00:00',
               '2017-10-11 11:15:00', '2017-10-11 11:30:00',
               '2017-10-11 11:45:00', '2017-10-12 12:45:00',
               '2017-10-12 13:00:00', '2017-10-18 12:30:00',
               '2017-10-18 12:45:00', '2017-10-19 12:45:00',
               '2017-10-20 11:45:00', '2017-11-10 13:45:00',
               '2017-11-10 14:00:00', '2017-11-10 14:15:00',
               '2017-11-10 14:30:00', '2017-11-10 14:45:00',
               '2017-11-23 13:15:00', '2017-11-23 13:30:00',
               '2017-11-23 13:45:00', '2017-11-23 14:00:00',
               '2017-11-23 14:15:00', '2017-11-29 14:00:00',
               '2017-11-29 14:15:00', '2017-11-29 14:30:00',
               '2017-12-01 06:45:00', '2017-12-01 07:00:00',
               '2017-12-01 07:15:00', '2017-12-01 07:30:00',
               '2017-12-19 08:00:00', '2017-12-29 07:00:00',
               '2017-12-29 07:15:00', '2017-12-29 07:30:00'],
              dtype='datetime64[ns]', name='datetime', freq=None)
Possible calibration events identified by edge detection.
cond
[Timestamp('2017-01-02 14:30:00') Timestamp('2017-01-09 10:30:00')
 Timestamp('2017-01-20 12:30:00') Timestamp('2017-01-23 16:00:00')
 Timestamp('2017-01-24 13:30:00') Timestamp('2017-01-25 16:30:00')
 Timestamp('2017-02-07 11:45:00') Timestamp('2017-02-24 13:30:00')
 Timestamp('2017-02-27 15:15:00') Timestamp('2017-03-07 15:45:00')
 Timestamp('2017-12-20 16:45:00') Timestamp('2017-12-25 14:30:00')]
ph
[Timestamp('2017-01-09 10:00:00') Timestamp('2017-06-23 10:45:00')
 Timestamp('2017-10-03 15:00:00') Timestamp('2017-11-14 16:45:00')
 Timestamp('2017-12-18 14:45:00')]
do
[Timestamp('2017-06-23 10:45:00')]

Find Gap Values and Perform Linear Drift Correction

Drift correction is used when data are shifted due to calibration or cleaning of a sensor. To perform drift correction, the routine requires a start date, an end date, and a gap value to shift the data for each event. A separate function determines the gap value for each calibration. To do so requires dates of calibration determined either by the above calibration detection functions (after review) or compiled from field records. The determined gap values should be reviewed and adjusted.

Functions called:

  • find_gap
  • lin_drift_cor
# Subset of sensors that are calibrated
calib_sensors = sensors[1:4]
# Initialize data structures
calib_dates = dict()
gaps = dict()
shifts = dict()
tech_shifts = dict()

for snsr in calib_sensors:
    # Import calibration dates
    calib_dates[snsr] = pd.read_csv('../LRO_data/' + site + '_' + snsr + '_calib_dates.csv',
                                        header=1, parse_dates=True, infer_datetime_format=True)
    calib_dates[snsr]['start'] = pd.to_datetime(calib_dates[snsr]['start'])
    calib_dates[snsr]['end'] = pd.to_datetime(calib_dates[snsr]['end'])
    # Ensure date range of calibrations correspond to imported data
    calib_dates[snsr] = calib_dates[snsr].loc[(calib_dates[snsr]['start'] > min(sensor_array[snsr].index)) &
                                                      (calib_dates[snsr]['end'] < max(sensor_array[snsr].index))]
    # Initialize dataframe to store determined gap values and associated dates
    gaps[snsr] = pd.DataFrame(columns=['end', 'gap'],
                              index=range(min(calib_dates[snsr].index), max(calib_dates[snsr].index)+1))
    if len(calib_dates[snsr]) > 0:
        # Initialize data structures
        shifts[snsr] = []
        tech_shifts[snsr] = []
        # Loop through each calibration event date.
        for i in range(min(calib_dates[snsr].index), max(calib_dates[snsr].index)+1):
            # Apply find_gap routine, add to dataframe, add output of shifts to list.
            gap, end, shifted = calibration.find_gap(observed=sensor_array[snsr]['observed'],
                                                     calib_date=calib_dates[snsr]['end'][i],
                                                     hours=2,
                                                     show_shift=True)
            gaps[snsr].loc[i]['end'] = end
            gaps[snsr].loc[i]['gap'] = gap
            shifts[snsr].append(shifted)

    # Plotting to examine gap values
    l = len(calib_dates[snsr])
    ncol = math.ceil(math.sqrt(l))
    nrow = math.ceil(l/ncol)
    hours = 6

    fig, ax = plt.subplots(nrows=nrow, ncols=ncol, figsize=(nrow, ncol+1), facecolor='w')
    for i, axi in enumerate(ax.flat):
        if i < l:
            axi.plot(sensor_array[snsr]['observed'].loc[
                    pd.to_datetime(calib_dates[snsr]['end'].iloc[i]) - pd.Timedelta(hours=hours):
                    pd.to_datetime(calib_dates[snsr]['end'].iloc[i]) + pd.Timedelta(hours=hours)
                    ])
            axi.plot(shifts[snsr][i], 'c')
print('Gap value determination complete.')
Gap value determination complete.

# Review gaps and make adjustments as needed before performing drift correction
# For example:
gaps['cond'].iloc[5]['gap'] = -5

# Perform Linear Drift Correction
calib_sensors = sensors[1:4]
for snsr in calib_sensors:
    # Set start dates for drift correction at the previously identified calibration.
    gaps[snsr]['start'] = gaps[snsr]['end'].shift(1)
    gaps[snsr]['start'].iloc[0] = sensor_array[snsr].index[0]

    snsr = 'cond'
    if len(gaps[snsr]) > 0:
        for i in range(min(gaps[snsr].index), max(gaps[snsr].index) + 1):
            result, sensor_array[snsr]['observed'] = calibration.lin_drift_cor(observed=sensor_array[snsr]['observed'],
                                                                               start=gaps[snsr]['start'][i],
                                                                               end=gaps[snsr]['end'][i],
                                                                               gap=gaps[snsr]['gap'][i],
                                                                               replace=True)
print('Linear drift correction complete.')
Linear drift correction complete.

Model Based Anomaly Detection

Generates 5 models. Each model predicts one step ahead using immediately adjacent data. Anomalies are detected by comparing the model residual (predictions - observations) to a threshold. Dynamic thresholds are based on model variability. Settings for ARIMA, LSTM, and threshold determination are in the parameters. Results from all models are aggregated so that a detection by any model results in an anomaly.

ARIMA Detection

ARIMA models use a combination of past data in a linear form to predict the next value. These models are univariate. Each ARIMA model requires the parameters p, d, q, which can be defined automatically as shown here.

Functions called:

  • pdq
  • ARIMA_detect:
    • build_ARIMA_model
    • set_dynamic_threshold
    • detect_anomalies
    • anomaly_events
all_pdq = dict()
for snsr in sensors:
    all_pdq[snsr] = modeling_utilities.pdq(data=sensor_array[snsr]['observed'])
    print(snsr + ' (p, d, q) = ' + str(all_pdq[snsr]))
    site_params[site][snsr]['pdq'] = all_pdq[snsr]

ARIMA = dict()
for snsr in sensors:
    ARIMA[snsr] = model_workflow.ARIMA_detect(df=sensor_array[snsr], sensor=snsr, params=site_params[site][snsr],
                                              rules=False, plots=False, summary=False, compare=False)
print('\nARIMA detection complete.\n')
temp (p, d, q) = [1, 1, 1]
cond (p, d, q) = [3, 1, 3]
ph (p, d, q) = [3, 1, 4]
do (p, d, q) = [4, 1, 2]

Processing ARIMA detections.
temp ARIMA model complete.
Threshold determination complete.
ratio of detections: 0.028556 %

Processing ARIMA detections.
cond ARIMA model complete.
Threshold determination complete.
ratio of detections: 0.177047 %

Processing ARIMA detections.
ph ARIMA model complete.
Threshold determination complete.
ratio of detections: 0.045689 %

Processing ARIMA detections.
do ARIMA model complete.
Threshold determination complete.
ratio of detections: 0.017134 %

ARIMA detection complete.


LSTM Detection

LSTM models create a neural network that uses a sequence of values to make a prediction. Settings and hyperparameters are defined in the parameters.

DATA: univariate, MODEL: vanilla

Uses a single sensor and data prior to predict the next point.

Functions called:

  • LSTM_detect_univar:
    • LSTM_univar:
      • create_scaler
      • create_sequenced_dataset
      • create_training_dataset
      • create_vanilla_model
      • train_model
    • set_dynamic_threshold
    • detect_anomalies
    • anomaly_events
LSTM_univar = dict()
for snsr in sensors:
    LSTM_univar[snsr] = model_workflow.LSTM_detect_univar(df=sensor_array[snsr], sensor=snsr, 
                                                          params=site_params[site][snsr], LSTM_params=LSTM_params, 
                                                          model_type=ModelType.VANILLA, rules=False, plots=False, 
                                                          summary=False, compare=False, model_output=False, 
                                                          model_save=False)
print('\nLSTM univariate detection complete.\n')
/Users/amber/.virtualenvs/LRO-anomaly-detection/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)


Processing LSTM univariate ModelType.VANILLA detections.
1095/1095 [==============================] - 2s 2ms/step - loss: 1.3198
temp ModelType.VANILLA LSTM model complete.
Threshold determination complete.
ratio of detections: 0.031416 %

Processing LSTM univariate ModelType.VANILLA detections.
1095/1095 [==============================] - 2s 2ms/step - loss: 0.1020
cond ModelType.VANILLA LSTM model complete.
Threshold determination complete.
ratio of detections: 0.139944 %

Processing LSTM univariate ModelType.VANILLA detections.
1095/1095 [==============================] - 2s 1ms/step - loss: 0.1891
ph ModelType.VANILLA LSTM model complete.
Threshold determination complete.
ratio of detections: 0.014280 %

Processing LSTM univariate ModelType.VANILLA detections.
1095/1095 [==============================] - 2s 2ms/step - loss: 0.1516
do ModelType.VANILLA LSTM model complete.
Threshold determination complete.
ratio of detections: 0.002856 %

LSTM univariate detection complete.


DATA: univariate, MODEL: bidirectional

Uses a single sensor and data before and after to predict a point.

Functions called:

  • LSTM_detect_univar:
    • LSTM_univar_bidir:
      • create_scaler
      • create_bidir_sequenced_dataset
      • create_bidir_training_dataset
      • create_bidir_model
      • train_model
    • set_dynamic_threshold
    • detect_anomalies
    • anomaly_events
LSTM_univar_bidir = dict()
for snsr in sensors:
    LSTM_univar_bidir[snsr] = model_workflow.LSTM_detect_univar(df=sensor_array[snsr], sensor=snsr, 
                                                                params=site_params[site][snsr], LSTM_params=LSTM_params,
                                                                model_type=ModelType.BIDIRECTIONAL,rules=False, 
                                                                plots=False, summary=False, compare=False, 
                                                                model_output=False, model_save=False)
print('\nLSTM univariate bidirectional detection complete.\n')
/Users/amber/.virtualenvs/LRO-anomaly-detection/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)


Processing LSTM univariate ModelType.BIDIRECTIONAL detections.
1095/1095 [==============================] - 4s 4ms/step - loss: 1.2945
temp ModelType.BIDIRECTIONAL LSTM model complete.
Threshold determination complete.
ratio of detections: 0.039990 %

Processing LSTM univariate ModelType.BIDIRECTIONAL detections.
1095/1095 [==============================] - 5s 4ms/step - loss: 0.0739
cond ModelType.BIDIRECTIONAL LSTM model complete.
Threshold determination complete.
ratio of detections: 0.128538 %

Processing LSTM univariate ModelType.BIDIRECTIONAL detections.
1095/1095 [==============================] - 4s 4ms/step - loss: 0.1575
ph ModelType.BIDIRECTIONAL LSTM model complete.
Threshold determination complete.
ratio of detections: 0.022851 %

Processing LSTM univariate ModelType.BIDIRECTIONAL detections.
1095/1095 [==============================] - 6s 6ms/step - loss: 0.1609 - ETA: 4s - loss: 0.3184
do ModelType.BIDIRECTIONAL LSTM model complete.
Threshold determination complete.
ratio of detections: 0.005713 %

LSTM univariate bidirectional detection complete.


DATA: multivariate, MODEL: vanilla

Uses multiple sensors as inputs and outputs and data prior to predict the next point.

Functions called:

  • LSTM_detect_multivar:
    • LSTM_multivar:
      • create_scaler
      • create_sequenced_dataset
      • create_training_dataset
      • create_vanilla_model
      • train_model
    • set_dynamic_threshold
    • detect_anomalies
    • anomaly_events
LSTM_multivar = model_workflow.LSTM_detect_multivar(sensor_array=sensor_array, sensors=sensors, 
                                                    params=site_params[site], LSTM_params=LSTM_params, 
                                                    model_type=ModelType.VANILLA, rules=False, plots=False, 
                                                    summary=False, compare=False, model_output=False, model_save=False)
print('\nLSTM mutivariate detection complete.\n')
/Users/amber/.virtualenvs/LRO-anomaly-detection/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)


Processing LSTM multivariate ModelType.VANILLA detections.
Raw data shape: (35019, 4)
Observed data shape: (35019, 4)
Initial anomalies data shape: (35019, 4)
1095/1095 [==============================] - 2s 2ms/step - loss: 0.4319
multivariate ModelType.VANILLA LSTM model complete.

ratio of detections: 0.039984 %
ratio of detections: 0.165648 %
ratio of detections: 0.014280 %
ratio of detections: 0.002856 %
Threshold determination complete.

LSTM mutivariate detection complete.


DATA: multivariate, MODEL: bidirectional

Uses multiple sensors as inputs and outputs and data before and after to predict a point.

Functions called:

  • LSTM_detect_multivar:
    • LSTM_multivar_bidir:
      • create_scaler
      • create_bidir_sequenced_dataset
      • create_bidir_training_dataset
      • create_bidir_model
      • train_model
    • set_dynamic_threshold
    • detect_anomalies
    • anomaly_events
LSTM_multivar_bidir = model_workflow.LSTM_detect_multivar(sensor_array=sensor_array, sensors=sensors, 
                                                          params=site_params[site], LSTM_params=LSTM_params, 
                                                          model_type=ModelType.BIDIRECTIONAL, rules=False, plots=False, 
                                                          summary=False, compare=False, model_output=False, 
                                                          model_save=False)
print('\nLSTM multivariate bidirectional detection complete.\n')
/Users/amber/.virtualenvs/LRO-anomaly-detection/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)


Processing LSTM multivariate ModelType.BIDIRECTIONAL detections.
Raw data shape: (35019, 4)
Observed data shape: (35019, 4)
Initial anomalies data shape: (35019, 4)
1095/1095 [==============================] - 6s 5ms/step - loss: 0.4033 - ETA: 0s - loss: 0.4226
multivariate ModelType.BIDIRECTIONAL LSTM model complete.

ratio of detections: 0.034277 %
ratio of detections: 0.145677 %
ratio of detections: 0.014282 %
ratio of detections: 0.002856 %
Threshold determination complete.

LSTM multivariate bidirectional detection complete.


Aggregate Detections for All Models

Aggregates the results from all models.

Functions called:

  • aggregate_results
results_all = dict()
for snsr in sensors:
    models = dict()
    models['ARIMA'] = ARIMA[snsr].df
    models['LSTM_univar'] = LSTM_univar[snsr].df_anomalies
    models['LSTM_univar_bidir'] = LSTM_univar_bidir[snsr].df_anomalies
    models['LSTM_multivar'] = LSTM_multivar.all_data[snsr]
    models['LSTM_multivar_bidir'] = LSTM_multivar_bidir.all_data[snsr]
    results_all[snsr] = anomaly_utilities.aggregate_results(df=sensor_array[snsr], 
                                                            models=models, 
                                                            verbose=True, 
                                                            compare=False)
/Users/amber/.virtualenvs/LRO-anomaly-detection/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

Correction

Correction is performed using piecewise ARIMA- small models determined for each period of anomalous data, along with forecast and backcast, which consider data before and after a period of anomalous data and are merged together to ensure there are no discontinuities.

Functions called:

  • generate_corrections
    • group_bools
    • ARIMA_group
    • xfade
corrections = dict()
for snsr in sensors:
    corrections[snsr] = ARIMA_correct.generate_corrections(df=results_all[snsr], 
                                                           observed='observed', 
                                                           anomalies='detected_event', 
                                                           savecasts=True)
    print(snsr + ' correction complete.')
/Users/amber/.virtualenvs/LRO-anomaly-detection/lib/python3.7/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)
/Users/amber/.virtualenvs/LRO-anomaly-detection/lib/python3.7/site-packages/pmdarima/arima/auto.py:460: UserWarning: Input time-series is completely constant; returning a (0, 0, 0) ARMA.
  warnings.warn('Input time-series is completely constant; '

temp correction complete.
cond correction complete.
ph correction complete.
do correction complete.