Detecting Anomalies for Failed Messages

Last Updated On January 01, 2018
You are here:

Prerequisites

Using anomaly detection for BizTalk or .NET metrics requires the following:

  • ESB Feature Pack – required for collecting metrics from BizTalk Server analytic events and emitting telemetry related to ESB workloads. The steps to configure BizTalk analytics provider for InfluxDb configured are described here.
  • Trace Agent (“Hydrofone”) – required for collecting ETW telemetry from BizTalk, WCF, WF applications. This agent is included in ESB Feature Pack and deployed as windows service.
  • InfluxDb
  • Kapacitor

Common Scenarios

The below set of metrics may be considered for anomaly detection:

  • WCF, WF events that represent performance, faults or errors.
  • Tracking data intercepted by ESB Feature Pack from BizTalk analytic events.
  • Metrics from ESB Pack for failed messages, service level agreements from consuming external services, failed messages and/or itinerary services.
  • Performance counters or any other metrics collected using TICK stack.

Design Considerations

ESB Feature Pack provides integration for storing telemetry it collects in time series databases. The Hydrofone can be used to collect telemetry from ETW with or without BzTalk Server. This telemetry can contain events represent BizTalk tracking data, ESB, WCF, WF or IIS. Collected telemetry is processed by the InfluxData TICK stack (InfluxDb and Kapacitor). Thus, InfluxDb time series database is used for storing and processing time series with Kapacitor being configured for running anomaly detection queries and alerting. The below diagram shows relationship between BizTalk Server, TICK stack and ESB Feature pack:

The tasks for anomaly detection can be configured as stream or batch within the Kapacitor. In streaming scenarios, InfluxDB sends time series to Kapacitor nodes. In batch scenarios, given Kapacitor node pull data from InfluxDb on periodic basis. For the most BizTalk and .NET related scenarios, using batch mode should be sufficient for anomaly detection and simplifies deployment with redundant Kapacitor nodes.

Implementation Considerations

In the example scenario, corporate development team needs to be alerted when failed message routing occurs in BizTalk. Because BizTalk farm is used by multiple teams, alerts should be generated for specific values related to BizTalk artifact and/or context property present in the message. In this example, the name of BizTalk receive port or location will be used for anomaly detection to generate alerts.

The first option is to use BizTalk analytic tracing events intercepted by ESB Feature Pack. These events are sent to InfluxDb by analytic events provider included in this pack using UDP transport protocol. To analyze these events, create new file named bts_failedrl_alert.tick with the following streaming query to be used in Kapacitor:

Kapacitor Query
stream
|from()
.measurement(‘Infragravity.ESB.All.Exceptions_Receive’)
.groupBy(‘Computer’)
.where(lambda: “InboundTransportLocation (http://schemas.microsoft.com/BizTalk/2003/system-properties)” == ‘/ESBP.ItineraryServices.WCF/ProcessItinerary.svc’)
|alert()
.id(‘FailedMessage’)
.message(‘Instance failed {{index .Tags “Computer”}} : {{index .Fields “ActivityIdentity (http://schemas.microsoft.com/BizTalk/2003/messagetracking-properties)”}}’)
.crit(lambda: TRUE)
.slack()
The above query directs Kapacitor to analyze tracking data for port that subscribes for failed message routing. The alert will be sent via Slack channel when event property matches specific receive location. Next open command prompt in the directory where this file is stored and enable this task in Kapacitor:

Kapacitor Query
>kapacitor define bts_failedrl_alert -tick bts_failedrl_alert.tick -type stream -dbrp tddsdb.autogen
>kapacitor enable bts_failedrl_alert
The first command uses Kapacitor to register new task for receiving metrics from retention policy ‘autogen’ in time series database named ‘tddsdb’. This database contains all BizTalk analytic events with tracking data. The configuration of ESB Feature Pack analytics provider for InfluxDb is described here.

The second query shown below is used for anomaly detection using telemetry sent by Hydrofone to InfluxDb time series database. Create new file named esb_failedrl_alert.tick with the following content:

Kapacitor Query
batch
|query(”’
SELECT count(E2EActivityId) as count
FROM “testdb”.”autogen”.”MessageFailed”
WHERE Container = ‘Infragravity.OnRamp.WCF’
”’)
.period(2m)
.every(1m)
.groupBy(time(1m, -10s),’Computer’)
.align()
.offset(10s)
|alert()
.id(‘FailedMessage’)
.message(‘Receive failed {{index .Tags “Computer”}} : {{index .Fields “count”}}’)
.warn(lambda: “count” > 1)
.crit(lambda: “count” > 5)
.slack()
The batch query who above is used to analyze failed message rate for a given receive port to send warning or error when it exceeds preconfigured limits via Slack channel. Next, open command prompt in the directory where this file is stored and enable this task in Kapacitor:

Kapacitor Query
>kapacitor define esb_failedrl_alert -tick esb_failedrl_alert.tick -type batch -dbrp testdb.autogen
>kapacitor enable esb_failedrl_alert

As you can see, the above queries are using two different types of telemetry collected by ESB Feature pack and sent to TICK stack. The same approach can be applied to any metric stored using the TICK stack.

After generating failed messages for BizTalk receive port and location defined in the tasks above, the following alerts were sent by Kapacitor via Slack channel:

Messages shown in Slack indicate that the batch query sent alert for four failed messages and later changed its state to info due decreased rate of failed messages over time. Note that similar method can be applied for measuring service level agreements using metrics from IIS, WCF or WF. The below query measures service level agreement for WCF services on BizTalk Server and sends alerts when thresholds are exceeded:

Kapacitor Query
batch
// Select just duration from WCF OperationCompleted event
|query(”’
SELECT mean(“Duration”) as theDuration FROM “appfdb”.”autogen”.”OperationCompleted”
”’)
.period(1m)
.every(1m)
.groupBy(‘Computer’,’EventSource’)
|eval(lambda: sigma(“theDuration”))
.as(‘sigma’)
.keep()
|alert()
.id(‘Wcf SLA’)
.message(‘{{.Level}}:{{ .Name }}:Investigate {{index .Tags “Computer”}}:{{index .Tags “EventSource”}}:{{index .Fields “theDuration”}} ms’)
.info(lambda: “theDuration” < 1000 OR “sigma” < 1000)
.warn(lambda: “theDuration” > 1000 OR “sigma” > 1000)
.crit(lambda: “theDuration” > 4000 OR “sigma” > 4000)
.stateChangesOnly()
.slack()

This Kapacitor batch task computes response time for each WCF service operation every minute and sends alerts when state changes:

Summary

The following benefits and liabilities should be considered for anomaly detection using TICK stack and integration with feature pack:

Benefits

  • Increased data privacy for collected metrics.
  • Common approach for anomaly detection using TICK stack using metrics collected from any software product or platform. Using this method enables predictive analytics, thresholds or absence of signals for collected metrics.
  • Ability to tailor anomaly detection queries for metrics represented by time series with Kapacitor included in InfluxData TICK stack.
  • Flexibility to analyze metrics using prediction algorithms supported by Kapacitor and multiple alert notification channels.
  • Leverage additional telemetry that does not exist in BizTalk Server and store it in time series database on-premise or in the cloud for further analysis.
  • Ability to intercept BizTalk tracking data as analytic events using ESB Feature Pack and store it in time series database on-premise or in the cloud.
  • Interception of analytic events from ETW (BizTalk ESB, WCF, WF ) using Hydrofone for streaming them to time series or relational database on-premise or in the cloud.
  • Collecting any performance counters using Telegraf included in InfluxData TICK stack.

Liabilities

  • Additional considerations are required to deploy TICK stack for collecting telemetry and anomaly detection.
  • Current version of ESB Pack only supports UDP as transport protocol for sending BizTalk analytic events as tracking data to time series database.
  • Event Tracking for Windows (ETW) is non-transactional and events may be lost upon server failure.
Tags: