Connectivity Monitoring SCOM management pack. The making of.

Part One. Why?

In the world around us, everything is online. That can be a web site, a REST API endpoint, cloud store, status page, Azure function, and many other online resources. If these online items are part of your distributed application or you own them and provide them as service to other people, this is crucial to monitor these resources’ availability and alert if any issues. Most likely there is already some sort of monitoring on all these online resources’ backend. But end-user-like testing is always the final truth (“It works on my computer” never was the best answer from a support person, eh?) Thus, I was looking for a Swiss-army like solution to test variety of end-user-like connectivity monitoring: from simple ping, through secure connection verification, and then to synthetic transactions. A comprehensive, an easy to configure, intuitive to use solution is required.

Before, I could name three solutions for that purpose. First one is the most popular “ping” management pack from the leading management pack developing company OpsLogix. It has intuitive UI, easy to configure, but unfortunately can only do ping probe. The second one is community driven URLGenie. It’s a great management pack for Web-resource availability testing, but the configuration portion is very specific. And the final is SCOM native synthetic transaction monitoring. It’s the most flexible and comprehensive testing, but honestly, who knows how to configure that? Another downside for synthetic transaction monitoring is alerting. It’s very unclear.

Part Two. A new player.

Summarizing my experience with the current solutions, I made the following manifest for an ideal connectivity testing management pack:

  • All configuration must be done via UI embedded into the SCOM Console application. In particular UI must:
    • Add/Remove new targets.
    • Turn on/off specific tests.
    • Configure tests (like ping, TCP connect, HTTP probe, JSON probe, etc.)
    • Add/Remove tests.
  • No configuration or only very small portion of it should be done via overrides.
  • Flexibly chose probe source (i.e. don’t use management servers as connection source).
  • Templates mechanism to enable/modify tests for multiple targets at once.
  • Support extension/addons (i.e. other developers can add new tests).

These requirements came from my observations of user experience. While being a SCOM administrator/developer, I noticed that using SCOM Console is very counterintuitive for people new to SCOM. Don’t get me wrong, it’s an awesome application for anyone who is experienced with SCOM and knows what/where/why. But for any newbies, it’s a nightmare. Same with configuration. If management pack enablement requires applying certain overrides, most likely this MP will be abandoned, because people expect plug and play type of behavior, or at lease GUI/Wizard based configuration. It’s a blessing and curse of the Windows world. If at the Linux side you always must read documentation, because all configuration must be written, then at the Windows side, most applications are configured via GUI, so people don’t expect looking for “hidden gems”, if they aren’t seen right away. In other words, if anything doesn’t work out of the box (because it’s configuration is hidden and careful read of documentation is required to make it working), then it’s broken.

Therefore, a big proportion of my efforts at my current position was to make configuration of my management packs as easy and as intuitive as possible. This one should not be an exclusion.

Part Three. Design.

A design of SCOM management pack starts with defining its classes and their relations with the new or existing classes (or re-using exiting ones). If you’re not familiar with SCOM object, classes, and class instances, please refer to “Understanding Classes and Objects”: https://docs.microsoft.com/en-us/previous-versions/system-center/system-center-2012-r2/hh457568(v=sc.12).

In overview, there are three sets of management pack abstractions, existing pieces of SCOM infrastructure, and, in the top of that, is a requirement to support templates.

  • sources, which are Health Services aka SCOM Agent or Resource Pools
  • destinations, which are test targets
  • tests, which are probes executed at sources against destinations

To sort out template strategy, I made two projections. In the fist projection, the base is a test. Then, for each test a number of destinations can be defined. Another projection has destination as a base, then a number of tests is defined for a destination. Source-based projection is not required, because the base class (whether test- or destination- based) will be freely moving between sources. Both projections have pros and cons, but none of the projections removes template requirements. The test-base projection allows to define the same test for multiple destinations (i.e. templating), but it breaks when different sources required. Therefore, after certain considerations, I decided to go with destination-base projection. In other words, I’m going to design the base class as a destination, and them test classes will be hosted on it and will enable particular tests.

So, in the very first case, I needed a base target class. Initially I was thinking about a class, which would have a URL as a property, or, perhaps, as a key. URL means a string like: schema://host.name/path/subpath. But considering the fact, that not all test need either full path or schema, as well as some tests will need different path portion, I decided to make the base class to only contain host name or FQDN (fully qualified domain name). All tests, where schema and/or path is required will have it as a part of their configuration.

The base class will be un-hosted. This is in order to meet the “flexible source” requirement. If we don’t host our class at another class (effectively don’t host at a particular computer), then we can change serving agent (or agent pool) on fly, which means changing test source on fly. This is a very big difference from both OpsLogix and URLGenie management packs. In these solutions, when a target class instance is created, it’s being hosted on a particular computer. That means target cannot be moved to another source without deleting and re-creation it (and of cause loosing any overrides applied). Therefore, having our target class unhosted let us move it between different SCOM agents by submitting relationship between our instance and desired Management Action Point class (which represents an agent or a pool).

So far I get: base class represents FQDN and base class is unhosted. Next design choice is a key field. An obvious choice is FQDN value itself. It makes lots of sense: if a source can connect to its destination FQDN address on one protocol, it should be OK to enable another protocol. It also make sense to allow only one destination registered to avoid duplicates, when an administrator adds the same destination to add another test, instead of reusing an existing destination to just add the test. However, there are some situations, when hypothetical users of my MP would need duplicate FQDNs. For example for split-brain scenario, when the same resource published internally and externally using the same name. Therefore, it’s better to don’t put a constraint to make FQDN value unique. But what are other choices? A unique name user must give? A generated GUID? Both options are not user friendly and to be honest, it should not be “no constraint”, but a constraint, which can be suppressed. Therefore, I made a decision to make a compound key from two fields. FQDN and an order number. The order number allows to add duplicate where is required, but makes it less easier. As result, the final base class definition for FQDN is the following:

<ClassType ID="Maximus.Connectivity.Monitoring.FullyQualifiedDomainName" Accessibility="Public" Base="System!System.Entity" Abstract="false"
           Hosted="false" Singleton="false"
           Comment="The root class to define a destination address for other test classes hosted at this.">
  <!-- Inherited property 
  <Property ID="DisplayName" Type="string" AutoIncrement="false" Key="false" CaseSensitive="false" MaxLength="4000" MinLength="0" Required="false" Scale="0" />
  -->
  <Property ID="FullyQualifiedDomainName" Type="string" Key="true" Comment="Just FQDN without any protocol prefixes."/>
  <Property ID="TargetIndex" Type="int" Key="true" DefaultValue="0" Comment="Enables FQDN duplicates when required." />
  <Property ID="Description" Type="string" MaxLength="1024" />
</ClassType>

Next step is to design test classes. The test classes will contain test settings (like TCP port number or timeout) and existence of an instance of a test class will enable particular test probe or probes is test subject is complex. This is how I’m going to achieve flexible GUI based configuration. However, before I can start design of the test classes, I need to think about template implementation. After certain considerations, I didn’t find a way to make templates override test settings automatically. Therefore, all I need is a template reference (i.e. unique name) in each test settings object. Having this, GUI will disable edit controls for this object. On the other hand, when a template is edited, this will be an editor task to propagate all template settings to referencing test objects. This should keep all “templated” test objects consistent, unless editor (i.e. SCOM Console crashes or test objects are edited externally). At his stage, there is only plan to implement support for single test object template, i.e. to implement a particular test at a destination, a user will need to create a test object and link it to a template. Later, I may think about implementing a test set template applicable for a destination, i.e. a template enabling multiple tests for all linked destinations.

So, to make all test objects being able to reference a template, I need an abstract base class with a template reference property for all test classes to inherit from. This is how I’m going to ensure that all test classes will have a compulsory template reference. On the top, having a base class gives a future ability to query for all tests. I also decided to add few common test settings like interval suppression settings: MatchCount and SampleCount (but each monitor associated with this test object may not use them). Next common parameter is interval between tests: IntervalSeconds. This also gives an advantage for cookdown. As there is no monitor overrides (except actual thresholds), then all test settings, including interval between test will be always the same for any monitors and rules. See my other post for cookdown details: https://maxcoreblog.com/2020/08/05/implementing-scom-managed-modules-part-2/. Another interesting common property I’m adding is CollectPerformanceData. It’s purposed to turn on and off all performance collection rules targeting this test class. Controlling this via GUI is more comfortable than applying enable/disable override to certain performance collection rules (which also need to be located from a list of dozens). And the final decision is for a key for test classes. First I planned to make no key in the base class and let key(s) to be defined in child classes if required. Say, a ping test class may no need a key, as no multiple pings required. Or, a TCP test may have a port number as a key to avoid duplicates. But later I decided to make a single key in the base class. This is TestId of guid type. This will allow to create, say, multiple ping instances with different buffer size, for whatever reason. On the other hand, child classes still may declare additional keys, like in the example with TCP port, or just allow duplicates. In the result, the base abstract class looks like the below:

<ClassType ID="Maximus.Connectivity.Monitoring.Test" Accessibility="Public"
           Base="System!System.Entity" Abstract="true"
           Hosted="true" Singleton="false"
           Comment="Base class for all tests. Has template name.">
  <Property ID="TestId" Type="guid" Key="true"/>
  <Property ID="TemplateReference" Type="string" Comment="Template reference"/>
  <Property ID="IntervalSeconds" Type="int" DefaultValue="600" />
  <Property ID="MatchCount" Type="int" DefaultValue="3"/>
  <Property ID="SampleCount" Type="int" DefaultValue="4"/>
  <Property ID="CollectPerformanceData" Type="bool" DefaultValue="false"/>
</ClassType>

An now, it’s time I can make the very first and easy test class — ping test. It looks like the following:

<ClassType ID="Maximus.Connectivity.Monitoring.Test.Ping" Accessibility="Public"
           Base="Maximus.Connectivity.Monitoring.Test" Abstract="false"
           Hosted="true"
           Comment="Enables Ping test and defines additional parameters.">
  <Property ID="MaxTTL" Type="int" DefaultValue="128"/>
  <Property ID="BufferSize" Type="int" DefaultValue="32"/>
  <Property ID="Timeout" Type="int" DefaultValue="5000"/>
  <Property ID="DontFragment" Type="bool" DefaultValue="true" />
</ClassType>

Part Three. Implementation: GUI.

Now, I have a bare minimum to start implementation of the destination and test editor. First, I decided to merge editor and status view in one control. That decision also means that view and edit interfaces will be available in the Monitoring section of SCOM Console. To be honest, this is against Microsoft design guidelines. Adding, removing, and editing interfaces should be in the Administration section, so users, who are not full administrators won’t see them. This is also against RBAC (Role Base Access Control) model, and actually may cause either exceptions when non-admin users add objects, or the newly added objects won’t be seen to the current users, or both. However, I made this decision for two reasons. First and the main to make it more intuitive. And second reason is that editing test objects is close to make overrides, rather than creating new entities (comparing to adding new monitored servers, for instance).

So, for visual design, I’m going to use grid view with details plane. Where master view will show list of destinations, and the details plane will show associated test objects. Editing commands will be available in the Task plane and in the control’s context menu. Detail plane should show tips for new destination/test creation when no object available/selected. I’m going to use https://github.com/MaxxVolk/Maximus.Base.Library library to easy the development of the UI. In particular the view control will be inherited from the SimpleGridViewWithDetails class and detail plane class is a child of the SimpleGridViewDetailsPlane class. Leveraging these base classes I can get something in SCOM Console very quickly, just by implementing few abstract methods.

After few days of development (where one one was wasted in attempts to crack how GridControl works, but failed), I got the following UI:

Final user interface for Connectivity monitoring UI.

At the screenshot above, first area is a new folder added by the management pack to SCOM Monitoring section. The second area is list of destinations. Third area, which is detail plane, is a list of all test objects associated with the selected destination. Finally, the forth area is a part of the standard Tasks Plane, but with new commands, which allow edit destination and tests. There is also a context menu associated with the master view as at the screenshot below:

Master View context menu.

Next, I needed to find a way to add/edit/remove destination and test objects. Despite the base library has data binding compatible SCOM class instance editor, which can be easily bound to a data control like DataGridView, I decided to go with individual forms and do incremental discoveries on form submission event for each object. The first reason for that is to avoid loading all existing destination objects into the editor (and I expect many of them), when one destination need to be added or altered. There is also another, more important reason, but I will touch it later when describing test object editor. Destination object class if fairly simple and don’t have much properties, but the main challenge here is to make an interface to assign a managing SCOM agent, which will run all the test related to a destination object. I also decided to go further and give a choice to either use an individual SCOM agent or a pool of gateway/management servers. This is yet another reason to not use the editor from the standard library, because it can support only one management action point at the time, i.e. either Resource Pools or individual Agents, but not both. This is because SCOM backend uses two different relationships to make pools or agents managing an entity. They are ‘Microsoft.SystemCenter.HealthServiceShouldManageEntity‘ for individual agents and ‘Microsoft.SystemCenter.ManagementActionPointShouldManageEntity‘ for resource pools. But ironically enough, they both result in ‘Microsoft.SystemCenter.ManagementActionPointManagesEntity‘, which reflects the actual management point. But the ManagementActionPointManagesEntity relationship cannot be inserted directly, it’s a result of processing HealthServiceShouldManageEntity and ManagementActionPointShouldManageEntity relationships. So, the new destination form has two main areas: to edit object itself and to select action point for it. And lots of pre/checking and validation code. It looks like below:

New Destination form.

Note, after a destination is created, its FQDN and Index properties cannot be changed, because they are keys. The only way to change FQDN keeping all tests would be to export, edit exported data, and then import it back. But on the bright side, it’s possible to move action point between resource pools or agents, or change action point type from pools to agents and back.

Next big design decision was about test objects. Initially, I was thinking about individual editors for each test type. This approach has an advantage, because I can group same type properties together making editing more intuitive, can put more tips and help suggestions, etc. But I was too lazy to make all the individual forms. And laziness, as it’s know, is the engine of the progress. Therefore, I decided to go with a single editor for all test object, which will obtain class properties and other information dynamically via SCOM SDK. And this decision was genius. This is true — I only realized its consequences later. Making UI and editor compatible with ALL test object classes inherited from the abstract test base class makes my management pack extendable. What does that mean? It means that anyone can write another management pack implementing just a new test class and rules/monitors targeting the test class. They will not need to implement any UI, because the existing UI from my management pack dynamically enumerates all child of the base test class. Therefore any test classes from other management packs will “magically” appears in the base MP UI. This means anyone can develop an extension to my MP, this is why it’s extendable.

But to make it real, I need a good component to edit object properties, and a way to translate a SCOM object into a suitable interface, which will be consumed by this hypothetical component. Luckily for me, there is all tools in .NET for this. First, there is the PropertyGrid control (https://docs.microsoft.com/en-us/dotnet/api/system.windows.forms.propertygrid?view=net-5.0). It’s the same control, Visual Studio uses to edit object properties in Designer. And there are few attributes and the ICustomTypeDescriptor interface to make a “translator” from any other shape into the form, the PropertyGrid control understands. The resultant new test object form looks very familiar for anyone who have used Visual Studio:

New/Edit Test Object form.

The form features three main areas. First, is a description of the test class. Yes, it’s simply a class description value from management pack’s language pack. SCOM SDK has a mechanism similar to Reflections in C#, so all class definitions and metadata can be read programmatically. That also sets a requirement to all extending management packs’ authors to make proper description to all classes and their properties. So, the second area is actual property editor. Implementation of the ICustomTypeDescriptor interface and custom PropertyDescriptor ensures, that all SCOM properties has correct data type, name, description, and category to make them presented well in the UI. It also hides some auxiliary properties like references to destination object. Finally, the 3rd area has a description for the currently edited property. Again, this is from the description element of a management pack. For example:

<DisplayString ElementID="Maximus.Connectivity.Monitoring.Test.SSL">
  <Name>Connectivity Test - Secure Socket Layer Connection</Name>
  <Description>When created, this object will add the 'Certificate Expiration Monitor', 'Allowed SSL Protocol Monitor', 'Certificate Policy Monitor', 'Certificate Validity Monitor', 'Disabled SSL Protocols Monitor' and 'Weak Certificate Hash Signature Algorithm Monitor' monitors. Note, that certificate expiration warning threshold is set in monitor overrides, not in the object property. Default threshold is 15 days. The 'Collect Performance Data' option is not supported.</Description>
  </DisplayString>
<DisplayString ElementID="Maximus.Connectivity.Monitoring.Test.SSL" SubElementID="ApplicationPolicy">
  <Name>Application Policy OIDs</Name>
  <Description>List of OIDs for X509 Chain Application Policy</Description>
</DisplayString>

So, this self-reflecting approach was the main factor to avoid using the editor from the base library. The editor requires an object adapter class defined for each editing SCOM class, therefore it can adopt new test classes “on fly”. But a custom property descriptor, which queries metadata from SCOM SKD — can.

This is the end of part two — the UI is ready. But note, at the first release I don’t implement templates (despite discussed them so much), neither bulk import/export.

Part Three. Monitoring.

There is not much to cover in this part, as it’s pretty standard monitoring development using managed modules. However, there are few special things due to kind of out-of-the-box usage of SCOM classes, so it’s better to create a mini framework. So, framework principles for me and recommendations to extension authors would be:

  • When designing test objects and monitors, put all configuration parameters into object properties, but keep thresholds as monitor parameters.
  • When writing class and property descriptions, make them making sense in the new object property editor.
  • Inherit test classes from ‘Maximus.Connectivity.Monitoring.Test‘ to make them visible for the UI.
  • Be careful when defining new keys in new test classes. Generally, there is not need for them.
  • Try to keep test object to probe action/data source ratio as 1 to 1. In other words, if it’s not possible to put all required tests in one single PA/DS, then create another test object for different tests.
  • When defining a primary Probe Action or Data Source module for new rule(s) or monitor(s) use Native/Simple pattern (see below).

What is “Native/Simple pattern”? Usually, when defining a new module type (probe action, data source, write action, or condition detection), or a new unit monitor type, it’s better to always keep all parameters in module’s configuration section, like the below:

<Configuration>
  <xsd:element minOccurs="1" name="TestDisplayName" type="xsd:string" />
  <xsd:element minOccurs="1" name="FullyQualifiedDomainName" type="xsd:string" />
  <xsd:element minOccurs="1" name="TargetIndex" type="xsd:integer" />
</Configuration>

And, if this is a composite workflow, like a data source composed of a scheduler and a probe action, then do the same. Copy all configuration from the downstream probe action, add current module specific configuration, then push configuration parameters to downstream modules using config references like this: ‘$Config/TestDisplayName$‘. In this approach (which is entirely correct and makes lots of sense), actual parameters, or object property query are only used in “final product”, which is monitors and users. In other words, “variables” aka configuration elements should be used in all modules and types, and only replaced with actual values in top-level end-user rules and monitors. Doing this, an MP author allows other authors to easily reuse workflows, because they are not bound to any particular classes or objects. Instead, they accept input parameters, which may come either from direct configuration (an explicit value), or from any object property of any class.

However, with the test objects this doesn’t work quite well, mostly because they have dozens of properties, and because properties might be added in future releases to support new tests or enhance existing ones. I would love to give anyone, who needs a ping probe in their MP a public reusable probe action to ping any target, but that means that I have to pass all the parameters through all workflow definitions. This is absolutely doable for the ping probe, which has just 4 parameters, but becomes complex for SSL one — it has 20 parameters. Copying that number of configuration elements and config references makes MP code bulky and hard to read, as well as significantly increases chances to make a mistake, especially if a new release needs to add another property. Therefore, for this MP I made yet another decision to come against design guide lines and use object property substitution as earlier as possible. And called it “Native/Simple pattern”. How it works. First, I define a primary managed code probe action (or data source), which has all configuration as variables (this is no choice on this). This action called native and never used by any other composite workflows except one. Next, I define a “simple” composite probe action and substitute all parameters from object properties, making it config-less. See an example:

Native probe action:

<ProbeActionModuleType ID="Maximus.Connectivity.Monitoring.Ping.NativeProbeAction" Accessibility="Public" Batching="false">
  <Configuration>
    <xsd:element minOccurs="1" name="TestDisplayName" type="xsd:string" />
    
    <xsd:element minOccurs="1" name="FullyQualifiedDomainName" type="xsd:string" />
    <xsd:element minOccurs="1" name="TargetIndex" type="xsd:integer" />
    
    <xsd:element minOccurs="1" name="MaxTTL" type="xsd:integer" />
    <xsd:element minOccurs="1" name="BufferSize" type="xsd:integer" />
    <xsd:element minOccurs="1" name="Timeout" type="xsd:integer" />
    <xsd:element minOccurs="1" name="DontFragment" type="xsd:boolean" />
  </Configuration>
  <OverrideableParameters />
  <ModuleImplementation>
    <Managed>
      <Assembly>Maximus.Connectivity.Monitoring.Modules.DeployableAssembly</Assembly>
      <Type>Maximus.Connectivity.Modules.PingPA</Type>
    </Managed>
  </ModuleImplementation>
  <OutputType>System!System.PropertyBagData</OutputType>
  <InputType>System!System.BaseData</InputType>
</ProbeActionModuleType>

And Simple probe action:

<ProbeActionModuleType ID="Maximus.Connectivity.Monitoring.Ping.ProbeAction" Accessibility="Public" Batching="false">
  <Configuration />
  <OverrideableParameters />
  <ModuleImplementation>
    <Composite>
      <MemberModules>
        <ProbeAction ID="PA_SubstituteParametersFromObject" TypeID="Maximus.Connectivity.Monitoring.Ping.NativeProbeAction">
          <TestDisplayName>$Target/Host/Property[Type="System!System.Entity"]/DisplayName$</TestDisplayName>
          <FullyQualifiedDomainName>$Target/Host/Property[Type="Maximus.Connectivity.Monitoring.FullyQualifiedDomainName"]/FullyQualifiedDomainName$</FullyQualifiedDomainName>
          <TargetIndex>$Target/Host/Property[Type="Maximus.Connectivity.Monitoring.FullyQualifiedDomainName"]/TargetIndex$</TargetIndex>
          <MaxTTL>$Target/Property[Type="Maximus.Connectivity.Monitoring.Test.Ping"]/MaxTTL$</MaxTTL>
          <BufferSize>$Target/Property[Type="Maximus.Connectivity.Monitoring.Test.Ping"]/BufferSize$</BufferSize>
          <Timeout>$Target/Property[Type="Maximus.Connectivity.Monitoring.Test.Ping"]/Timeout$</Timeout>
          <DontFragment>$Target/Property[Type="Maximus.Connectivity.Monitoring.Test.Ping"]/DontFragment$</DontFragment>
        </ProbeAction>
      </MemberModules>
      <Composition>
        <Node ID="PA_SubstituteParametersFromObject" />
      </Composition>
    </Composite>
  </ModuleImplementation>
  <OutputType>System!System.PropertyBagData</OutputType>
  <InputType>System!System.BaseData</InputType>
</ProbeActionModuleType>

Similarly, an upstream composite data source with a scheduler will take scheduler interval from the object propery:

<DataSource ID="DS_Scheduler" TypeID="System!System.SimpleScheduler">
  <IntervalSeconds>$Target/Property[Type="Maximus.Connectivity.Monitoring.Test"]/IntervalSeconds$</IntervalSeconds>
  <SyncTime />
</DataSource>

And, as I already mentioned, despite this approach is against the guide lines, it has a great advantage of making cookdown unbreakable. In other words, regardless a number of monitories/rules analyzing/recording output data from a module, the module runs one once, and then SCOM engine sends a copy of data to all consumers.

Source code.

You can find all the source code for this management pack in the following repository: https://github.com/MaxxVolk/Maximus.Connectivity.Monitoring.

The End.

One thought on “Connectivity Monitoring SCOM management pack. The making of.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s