The DevOps Jedi

Taking the cloud by storm one line of code at a time....

Creating A Modern Azure Diagnostics Policy That Generative AI Couldn't - Yet!

2024-07-2315 min readGovernanceDarren Johnson

Background

Before diving into the detail I think it’s important to set the context here as all environments are different and every not everyone’s requirements are the same. This particular organisation had a large enterprise environment that was aligned to Microsoft’s Cloud Adoption Framework Landing Zone Design , so they had multiple Subscriptions under a number of well structured Management Groups and wanted to leverage Policy Driven Governance to ensure consistency.

A centralised Log Analytics Workspace had been deployed in a Management Subscription which was to be used for Azure Monitor, and would receive diagnostic logs for all available resources configured via their Diagnostic Settings.

Requirements

The requirements for the policy to collect diagnostic data for Azure Monitor were as follows:

  • All available Logs must be collected using Category Groups
  • All Category Groups must be enabled and the policy configured to allow for new Category groups being added in the future
  • All Metrics must be collected
  • All Diagnostic data must be sent to the centralised Log Analytics Workspace
  • Where the resources support it, all diagnostic data must be sent to the ‘Resource Specific’ Log Analytics Destination Table

If you haven’t come across Resource Specific tables in Log Analytics before, they store data in dedicated tables which makes querying the data a lot simpler, and has the added benefit of being able to fine tune data retention values at the table level.

Engineering Approach

To see what was required to be configured in a policy, I decided to reverse engineer the solution by configuring the desired settings in the portal in my own Azure tenant, then viewing the resulting JSON. This is a technique I use a lot to automate the configuration of Azure resources, ensuring consistent results whether I’m configuring them via code or through the portal.

Below is an example of a Service Bus resource that has been configured to meet the requirements, and has the option selected to send logs to the ‘Resource Specific’ destination table in Log Analytics.

Configured Service Bus Diagnostics Portal View

Once the settings had been configured, I clicked the JSON View link to display the resulting JSON.

Configured Service Bus Diagnostics JSON View

You can see I have highlighted a few sections of the JSON that became relevant during the creation and testing process, so let me explain these further:

  • API version - this is the version of the API the portal used to provision the diagnostic settings
  • categoryGroup - there are separate Category Groups defined for both allLogs and audit, and the category for these is defined and set to null
  • AllMetrics - this still uses category and there is no categoryGroup defined
  • logAnalyticsDestinationType - this is set to Dedicated

The resulting JSON was formatted as follows:

{
    "id": "/subscriptions/SUBSCRIPTION-GUID/resourceGroups/RESOURCEGROUP-NAME/providers/Microsoft.ServiceBus/namespaces/RESOURCE-NAME/providers/microsoft.insights/diagnosticSettings/DIAGNOSTIC-SETTING-NAME",
    "name": "setByPolicy",
    "properties": {
        "logs": [
            {
                "category": null,
                "categoryGroup": "allLogs",
                "enabled": true,
                "retentionPolicy": {
                    "days": 0,
                    "enabled": false
                }
            },
            {
                "category": null,
                "categoryGroup": "audit",
                "enabled": false,
                "retentionPolicy": {
                    "days": 0,
                    "enabled": false
                }
            }
        ],
        "metrics": [
            {
                "enabled": true,
                "retentionPolicy": {
                    "days": 0,
                    "enabled": false
                },
                "category": "AllMetrics"
            }
        ],
        "workspaceId": "/subscriptions/SUBSCRIPTION-GUID/resourceGroups/RESOURCEGROUP-NAME/providers/Microsoft.OperationalInsights/workspaces/LOG-ANALTYICS-WORKSPACE-RESOURCE-NAME",
        "logAnalyticsDestinationType": "Dedicated"
    }
}

Checking For Existing Policies

I didn’t want to have to create a custom policy, and before going down this route it’s important to ensure there isn’t already a policy created by Microsoft that provides this functionality. To do this I used the excellent AzAdvertizer website. I chose this resource because it allowed me to search for policies based on their content as opposed to just their name.

My initial search is always based on the policy definition and I do this by entering a search string in the Policy definition compressed column. This is great because any whitespace is removed from the JSON string which ensures you get all relevant results and avoid any human errors in the JSON definition.

I started by searching for "categoryGroup":"allLogs" which brought back hundreds of results. I then filtered this down further by clicking the funnel filter icon selecting AND Contains and entering "category":"AllMetrics". This brought back over 30 results so clicked the filter icon again and added another AND Contains search for "logAnalyticsDestinationType":"Dedicated".

AzAdvertizer Search

This returned no results at all. Suspecting I’d made a typo, or just got the search logic wrong, I tried searching just for policies that contained "logAnalyticsDestinationType":"Dedicated". Only 1 policy was returned.

At this point I knew I was going to have to create a custom policy to meet the requirements.

Calling AI - ‘Your Partner In Problem Solving’

They say that ‘a problem shared is a problem halved` so I decided to try and get AI to generate me a policy I could use instead of creating one from scratch.

Now before I dive into how I used AI for this, I have to state I am a big fan of Generative AI. It can save a lot of time and effort and really help enhance solutions. However it is not always right, and you should always validate and test its output.

There is a lot of hype around AI and Prompt Engineering at present and I don’t want to make this post all about that, so I will just share the prompt I used with Microsoft Copilot, ChatGPT & Google Gemini.

Act as an Azure Expert.

You have a requirement to ensure “Diagnostic Settings” are configured on all Azure “Service Bus” instances.

You need to ensure the “Category Groups” for “allLogs”, “audit” and “AllMetrics” are selected to “Send to Log Analytics Workspace” and use the “Resource Specific” destination table.

Please generate a policy definition to achieve this. The policy must check that the correct log analytics workspace ID is configured and not just that category groups are enabled.

The ID of the log analytics workspace will be provided by the user when assigning the policy at the Management Group level so should not be specified in the definition.

Any remediation should be carried out using the built in “Log Analytics Contributor” role to configure the settings, custom roles are not to be used.

Before generating the policy definition please ask any questions that will help you produce a more accurate response.

Now there are a couple of things to note from this prompt. Firstly I made an error in stating “AllMetrics” is a Category Group, it is a Category as we have seen before. Secondly I added an extra requirement to ensure the policy checks that the correct Log Analytics Workspace is configured. I did this after inspecting existing policies and discovered they check that Logs and Metrics are enabled, but don’t validate where they are sending them to which seemed pointless to me.

All 3 Bots asked similar questions and came pretty close to giving me what I needed. However none were 100% accurate.

The Policy Definition

I took the base policy definition Copilot generated and stripped it back to only include only the information I was interested in. I also didn’t want any logic defined in the policy I didn’t fully understand. I won’t go through all the settings that were incorrect, but I will now break down the sections of my final working policy definition further.

The properties Object

This is fairly self explanatory and contains the following:

  • displayName - the name of the policy displayed in the Azure portal
  • policyType - the type of the policy which is always Custom for policies not created by Microsoft
  • mode - this is set to All as per Microsoft’s recommendation and specifies that the policy applies to all resource types
  • description - the description of the policy as displayed in the Azure portal
  • metadata - the metadata object stores information about the policy definition
  • parameters - the parameters object stores and validates information passed to the policy when it is assigned and can be used to set default values
  • policyRule - the policyRule object is the conditional logic that needs to be matched for the policy to apply
{
    "properties": {
        "displayName": "Deploy Diagnostic Settings - Service Bus - Azure Monitor",
        "policyType": "Custom",
        "mode": "All",
        "description": "Configures diagnostic logs for the category groups 'allLogs' and 'audit' as well as the category 'AllMetrics' for a Service Bus resource to send to a Log Analytics workspace for use with Azure Monitor. The diagnostic logs are sent to the 'Resource Specific' Destination Table. NOTE: The identity carrying out remediation tasks requires the 'Log Analytics Contributor' role to be assigned on both the resource and the Log Analytics Workspace.",
        "metadata": {
        },
        "parameters": {
        },
        "policyRule": {
        }
    }
}

The main focus of this post will be on the policyRule object as the metadata and parameters objects are kept simple.

The metadata Object

The metadata object stores extra data about the policy definition that you want to add in addition to Azures default values. Here we are just setting the version and the category:

"metadata": {
    "version": "1.0.0",
    "category": "Monitoring"
}

The parameters Object

The parameters object makes policy definitions more flexible and reusable as they allow you to define values that can be customised when the policy is assigned. Here we are setting the logAnalytics Workspace Resource ID.

"parameters": {
    "logAnalytics": {
        "type": "String",
        "metadata": {
            "displayName": "Log Analytics Workspace Resource ID",
            "description": "Specifies the resource ID of Log Analytics workspace to send logs to."
        }
    }
}

The policyRule Object

The policy rule object is made up of if and then objects. In the if object you set conditional logic that determines when the policy should be applied. In the then object you specify the action or effect that occurs when the conditions in the if object are true.

"policyRule": {
    "if": {
    },
    "then": {
    }
}

The if Object

The if object specifies conditional logic that needs to be met in order for the policy to apply. In this case, the type of resource needs to match "Microsoft.ServiceBus/namespaces".

"if": {
    "field": "type",
    "equals": "Microsoft.ServiceBus/namespaces"
}

The then Object

The then object is what happens when the if condition has been met. It contains the following:

  • effect - the effect determines what action is taken if the compliance criteria are not met. In this case, the effect is set to deployIfNotExists which will automatically deploy an Azure Resource Manager (ARM) template
  • details - the details object provides additional information and parameters
"then": {
    "effect": "deployIfNotExists",
    "details": {
        "type": "Microsoft.Insights/diagnosticSettings",
        "existenceCondition": {
        },
        "roleDefinitionIds": [
        ],
        "deployment": {
        }
    }
}

The details Object

For this example the details object contains the following:

  • type - the type of resource that the policy will be targeting
  • existenceCondition - the compliance criteria that must be met
  • roleDefinitionIds - a list of RBAC role(s) required to be assigned to perform the deployment
  • deployment - the ARM template to be deployed

The existenceCondition Object

The existenceCondition object is key, as this is where the compliance of the policy will be determined. Think of this as all the things that need to be true for the policy to be considered compliant. This section needs extra consideration as it evaluates conditional logic for both deployment and compliance. I say this because after creating the policy and watching it successfully apply, it was marked as non compliant.

Let me explain further.

I initially used the logic from the JSON created when the settings were configured via the portal which has separate objects for the allLogs and audit category groups. Part of the existenceCondition object looked like this:

"existenceCondition": {
    "allOf": [
        {
            "field": "Microsoft.ServiceBus/namespaces/diagnosticSettings[*].logs[*].categoryGroup",
            "contains": "allLogs"
        },
        {
            "field": "Microsoft.ServiceBus/namespaces/diagnosticSettings[*].logs[*].categoryGroup",
            "contains": "audit"
        }
    ]
}

From reading this I can see that categoryGroup needs to contain both allLogs and audit to be considered compliant, and this works perfectly for deployment.

However when I checked the compliance of the policy I discovered it used different logic to check these settings.

Compliance Check Logic

I wasn’t sure how to fix this one so I turned to Copilot again and asked:

Act as an Azure Expert.

You have created a policy as defined below but compliance fails as it sees properties.logs[].categoryGroup’ as ‘[“allLogs”,“audit”]’ the definition is below, how do I fix this?*

Copilot replied with the following:

The issue you’re facing is likely due to the way the policy is checking for the categoryGroup in the existenceCondition. The policy is currently checking for allLogs and audit separately, but the compliance check is seeing them together as ["allLogs","audit"].

To fix this, you can modify the existenceCondition to check if both allLogs and audit are in the categoryGroup array. Here’s how you can modify it:

"existenceCondition": {
  "allOf": [
    {
      "equals": "true",
      "field": "Microsoft.Insights/diagnosticSettings/logs.enabled"
    },
    {
      "in": ["allLogs", "audit"],
      "field": "microsoft.insights/diagnosticSettings/logs[*].categoryGroup"
    },
    {
      "equals": "true",
      "field": "Microsoft.Insights/diagnosticSettings/metrics.enabled"
    },
    {
      "field": "Microsoft.Insights/diagnosticSettings/workspaceId",
      "matchInsensitively": "[parameters('logAnalytics')]"
    },
    {
      "equals": "Dedicated",
      "field": "Microsoft.Insights/diagnosticSettings/logAnalyticsDestinationType"
    }
  ]
}

In this modification, I replaced the separate checks for allLogs and audit with a single check using the in operator. This checks if the categoryGroup array contains both allLogs and audit.

Please try this modification and see if it resolves the compliance issue. If the issue persists, please let me know so we can investigate further.

I implemented the changes as I understood exactly what it was suggesting, and sure enough compliance was happy. Go Copilot!!!

You will notice from the JSON above that there are additional settings in the existenceCondition object that I supplied to Copilot to meet our requirements:

  • There is a check to ensure that both logs and metrics are enabled
  • There is a check to ensure that the configured workspaceId matches the logAnalytics parameter specified when the policy is assigned
  • The logAnalyticsDestinationType is set to Dedicated which we saw from the JSON the portal produced earlier

The roleDefinitionIds Object

The roleDefinitionIds object is a list of RBAC roles that the identity performing the remediation will need to be assigned for a successful deployment to take place. In this case there is only one role that is required, which is Log Analytics Contributor and this is specified by its ID of 92aaf0da-9dab-42b6-94a3-d43ce8d16293

"roleDefinitionIds": [
    "/providers/Microsoft.Authorization/roleDefinitions/92aaf0da-9dab-42b6-94a3-d43ce8d16293"
]

The deployment Object

The deployment object is the ARM template that is deployed when the effect is set to deployIfNotExists. I’ve used ARM templates in the past and can understand them, but don’t consider myself an expert by any means. However I was able to use some of the JSON generated by Copilot as starter code and modify it accordingly:

"deployment": {
    "properties": {
        "mode": "Incremental",
        "template": {
            "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
            "contentVersion": "1.0.0.0",
            "parameters": {
                "name": {
                    "type": "string"
                },
                "logAnalytics": {
                    "type": "string"
                }
            },
            "resources": [
                {
                    "type": "Microsoft.ServiceBus/namespaces/providers/diagnosticSettings",
                    "name": "[concat(parameters('name'), '/Microsoft.Insights/setByPolicy')]",
                    "apiVersion": "2021-05-01-preview",
                    "properties": {
                        "workspaceId": "[parameters('logAnalytics')]",
                        "logAnalyticsDestinationType": "Dedicated",
                        "logs": [
                            {
                                "categoryGroup": "allLogs",
                                "enabled": true
                            },
                            {
                                "categoryGroup": "audit",
                                "enabled": true
                            }
                        ],
                        "metrics": [
                            {
                                "category": "AllMetrics",
                                "enabled": true
                            }
                        ]
                    }
                }
            ]
        },
        "parameters": {
            "name": {
                "value": "[field('name')]"
            },
            "logAnalytics": {
                "value": "[parameters('logAnalytics')]"
            }
        }
    }
}

The key changes I had to make as I went along were in the resources object where I changed the following:

  • The name was updated to setByPolicy to align with other diagnostic settings that had been deployed via policies elsewhere within the tenant
  • The apiVersion was updated to match the version that was used in the portal when configuring the settings as this version introduced support for the categoryGroup attribute
  • The logAnalyticsDestinationType was set to Dedicated
  • In the list of logs I had to update category to categoryGroup

Putting It All Together

When comparing the final JSON file to the content in this post you may notice a few of the fields appear in a different order to that shown earlier. This is because I like any JSON I write to be as declarative as possible. JSON is designed to be read by machines, not humans, so it doesn’t care what order values are specified within an object as long as it is syntactically correct. A quick example here would be below where the equals is before the field:

{
    "equals": "true",
    "field": "Microsoft.Insights/diagnosticSettings/logs.enabled"
}

This is technically the same as how I would write it, but the below makes more sense to me:

{
    "field": "Microsoft.Insights/diagnosticSettings/logs.enabled",
    "equals": "true"
}

The finished working policy definition JSON file is below, so feel free to modify it and use within your own environments, after testing of course!

{
    "properties": {
        "displayName": "Deploy Diagnostic Settings - Service Bus - Azure Monitor",
        "policyType": "Custom",
        "mode": "All",
        "description": "Configures diagnostic logs for the category groups 'allLogs' and 'audit' as well as the category 'AllMetrics' for a Service Bus resource to send to a Log Analytics workspace for use with Azure Monitor. The diagnostic logs are sent to the 'Resource Specific' Destination Table. NOTE: The identity carrying out remediation tasks requires the 'Log Analytics Contributor' role to be assigned on both the resource and the Log Analytics Workspace.",
        "metadata": {
            "version": "1.0.0",
            "category": "Monitoring"
        },
        "parameters": {
            "logAnalytics": {
                "type": "String",
                "metadata": {
                    "displayName": "Log Analytics Workspace Resource ID",
                    "description": "Specifies the resource ID of Log Analytics workspace to send logs to."
                }
            }
        },
        "policyRule": {
            "if": {
                "field": "type",
                "equals": "Microsoft.ServiceBus/namespaces"
            },
            "then": {
                "effect": "deployIfNotExists",
                "details": {
                    "type": "Microsoft.Insights/diagnosticSettings",
                    "existenceCondition": {
                        "allOf": [
                            {
                                "field": "Microsoft.Insights/diagnosticSettings/logs.enabled",
                                "equals": "true"
                            },
                            {
                                "field": "microsoft.insights/diagnosticSettings/logs[*].categoryGroup",
                                "in": [
                                    "allLogs",
                                    "audit"
                                ]
                            },
                            {
                                "field": "Microsoft.Insights/diagnosticSettings/metrics.enabled",
                                "equals": "true"
                            },
                            {
                                "field": "Microsoft.Insights/diagnosticSettings/workspaceId",
                                "matchInsensitively": "[parameters('logAnalytics')]"
                            },
                            {
                                "field": "Microsoft.Insights/diagnosticSettings/logAnalyticsDestinationType",
                                "equals": "Dedicated"
                            }
                        ]
                    },
                    "roleDefinitionIds": [
                        "/providers/Microsoft.Authorization/roleDefinitions/92aaf0da-9dab-42b6-94a3-d43ce8d16293"
                    ],
                    "deployment": {
                        "properties": {
                            "mode": "Incremental",
                            "template": {
                                "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
                                "contentVersion": "1.0.0.0",
                                "parameters": {
                                    "name": {
                                        "type": "string"
                                    },
                                    "logAnalytics": {
                                        "type": "string"
                                    }
                                },
                                "resources": [
                                    {
                                        "type": "Microsoft.ServiceBus/namespaces/providers/diagnosticSettings",
                                        "name": "[concat(parameters('name'), '/Microsoft.Insights/setByPolicy')]",
                                        "apiVersion": "2021-05-01-preview",
                                        "properties": {
                                            "workspaceId": "[parameters('logAnalytics')]",
                                            "logAnalyticsDestinationType": "Dedicated",
                                            "logs": [
                                                {
                                                    "categoryGroup": "allLogs",
                                                    "enabled": true
                                                },
                                                {
                                                    "categoryGroup": "audit",
                                                    "enabled": true
                                                }
                                            ],
                                            "metrics": [
                                                {
                                                    "category": "AllMetrics",
                                                    "enabled": true
                                                }
                                            ]
                                        }
                                    }
                                ]
                            },
                            "parameters": {
                                "name": {
                                    "value": "[field('name')]"
                                },
                                "logAnalytics": {
                                    "value": "[parameters('logAnalytics')]"
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}

I covered how to add the policy definition using PowerShell in my previous post so head over there to check that out.