The DevOps Jedi

Taking the cloud by storm one line of code at a time....

Deleting An Azure Resource That Is Stuck In A 'Deleting Resource' Status

2024-08-194 min readAzureDarren Johnson

Background

I have recently been working with deploying some of Azure’s Artificial Intelligence (AI) resources via Terraform, and after standing them up and destroying a few times I encountered an error where an Azure Machine Learning Workspace v2 (MLW) would not delete, and the terraform destroy command timed out. It had been a long time since I’d seen this behaviour in Azure, and the previous workaround was to just delete the resource(s) via the portal. However on this occasion this didn’t work. I couldn’t find the answer online and I’m writing this from memory, so apologies in advance if I have missed something.

Azure Machine Learning Workspace Dependencies

When building an MLW, there are a number of prerequisite resources that need to be in place:

  • Application Insights (although this can be deleted after the MLW deployment)
  • Container Registry
  • Key Vault
  • Log Analytics Workspace
  • Storage Account

In Terraform, when you link resources together via their resource type and name, an implicit dependency is automatically detected. This helps Terraform know in what order resources must be created and deleted. You can also add a depends_on argument to manually enforce dependencies between the two resources.

The challenge is the Azure’s API’s allow resources to be created asynchronously (at the same time), but when destroying, the order can be more specific and require dependant resources to exist before allowing the deletion. In theory this is a good thing, but if the error messages don’t trap this then you can get stuck, which is what happened to me.

The Problem

I had created these resources and tore them down numerous times in the last week without any issues. I had completed a customer demo and decided to destroy the resources to save the ongoing cost which is when I hit the problem.

Most resources in Azure aren’t that expensive and should this happen it wouldn’t be too much of an issue, but in this case I needed to have the most secure configuration which involved configuring ‘Private with Approved Outbound’ networking. What this meant in reality was a managed Virtual Network including an Azure Firewall instance ($$$) was provisioned which I had no control over. By not being able to delete this resource I would incur any costs for these resources which was not good!

Attempted Solution 1

When Terraform failed, I tried deleting the resource via the portal which appeared to be working:

MLW Deleting Resource Portal View

However I left this overnight with no update or additional information to go on.

When I logged back in to the portal the ‘Deleting resource’ flag had gone, so I attempted to access the Machine Learning Studio to see if I could get any additional information. I immediately got an error stating:

"WorkspaceIsDeleting: GUID Request can't be accepted while workspace is in deleting state."

At this point I knew I had to try something else, or log a support call with Microsoft.

Attempted Solution 2

Next I found this link which despite being over 3 years old, was from someone at Microsoft suggesting to use the AZ CLI to delete the resource. I tried this using the command below:

az ml workspace delete --resource-group RG_NAME --workspace-name MLW_NAME --all-resources

I left this running, and it ran, and it ran and…. you get the idea.

8 Hours had passed so I left it running overnight, during which time my Azure VM had required updates and had rebooted.

Time to start over and back to square 1….

How I Found The Problem

Experience told me I could enable verbose logging when using the AZ CLI and checking the updated documentation I saw there were a few extra arguments I wanted to try:

  • --permanently-delete - this would ensure the MLW wasn’t placed into a soft delete state
  • --yes - to avoid prompting for confirmation (ideal when running via a pipeline)
  • --verbose - increase the logging verbosity (i.e. I would get more info)
  • --no-wait - do not wait for long running operations to finish (like I had previously)

I updated my command as follows (changing the --workspace-name argument to just --name as per the updated documentation):

az ml workspace delete --name MLW_NAME --resource-group RG_NAME --all-resources --permanently-delete --yes --verbose --no-wait

This time command immediately failed with the error below:

(ResourceNotFound) The Resource 'Microsoft.insights/components/APP_INSIGHTS_NAME' under resource group 'RG_NAME' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix

The Fix

I recreated all the dependant resources via Terraform (but not the MLW itself), and re-ran the command above. This time it sailed through in about 30 seconds, and after waiting another minute or 2 for the portal to catch up, I was able to confirm the resources had indeed been deleted. Result!

Key Takeaway: When using the AZ CLI to delete resources, always use the --verbose argument and the --no-wait argument if it is available as this will give you more information than the portal or the API which is where Terraform gets it’s errors information from.