Deep observability insights at any scale with Consul

10min
|
Consul

Consul helps you securely connect applications running in any environment, at any scale. Consul observability features enhance your service mesh capabilities with enriched metrics, logs, and distributed traces so you can improve performance and debug your distributed services with precision.

In this tutorial, you will enable observability features for your Consul data plane and control plane. You will use Grafana to explore dashboards that provide information regarding health, performance, security, and operations. In the process, you will learn how using these features can provide you with deep insights, reduce operational overhead, and contribute to a more holistic view of your service mesh applications.

Scenario overview

HashiCups is a coffee shop demo application. It has a microservices architecture and uses Consul service mesh to securely connect the services. At the beginning of this tutorial, you will use Terraform to deploy the HashiCups microservices, a self-managed Consul cluster, and an observability suite on Elastic Kubernetes Service (EKS).

You will enable Consul observability features for your service mesh environment that will provide insights into the health and performance of your data plane and control plane. You will use these features to diagnose and troubleshoot traffic problems between services on the data plane.

The architecture diagram of the scenario. This shows the Kubernetes environment and the flow of traffic from the client request through the self-managed Consul service mesh.

In this tutorial, you will:

Deploy the following resources with Terraform:
- Elastic Kubernetes Service (EKS) cluster
- A self-managed Consul datacenter on EKS
- Grafana and Prometheus on EKS
- HashiCups demo application
Perform the following Consul data plane procedures:
- Review and enable observability features
- Explore dashboards with Grafana
- Troubleshoot the HashiCups demo application
Perform the following Consul control plane procedures:
- Review and enable observability features
- Explore dashboards with Grafana
Perform the following HCP Consul procedures:
- Review and enable HCP observability features
- Explore dashboards with HCP Consul portal
Clean up your demo environment

Prerequisites

For this tutorial, you will need:

An AWS account configured for use with Terraform
An HCP account
aws-cli >= 2.0
terraform >= 1.0
consul >= 1.16.0
consul-k8s >= 1.2.0
git >= 2.0
helm >= 3.0
kubectl > 1.24

Clone GitHub repository

Clone the GitHub repository containing the configuration files and resources.

$ git clone https://github.com/hashicorp-education/learn-consul-hashiconf-2023.git

Change into the directory that contains the complete configuration files for this tutorial.

$ cd learn-consul-hashiconf-2023/

Deploy infrastructure and demo application

With these Terraform configuration files, you are ready to deploy your infrastructure.

Initialize your Terraform configuration to download the necessary providers and modules.

$ terraform init

Initializing the backend...

Initializing provider plugins...
## ...

Terraform has been successfully initialized!
## ...

Then, deploy the resources. Confirm the run by entering yes.

$ terraform apply

## ...
Do you want to perform these actions?
 Terraform will perform the actions described above.
 Only 'yes' will be accepted to approve.

 Enter a value: yes

## ...

Apply complete! Resources: 98 added, 0 changed, 0 destroyed.

Note

The Terraform deployment could take up to 15 minutes to complete. Feel free to explore the next sections of this tutorial while waiting for the environment to complete initialization.

Connect to your infrastructure

Now that you have deployed the Kubernetes cluster, configure kubectl to interact with it.

$ aws eks --region $(terraform output -raw region) update-kubeconfig --name $(terraform output -raw kubernetes_cluster_id)

Configure your CLI to interact with Consul datacenter

In this section, you will set environment variables in your terminal so your Consul CLI can interact with your Consul datacenter. The Consul CLI reads these environment variables for behavior defaults and will reference these values when you run consul commands.

Set the Consul destination address. By default, Consul runs on port 8500 for http and 8501 for https.

$ export CONSUL_HTTP_ADDR=https://$(kubectl get services/consul-ui --namespace consul -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

Retrieve the ACL bootstrap token from the respective Kubernetes secret and set it as an environment variable.

$ export CONSUL_HTTP_TOKEN=$(kubectl get --namespace consul secrets/consul-bootstrap-acl-token --template={{.data.token}} | base64 -d)

Remove SSL verification checks to simplify communication to your Consul datacenter.

$ export CONSUL_HTTP_SSL_VERIFY=false

Note

In a production environment, we recommend keeping this SSL verification set to true. Only remove this verification for if you have a Consul datacenter without TLS configured in development environment and demonstration purposes.

Run the consul catalog services CLI command to print all known services from your Consul catalog. This will ensure you are able to communicate with your Consul environment.

$ consul catalog services
api-gateway
consul
frontend
frontend-sidecar-proxy
nginx
nginx-sidecar-proxy
payments
payments-sidecar-proxy
product-api
product-api-db
product-api-db-sidecar-proxy
product-api-sidecar-proxy
traffic
traffic-sidecar-proxy

Enable Consul data plane observability features

The Consul data plane is responsible for authorizing, forwarding, and observing every network packet that flows between the services in your service mesh.

Consul data plane observability features provide detailed statistics and logging data so you can understand distributed traffic flow and debug problems as they occur.

Review and enable data plane metrics

Consul lets you expose Prometheus metrics for your service mesh applications and sidecars. Review the highlighted lines in the values file below to see the parameters that enable this feature.

helm/consul-v2-data-plane.yaml

global:
## ...
  # Exposes Prometheus metrics for the Consul service mesh and sidecars.
  metrics:
    enabled: true
    # Enables Consul servers and clients metrics.
    enableAgentMetrics: true
    # Configures the retention time for metrics in Consul servers and clients.
    agentMetricsRetentionTime: "59m"

ui:
## ...
  # Enables displaying metrics in the Consul UI.
  metrics:
    enabled: true
    # The metrics provider specification.
    provider: "prometheus"
    # The URL of the prometheus metrics server.
    baseURL: http://prometheus-server.observability.svc.cluster.local

connectInject:
## ...
  # Enables metrics for Consul Connect sidecars.
  metrics:
    defaultEnabled: true

Refer to the Consul metrics for Kubernetes documentation to learn more about metrics configuration options and details.

Configure your Consul cluster to let Prometheus collect metrics from your data plane.

$ helm upgrade --values helm/consul-v2-data-plane.yaml consul hashicorp/consul --namespace consul --version "1.2.1"

Note

The Helm upgrade could take up to 5 minutes to complete. Feel free to explore the next sections of this tutorial while waiting for your updated Consul environment to become available.

Review the official Helm chart values to learn more about these settings.

Review and enable data plane logging

The ProxyDefaults configuration entry lets you configure global defaults across all sidecar proxies for Consul service mesh proxy configurations. The proxy/proxy-defaults.yaml file enables accessLogs for all of your Consul data plane sidecar proxies.

config/proxy-defaults.yaml

apiVersion: consul.hashicorp.com/v1alpha1
kind: ProxyDefaults
metadata:
  name: global
spec:
  accessLogs:
    enabled: true

Review the Consul proxy defaults documentation to learn more.

Configure your proxy defaults to enable access logs.

$ kubectl apply -f config/proxy-defaults.yaml
proxydefaults.consul.hashicorp.com/global created

Restart sidecar proxies

You need to restart your sidecar proxies to apply the updated configuration. To do so, redeploy your HashiCups application.

$ kubectl rollout restart deployment --namespace default
deployment.apps/api-gateway restarted
deployment.apps/frontend restarted
deployment.apps/nginx restarted
deployment.apps/payments restarted
deployment.apps/product-api restarted
deployment.apps/product-api-db restarted
deployment.apps/traffic restarted

Prometheus will now begin scraping the /metrics endpoint for all proxy sidecars on port 20200. Refer to the Consul metrics for Kubernetes documentation to learn more about changing the Consul metrics collection default parameters.

Generate traffic in the demo application

In this section, you will visit your demo application, HashiCups, to generate traffic that will populate the Consul proxy metrics dashboards in Grafana.

Retrieve the HashiCups URL.

$ export CONSUL_APIGW_ADDR=http://$(kubectl get svc/api-gateway -o json | jq -r '.status.loadBalancer.ingress[0].hostname') && echo $CONSUL_APIGW_ADDR
http://a4cc3e77d86854fe4bbcc9c62b8d381d-221509817.us-west-2.elb.amazonaws.com

Open the Consul API Gateway's URL in your browser and explore the HashiCups UI. Notice that the HashiCups UI displays an expected error message.

HashiCups in a state where all services are not functional. An error message that says unable to query coffees is displayed in the UI.

In the next section, you will use the data plane logs and metrics dashboards to troubleshoot the HashiCups demo application.

Explore Consul data plane metrics and logs with Grafana

Consul proxy access logs and proxy metrics provide you with detailed health and performance information for your service mesh applications. In this section, you will use Grafana to examine how this information provides insights into your distributed applications.

Event and error insights

Consul proxy access logs provide detailed event and error information for your service mesh applications. This includes upstream/downstream application connections, request status codes, errors, and additional information that you can use to gain deep insights into your distributed applications.

Navigate to the data plane logs dashboard.

$ export GRAFANA_DP_LOGS=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/data-plane-logs/ && echo $GRAFANA_DP_LOGS
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/data-plane-logs/

Note

The Grafana dashboard may take a few moments to fully load in your browser.

In this scenario, notice that the nginx app is experiencing a large amount of 503: Service Unavailable errors. When filtering for the 503 response code in the raw logs, Grafana shows nginx returns an error when it attempts to call the /api path. Referencing the HashiCups diagram, the /api path sends traffic to the public-api service.

The HashiCups event and error monitoring dashboard. The dashboard displays a wide variety of data plane logging visualizations, including one that shows multiple 503 errors for the nginx application.

Consul proxy access logs contain a large set of information that you can utilize to monitor your service mesh applications. Refer to the Consul access logs documentation for a complete list of available logs.

Health insights

Consul proxy metrics provide you with information for monitoring the health of your service mesh applications such as requests by status code, upstream/downstream connections, rejected connections, and Envoy cluster state that can be used for monitoring the health of your service mesh applications. The majority of these metrics are available for any service mesh applications and require no additional service configuration.

Navigate to the data plane health monitoring dashboard.

$ export GRAFANA_DP_HEALTH=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/data-plane-health/ && echo $GRAFANA_DP_HEALTH
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/data-plane-health/

Note

The Grafana dashboard may take a few moments to fully load in your browser.

In this scenario, notice that only 5 of the 6 HashiCups services are running and that the public-api service is not present in the list of active HTTP downstream connections. The status code related dashboards also show a large amount of 503: Service Unavailable errors for the nginx service.

The HashiCups health monitoring dashboard. The dashboard displays a wide variety of health related metrics, including one that shows multiple 503 errors for the nginx application.

Consul proxy metrics contain a large set of statistics that you can use to monitor your service mesh applications. Refer to the Envoy proxy statistics overview for a complete list of available metrics.

Performance insights

Consul proxy metrics provide you with information for monitoring the performance of your service mesh applications such as network traffic statistics, CPU usage by pod, Envoy connections per second, and upstream/downstream connection data. The majority of these metrics are available for any service mesh applications and require no additional application configuration.

Navigate to the data plane performance monitoring dashboard.

$ export GRAFANA_DP_PERFORMANCE=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/data-plane-performance/ && echo $GRAFANA_DP_PERFORMANCE
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/data-plane-performance/

Note

The Grafana dashboard may take a few moments to fully load in your browser.

In this scenario, notice that the public-api service is not present in the upstream requests dashboard. Even though the CPU and network usage dashboards show that the public-api pod is present, the pods is processing very little activity.

The HashiCups performance monitoring dashboard. The dashboard displays a wide variety of performance related metrics, including one that shows little CPU or network usage for the public-api service.

Consul proxy metrics contain a large set of statistics that you can utilize to monitor your service mesh applications. Refer to the Envoy proxy statistics overview for a complete list of available metrics.

Restore HashiCups functionality

In this section, you will restore HashiCups functionality by using the insights from the data plane metrics and log dashboards.

The data plane dashboards show that only 5 of the 6 HashiCups services are running in the service mesh and that the public-api service is not present in the list of active HTTP downstream connections. The CPU and network usage dashboards show that the public-api pod is present, but processing very little activity.

Based on this information, you can deduce that there is an error with the public-api service. List the pod details in the default namespace, where the HashiCups pods are running.

$ kubectl get pods --namespace default
NAMESPACE            NAME                                           READY   STATUS    RESTARTS   AGE
default              api-gateway-6ddbd69979-bm5kq                   1/1     Running   0          67s
default              frontend-5d7f97456b-4h7mj                      2/2     Running   0          67s
default              nginx-7445d8d8c4-nmht9                         2/2     Running   0          67s
default              payments-6888957c45-r5lks                      2/2     Running   0          68s
default              product-api-7fcf6cd96f-brdvf                   2/2     Running   0          67s
default              product-api-db-855dbcc787-4pv9k                2/2     Running   0          67s
default              public-api-7b985f985c-8hwwf                    1/1     Running   0          67s

Notice that public-api only has one container in its pod. Since the service itself is running, this means that no Consul proxy sidecar exists in this pod.

Open the hashicups/public-api.yaml and investigate the deployment resource configuration.

hashicups/public-api.yaml

## ...
apiVersion: apps/v1
kind: Deployment
metadata:
  name: public-api
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      service: public-api
      app: public-api
  template:
    metadata:
      labels:
        service: public-api
        app: public-api
      annotations:
        #consul.hashicorp.com/connect-inject: "true"
        consul.hashicorp.com/connect-inject: "false"
## ...

Notice that the Consul proxy sidecar annotation is set to false. This signals to Consul not to inject a proxy sidecar into the public-api pod. Update this value to true and save your changes.

Re-deploy your public-api deployment so Consul injects proxy sidecar in its pod.

$ kubectl apply -f hashicups/public-api.yaml --namespace default
service/public-api unchanged
serviceaccount/public-api unchanged
servicedefaults.consul.hashicorp.com/public-api unchanged
deployment.apps/public-api configured

Open the HashiCup's URL in your browser and refresh the HashiCups UI.

$ echo $CONSUL_APIGW_ADDR
http://a4cc3e77d86854fe4bbcc9c62b8d381d-221509817.us-west-2.elb.amazonaws.com

HashiCups in a state where all services are functional. An array of HashiCorp themed coffees are displayed in the UI.

Notice that the HashiCups UI functions correctly. You have successfully resolved the problem using Consul's data plane observability.

Enable Consul control plane observability features

The Consul control plane is responsible for providing policy and configuration for all running data planes in your service mesh. The control plane turns all of your data planes into a distributed system.

Consul control plane observability features provide detailed statistics and logging data to give you insight into the operational health and performance of your Consul cluster.

Review and enable control plane metrics

Consul lets you expose Prometheus metrics for your service mesh applications and sidecars. Review the highlighted lines in the values file below to see the parameters that enable this feature.

helm/consul-v3-control-plane.yaml

global:
## ...
  # Enables TLS across the cluster to verify authenticity of the Consul servers and clients.
  tls:
    enabled: true
    # Metrics are exposed on 8500 only (http).  Anonymous policy requires Agent "read" if ACL enabled.
    httpsOnly: false
## ...
  # Exposes Prometheus metrics for the Consul service mesh and sidecars.
  metrics:
    enabled: true
    # Enables Consul servers and clients metrics.
    enableAgentMetrics: true
    # Configures the retention time for metrics in Consul servers and clients.
    agentMetricsRetentionTime: "59m"

# Configures values that configure the Consul server cluster.
server:
## ...
  # Configures Consul server log level "TRACE", "DEBUG", "INFO", "WARN", or "ERROR".
  extraConfig: |
    {
      "log_level": "TRACE"
    }
## ...

Refer to the Consul metrics for Kubernetes documentation and official Helm chart values to learn more about metrics configuration options and details.

Configure your Consul cluster to let Prometheus collect metrics from your control plane.

$ helm upgrade --values helm/consul-v3-control-plane.yaml consul hashicorp/consul --namespace consul --version "1.2.1"

Note

The Helm upgrade could take up to 5 minutes to complete. Feel free to explore the next sections of this tutorial while waiting for your updated Consul environment to become available.

In addition to configuring Consul, you need to modify the anonymous ACL policy to allow agent:read permissions so Prometheus can scrape metrics from the secure Consul servers.

$ consul acl policy update -name "anonymous-token-policy" \
                        -datacenter "dc1" \
                        -rules @config/acl-policy.hcl

Review the Consul ACL Policies documentation to learn more.

Note

In a production environment, we recommend using the Prometheus Consul Exporter for the most secure, restrictive access to Consul metrics on port 8501.

Explore Consul control plane metrics and logs on Grafana

Consul control plane metrics and logs provide you with detailed health and performance information for your Consul servers. In this section, you will use Grafana to examine how this information provides insights into your Consul control plane.

Health and performance insights

Navigate to the control plane monitoring dashboard.

$ export GRAFANA_CP_DASHBOARD=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/control-plane-performance-monitoring && echo $GRAFANA_CP_DASHBOARD
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/control-plane-performance-monitoring

Note

The Grafana dashboard may take a few moments to fully load in your browser.

Notice that the example dashboard panes provide detailed performance insights for the Consul control plane.

The HashiCups performance monitoring dashboard. The dashboard displays a wide variety of performance related metrics.

Consul contains a large set of statistics that you can utilize to monitor your service mesh control plane. Refer to the Consul telemetry overview for a complete list and description of available metrics.

Event and error insights

Navigate to the control plane logs dashboard.

$ export GRAFANA_CP_LOGS_DASHBOARD=http://$(kubectl get svc/grafana --namespace observability -o json | jq -r '.status.loadBalancer.ingress[0].hostname')/d/control-plane-logs/ && echo $GRAFANA_CP_LOGS_DASHBOARD
http://a20fb6f2d1d3e4be296d05452a378ad2-428040929.us-west-2.elb.amazonaws.com/d/control-plane-logs/

Note

The Grafana dashboard may take a few moments to fully load in your browser.

Notice that the example dashboard panes provide detailed event and error insights for your Consul control plane.

The HashiCups performance monitoring dashboard. The dashboard displays a wide variety of performance related metrics.

Enable HCP Consul Observability

The HCP Consul management plane allows deeper insights to your Consul deployments via cloud-based observability and seamlessly links new and existing self-managed Consul clusters, simplifying observability for distributed Consul deployments.

Link your self-managed Consul cluster to HCP

Click Get Started with Consul.

The main HCP project dashboard page after successful login. The screen displays all available HCP services.

Click Self-Managed Consul and for linking method select Link existing. Click the Get Started button once complete.

The create or link Consul cluster page. The screen displays various options to link a Consul cluster.

Enter a name for your Consul cluster and select the Kubernetes runtime. We recommend using the cluster’s datacenter name as the cluster ID in this field. Click the Continue button once complete.

The specify cluster details portion of the link an existing self-managed Consul cluster page.

Select your preferred tool for updating your Consul deployment, Consul-K8S CLI or Helm, then only perform the first step to set secrets to authenticate with HCP.

The link cluster to HCP portion of the link an existing self-managed Consul cluster page.

Confirm you set the Kubernetes secrets required for linking your self-managed Consul cluster to HCP Consul Central. You should find five secrets that start with consul-hcp

$ kubectl get secrets --namespace consul 
NAME                                     TYPE                                  DATA   AGE
consul-hcp-client-id                     Opaque                                1      3m39s
consul-hcp-client-secret                 Opaque                                1      3m38s
consul-hcp-observability-client-id       Opaque                                1      3m37s
consul-hcp-observability-client-secret   Opaque                                1      3m36s
consul-hcp-resource-id                   Opaque                                1      3m36s

Review and link your cluster to HCP Consul Central

Consul lets you connect your self-managed cluster with HCP Consul. Review the highlighted lines in the values file below to see the parameters that enable this feature.

helm/consul-v4-hcp-mgmt-plane.yaml

global:
## ...
  metrics:
    enabled: true
    # Enables Consul Telemetry Collector
    enableTelemetryCollector: true
    ## ...
  # Configures self-managed cluster for linking to HCP Consul
  cloud:
    enabled: true
    resourceId:
      secretName: "consul-hcp-resource-id"
      secretKey: "resource-id"
    clientId:
      secretName: "consul-hcp-client-id"
      secretKey: "client-id"
    clientSecret:
      secretName: "consul-hcp-client-secret"
      secretKey: "client-secret"

# Configures Consul data plane telemetry collector
telemetryCollector:
  enabled: true
  cloud:
    clientId:
      secretKey: "client-id"
      secretName: "consul-hcp-observability-client-id"
    clientSecret:
      secretKey: "client-secret"
      secretName: "consul-hcp-observability-client-secret"

# Configures values that configure the Consul server cluster.
server:
## ...

Configure your Consul cluster to link to HCP Consul Central.

$ helm upgrade --values helm/consul-v4-hcp-mgmt-plane.yaml consul hashicorp/consul --namespace consul --version "1.2.1"

Note

The Helm upgrade could take up to 5 minutes to complete. Feel free to explore the next sections of this tutorial while waiting for your updated Consul environment to become available.

Review the official Helm chart values to learn more about these settings.

Create intentions for the Consul telemetry collector

The Consul telemetry collector runs as a service in your mesh. To receive data plane metrics from your sidecar proxies, you need to create a service intention that authorizes proxies to push metrics to the collector.

Create intentions for the Consul telemetry collector.

$ kubectl apply --filename config/consul-telemetry-intentions.yaml
serviceintentions.consul.hashicorp.com/consul-telemetry-collector created

Review the Consul telemetry collector documentation to learn more.

Restart sidecar proxies

You need to restart your sidecar proxies to apply the updated configuration. To do so, redeploy your HashiCups application.

$ kubectl rollout restart deployment --namespace default
deployment.apps/api-gateway restarted
deployment.apps/frontend restarted
deployment.apps/nginx restarted
deployment.apps/payments restarted
deployment.apps/product-api restarted
deployment.apps/product-api-db restarted
deployment.apps/public-api restarted
deployment.apps/traffic restarted

Your sidecars will now begin forwarding metrics to your HCP observability dashboard.

Explore HCP Consul observability dashboard

HCP Consul control plane metrics provide you with detailed health and performance information for your self-managed or HCP-managed Consul servers. In this section, you will examine how these metrics and logs provide insights into your Consul control plane and data plane.

Return to the HCP dashboard page in your browser. It may take a moment to sync with your self-managed Consul cluster.

The cluster details page of the linked self-managed Consul cluster.

Click Observability on the navigation pane and explore the observability insights of your self-managed Consul cluster.

The observability details page of the self-managed Consul cluster.

HCP Consul contains a large set of statistics that you can utilize to monitor your service mesh control plane. Refer to the HCP Consul observability documentation for a complete list and description of available metrics.

Clean up resources

Destroy the Terraform resources to clean up your environment. Confirm the destroy operation by inputting yes.

$ terraform destroy

## ...
Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

 Enter a value: yes

## ...

Destroy complete! Resources: 0 added, 0 changed, 98 destroyed.

Note

Due to race conditions with the various cloud resources created in this tutorial, it may be necessary to run the destroy operation twice to ensure all resources have been properly removed.

Open the HCP Consul portal and unlink your self-managed cluster to clean up your HCP resources.

The HCP unlinking option for the self-managed Consul cluster.

Next steps

In this tutorial, you enabled observability features for your Consul data plane and control plane to enhance the health and performance monitoring of your service mesh applications. You saw how these features can provide you with faster incident resolution, increased application understanding, and reduced operational overhead.

For more information about the topics covered in this tutorial, refer to the following resources:

L7 observability

Explore tutorial library