On-premises content inspection monitoring and troubleshooting guide

This guide outlines monitoring practices to ensure the on-premises content inspection (CI) is healthy. The monitoring strategy is based on key metrics provided by the CI stack which rely on response status codes and response time of the scan requests.

Additionally, this guide will be extended with new metrics that can be collected from endpoint Sensors that run 24.07 version or higher, which will have the added benefit of detecting connectivity issues between sensors and the on-premises CI instance and measuring the health of the CI stack as observed by the endpoints (e.g., including retries).

Monitoring and Alerting

The nginx_ingress_controller_response_duration_seconds metric from the Ingress-NGINX controller provides insights into the latency of content inspection queries and the associated HTTP status codes. Under normal conditions, CI scans should complete within three minutes. If a request times out, the endpoints are configured to retry up to three times. It is recommended to use this metric for the following purposes:

Note: You are most likely to be using your own managed ingress service from your provider, so this particular metric might not be accessible. If that's the case, use your provider's ingress metrics for latency and status breakdowns, and complement them with the CI stack metrics documented below.

First-Attempt Failure Rate Analysis: Monitor the occurrence of request failures (indicated by 4xx and 5xx error codes) relative to successful queries (status 200). This analysis should reveal internal issues within the system components, though it reflects only the first-attempt failure rate since endpoints will retry on failures.
Latency Assessment: Evaluate whether latencies remain within acceptable limits. Longer latencies can indicate scalability issues and may lead to endpoint retries.

Alerting

Cyberhaven recommends implementing the following alerts based on the http-latency-status metric:

Success Rate below 75%: The following PromQL query computes the first-attempt success rate for 15 min windows:

Note: In managed ingress setups, nginx_ingress_controller_response_duration_seconds_count may not be exposed. Use your provider's ingress metrics for request counts by HTTP status and adapt the query accordingly.
```
sum(rate(nginx_ingress_controller_response_duration_seconds_count{path="/v2/dlp/check",status="200"}[15m]))
/
sum(rate(nginx_ingress_controller_response_duration_seconds_count{path="/v2/dlp/check",status!="404"}[15m]))
```
If the value falls to 0.75 or below, an alert should be triggered.
Median Latency Exceeds 180 Seconds: This indicates that more than half of the requests are at risk of timing out and requiring retries. The following PromQL query can be used to extract the median latency in seconds:
```
sum(histogram_quantile(0.50, increase(nginx_ingress_controller_response_duration_seconds_bucket{path=~"/v2/dlp/check.*",status="200"}[${__interval}]))) by (pod)
```
If the value goes above 180 seconds for a considerable time window, such as 15 minutes, an alert should be triggered.

Note that the stack is designed to be self-adjusting and typically does not require manual intervention from the customer side. In the event these alerts are triggered, it suggests that the CI stack may not be functioning properly. Immediate attention is required, and Cyberhaven support should be contacted to investigate and resolve the issue.

Additional CI stack metrics

The CI stack exports several additional metrics that can be leveraged to further analyze and narrow down observed issues, as well as to gather detailed operational statistics from the system. It is important to note that the primary recommendation remains to rely on ingress-level metrics documented above for observed success rates and latencies. However, CI stack-specific metrics, exported via Prometheus by the dlp-coordinator component, provide valuable supplemental insights. By default, the dlp-coordinator service should already be configured with Prometheus scraping enabled.

Below is a list of useful dlp-coordinator Prometheus metrics along with their descriptions:

1. dlp_coordinator_fatal_errors_per_component (Counter)

This metric aggregates fatal errors per component within the CI stack. If no issues are observed in the ingress-level metrics, these errors are typically benign. For example, Tika might fail to extract text from a document, but the original payload would still be passed to the scanner engine, allowing content inspection to continue. This metric is especially useful when diagnosing root causes of reduced success rates, as it can help quickly pinpoint problematic components.

2. dlp_coordinator_file_processing_time (Histogram)

This metric provides a histogram of request processing times in seconds, categorized by status (success or failure). It is particularly helpful in further investigating delays that might have been identified by the ingress-level metrics, offering a more granular view of processing time variations across different components.

3. dlp_coordinator_file_size (Histogram)

This metric tracks the histogram of file sizes in bytes sent for inspection. It can be valuable for understanding usage patterns and overall demand on the CI stack, providing insights into typical file sizes processed by the system.

4. dlp_coordinator_retries_per_component (Counter)

This metric tracks the number of retries per component within the CI stack. It helps quantify how often operations had to be retried due to transient errors, offering insights into potential bottlenecks or temporary issues that were automatically handled by the system.

OCR metrics

The following metrics are exported via Prometheus and are useful to monitor the performance of the OCR service. They offer a view of OCR processing performance, including workload, success rates, failure patterns, and processing time distribution, to support informed monitoring and optimization.

**dlp_ocr_files_to_ocr (Counter)

This metric tracks the number of files received for OCR processing. It provides an overview of the workload and can help identify fluctuations in incoming files for OCR. This metric is helpful for capacity planning and monitoring intake volumes. It is expected that a considerable percentage of requests are made out of images that need to be OCR-ed. These images may be either standalone (e.g., screenshots) or extracted from other document types.**

dlp_ocr_files_ocr_success (Counter)

This metric counts the total number of files that have been successfully processed by the OCR service. A decrease in this metric might signal issues within the OCR processing pipeline. Note that it is expected that OCR will not succeed on all images for a variety of legitimate reasons. This metric does not differentiate among the possible reasons. A success rate that is consistently below 90% may need to be further debugged. To get the success rate, use this metric as well as the metric below.

dlp_ocr_files_with_errors (Counter)

This metric tallies the total number of files that encountered errors during OCR processing. Monitoring this metric can help diagnose the causes behind failures and identify trends that may require attention to improve overall OCR success rates. If success is low, the service needs to be scaled up more and may need tweaks to scale up quicker.

General Recommendation for Minimum Replicas

To enhance reliability and resiliency, particularly in environments prone to load spikes, it is recommended to increase the minReplicas to reflect the rough baseline values observed during normal operation. This ensures that the components can handle sudden surges in requests, even after periods of inactivity. A downscaled environment may struggle to cope with unexpected traffic increases, potentially impacting performance.

A practical approach is to take note of the typical number of replicas during regular functioning. The default minReplicas value might be too low, so adjusting it to match the rough average of replicas seen in normal conditions is advised. This should be applied to the following HPAs:

dlp-coordinator
content-inspection-scanner
dlp-tika
dlp-ocr

By setting minReplicas according to this rough baseline, the system can better handle varying loads and maintain stability.

Troubleshooting

When observed success rates fall below the recommended thresholds, it is important to investigate key areas to identify and resolve potential issues. The sections below outline critical aspects to examine, including service stability, component failures, and scalability challenges. By addressing these factors, operational issues within the on-premises content inspection (CI) stack can be diagnosed and remediated to restore optimal performance.

Service Stability

Ensure that all deployed services are operational and not experiencing frequent restarts. While occasional restarts are normal, multiple restarts due to crashes or out-of-memory (OOM) issues require immediate attention. If such problems are detected, it's recommended to contact Cyberhaven with the logs from the crashed container, including the exit code and any relevant context.

Identify Failing Components

Failures can often be isolated to a specific component or path. The dlp_coordinator_fatal_errors_per_component metric can help identify which components are experiencing issues. If the number of failures significantly exceeds those observed in previous time windows where the system was operating normally, it's essential to check the logs of the affected containers. Failures may be caused by missing permissions or communication issues with the Cyberhaven SaaS platform, which should be evident in the logs. These problems can sometimes arise after operational changes, such as modifications to the service accounts used for authentication to SaaS services. If the issue cannot be resolved internally, it is advisable to contact Cyberhaven with the logs and relevant metrics for further diagnosis.

Scalability Issues

A drop in success rates may indicate scalability problems if the components are not scaling up as expected. It is recommended to monitor the number of replicas for each horizontally scalable component (e.g., content-inspection-scanner, dlp-coordinator, dlp-tika) to check for any decrease in replica counts or changes in scaling behavior. Additionally, investigating Horizontal Pod Autoscaler (HPA) failures could reveal resource constraints, such as underprovisioning (i.e., inability to scale due to lack of resources in the node pool) or missing scaling metrics. As an immediate remedy, consider manually scaling up the component by increasing the minimum number of replicas beyond the levels observed during periods of normal operation, and contact Cyberhaven for further support.

Creating a diagnosis bundle

To get debugging information from k8s, run the diagnosis script and attach the resulting archive. The script only extracts information about k8s events, not private information. If you are concerned about any sensitive information in the container logs, you can use the --disable-collect-logs option.

Script: diagnosis.sh (request the latest version from Cyberhaven support)

OCR metrics dashboard

Use these PromQL queries to build dashboard either in GCP or in Prometheus. Then share screenshots of these charts for the relevant period.

To collect these metrics, deploy Prometheus in your cluster or use a managed Prometheus service.

** Average processing time**

Prometheus: sum(increase(dlp_ocr_file_processing_time_sum[5m])) / sum(increase(dlp_ocr_file_processing_time_count[5m]))
GCP: avg(increase(dlp_ocr_file_processing_time[${__interval}]))

** Success rate **

(sum(rate(dlp_ocr_files_to_ocr[5m])) - sum(rate(dlp_ocr_files_with_errors[5m])))
/
sum(rate(dlp_ocr_files_to_ocr[5m])) * 100

** Files to OCR **

Prometheus: sum(rate(dlp_ocr_files_to_ocr[5m]))
GCP: sum(rate(dlp_ocr_files_to_ocr[${__interval}]))

** Number of replicas for dlp_ocr **

kube_deployment_spec_replicas{deployment="dlp-ocr"}

** Container restarts **

topk(30, sum by ("container_name")(rate({"__name__"="kubernetes_io:container_restart_count","monitored_resource"="k8s_container","container_name"=~"^dlp.*|^content.*|^extern.*"}[${__interval}])))

** Termination reasons **

sum by (reason,container)(avg_over_time(kube_pod_container_status_last_terminated_reason{container=~"^(dlp|content|external).*"}[${__interval}]))

Monitoring and Alerting​

Alerting​

Additional CI stack metrics​

1. dlp_coordinator_fatal_errors_per_component (Counter)​

2. dlp_coordinator_file_processing_time (Histogram)​

3. dlp_coordinator_file_size (Histogram)​

4. dlp_coordinator_retries_per_component (Counter)​

OCR metrics​

**dlp_ocr_files_to_ocr (Counter)​

dlp_ocr_files_ocr_success (Counter)​

dlp_ocr_files_with_errors (Counter)​

General Recommendation for Minimum Replicas​

Troubleshooting​

Service Stability​

Identify Failing Components​

Scalability Issues​

Creating a diagnosis bundle​

OCR metrics dashboard​