Content Inspection Monitoring
On-premises content inspection monitoring guide
Introduction
This guide outlines monitoring practices to ensure the on-premises content inspection (CI) is healthy. The monitoring strategy is based on key metrics provided by the CI stack which rely on response status codes and response time of the scan requests.
Additionally, this guide will be extended with new metrics that can be collected from endpoint Sensors that run 24.07 version or higher, which will have the added benefit of detecting connectivity issues between sensors and the on-premises CI instance and measuring the health of the CI stack as observed by the endpoints (e.g., including retries). Further development is ongoing to refine the collection and analysis of these new metrics, ensuring they provide actionable insights into the operational state of the CI stack.
Monitoring and Alerting
The nginx_ingress_controller_response_duration_seconds metric from the Ingress-NGINX controller provides insights into the latency of content inspection queries and the associated HTTP status codes. Under normal conditions, CI scans should complete within three minutes. If a request times out, the endpoints are configured to retry up to three times. It is recommended to use this metric for the following purposes:
- First-Attempt Failure Rate Analysis: Monitor the occurrence of request failures (indicated by 4xx and 5xx error codes) relative to successful queries (status 200). This analysis should reveal internal issues within the system components, though it reflects only the first-attempt failure rate since endpoints will retry on failures.
- Latency Assessment: Evaluate whether latencies remain within acceptable limits. Longer latencies can indicate scalability issues and may lead to endpoint retries.
Alerting
Cyberhaven recommends implementing the following alerts based on the nginx_ingress_controller_response_duration_seconds metric:
- Success Rate below 75%: The following PromQL query computes the first-attempt success rate for 15 min windows:
sum(rate(nginx_ingress_controller_response_duration_seconds_count{path="/v2/dlp/check",status="200"}[15m]))
sum(rate(nginx_ingress_controller_response_duration_seconds_count{path="/v2/dlp/check",status!="404"}[15m]))
If the value falls to 0.75 or below, an alert should be triggered.
IMPACT: While endpoints are designed to retry failed requests, these retries are limited. If the success rate remains below 75% for extended periods, such as 15 minutes, it could result in some data not being fully analyzed, potentially leading to missed content matches. To accurately determine the extent of the impact, additional log analysis would be required to identify which specific requests failed during the affected time period.
- Median Latency Exceeds 180 Seconds: This indicates that more than half of the requests are at risk of timing out and requiring retries. The following PromQL query can be used to extract the median latency in seconds:
sum(histogram_quantile(0.50, increase(nginx_ingress_controller_response_duration_seconds_bucket{path=~"/v2/dlp/check.*",status="200"}[${__interval}]))) by (pod)
If the value goes above 180 seconds for a considerable time window, such as 15 minutes, an alert should be triggered.
IMPACT: Requests that exceed 180 seconds are prone to timeouts, which will reduce the success rate of the content inspection stack. The implications of low success rates could be more critical, as described above, leading to incomplete data analysis and potential gaps in content matches.
Note that the stack is designed to be self-adjusting and typically does not require manual intervention from the customer side. In the event these alerts are triggered, it suggests that the CI stack may not be functioning properly. Immediate attention is required, and Cyberhaven support should be contacted to investigate and resolve the issue.
Additional CI stack metrics
The CI stack exports several additional metrics that can be leveraged to further analyze and narrow down observed issues, as well as to gather detailed operational statistics from the system. It is important to note that the primary recommendation remains to rely on ingress-level metrics documented above for observed success rates and latencies. However, CI stack-specific metrics, exported via Prometheus by the dlp-coordinator component, provide valuable supplemental insights. By default, the dlp-coordinator service should already be configured with Prometheus scraping enabled.
Below is a list of useful dlp-coordinator Prometheus metrics along with their descriptions:
1. dlp_coordinator_fatal_errors_per_component (Counter)
- This metric aggregates fatal errors per component within the CI stack. If no issues are observed in the ingress-level metrics, these errors are typically benign. For example, Tika might fail to extract text from a document, but the original payload would still be passed to the scanner engine, allowing content inspection to continue. This metric is especially useful when diagnosing root causes of reduced success rates, as it can help quickly pinpoint problematic components.
2. dlp_coordinator_file_processing_time (Histogram)
- This metric provides a histogram of request processing times in seconds, categorized by status (success or failure). It is particularly helpful in further investigating delays that might have been identified by the ingress-level metrics, offering a more granular view of processing time variations across different components.
3. dlp_coordinator_file_size (Histogram)
- This metric tracks the histogram of file sizes in bytes sent for inspection. It can be valuable for understanding usage patterns and overall demand on the CI stack, providing insights into typical file sizes processed by the system.
4. dlp_coordinator_retries_per_component (Counter)
- This metric tracks the number of retries per component within the CI stack. It helps quantify how often operations had to be retried due to transient errors, offering insights into potential bottlenecks or temporary issues that were automatically handled by the system.
These additional metrics offer deeper visibility into the CI stack’s internal behavior and performance, allowing you to complement the ingress-level metrics with component-specific data for comprehensive system monitoring and troubleshooting.
OCR service metrics
The following metrics are exported via Prometheus and are useful to monitor the performance of the OCR service.
- dlp_ocr_files_to_ocr (Counter)
This metric tracks the number of files received for OCR processing. It provides an overview of the workload and can help identify fluctuations in incoming files for OCR. This metric is helpful for capacity planning and monitoring intake volumes. It is expected that a considerable percentage of requests are made out of images that need to be OCR-ed. These images may be either standalone (e.g., screenshots) or extracted from other document types. - dlp_ocr_files_ocr_success (Counter)
This metric counts the total number of files that have been successfully processed by the OCR service. A decrease in this metric might signal issues within the OCR processing pipeline. Note that it is expected that OCR will not succeed on all images for a variety of legitimate reasons. This metric does not differentiate among the possible reasons. A success rate that is consistently below 90% may need to be further debugged. To get the success rate, use this metric as well as the metric below. - dlp_ocr_files_with_errors (Counter)
This metric tallies the total number of files that encountered errors during OCR processing. Monitoring this metric can help diagnose the causes behind failures and identify trends that may require attention to improve overall OCR success rates. If success is low, the service needs to be scaled up more and may need tweaks to scale up quicker.
These metrics offer a view of OCR processing performance, including workload, success rates, failure patterns, and processing time distribution, to support informed monitoring and optimization.