Content Inspection (On‑Premises)
This overview is intended for DevOps and system administrators who operate the Content Inspection (CI) stack in customer‑controlled environments. For a high‑level introduction to the overall on‑premises solution, see On‑Premises Deployment.
Purpose & functionality
The CI stack scans files and payloads to detect sensitive content using your datasets and policies. It correlates matches with endpoint and cloud events to enrich lineage and detections. The system runs scanners inside your cloud accounts to honor data-locality requirements, with only metadata and results sent to the SaaS platform.
Components
- dlp-coordinator: This component orchestrates scan jobs, handles scheduling and back-pressure, and submits results to the SaaS service.
- ci-scanner: These are the workers that retrieve artifacts, perform content analysis, and emit findings.
- dlp-tika: These components are content parsers and extractors that handle a broad range of file types, performing text extraction and structure normalization.
- dlp-ocr: This is the optical character recognition service that processes content from images and PDFs.
- Storage integration: The system reads and writes cache objects from your cloud storage, which can be Google Cloud Storage (GCS), AWS S3, or Azure Blob Storage (AzBlob).
- SaaS channel: A secure control plane is used to receive policies and keys and submit results and telemetry to the Cyberhaven SaaS platform.
Data flow
- Policies and datasets are synchronized from the SaaS platform to the CI stack via the control plane.
- The
dlp-coordinatoraccepts and enqueues scan jobs. - The
ci-scannerpulls jobs, fetches content from configured sources or the cache, and then invokes thedlp-tikaand, if necessary,dlp-ocrcomponents. - Extracted text and structures are matched against datasets, and any matches are attached to the job results.
- Results and metrics are sent back to the SaaS platform for evaluation, lineage stitching, and detections.
- The cache buckets are updated based on retention and eviction policies to manage costs and performance.
Configuration & identity model
The deployment uses a single Helm chart with values files to control images, tags, ingress/TLS, storage, and feature flags (see Install). A consolidated identity (single service account, managed identity, or role) is used for pulling images and accessing storage to avoid per‑component identities. Access to the image registry can be configured via cloud IAM (preferred) or a pull secret. Provider‑specific values for identity, storage, and ingress are supplied per the provider guides: AWS • Azure • GCP.
Security & performance considerations
-
Security
- TLS termination should be configured at the ingress with valid certificates for your hostnames.
- A least-privilege IAM model should be used for image registry, cache buckets, and any data sources.
- Secrets are managed via Kubernetes Secrets or your secret store, and install tokens should be rotated periodically.
-
Performance
- The number of replicas and concurrency for components like
ci-scannercan be tuned to match workload volume. - Monitoring resource usage and queue depth is recommended.
- Cache and bucket sizing should be planned to balance cost and performance.
- It is advised to use cluster autoscaling and set resource requests and limits to prevent resource contention.
- The number of replicas and concurrency for components like
Monitoring
To track the health and throughput of the CI stack, you should monitor key metrics, including work queue depth, time-to-drain, job success/error rates, and parser/scan latencies. The dlp_coordinator_fatal_errors_per_component and dlp_coordinator_file_processing_time metrics provide detailed insights. For OCR specifically, you can monitor the number of files to OCR, success rates, and errors. It is also important to check pod and container health, including restarts, CPU, and memory usage.
Refer to Monitoring and Troubleshooting for commands, example queries, and log locations.