Skip to main content
Version: 25.09

Matching Elements Reference

Overview

Matching Elements are the components that perform pattern recognition and content analysis to identify sensitive data. They work in conjunction with Policy Elements to create comprehensive detection rules that can identify various types of sensitive information with high accuracy.

Element Categories

Keyword Matching Elements

Keyword elements match specific terms or phrases within content.

Keyword Element Structure

<Keyword id="keyword-identifier">
<Group matchStyle="word">
<Term>keyword1</Term>
<Term>keyword2</Term>
<Term>keyword3</Term>
</Group>
</Keyword>

Keyword Attributes

AttributeTypeDescriptionRequiredValues
idStringUnique identifier for the keyword listYesMust be unique within rule pack

Group Attributes

AttributeTypeDescriptionRequiredDefaultValues
matchStyleStringHow keywords should be matchedNowordword, string, regex

Match Styles

StyleDescriptionExampleUse Case
wordMatch whole words only"account" matches "account number" but not "accountability"Precise term matching
stringMatch substring anywhere"account" matches "accountability"Broader pattern matching
regexTreat terms as regular expressions\baccount\b for word boundariesComplex pattern matching

Keyword Examples

Financial Terms:

<Keyword id="financial-keywords">
<Group matchStyle="word">
<Term>account</Term>
<Term>balance</Term>
<Term>payment</Term>
<Term>transaction</Term>
<Term>routing</Term>
<Term>deposit</Term>
</Group>
</Keyword>

Personal Information:

<Keyword id="personal-keywords">
<Group matchStyle="word">
<Term>social security</Term>
<Term>SSN</Term>
<Term>date of birth</Term>
<Term>DOB</Term>
<Term>driver license</Term>
</Group>
</Keyword>

Regular Expression Elements

Regular expression elements provide powerful pattern matching capabilities for structured data.

Regex Element Structure

<Regex id="regex-identifier">
<Pattern>regular-expression-pattern</Pattern>
</Regex>

Common Regex Patterns

Social Security Numbers:

<Regex id="ssn-pattern">
<Pattern>\b\d{3}-?\d{2}-?\d{4}\b</Pattern>
</Regex>

Credit Card Numbers:

<Regex id="credit-card-pattern">
<Pattern>\b(?:\d{4}[-\s]?){3}\d{4}\b</Pattern>
</Regex>

Phone Numbers:

<Regex id="phone-pattern">
<Pattern>\b(?:\+?1[-.\s]?)?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}\b</Pattern>
</Regex>

Email Addresses:

<Regex id="email-pattern">
<Pattern>\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b</Pattern>
</Regex>

IP Addresses:

<Regex id="ip-address-pattern">
<Pattern>\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b</Pattern>
</Regex>

Built-in Function Elements

Built-in functions provide validated pattern matching with additional verification logic.

Available Functions

Function NameDescriptionValidation Method
Func_credit_card_formattedCredit card numbers with formattingLuhn algorithm + format check
Func_credit_card_unformattedCredit card numbers without formattingLuhn algorithm
Func_ssn_formattedSSN with dashes (XXX-XX-XXXX)Format validation
Func_ssn_unformattedSSN without formatting (XXXXXXXXX)Format validation
Func_us_phone_formattedUS phone with formattingFormat validation
Func_us_phone_unformattedUS phone without formattingFormat validation
Func_email_addressEmail addressesRFC 5322 validation
Func_ip_addressIP addressesIPv4/IPv6 validation

Function Usage

<!-- Credit card with Luhn validation -->
<IdMatch idRef="Func_credit_card_formatted"/>

<!-- SSN with format validation -->
<IdMatch idRef="Func_ssn_formatted"/>

<!-- Email with RFC validation -->
<IdMatch idRef="Func_email_address"/>

Localized String Elements

Localized strings provide language-specific keyword matching.

LocalizedStrings Structure

<LocalizedStrings>
<Resource idRef="financial-terms">
<Name default="true" langcode="en">Financial Terms</Name>
<Name langcode="es">Términos Financieros</Name>
<Description default="true" langcode="en">Common financial terminology</Description>
<Description langcode="es">Terminología financiera común</Description>
</Resource>
</LocalizedStrings>

Localized Keywords

<Keyword id="payment-terms-localized">
<Group matchStyle="word" langcode="en">
<Term>payment</Term>
<Term>invoice</Term>
<Term>bill</Term>
</Group>
<Group matchStyle="word" langcode="es">
<Term>pago</Term>
<Term>factura</Term>
<Term>cuenta</Term>
</Group>
<Group matchStyle="word" langcode="fr">
<Term>paiement</Term>
<Term>facture</Term>
<Term>compte</Term>
</Group>
</Keyword>

Advanced Matching Techniques

Proximity-Based Matching

Elements can be configured to work together within specified distances.

<Entity id="credit-card-with-context" patternsProximity="300">
<Pattern confidenceLevel="90">
<IdMatch idRef="Func_credit_card_formatted"/>
<Match idRef="credit-card-keywords"/>
</Pattern>
</Entity>

Conditional Matching

Use logical operators to create complex matching conditions.

Any Element

Match any of several patterns:

<Any minMatches="2">
<Match idRef="financial-keywords"/>
<Match idRef="payment-keywords"/>
<Match idRef="banking-keywords"/>
</Any>

All Element

Require all patterns to match:

<All>
<Match idRef="ssn-pattern"/>
<Match idRef="personal-keywords"/>
<Match idRef="government-keywords"/>
</All>

Not Element

Exclude certain patterns:

<Pattern confidenceLevel="80">
<IdMatch idRef="ssn-pattern"/>
<Match idRef="personal-context"/>
<Not>
<Match idRef="test-data-keywords"/>
</Not>
</Pattern>

Case Sensitivity

Control case sensitivity for string matching:

<Keyword id="case-sensitive-terms">
<Group matchStyle="string" caseSensitive="true">
<Term>API</Term>
<Term>SQL</Term>
<Term>XML</Term>
</Group>
</Keyword>

<Keyword id="case-insensitive-terms">
<Group matchStyle="word" caseSensitive="false">
<Term>confidential</Term>
<Term>secret</Term>
<Term>private</Term>
</Group>
</Keyword>

Pattern Validation

Format Validation

Built-in functions provide format validation for common data types:

<!-- Validates credit card format and Luhn checksum -->
<IdMatch idRef="Func_credit_card_formatted"/>

<!-- Validates SSN format (XXX-XX-XXXX) -->
<IdMatch idRef="Func_ssn_formatted"/>

<!-- Validates email format per RFC 5322 -->
<IdMatch idRef="Func_email_address"/>

Custom Validation

Create custom validation using regex with specific constraints:

<!-- US ZIP code validation -->
<Regex id="us-zip-code">
<Pattern>\b\d{5}(?:-\d{4})?\b</Pattern>
</Regex>

<!-- Canadian postal code validation -->
<Regex id="canadian-postal-code">
<Pattern>\b[A-Za-z]\d[A-Za-z][-\s]?\d[A-Za-z]\d\b</Pattern>
</Regex>

Performance Optimization

Keyword Optimization

  1. Use Specific Terms: Prefer specific over generic keywords
  2. Limit List Size: Keep keyword lists under 1000 terms
  3. Group Related Terms: Organize keywords logically
  4. Use Word Matching: Prefer word matching over string matching

Regex Optimization

  1. Anchor Patterns: Use \b for word boundaries
  2. Avoid Backtracking: Use non-capturing groups (?:)
  3. Limit Quantifiers: Avoid excessive * and + operators
  4. Test Performance: Validate regex performance with large content

Pattern Ordering

Order patterns by selectivity (most specific first):

<Entity id="optimized-detection">
<!-- Most specific pattern first -->
<Pattern confidenceLevel="95">
<IdMatch idRef="Func_credit_card_formatted"/>
<Match idRef="specific-keywords"/>
</Pattern>
<!-- Less specific patterns follow -->
<Pattern confidenceLevel="75">
<IdMatch idRef="credit-card-regex"/>
<Any minMatches="2">
<Match idRef="general-keywords"/>
<Match idRef="context-keywords"/>
</Any>
</Pattern>
</Entity>

Common Patterns Library

Financial Data Patterns

<!-- Credit Card Numbers -->
<Regex id="visa-pattern">
<Pattern>\b4\d{3}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b</Pattern>
</Regex>

<Regex id="mastercard-pattern">
<Pattern>\b5[1-5]\d{2}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b</Pattern>
</Regex>

<Regex id="amex-pattern">
<Pattern>\b3[47]\d{2}[-\s]?\d{6}[-\s]?\d{5}\b</Pattern>
</Regex>

<!-- Bank Account Numbers -->
<Regex id="bank-account-pattern">
<Pattern>\b\d{8,17}\b</Pattern>
</Regex>

<!-- Routing Numbers -->
<Regex id="routing-number-pattern">
<Pattern>\b[0-9]{9}\b</Pattern>
</Regex>

Personal Information Patterns

<!-- Driver License Numbers -->
<Regex id="drivers-license-pattern">
<Pattern>\b[A-Z]{1,2}\d{6,8}\b</Pattern>
</Regex>

<!-- Passport Numbers -->
<Regex id="passport-pattern">
<Pattern>\b[A-Z]{2}\d{7}\b</Pattern>
</Regex>

<!-- Medical Record Numbers -->
<Regex id="medical-record-pattern">
<Pattern>\bMRN[-\s]?\d{6,10}\b</Pattern>
</Regex>

Technical Patterns

<!-- API Keys -->
<Regex id="api-key-pattern">
<Pattern>\b[A-Za-z0-9]{32,64}\b</Pattern>
</Regex>

<!-- Database Connection Strings -->
<Regex id="connection-string-pattern">
<Pattern>(?i)(?:server|data source|host)=.+?(?:;|$)</Pattern>
</Regex>

<!-- File Paths -->
<Regex id="file-path-pattern">
<Pattern>(?:[A-Za-z]:\\|/)(?:[^\\/:*?"<>|\r\n]+[\\\/])*[^\\/:*?"<>|\r\n]*</Pattern>
</Regex>

Testing and Validation

Pattern Testing

  1. Positive Tests: Verify patterns match intended content
  2. Negative Tests: Ensure patterns don't match unintended content
  3. Edge Cases: Test boundary conditions and special characters
  4. Performance Tests: Measure execution time with large content

Validation Checklist

  • Regex patterns are syntactically correct
  • Keyword lists contain relevant terms
  • Match styles are appropriate for use case
  • Case sensitivity settings are correct
  • Proximity values are reasonable
  • Performance is acceptable with test content

Best Practices

Design Guidelines

  1. Start Simple: Begin with basic patterns, add complexity gradually
  2. Use Built-ins: Prefer built-in functions over custom regex when available
  3. Test Thoroughly: Validate with representative content samples
  4. Document Patterns: Include comments explaining complex regex patterns

Performance Guidelines

  1. Optimize Regex: Use efficient regex patterns
  2. Limit Scope: Use appropriate proximity values
  3. Order Elements: Place most selective patterns first
  4. Monitor Performance: Track execution times and resource usage

Maintenance Guidelines

  1. Version Control: Track pattern changes over time
  2. Regular Review: Periodically assess pattern effectiveness
  3. Update Keywords: Keep keyword lists current with evolving terminology
  4. Performance Monitoring: Watch for degradation over time