Skip to main content

How Redaction Categories Detection Works

Understanding what the redaction categories detects (and what it doesn’t)

Ciara Scott avatar
Written by Ciara Scott
Updated over a week ago

The redaction system uses pattern-based detection to identify sensitive information in documents. Each redaction category relies on a combination of keywords, formats, and structural signals to determine whether content should be redacted.

💡 Tip
For the most precise extraction of data points, we recommend using Dragon AI. It’s designed to understand context and structure beyond pattern-based detection, making it better suited for complex or ambiguous data.

Personally Identifiable Information (PII)

Email Address

What the system looks for
A standard email structure containing:

  • A username

  • An @ symbol

  • A valid domain and extension

Will match

Won’t match

  • user@example (missing domain extension)

  • user.example.com (missing @ symbol)


Phone Number

What the system looks for
Recognized phone number formats that include separators and optional country or area codes.

Will match

  • 123-456-7890

  • (555) 123-4567

  • +1 415-555-2671

Won’t match

  • 1234567890 (no separators)

  • 123-456 (too short)


Address

What the system looks for
Street addresses with clear structural indicators.

Required signals

  • A street-type keyword (e.g. Street, Road, Avenue, Lane) or a unit prefix (e.g. Apt, Suite, Flat)

  • A valid postal code (US ZIP or UK postcode)

Will match

  • 123 Main Street, New York, NY 10001

  • Flat 3, 78 Victoria Road, Edinburgh, EH1 2JW

  • PO Box 123, London, AA1 1AA

Won’t match

  • 123 Main (missing street type and postal code)

  • Order #12345 (numbers without address context)


US Social Security Number (SSN)

What the system looks for
A valid SSN format with correct digit groupings and known validation rules.

Will match

  • 123-45-6789

Won’t match

  • 123456789 (missing dashes)

  • 123-45-67890 (incorrect number of digits)

  • 000-00-0000 (invalid values)


Age

What the system looks for
Age expressions that combine numbers with time units or age-specific language.

Will match

  • Age: 25 years

  • 30 years old

  • 6 months

Won’t match

  • 25 (number without age context)

  • Year 25 (not age-related)


Gender

What the system looks for
Explicit gender terms or gender labels with context.

Will match

  • male, female, non-binary, gender-fluid, transgender.

  • Gender: M

  • Gender: F

Won’t match

  • M or F on their own (no gender context)


UK National Insurance Number

What the system looks for
A valid UK National Insurance number format with correct prefixes and suffixes.

Will match

  • AB 12 34 56 C

Won’t match

  • Invalid prefixes or formats that do not meet NI standards


US ZIP Code

What the system looks for
A ZIP code paired with a US state abbreviation.

Will match

  • NY 10001

  • CA 90210-1234

Won’t match

  • 10001 (missing state)

  • 12345 (no state context)


Financial and Business Information

Credit Card Number

What the system looks for
Valid credit card number formats 15-16 digits long.

Will match

  • 4532-0151-1283-0366 (Visa)

  • 6011-0009-9013-9424 (Discover)

Won’t match

  • 1234567812345678 (fails validation)

  • Random numeric sequences with invalid lengths or formatting


Account Number

What the system looks for
Standalone numeric sequences between 8 and 17 digits.

Will match

  • 123456789012345

Won’t match

  • Numbers embedded in URLs

  • Numbers with prefixes like ACCT-


Amount

What the system looks for
Monetary values with a currency symbol or currency code.

Will match

  • $1,500.00

  • EUR 25.50

  • -£100.00

Won’t match

  • 1500 (no currency context)


IBAN

What the system looks for
International Bank Account Numbers with valid country codes and lengths.

Will match

  • GB82 WEST 1234 5698 7654 32

Won’t match

  • Incorrect lengths or invalid formats


SWIFT / BIC Code

What the system looks for
Valid 8- or 11-character bank identifier codes.

Will match

  • BOFAUS3N

  • DEUTDEFF500

Won’t match

  • Invalid formats or unsupported country codes


Other Sensitive Data

Date

What the system looks for
Commonly used date formats.

Will match

  • 12/25/2023

  • 25-Dec-2023

  • January 15, 2024


Time

What the system looks for
Time values with optional AM/PM or timezone indicators.

Will match

  • 14:30:00

  • 2:30 PM

  • 14:30 UTC


URLs

What the system looks for
Fully qualified URLs that include a protocol.

Will match

  • https://example.com

  • https://api.example.com/path

Won’t match

  • www.example.com (missing protocol)


IP Addresses

What the system looks for
Valid IPv4 or IPv6 address formats.

Will match

  • 192.168.1.1

  • 2001:0db8:85a3:0000:0000:8a2e:0370:7334


Six- and Eight-Digit Numbers

What the system looks for
Exact numeric sequences of six or eight digits.

Will match

  • 123456

  • 12345678

Won’t match

  • Numbers embedded in URLs

  • Numbers with invalid separators

Did this answer your question?