Veriafy - Universal File Classification Platform

Understanding the irreversible hash representation that makes privacy-preserving classification possible

What is a Veriafy Vector?

A Veriafy Vector is a compact, irreversible mathematical representation of a file. It captures the semantic and perceptual features of content without storing the content itself.

Key Properties

Irreversible: Cannot be converted back to the original file
Compact: 500,000x smaller than the source file
Semantic: Preserves meaning for classification
Deterministic: Same input always produces the same vector

Vector Components

Each Veriafy Vector consists of two main components:

Perceptual Hash

A content-based fingerprint that captures structural features while being resilient to minor modifications. Different algorithms for different file types:

• PDQ for images (256-bit)
• TMK for video (temporal)
• Chromaprint for audio
• SimHash for text

Semantic Embedding

A dense vector representation that captures the meaning and context of the content using neural network encoders:

• CLIP for images (512-dim)
• SBERT for text (768-dim)
• AudioCLIP for audio
• Custom embeddings for code

Vector Structure

A Veriafy Vector is represented as a JSON object:

{
  "version": "1.0",
  "vector_id": "v_8f3a2b1c4d5e6f7a",
  "file_type": "image",
  "extractor": "pdq_clip",
  "created_at": "2025-01-08T12:00:00Z",
  "components": {
    "perceptual_hash": {
      "algorithm": "pdq",
      "value": "f8a3b2c1d4e5f6a7...",  // 256-bit hex
      "quality": 0.95
    },
    "semantic_embedding": {
      "model": "clip-vit-b32",
      "dimensions": 512,
      "value": [0.123, -0.456, ...]  // normalized float32
    }
  },
  "metadata": {
    "file_size_category": "medium",  // not exact size
    "aspect_ratio_bucket": "landscape",  // not exact ratio
    "duration_bucket": null  // for video/audio
  }
}

Why Irreversibility Matters

The mathematical properties of Veriafy Vectors guarantee that the original content cannot be reconstructed:

1.Hash Collision Space: PDQ produces 256 bits from millions of pixels. Infinite images map to the same hash — there's no unique inverse.
2.Embedding Compression: CLIP compresses an image to 512 floats. The dimensionality reduction is lossy by design.
3.No Raw Features: Unlike some ML systems, VERIAFY doesn't store intermediate features that could leak information.

Impossible Operations

From a Veriafy Vector, you cannot: view the image, play the audio, read the document, or extract any recognizable portion of the original content. This is guaranteed by mathematics, not policy.

Compression Ratios

File Type	Typical Size	Vector Size	Compression
Image (JPEG)	2 MB	4 KB	500x
Video (1 min)	50 MB	12 KB	4,000x
PDF Document	500 KB	6 KB	80x
Audio (3 min)	5 MB	8 KB	600x

Next Steps

File Extractors Classification Models