Veriafy - Universal File Classification Platform

How Extractors Work

Extractors are specialized modules that process specific file types and convert them into Veriafy Vectors. Each extractor uses algorithms optimized for its content type to capture the most relevant features for classification.

Extraction Pipeline

1. Detect

Identify file type

2. Parse

Extract raw features

3. Hash

Perceptual fingerprint

4. Embed

Semantic vector

Available Extractors

Image Extractor

Processes images using perceptual hashing for structural features and CLIP for semantic understanding.

JPEGPNGWebPGIFBMPTIFF

Algorithms: PDQ (perceptual hash), CLIP (semantic embedding)

Output

~4 KB

Speed

~5ms

Video Extractor

Extracts temporal fingerprints and keyframe embeddings for video content classification.

MP4AVIMOVWebMMKV

Algorithms: TMK (temporal matching), PDQF (frame hashing), CLIP (keyframes)

Output

~12 KB

Speed

~50ms/min

Audio Extractor

Creates acoustic fingerprints and semantic embeddings from audio content.

MP3WAVFLACAACOGG

Algorithms: Chromaprint (acoustic fingerprint), AudioCLIP (semantic)

Output

~8 KB

Speed

~20ms/min

Document Extractor

Processes text content using locality-sensitive hashing and transformer embeddings.

PDFDOCDOCXTXTRTF

Algorithms: SimHash (text fingerprint), SBERT (semantic embedding)

Output

~6 KB

Speed

~10ms/page

Office Extractor

Extracts features from spreadsheets and presentations including layout and content.

XLSXPPTXCSV

Algorithms: Structure hash, Content embedding

Output

~8 KB

Speed

~15ms

Email Extractor

Processes email headers, body, and recursively handles attachments.

EMLMSGMBOX

Algorithms: Header hash, Body embedding, Attachment recursive

Output

~10 KB

Speed

~20ms

Code Extractor

Analyzes code structure via AST and semantic meaning via code-specific transformers.

PythonJavaScriptJavaC++GoRust

Algorithms: AST hash (structure), CodeBERT (semantic)

Output

~6 KB

Speed

~15ms

Archive Extractor

Recursively extracts and processes archive contents, creating composite vectors.

ZIPTARRAR7ZGZ

Algorithms: Recursive extraction, Manifest hash

Output

Variable

Speed

Variable

Executable Extractor

Extracts binary features for malware detection without execution.

EXEDLLELFMach-O

Algorithms: PE/ELF structure hash, Import table analysis, Entropy features

Output

~10 KB

Speed

~25ms

Crypto Extractor

Analyzes cryptocurrency-related files for fraud and compliance.

Wallet filesTransaction data

Algorithms: Address clustering, Transaction graph

Output

~8 KB

Speed

~30ms

3D Model Extractor

Processes 3D geometry and texture information for model classification.

OBJFBXGLTFSTL

Algorithms: Mesh signature, Texture embedding

Output

~12 KB

Speed

~40ms

Custom Extractor

Extensible framework for adding custom file type support.

User-defined

Algorithms: Plugin-based

Output

Variable

Speed

Variable

Custom Extractors

You can create custom extractors for proprietary file formats using the Veriafy SDK:

from veriafy.extractors import BaseExtractor, ExtractorResult

class MyFormatExtractor(BaseExtractor):
    supported_extensions = ['.myformat', '.mf']

    def extract(self, file_path: str) -> ExtractorResult:
        # Parse your file format
        data = self.parse_file(file_path)

        # Generate perceptual hash
        phash = self.compute_hash(data)

        # Generate semantic embedding
        embedding = self.compute_embedding(data)

        return ExtractorResult(
            perceptual_hash=phash,
            semantic_embedding=embedding,
            metadata={'custom_field': 'value'}
        )

# Register the extractor
veriafy.register_extractor(MyFormatExtractor())

Next Steps

Classification Models Python SDK

File Extractors

How Extractors Work

Extraction Pipeline

Available Extractors

Image Extractor

Video Extractor

Audio Extractor

Document Extractor

Office Extractor

Email Extractor

Code Extractor

Archive Extractor

Executable Extractor

Crypto Extractor

3D Model Extractor

Custom Extractor

Custom Extractors

Next Steps