File Extractors

12 specialized extractors for different content types

How Extractors Work

Extractors are specialized modules that process specific file types and convert them into Veriafy Vectors. Each extractor uses algorithms optimized for its content type to capture the most relevant features for classification.

Extraction Pipeline

1. Detect
Identify file type
2. Parse
Extract raw features
3. Hash
Perceptual fingerprint
4. Embed
Semantic vector

Available Extractors

Image Extractor

Processes images using perceptual hashing for structural features and CLIP for semantic understanding.

JPEGPNGWebPGIFBMPTIFF
Algorithms: PDQ (perceptual hash), CLIP (semantic embedding)
Output
~4 KB
Speed
~5ms

Video Extractor

Extracts temporal fingerprints and keyframe embeddings for video content classification.

MP4AVIMOVWebMMKV
Algorithms: TMK (temporal matching), PDQF (frame hashing), CLIP (keyframes)
Output
~12 KB
Speed
~50ms/min

Audio Extractor

Creates acoustic fingerprints and semantic embeddings from audio content.

MP3WAVFLACAACOGG
Algorithms: Chromaprint (acoustic fingerprint), AudioCLIP (semantic)
Output
~8 KB
Speed
~20ms/min

Document Extractor

Processes text content using locality-sensitive hashing and transformer embeddings.

PDFDOCDOCXTXTRTF
Algorithms: SimHash (text fingerprint), SBERT (semantic embedding)
Output
~6 KB
Speed
~10ms/page

Office Extractor

Extracts features from spreadsheets and presentations including layout and content.

XLSXPPTXCSV
Algorithms: Structure hash, Content embedding
Output
~8 KB
Speed
~15ms

Email Extractor

Processes email headers, body, and recursively handles attachments.

EMLMSGMBOX
Algorithms: Header hash, Body embedding, Attachment recursive
Output
~10 KB
Speed
~20ms

Code Extractor

Analyzes code structure via AST and semantic meaning via code-specific transformers.

PythonJavaScriptJavaC++GoRust
Algorithms: AST hash (structure), CodeBERT (semantic)
Output
~6 KB
Speed
~15ms

Archive Extractor

Recursively extracts and processes archive contents, creating composite vectors.

ZIPTARRAR7ZGZ
Algorithms: Recursive extraction, Manifest hash
Output
Variable
Speed
Variable

Executable Extractor

Extracts binary features for malware detection without execution.

EXEDLLELFMach-O
Algorithms: PE/ELF structure hash, Import table analysis, Entropy features
Output
~10 KB
Speed
~25ms

Crypto Extractor

Analyzes cryptocurrency-related files for fraud and compliance.

Wallet filesTransaction data
Algorithms: Address clustering, Transaction graph
Output
~8 KB
Speed
~30ms

3D Model Extractor

Processes 3D geometry and texture information for model classification.

OBJFBXGLTFSTL
Algorithms: Mesh signature, Texture embedding
Output
~12 KB
Speed
~40ms

Custom Extractor

Extensible framework for adding custom file type support.

User-defined
Algorithms: Plugin-based
Output
Variable
Speed
Variable

Custom Extractors

You can create custom extractors for proprietary file formats using the Veriafy SDK:

from veriafy.extractors import BaseExtractor, ExtractorResult

class MyFormatExtractor(BaseExtractor):
    supported_extensions = ['.myformat', '.mf']

    def extract(self, file_path: str) -> ExtractorResult:
        # Parse your file format
        data = self.parse_file(file_path)

        # Generate perceptual hash
        phash = self.compute_hash(data)

        # Generate semantic embedding
        embedding = self.compute_embedding(data)

        return ExtractorResult(
            perceptual_hash=phash,
            semantic_embedding=embedding,
            metadata={'custom_field': 'value'}
        )

# Register the extractor
veriafy.register_extractor(MyFormatExtractor())

Next Steps

Veriafy - Universal File Classification Platform