File Extractors
12 specialized extractors for different content types
How Extractors Work
Extractors are specialized modules that process specific file types and convert them into Veriafy Vectors. Each extractor uses algorithms optimized for its content type to capture the most relevant features for classification.
Extraction Pipeline
Available Extractors
Image Extractor
Processes images using perceptual hashing for structural features and CLIP for semantic understanding.
Video Extractor
Extracts temporal fingerprints and keyframe embeddings for video content classification.
Audio Extractor
Creates acoustic fingerprints and semantic embeddings from audio content.
Document Extractor
Processes text content using locality-sensitive hashing and transformer embeddings.
Office Extractor
Extracts features from spreadsheets and presentations including layout and content.
Email Extractor
Processes email headers, body, and recursively handles attachments.
Code Extractor
Analyzes code structure via AST and semantic meaning via code-specific transformers.
Archive Extractor
Recursively extracts and processes archive contents, creating composite vectors.
Executable Extractor
Extracts binary features for malware detection without execution.
Crypto Extractor
Analyzes cryptocurrency-related files for fraud and compliance.
3D Model Extractor
Processes 3D geometry and texture information for model classification.
Custom Extractor
Extensible framework for adding custom file type support.
Custom Extractors
You can create custom extractors for proprietary file formats using the Veriafy SDK:
from veriafy.extractors import BaseExtractor, ExtractorResult
class MyFormatExtractor(BaseExtractor):
supported_extensions = ['.myformat', '.mf']
def extract(self, file_path: str) -> ExtractorResult:
# Parse your file format
data = self.parse_file(file_path)
# Generate perceptual hash
phash = self.compute_hash(data)
# Generate semantic embedding
embedding = self.compute_embedding(data)
return ExtractorResult(
perceptual_hash=phash,
semantic_embedding=embedding,
metadata={'custom_field': 'value'}
)
# Register the extractor
veriafy.register_extractor(MyFormatExtractor())