Batch Processing
Process millions of files efficiently with Veriafy
Overview
Veriafy is optimized for high-throughput batch processing. With GPU acceleration, you can classify over 10,000 files per second while maintaining privacy guarantees.
10K+
Vectors/second (GPU)
<1ms
Per classification
Linear
Scaling with workers
CLI Batch Processing
# Process a directory veriafy classify ./documents --model veriafy/fraud-detection --recursive # With output file veriafy classify ./images --model veriafy/nsfw-classifier \ --output results.csv --format csv # Parallel processing with multiple workers veriafy classify ./data --model veriafy/malware-scanner \ --workers 8 --batch-size 64 # Filter by file type veriafy classify ./uploads --model veriafy/document-triage \ --include "*.pdf" --include "*.docx"
Python SDK
from veriafy import Veriafy
from pathlib import Path
import asyncio
client = Veriafy(gpu=True)
# Simple batch processing
files = list(Path("./documents").glob("**/*.pdf"))
results = client.classify_batch(files, model="veriafy/fraud-detection")
# Process with progress callback
def on_progress(completed, total):
print(f"Progress: {completed}/{total} ({completed/total:.1%})")
results = client.classify_batch(
files=files,
model="veriafy/fraud-detection",
batch_size=64,
on_progress=on_progress,
)
# Async batch processing for maximum throughput
async def process_large_dataset():
files = list(Path("./data").glob("**/*"))
async for result in client.classify_batch_async(
files=files,
model="veriafy/document-triage",
concurrency=100,
):
if result.action != "allow":
print(f"Flagged: {result.file} - {result.action}")
asyncio.run(process_large_dataset())Streaming Results
For very large datasets, use streaming to process results as they complete:
from veriafy import Veriafy
import json
client = Veriafy()
# Stream results to file as they complete
with open("results.jsonl", "w") as f:
for result in client.classify_stream(
files=Path("./data").glob("**/*"),
model="veriafy/classifier",
):
f.write(json.dumps(result.to_dict()) + "\n")
f.flush() # Ensure immediate write
# Process with database insertion
import sqlite3
conn = sqlite3.connect("classifications.db")
cursor = conn.cursor()
for result in client.classify_stream(files, model="veriafy/classifier"):
cursor.execute(
"INSERT INTO results VALUES (?, ?, ?, ?)",
(result.vector_id, result.file, result.action, result.confidence)
)
conn.commit()Distributed Processing
For enterprise-scale workloads, distribute processing across multiple machines:
# Start workers on multiple machines # Machine 1 veriafy worker start --id worker-1 --redis redis://redis:6379 # Machine 2 veriafy worker start --id worker-2 --redis redis://redis:6379 # Submit job from any machine veriafy batch submit ./data --model veriafy/classifier \ --redis redis://redis:6379 \ --output s3://bucket/results/ # Monitor progress veriafy batch status --redis redis://redis:6379
Performance Tuning
| Parameter | CPU | GPU |
|---|---|---|
| batch_size | 16-32 | 64-256 |
| workers | CPU cores | 1-2 per GPU |
| concurrency | 50-100 | 100-500 |
| memory | 4GB+ | 8GB+ VRAM |