Batch Processing

Process millions of files efficiently with Veriafy

Overview

Veriafy is optimized for high-throughput batch processing. With GPU acceleration, you can classify over 10,000 files per second while maintaining privacy guarantees.

10K+
Vectors/second (GPU)
<1ms
Per classification
Linear
Scaling with workers

CLI Batch Processing

# Process a directory
veriafy classify ./documents --model veriafy/fraud-detection --recursive

# With output file
veriafy classify ./images --model veriafy/nsfw-classifier \
  --output results.csv --format csv

# Parallel processing with multiple workers
veriafy classify ./data --model veriafy/malware-scanner \
  --workers 8 --batch-size 64

# Filter by file type
veriafy classify ./uploads --model veriafy/document-triage \
  --include "*.pdf" --include "*.docx"

Python SDK

from veriafy import Veriafy
from pathlib import Path
import asyncio

client = Veriafy(gpu=True)

# Simple batch processing
files = list(Path("./documents").glob("**/*.pdf"))
results = client.classify_batch(files, model="veriafy/fraud-detection")

# Process with progress callback
def on_progress(completed, total):
    print(f"Progress: {completed}/{total} ({completed/total:.1%})")

results = client.classify_batch(
    files=files,
    model="veriafy/fraud-detection",
    batch_size=64,
    on_progress=on_progress,
)

# Async batch processing for maximum throughput
async def process_large_dataset():
    files = list(Path("./data").glob("**/*"))

    async for result in client.classify_batch_async(
        files=files,
        model="veriafy/document-triage",
        concurrency=100,
    ):
        if result.action != "allow":
            print(f"Flagged: {result.file} - {result.action}")

asyncio.run(process_large_dataset())

Streaming Results

For very large datasets, use streaming to process results as they complete:

from veriafy import Veriafy
import json

client = Veriafy()

# Stream results to file as they complete
with open("results.jsonl", "w") as f:
    for result in client.classify_stream(
        files=Path("./data").glob("**/*"),
        model="veriafy/classifier",
    ):
        f.write(json.dumps(result.to_dict()) + "\n")
        f.flush()  # Ensure immediate write

# Process with database insertion
import sqlite3

conn = sqlite3.connect("classifications.db")
cursor = conn.cursor()

for result in client.classify_stream(files, model="veriafy/classifier"):
    cursor.execute(
        "INSERT INTO results VALUES (?, ?, ?, ?)",
        (result.vector_id, result.file, result.action, result.confidence)
    )
    conn.commit()

Distributed Processing

For enterprise-scale workloads, distribute processing across multiple machines:

# Start workers on multiple machines
# Machine 1
veriafy worker start --id worker-1 --redis redis://redis:6379

# Machine 2
veriafy worker start --id worker-2 --redis redis://redis:6379

# Submit job from any machine
veriafy batch submit ./data --model veriafy/classifier \
  --redis redis://redis:6379 \
  --output s3://bucket/results/

# Monitor progress
veriafy batch status --redis redis://redis:6379

Performance Tuning

ParameterCPUGPU
batch_size16-3264-256
workersCPU cores1-2 per GPU
concurrency50-100100-500
memory4GB+8GB+ VRAM

Next Steps

Veriafy - Universal File Classification Platform