Multimodal ML Engineer

Whitecircle

Paris or LondonHybridFull TimeSalary not listed

Job details

TLDR: Multimodal ML Engineer to train and ship vision, audio, video, and speech models for an AI safety platform that operates at 100M+ API calls/month.

About us

White Circle is an AI Safety company building the safety, reliability, and optimization layer for AI systems. At the core of our platform are policies – simple natural-language rules that define what an AI model should and shouldn’t do. We automatically test, enforce, and continuously improve these policies at scale.

  • We’ve raised $11M from top funds, founders, and senior leaders at OpenAI, Anthropic, HuggingFace, Mistral, DeepMind, Datadog, Sentry, and others

  • We process over 100M+ API calls every month

  • We fine-tune and train our own LLMs so they run faster and cheaper than any open or proprietary model

We’re a small, highly focused team. If you want to work deeply on hard problems, see your work ship to production quickly, and influence how AI safety is actually built – you’re the one we need.

You will

  • Train and fine-tune large-scale multimodal models (vision-language, audio, speech) from scratch and from pretrained checkpoints

  • Extend models across modalities: image understanding, video temporal modeling, long-context processing, and streaming audio

  • Design and run experiments: architecture changes, data mixes, training recipes

  • Build and maintain multimodal data pipelines — from raw images, video, and audio recordings to training-ready datasets, including synthetic data generation

  • Train and optimize MoE architectures for efficient multimodal inference

  • Build alignment pipelines: SFT, DPO, GRPO, reward modeling — across modalities, not just text

  • Optimize models for production: quantization, distillation, batching, streaming and low-latency serving

  • Deploy models end-to-end: from research checkpoint to production serving

  • Define evaluation metrics and benchmarks that actually matter for the product: visual QA, spatial reasoning, video comprehension, speech and audio understanding

You’ll fit right in if you

  • 3+ years training large-scale deep learning models in multimodal domains (vision-language, audio, speech, or acoustic)

  • Strong PyTorch skills with hands-on distributed training experience (DeepSpeed, FSDP, or similar)

  • Deep experience with multimodal architectures — you understand how vision/audio encoders, projectors, and LLMs fit together (LLaVA, Qwen-VL, InternVL, Audio Flamingo, Omni Qwen, Audio Qwen, Whisper, HuBERT, Conformer, or similar)

  • Hands-on with RLHF/alignment for multimodal: GRPO, DPO, reward modeling — not just for text

  • Experience with video and/or audio sequence modeling: temporal modeling, long-context processing, efficient attention, streaming inference

  • Track record of shipping models to production: you've hit latency targets and optimized inference, not just reported benchmark scores

  • Comfortable with large-scale multimodal dataset curation: image-text pairs, video-instruction data, audio preprocessing, augmentation, synthetic data generation

  • Familiar with MoE architectures and their tradeoffs for multimodal workloads

  • Strong engineering fundamentals: clean code, version control, testing, documentation

A big plus:

  • Understanding of audio signal processing fundamentals (spectrograms, mel features, noise reduction) is a plus

Why White Circle

  • Paid time off in line with your local regulations, no matter where you work from

  • Work from Paris (hybrid) with a relocation package available, or work from London (note: we are unable to provide relocation support for London-based roles)

  • Comprehensive medical insurance for our France-based team (please note that we are in the process of setting up our UK office and therefore cannot offer medical insurance for London-based roles yet)

  • All the hardware, tools, and services you need

  • Covered subscriptions for AI agents and IDEs

  • Team off-sites twice a year: we’ve recently been to the Alps and to Saint-Tropez

 

How we hire

  1. Introductory call with HR (25 min)

  2. Take-home test task

  3. Technical interview with Head of Applied Research (60 min)

  4. Final conversation with our CEO (45 min)

Multimodal ML Engineer at Whitecircle | Jobdaemon