Senior Research Scientist - Reinforcement Learning, MoEs job opportunity at Canva.



Date2026-02-25T09:00:06.207Z bot
Canva Senior Research Scientist - Reinforcement Learning, MoEs
Experience: General
Pattern: Full-time
apply Apply Now
Salary:
Status:

MoEs

Copy Link Report
degreeGeneral
loacation London, United Kingdom
loacation London....United Kingdom
Auto GPT Summarize Enabled

Job DescriptionAt Canva, our mission is to empower the world to design. We’re building AI that feels magical and lands real impact for millions of people - helping anyone create with confidence. We’re looking for a senior research scientist who lives and breathes reinforcement learning, agentic systems and mixture of expert models to push the frontier of reasoning, tool use, latency and reliability - and ship it to users.About the teamWe explore multimodal agentic architectures, build scalable training and evaluation loops, and partner closely with product and platform teams to turn breakthroughs into delightful product features. We are a cutting-edge post-training team, developing new multimodal agentic systems. We work on all topics of multimodal modelling, post-training and design agents, we build scalable training and evaluation loops, and partner closely with product and platform teams to turn breakthroughs into delightful product features. We are looking for a person with experience in post-training, reinforcement learning (RL) and mixture of expert models to join our team.About the roleYou’ll drive research directions and play a leading role in hands‑on work across the agent stack—from reward design and policy optimization to planning, memory, and tool orchestration, dataset construction, to post-training, and the development of novel post-training approaches. You’ll design tight experiments, iterate quickly, and land trustworthy conclusions. Most importantly, you’ll help convert research into reliable, safe, and high‑quality product experiences.What you’ll doDevelop agent systems (planning, multimodal tool use, retrieval, novel training approaches, modeling ablations) for real tasks in design, vision, and language.Scale post-training and RL across distributed systems (PyTorch) with efficient data loaders, tracing/telemetry, stable training of mixture-of-experts (MoE) architectures, and reproducible pipelines; profile, debug, and optimize.Contribute to the research agenda for RL/agentic systems aligned with Canva’s product goals; identify high‑leverage bets and retire dead ends quickly.Build reward models and learning loops: RLHF/RLAIF, preference modeling, DPO/IPO‑style objectives, offline/online RL, curriculum learning, and credit assignment for multi‑step reasoning.Develop simulation and sandbox tasks that surface failure modes (planning errors, tool‑use brittleness, hallucination, unsafe actions) and turn them into measurable targets.Help align on rigorous evaluation for agents (task success, reliability, latency, safety, regressions). Stand up offline suites and online A/B tests; favor simple, controlled experiments that generalize.Collaborate and ship: work shoulder‑to‑shoulder with product, design, safety, and platform to land research as reliable features—then iterate.Share and elevate: mentor teammates, present findings internally, and contribute back to the community when it helps the field and our users.You’re likely a match if you haveDepth in implementing and post-training MoEs/LLMs/VLMs/Diffusion models, with a track record of shipped research or publications in MoEs, RL or agents.Experience modifying, and adapting open-source models.Strong experience with experimental design: tight baselines, clean ablations, reproducibility, and clear, data‑backed conclusions.Fluency in Python and PyTorch; you’re comfortable in large ML codebases and can profile, debug, and optimize training and inference.Practical experience building agent loops (planning, tool invocation, retrieval, memory) and evaluating multi‑step reasoning quality.Hands‑on experience with policy optimization, reward modeling, and preference learning (e.g., RLHF/RLAIF, DPO/IPO, actor‑critic/PPO, offline RL).Experience with large‑scale training (distributed training, experiment tracking, evaluation harnesses) and cloud multimodal tooling.Experience with RL for MoE architectures.Nice to haveExperience with video and audio modelling.Experience with multi‑agent settings.Strength in alignment and safety evaluations, including red‑teaming and risk mitigation for tool‑using agents.Contributions to open‑source, benchmarks, or shared evaluation suites for agents.

Other Ai Matches

Sales Manager, SMB Applicants are expected to have a solid experience in handling SMB related tasks
Senior Machine Learning Engineer - Inspire & Create (AU Remote) Applicants are expected to have a solid experience in handling Job related tasks
Senior Software Engineer - Product & Features (Java) - Open to remote across ANZ Applicants are expected to have a solid experience in handling Job related tasks
SEO Content Specialist - South Africa (12 month contract) Applicants are expected to have a solid experience in handling Job related tasks