Software Engineering Manager, LLM Training job opportunity at LinkedIn.



Date2026-05-12T22:51:23.690Z bot
LinkedIn Software Engineering Manager, LLM Training
Experience: General
Pattern: Full-time
apply Apply Now
Salary:
Status:

LLM Training

Copy Link Report
degreeGeneral
loacation Mountain View, California, United States
loacation Mountain View,..........United States
Auto GPT Summarize Enabled

Job DescriptionThis role will be based in Mountain View, CA.At LinkedIn, our approach to flexible work is centered on trust and optimized for culture, connection, clarity, and the evolving needs of our business. The work location of this role is hybrid, meaning it will be performed both from home and from a LinkedIn office on select days, as determined by the business needs of the team.As a Software Engineering Manager of the Post-Training Infra team, you will architect the high-throughput systems required for Supervised Fine-Tuning (SFT) and RL, Multi-Techer Distillation, Reinforcement Learning from Human Feedback (RLHF), Agentic Performance Optimization and Agentic Research at scale. You won’t just be "running scripts"; you’ll be optimizing the engine that makes rapid model alignment possible.ResponsibilitiesDistributed Training Enablement: Enable and support sophisticated parallelism strategies, including data, tensor, pipeline, context, and expert parallelism, for models exceeding 100B+ parameters. Provide optimized configurations, reference examples, and platform-level integration so that customer teams can effectively leverage these techniquesPost-Training Expertise: Maintain deep expertise across the post-training landscape, including Multi-Teacher Distillation, RL-based alignment and optimization (RLHF, GRPO), Pruning, Quantization, and Speculative Decoding. Build and maintain reusable platform components that enable customer teams to efficiently leverage these techniques in their workflows.Performance Engineering: Deep-dive into strategic customer workloads and drive workload-specific and platform-level optimizations, including Liger Kernels, FlashAttention, low-precision training, high-performance data I/O, and inter-node latency reduction.Multi-Modal Strategy: Video and Audio Models Post Training strategyFramework & Ecosystem Mastery: Act as a bridge to the OSS community. You will contribute to and troubleshoot the "Post-Training Stack," including Liger, PyTorch, Hugging Face (Accelerate/Transformers), Megatron, Ray, VERL, SGLang and vLLM.Observability & Profiling: Develop advanced telemetry for large-scale training runs. You will use profiling tools to debug hardware-level stalls (NCCL timeouts, memory fragmentation) and provide internal teams with actionable insights into training stability.Containerized Lifecycle Management: Lead the development of the "Golden Image" environment. Maintain and distribute optimized, containerized base images with compatible, validated builds of PyTorch, CUDA, and the broader training stack to ensure seamless training on our clusters.Responsible AI & Compliance Partnership: Serve as the bridge between the training platform and Responsible AI teams, collaborating on data compliance, model evaluation, and safety processes. Ensure the platform provides the tooling and integration points needed for RAI teams to effectively apply their frameworks throughout the training lifecycle.Agentic Strategy: Lead development of Agents for autonomous model research, performance optimizationLead, coach and manage core team of engineers working on building the infrastructure.Participate with senior management in developing a long-term technology roadmap for the team and company.Have the ability to dive deep into technical discussions to challenge the status quo, and steer the team in the right direction/to push the envelope.Communicate and collaborate effectively with stakeholders across engineering and business leadership.Help the team realize their potential by setting clear expectations, openly evaluating performance, upholding accountability, and providing challenges to stretch their skills.Drive a culture of operational excellence. Lead the team into defining performance goals, metrics and building the infrastructure and tooling necessary to maintain a high quality bar and detect issues in real time.Create an inclusive work environment that fosters autonomy, transparency, innovation and learning, while holding a high bar for quality.

Other Ai Matches

Senior Enterprise Engineer Applicants are expected to have a solid experience in handling Job related tasks
Account Executive - LinkedIn Sales Solutions Applicants are expected to have a solid experience in handling Job related tasks
Fellow, Software Engineering- Infrastructure Applicants are expected to have a solid experience in handling Software Engineering- Infrastructure related tasks
Senior Customer Success Manager- Dutch ( 12-month Fixed term contract) Applicants are expected to have a solid experience in handling Job related tasks