Understanding the AI Engineering Org
Frontier AI companies like OpenAI, Anthropic, Google DeepMind, and Meta AI organize their technical teams into distinct clusters. Here's how to navigate them.
AI research companies hire differently from typical tech companies. Many roles blur the line between research and engineering. Use the tabs above to explore role clusters. For each cluster you'll find: plain-English job descriptions, required skills with technology icons, keyword search operators, and screening questions.
Research Scientists & Engineers
The core scientists who run experiments and publish findings that push the frontier of what AI can do.
Roles covered: 8RL, Post-Training & Pretraining
Engineers who shape how models learn — from initial training on massive datasets to fine-tuning behavior with reinforcement learning.
Roles covered: 7Safety, Alignment & Interpretability
Specialists who ensure AI behaves safely, honestly, and in alignment with human values — one of the fastest-growing areas in AI.
Roles covered: 8Infrastructure, GPU & Performance
Engineers who build the high-performance computing systems that make training and serving multi-billion parameter models possible.
Roles covered: 8Agents, Products & Applied AI
Engineers building AI products that users interact with — autonomous agents, tools, and developer-facing systems.
Roles covered: 7Engineering Management
Technical leaders who run research and engineering teams — they need both deep technical credibility and strong people skills.
Roles covered: 5The "Research Engineer" Explained
This is the most confusing title for non-technical recruiters. Here's the breakdown.
In frontier AI companies, a Research Engineer is not the same as a Software Engineer. They are a hybrid — they write high-quality production code and they design and run scientific experiments. Think of them as "scientist + engineer." They sit between a pure researcher (who theorizes) and a pure engineer (who builds). Most senior AI hires at frontier companies fall into this category.
Primarily a theorist and experimentalist. Generates hypotheses, designs experiments, analyzes results. Often has a PhD. Publishes papers. Code is a tool, not the output.
The hybrid. Implements and runs experiments at scale. Writes production-grade ML code. Partners with scientists to turn ideas into results. Deep coding + scientific thinking.
Primarily an engineer. Builds the infrastructure and tools that make research possible — training pipelines, distributed systems, tooling. Strong SWE background + ML knowledge.
Common Technology Stack
Technologies you'll see across virtually all AI research & engineering roles.
Research Scientists & Core Researchers
These are the scientists who define what AI is capable of. They run experiments, publish papers, and push the frontier.
Research Scientists at AI companies are like academic researchers — but instead of publishing in university labs, they work inside companies and actually build the systems they study. A PhD is common but not always required. Strong candidates will have a publication record, a GitHub with ML code, or a Kaggle/Hugging Face profile.
What They Do
Designs and runs experiments to discover new AI capabilities or improve existing models. Generates novel ideas, tests hypotheses, analyzes results, and often co-authors papers. Works closely with Research Engineers who implement their ideas at scale.
Key Skills to Look For
What "Good" Looks Like on a Resume
Published papers at top conferences (NeurIPS, ICML, ICLR, ACL, EMNLP). First-author publications are a strong signal. GitHub with reproducible ML experiments. Citations on Google Scholar or arXiv. PhD in CS, Statistics, Math, or a quantitative field from a top program.
📋 Screening Questions
- Walk me through a research project you led from hypothesis to results. What were the most important decisions you made?
- What is your most cited or impactful paper, and what problem did it solve?
- How do you decide when an experimental result is meaningful vs. noise?
- How do you balance exploring new ideas versus improving on known approaches?
- Describe a time a research direction failed. What did you learn?
What They Do
Studies what is happening inside AI models — which parts of the neural network activate for which concepts, and why. This is sometimes called "mechanistic interpretability." The goal is to make AI more understandable and verifiably safe.
Key Skills
📋 Screening Questions
- What do you find most interesting or surprising about what we currently know about how transformer models store information internally?
- Describe a technique you've used to understand what a specific layer or attention head is doing in a model.
- How would you explain the concept of "polysemanticity" to a non-technical executive?
- What do you think are the biggest open problems in mechanistic interpretability right now?
What They Do
Studies how AI models affect society — labor markets, misinformation, access to information, fairness, and systemic risks. Publishes empirical research and informs internal policy decisions about model deployment.
Key Skills
📋 Screening Questions
- Describe a study you designed to measure how an AI system affected a real-world behavior or outcome.
- How do you approach studying phenomena that are difficult to measure (e.g., AI's effect on employment decisions)?
- How have you communicated complex research findings to non-academic audiences or policymakers?
What They Do
Extends language models to understand and generate images, video, and audio. Works on training multimodal models — those that can process and produce multiple types of content simultaneously. This is a rapidly growing frontier area.
Key Skills
📋 Screening Questions
- What approaches have you used to align visual and text representations in a shared embedding space?
- What are the main challenges of training a model that processes both images and text together?
- Describe a multimodal project you've worked on — what was the hardest technical challenge?
Reinforcement Learning, Post-Training & Pretraining
These engineers and scientists shape how AI models learn — from initial training on raw data to fine-tuning behavior with human feedback.
"Pretraining" is the initial phase where a model learns from massive amounts of text. "Post-training" is the follow-up phase that shapes how the model behaves — making it helpful, honest, and safe. "RL" (Reinforcement Learning) is the technique used most often in post-training, where the model learns from feedback scores rather than direct examples.
What They Do
Implements and runs reinforcement learning experiments to improve model behavior. This includes RLHF (RL from Human Feedback), PPO, and newer RL variants used to teach models to be more helpful, honest, and safe. They run experiments, debug training instabilities, and scale RL pipelines.
Key Skills
📋 Screening Questions
- Explain RLHF in plain terms — what problem does it solve and what are its main limitations?
- What does reward hacking mean in RL, and how have you dealt with it in practice?
- Walk me through how you debugged a training run that was diverging or producing unexpected behavior.
- How do you decide what reward signal to use for a new RL task?
- What's the difference between on-policy and off-policy RL, and when does each matter for LLM training?
What They Do
Builds the infrastructure and tooling that makes RL training possible at scale. While a Research Engineer runs experiments, this person builds the "engine room" — efficient RL training loops, rollout management, infrastructure that handles thousands of GPUs, and logging systems.
Key Skills
📋 Screening Questions
- How have you optimized a training pipeline to reduce wall-clock time or GPU cost?
- Describe how you'd design a distributed rollout system for a large RL training run.
- What bottlenecks have you identified and resolved in ML training infrastructure?
- How do you approach debugging a training pipeline that's silently producing incorrect results?
What They Do
Works on training the base language model — the initial massive training run that happens before any fine-tuning. This involves choosing data, designing training curricula, scaling experiments, and ensuring training runs don't fail or diverge when running on thousands of GPUs for weeks.
Key Skills
📋 Screening Questions
- What are scaling laws, and how would you use them to plan a new pretraining run?
- What strategies do you use to curate training data, and what quality signals do you rely on?
- Walk me through a training run you've managed. What went wrong, and how did you recover?
- How do you decide on the architecture choices (depth, width, attention heads) for a new model?
What They Do
Designs and trains the "reward models" used in RLHF. A reward model is a separate AI that scores another AI's outputs — essentially teaching the main AI what "good" looks like. This is a highly specialized, high-impact role that directly determines how helpful and safe the final model is.
Key Skills
📋 Screening Questions
- What are the failure modes of a reward model trained on human preference data?
- How would you design a human annotation process to collect high-quality preference data?
- Explain how you would evaluate whether a reward model is generalizing well vs. overfitting to annotator quirks.
Safety, Alignment & Interpretability
One of the most important and fastest-growing areas in AI — ensuring models are safe, honest, and aligned with human values.
"AI Safety" at frontier labs means ensuring models don't do harmful things — whether intentionally or accidentally. "Alignment" means making sure the model pursues goals that humans actually want. "Interpretability" means understanding why the model does what it does. These are deeply philosophical AND deeply technical roles. Candidates often come from academia, philosophy of mind, or security research backgrounds.
What They Do
Works on the fundamental problem of making AI systems reliably pursue intended goals — and nothing else. Designs experiments to test whether models "understand" instructions, have hidden goals, or behave differently when unobserved. Papers in this area often combine philosophy, game theory, and ML.
Key Skills
📋 Screening Questions
- What does it mean for an AI system to be "aligned"? What would misalignment look like in practice?
- How would you design an experiment to test whether a model behaves consistently when it believes it is or isn't being evaluated?
- What do you think are the most important unsolved problems in AI alignment today?
- Describe a piece of alignment research you've found compelling — what was the key insight?
What They Do
Proactively attempts to find ways that AI models could be used for harmful cyber operations — like helping write malware, compromising systems, or enabling novel attacks. Their job is to find these risks before bad actors do, so the company can add safeguards.
Key Skills
📋 Screening Questions
- Describe a creative attack vector you've identified in a system — how did you find it and what made it non-obvious?
- How would you structure an evaluation to measure whether an AI model provides meaningful uplift to a malicious cyber actor?
- How do you think about the distinction between information that is "dangerous to provide" vs. "freely available online"?
What They Do
Studies and mitigates privacy risks in AI models — such as models memorizing and reproducing private training data, or being tricked into revealing personal information. Combines ML research with privacy law and data governance knowledge.
Key Skills
📋 Screening Questions
- What is a membership inference attack and how would you test for it in an LLM?
- How does differential privacy work conceptually, and what are its trade-offs when applied to LLM training?
- How would you evaluate whether a model is memorizing specific training examples it shouldn't reproduce?
What They Do
Builds systems to monitor AI model behavior in production — detecting when models behave unexpectedly, drift from intended behavior, or exhibit patterns that could indicate emerging risks. Think of them as building the "vital signs monitoring" for deployed AI.
Key Skills
📋 Screening Questions
- How would you detect if a deployed model started behaving differently than during evaluation? What signals would you monitor?
- What's the difference between monitoring model performance and monitoring model safety? Do they require different approaches?
- Describe a monitoring or alerting system you've built for a production ML system.
Infrastructure, GPU & Performance Engineering
The engineers who build and optimize the massive computing systems that make frontier AI possible.
Training a frontier AI model requires thousands of specialized chips (GPUs and TPUs) running in perfect coordination for weeks or months. A single training run can cost tens of millions of dollars. The engineers in this section build the systems that make this scale possible — and squeeze every bit of performance out of the hardware.
What They Do
Optimizes how AI training and inference code runs on GPU hardware. Writes low-level CUDA kernels, profiles where compute is being wasted, and implements optimizations that can save millions of dollars per training run. This is one of the most specialized and highest-paid roles in AI infrastructure.
Key Skills
📋 Screening Questions
- What does "memory bandwidth bound" vs "compute bound" mean, and how do you identify which one a kernel is hitting?
- Walk me through how you would optimize a slow attention kernel. What tools would you use and where would you start?
- Explain FlashAttention — what problem does it solve and how?
- What profiling tools do you use for GPU performance work, and what do you look for first?
What They Do
Writes optimized code specifically for Google's TPU (Tensor Processing Unit) chips. TPUs are Google's custom-built AI hardware, used for training and inference. This role requires deep knowledge of TPU architecture, XLA compiler, and JAX — a very niche and highly sought-after skill set.
Key Skills
📋 Screening Questions
- What are the key architectural differences between TPU and GPU that affect how you write optimized code?
- How does XLA compilation work, and what are common pitfalls when writing JAX code intended for TPUs?
- Describe a performance optimization you've done specifically for TPU hardware.
What They Do
Builds and maintains the compute infrastructure that runs pretraining jobs — cluster management, job scheduling, fault tolerance (so a training run doesn't fail when one GPU out of thousands crashes), data loading pipelines, and storage systems for petabytes of training data.
Key Skills
📋 Screening Questions
- How would you design a fault-tolerant system for a training job running on 4,000 GPUs where individual GPUs fail periodically?
- What strategies do you use to make data loading fast enough to not bottleneck GPU utilization?
- Describe the most complex infrastructure system you've designed or maintained. What made it hard?
What They Do
Optimizes how data moves between GPUs during training. When thousands of GPUs train a single model together, they need to share gradients and parameters constantly — this communication is often the bottleneck. This engineer designs the networking stack (InfiniBand, RDMA, collective operations) to make it faster.
Key Skills
📋 Screening Questions
- What is AllReduce and why is it critical for distributed ML training? What are the common performance bottlenecks?
- How would you diagnose a distributed training run that's slower than expected due to communication overhead?
- What's the difference between ring-AllReduce and tree-AllReduce, and when would you choose one over the other?
Agents, Products & Applied AI Engineering
Engineers building the AI products that users and developers actually interact with — autonomous agents, tools, and developer platforms.
An "AI Agent" is an AI that doesn't just answer questions — it takes actions: browsing the web, writing and running code, managing files, or completing multi-step tasks autonomously. These engineers build the systems that make agents work reliably and safely in the real world.
What They Do
Builds systems that allow AI models to autonomously use tools, take actions, and complete multi-step tasks. Designs architectures for how agents plan, use memory, call external APIs, and recover from errors. Evaluates agent performance on real-world tasks.
Key Skills
📋 Screening Questions
- What are the main failure modes you've seen in autonomous AI agents? How have you addressed them?
- How would you design an evaluation suite for a coding agent? What tasks would you include and how would you score them?
- Explain the trade-offs between giving an agent more autonomy vs. more human checkpoints.
- How do you handle cases where an agent encounters an ambiguous situation mid-task?
What They Do
Builds secure "sandboxes" — isolated computing environments where AI agents can run code safely without causing harm to production systems. When an AI agent writes and executes code, it needs to do so in a container that limits what it can access or break. This is critical security infrastructure for agentic AI.
Key Skills
📋 Screening Questions
- How would you design a sandboxed environment for running untrusted AI-generated code? What isolation mechanisms would you use?
- What are the differences between container-based isolation and VM-based isolation? When would you choose each?
- How have you approached the trade-off between security and performance in sandboxing systems?
What They Do
Crafts, tests, and refines the instructions given to AI models to make them perform specific tasks reliably. At frontier AI companies, this is a technical and experimental role — they run systematic evaluations, develop frameworks for how to best communicate complex requirements to models, and work closely with product teams.
Key Skills
📋 Screening Questions
- Walk me through how you would systematically improve a prompt that's producing inconsistent results. What's your process?
- How do you evaluate whether one prompt is better than another at a large scale?
- Give an example of a complex task you've successfully broken down into a prompt or chain of prompts. What made it hard?
- How do you think about the difference between "prompting" and "fine-tuning"? When would you use each?
What They Do
Studies the economic implications of advanced AI — impacts on labor markets, productivity, inequality, and business models. Builds tools and datasets to measure AI's economic footprint. A rare hybrid of economics research + ML engineering.
Key Skills
📋 Screening Questions
- How would you design a study to measure whether AI tools increase or decrease productivity in a specific occupation?
- What data sources would you use to study AI's effect on labor markets, and what are their limitations?
- Describe a piece of economic research you've done — what was the most challenging aspect of establishing causal claims?
Engineering Management Roles
Technical leaders who manage AI research and engineering teams — they need both deep technical credibility AND strong people leadership skills.
AI engineering managers are almost always former engineers or researchers first. They cannot just manage — they need to understand what their teams are doing technically, make resourcing decisions, and maintain enough technical knowledge to evaluate work quality. Look for a track record of both personal technical contributions AND team leadership.
What They Do
Leads a team of GPU and ML accelerator engineers. Responsible for hiring, mentoring, project prioritization, and technical direction for the team that makes AI training hardware run as efficiently as possible. Reports to a VP or Director of Engineering.
Key Skills
📋 Screening Questions
- How do you balance giving your team technical autonomy while ensuring project delivery on time?
- Describe how you've hired for a highly specialized technical role. What signals did you look for beyond technical skills?
- Tell me about a situation where you had to make a difficult technical trade-off decision for your team. How did you approach it?
- How do you stay technically current enough to evaluate the work of GPU performance engineers while also handling management responsibilities?
- How have you handled a high-performing engineer who was struggling with their behavior or communication on the team?
What They Do
"Inference" is what happens when a trained model responds to a user — it's running the model in real-time at scale. This manager leads teams that make inference fast, cheap, and reliable for millions of API calls per day. "Routing" is about intelligently directing requests to the right model or hardware to balance cost, speed, and quality.
Key Skills
📋 Screening Questions
- What does "time-to-first-token" mean and why is it critical for user experience in LLM serving? How would you approach optimizing it?
- How would you think about the architecture of a system routing requests across multiple model sizes or data centers?
- Describe how you've managed a team delivering a latency-critical production system. What did you prioritize?
What They Do
Leads a team of interpretability researchers. Sets the research agenda, manages individual contributors, coordinates with other safety teams, and represents the team's work externally. Unlike pure engineering management, this role requires deep research expertise and the ability to evaluate the quality of novel research ideas.
Key Skills
📋 Screening Questions
- How do you evaluate the quality of a research idea before committing a team to pursue it for months?
- How have you mentored a researcher who was technically strong but struggling to produce impactful results?
- What is your own research vision for interpretability? Where do you think the field should be in 3 years?
Candidate Sourcing Guide
Where to find these candidates beyond LinkedIn — and how to use each platform effectively.
The best AI researchers and engineers are not actively applying for jobs. They are publishing papers, posting code on GitHub, sharing models on Hugging Face, or presenting at conferences. Your job is to find them where they work publicly. This guide will show you how.
| Platform | What to Search | Best For | How to Engage |
|---|---|---|---|
|
arXiv.org
|
Search by topic: "mechanistic interpretability", "RLHF", "constitutional AI", "LLM scaling". Look at first-author names on recent papers. |
Research Scientists, Alignment researchers, Pretraining engineers | Find their email in the paper. Send a short, personalized outreach mentioning their specific paper. Never use a template. |
|
GitHub
|
Search repositories: language:Python topic:RLHF, topic:mechanistic-interpretability, topic:llm-training. Look at contributors to major ML repos (Transformers, vllm, TRL). |
Research Engineers, ML Systems Engineers, Infrastructure Engineers, GPU engineers | Check their profile for email or Twitter/X. Reference their specific project in outreach. Look at star count of their repos as a signal of impact. |
|
Hugging Face
|
Browse the Models hub. Search for models by task: text-generation, safety. Look at who has published fine-tuned models, datasets, or popular "Spaces" (demos). Check profile pages for LinkedIn links. |
Research Scientists, Alignment researchers, NLP specialists, Applied AI engineers | Hugging Face profiles often link to LinkedIn or GitHub. Reference a specific model or Space they created. Community forum discussions are another source. |
|
Google Scholar
|
Search paper titles or topics. Look at author profiles — check h-index, citation count, and institution. Sort by "recent" to find people actively publishing. Search "large language model" safety site:scholar.google.com. |
Research Scientists (especially senior/PhD), Research Managers, Societal Impact researchers | Find their institutional email on the paper or university page. Reference their research specifically. For academic candidates, emphasize the research freedom and resources at the company. |
| ResearchGate | Search by topic, institution, or paper title. Profiles show full publication history and co-author networks — useful for finding connected clusters of researchers. | Research Scientists, Social Science researchers, Privacy researchers | Message feature available on ResearchGate. Lead with genuine interest in their research. Useful for finding researchers at international institutions. |
|
Kaggle
|
Browse competition leaderboards in NLP and ML categories. Look for Grandmasters and Masters. Filter profiles by "notebook" activity. Competition winners in LLM-related tasks are strong ML engineering candidates. | ML Engineers, Applied Research Engineers, Data Scientists with ML depth | Many Kaggle profiles link to LinkedIn or GitHub. Reference their competition placement or a specific notebook they've published. |
| Conference Proceedings | NeurIPS, ICML, ICLR, ACL, EMNLP, USENIX Security, IEEE S&P. Look at accepted paper lists — especially papers from non-obvious institutions (startups, independent researchers). | All research roles, especially Research Scientists and Research Managers | Conference talks are often on YouTube — you can reference their talk. Workshops (not just main track) are where up-and-coming researchers present early-stage work. |
| Twitter / X | Search hashtags: #mechanisticinterpretability, #AIalignment, #RLHF, #LLM. Follow major AI researchers — their networks are full of candidates. Retweet activity reveals who is engaged with what topics. |
All AI research and engineering roles | Twitter DMs are often more effective than email for AI researchers. Keep it brief, genuine, and specific. Many researchers have public email in their bio. |
Boolean Search Templates
Copy-paste these into LinkedIn Recruiter or other sourcing tools.
("Research Scientist" OR "Research Engineer") AND ("large language model" OR "LLM" OR "transformer") AND ("NeurIPS" OR "ICML" OR "ICLR" OR "publications")
("AI safety" OR "AI alignment" OR "interpretability" OR "RLHF") AND ("language model" OR "Claude" OR "GPT" OR "LLM")
("reinforcement learning from human feedback" OR "RLHF" OR "reward model" OR "PPO") AND ("language model" OR "LLM" OR "post-training")
("CUDA" OR "GPU kernel" OR "Triton" OR "GPU optimization") AND ("machine learning" OR "deep learning" OR "training")
General Screening Tips for AI Roles
Cross-cutting advice that applies to all AI research and engineering roles.
AI researchers and engineers are often skeptical of recruiters who don't understand their work. The more specific and knowledgeable your outreach and screening conversations are, the better your response rates will be. Use this guide to sound informed — even if you don't understand every detail.
✓ Strong Candidate Signals — All AI Roles
- Has published papers at top venues (NeurIPS, ICML, ICLR, ACL, USENIX)
- Active GitHub with well-maintained ML repositories that others star or fork
- Can clearly explain the business/mission impact of their past work
- Speaks specifically about failures and what they learned
- Mentions collaboration with cross-functional teams, not just working solo
- Shows progression in scope — has gone from individual contributor to leading projects
- Can explain deeply technical concepts clearly without jargon overload
- Expresses genuine intellectual curiosity beyond just their current role
- Has presented research externally (conferences, blog posts, talks)
- Can describe trade-offs in technical decisions, not just "I used X technology"
⚠ Red Flags — Warning Signs Across AI Roles
- Claims expertise in every hot AI topic (GPT-4, Llama, RLHF, etc.) without depth in any
- Resume lists cutting-edge frameworks but can't explain why they chose them
- Can't point to a specific system they built that is in production or was published
- Vague about team size — "worked on a team" without clarity on their personal contribution
- For research roles: no publications, no citations, no open-source contributions at senior level
- Can't describe a research failure or a wrong hypothesis they pursued
- Claims full credit for large team accomplishments with no specific personal ownership
- Dismissive of safety or ethics concerns ("that's someone else's job")
- Primarily uses commercial AutoML tools or ChatGPT outputs without understanding the underlying models
Universal Screening Questions
These apply across all AI research and engineering roles.
Opening Questions
- What are you working on right now that you're most excited about?
- What drew you to AI research or engineering specifically, rather than other fields of software?
- Tell me about the project you're most proud of. What was your specific contribution?
Mission & Values Questions
- How do you think about the responsibility that comes with working on very powerful AI systems?
- What do you think is the most important problem in AI to be working on right now, and why?
- How do you stay current with the pace of change in this field?
Depth Probe Questions
- Explain [a specific term from their resume] in simple terms, as if talking to a smart non-technical person.
- What's a technical decision you made that you'd make differently today? What changed?
- What's the most important paper or piece of research from the last year that influenced your work?
Collaboration Questions
- How do you work with people who have very different technical backgrounds from yours?
- Describe a time you disagreed with a technical direction your team was taking. How did you handle it?
- How have you communicated a complex technical result to a non-technical stakeholder?
Salary & Compensation Benchmarks
Approximate ranges for frontier AI companies (US-based). These vary significantly by level, location, and company stage.
| Role Level | Total Comp Range (US) | Key Compensation Notes |
|---|---|---|
| New Grad Research Engineer / Scientist | $200K – $350K | Significant equity component at frontier labs. Base ~$150–200K at top companies. |
| Mid-Level (3–5 yrs) Research Engineer | $300K – $500K | Equity refreshes add significantly. Equity vesting usually 4 years with 1-year cliff. |
| Senior Research Scientist / Engineer | $400K – $700K+ | Performance-based bonuses common. Level definitions vary widely by company. |
| Staff / Principal Research Engineer | $600K – $1M+ | Equity grants can dominate at this level, especially at pre-IPO companies. |
| Engineering Manager (Research Teams) | $450K – $800K+ | Similar to senior IC, sometimes slightly lower base; management premium varies. |
| GPU / Performance Engineer (Senior) | $500K – $900K+ | Extreme scarcity of talent drives premium. CUDA expertise is especially compensated. |
* Ranges are illustrative based on publicly available data and industry reports as of 2024–2025. Actual compensation varies by company, location, negotiation, and equity value at time of exercise.