This one-page guide translates common AI engineering tools, frameworks, and buzzwords into plain English. Use it to understand what sits in a modern AI tech stack, what each tool actually does, and what to ask candidates during an initial screen.
Think of an AI app like a restaurant. Python is the kitchen language. PyTorch/TensorFlow are the cooking tools. Hugging Face is the pantry of ready-made recipes. LangChain helps coordinate the waiter, kitchen, and delivery. FAISS helps the app look up the right notes quickly. FastAPI and Gradio let users interact with the system. MLflow keeps records. Docker and Kubernetes help run the whole thing reliably at scale.
Each card includes a simple explanation, what recruiters should listen for, a mini visual, and source links for docs, code, papers, or trusted video explainers.
A large language model predicts and generates text. It powers chatbots, summarizers, copilots, and document Q&A systems.
A way to turn text, images, or audio into numbers so the system can measure similarity and meaning.
Retrieval-Augmented Generation means the app looks up relevant documents first, then asks the model to answer using that context.
Taking a base model and training it further on a company’s data or task so it behaves more specifically.
The moment the model is actually used to answer a real request. Training builds the model; inference runs it.
How teams measure whether the system is good enough. This might be accuracy, latency, hallucination rate, or business KPIs.
The discipline of getting models from experiment into reliable production systems with versioning, monitoring, and repeatability.
Instead of only generating text, the model can trigger tools such as search, calculators, databases, or workflows to get work done.
These are not “gotcha” questions. They are designed to surface practical depth, ownership, and whether the candidate shipped real AI systems.
| Question | What a solid answer sounds like | What weaker answers sound like |
|---|---|---|
| 1. What kind of AI applications have you built or supported? | Good: Names real use cases such as document Q&A, recommendation, forecasting, fraud detection, copilots, or image analysis, plus business impact and users. | Watch for: Only says “LLMs” or “GenAI” with no product, domain, users, or result. |
| 2. Which tools or frameworks did you use most, and why? | Good: Explains tradeoffs. Example: “PyTorch for fine-tuning, Hugging Face for models, FastAPI for serving, MLflow for tracking.” | Watch for: Name-drops tools but cannot explain what each one did. |
| 3. Did you work more on model training, model integration, or production deployment? | Good: Clearly separates responsibilities and describes ownership. Example: “I integrated existing models and productionized them rather than training from scratch.” | Watch for: Vague “end-to-end” claims without specifics. |
| 4. Have you used pre-trained models, fine-tuning, or both? | Good: Explains when they used each. Example: “We started with a pre-trained model, then fine-tuned for our domain because retrieval alone wasn’t enough.” | Watch for: Cannot explain the difference or why one approach was chosen. |
| 5. How did you evaluate whether the AI system was actually good? | Good: Mentions offline metrics, human review, latency, hallucinations, business KPIs, A/B tests, precision/recall, or task success. | Watch for: “We just tried it and it looked good.” Humanity’s favorite scientific method. |
| 6. Did your application use retrieval, embeddings, or a vector store? | Good: Can explain RAG simply: chunk data, create embeddings, store/search them, pass retrieved context into the model. | Watch for: Says yes but cannot explain how retrieval improved answers. |
| 7. How did you expose the model to users or other systems? | Good: Mentions APIs, FastAPI, batch pipelines, microservices, internal tools, chat UIs, Gradio demos, or product integration. | Watch for: Only built isolated notebooks. |
| 8. What was the hardest production challenge? | Good: Real answers include latency, cost, bad data, hallucinations, scaling, GPU limits, monitoring, or reliability. | Watch for: Says there were no real challenges, which is adorable and almost certainly false. |
| 9. How did you manage experiments, model versions, or prompts? | Good: Mentions MLflow, registries, Git, model cards, prompt versioning, evaluation logs, or release controls. | Watch for: No reproducibility process at all. |
| 10. Did you deploy with Docker, Kubernetes, cloud services, or something simpler? | Good: Explains packaging and runtime decisions based on scale, reliability, and team maturity. | Watch for: Claims cloud-native expertise but cannot describe basic deployment flow. |
| 11. What did you personally own versus what the team owned? | Good: Specific boundaries. Example: “I owned retrieval quality, prompt evaluations, and API integration. Another engineer owned infra.” | Watch for: Everything somehow belonged to them when success is mentioned and to “the team” when details are requested. |
| 12. If you had to improve the system today, what would you change first? | Good: Thoughtful priorities such as better evaluation, lower latency, improved retrieval, domain fine-tuning, caching, guardrails, or monitoring. | Watch for: No opinion, no diagnosis, no engineering judgment. |