What do they actually do?
Start here — this is the most important distinction to understand before screening candidates.
Data Engineers build the roads that data travels on. Data Scientists drive the car and figure out where to go. Without good roads, no one gets anywhere — but without a driver, the roads are useless.
The Data Builder
Designs, builds, and maintains the systems that collect, store, and move data reliably. They make sure the right data gets to the right place at the right time — clean, fast, and at scale.
Think of them as: A plumber + electrician for data. They build the pipes and wiring that make everything else possible.
- Builds data pipelines (automated data flows)
- Manages databases and data warehouses
- Ensures data is clean, consistent, and accessible
- Works with huge volumes of data (millions of rows)
- Focuses on reliability, speed, and scale
- Collaborates with Data Scientists to give them clean data
The Data Analyst & Predictor
Uses data to answer business questions, find patterns, and build models that predict future outcomes. They turn numbers into insights and stories that help the business make better decisions.
Think of them as: A detective + scientist. They investigate data, form hypotheses, run experiments, and present findings.
- Builds machine learning / predictive models
- Runs statistical analyses and experiments (A/B tests)
- Creates dashboards and visualizations
- Communicates findings to business stakeholders
- Identifies trends, anomalies, and opportunities
- Uses clean data provided by Data Engineers
Side-by-Side Comparison
A quick reference to help you tell these roles apart at a glance.
| Dimension | Data Engineer | Data Scientist |
|---|---|---|
| Core Question | "How do we move and store data reliably?" | "What does the data tell us?" |
| Primary Output | Data pipelines, databases, infrastructure | Models, insights, reports, predictions |
| Daily Work | Writing code to move/transform data, fixing pipelines, managing cloud storage | Exploring data, building ML models, presenting findings |
| Key Skills | SQL, Python, cloud (AWS/GCP/Azure), Spark, Kafka, Airflow | Python/R, statistics, machine learning, visualization, storytelling |
| Works Most With | DevOps, Software Engineers, Data Scientists | Product managers, executives, Data Engineers, analysts |
| Coding Level | Heavy — writes production-grade code daily | Moderate to heavy — writes analysis/model code |
| Math/Stats Need | Light — mostly systems thinking | Heavy — statistics is core to the job |
| Career Path | Sr. DE → Staff DE → Data Architect → Head of Data Engineering | Sr. DS → Lead DS → Principal DS → Head of Data Science / Director |
| Salary Range (US) | $110K–$180K+ depending on seniority & location | $110K–$190K+ depending on seniority & domain |
A candidate who says they can do both equally well is a yellow flag at senior levels. These roles diverge significantly in skills and focus. Junior candidates sometimes overlap (especially in small companies), but experienced professionals typically identify strongly with one path.
Educational Backgrounds
What degrees should you expect to see — and what alternatives are equally valid?
Neither role has one single "correct" degree path. Many excellent candidates are self-taught or come from bootcamps. That said, certain degree fields appear more often. Focus on what they know and what they've built, not just the school name on a resume.
Common Degree Fields
- Computer Science (most common)
- Software Engineering
- Information Systems / Information Technology
- Electrical Engineering
- Mathematics or Applied Mathematics
- Bootcamp + self-study (increasingly accepted)
Common Degree Fields
- Statistics or Applied Statistics
- Mathematics
- Computer Science
- Machine Learning / AI (graduate programs)
- Physics, Economics, or Engineering (strong analytical base)
- Domain-specific fields (biology for biotech DS, finance for quant DS)
Degree Level Reality
- Bachelor's is standard for most roles
- Master's valued for senior/architect roles
- PhD rare and not typically needed
- Cloud certifications (AWS, GCP, Azure) often valued more than advanced degrees
- Strong GitHub portfolio can substitute for formal education
Degree Level Reality
- Bachelor's is the minimum baseline
- Master's is common and often preferred
- PhD common in research-heavy roles (tech companies, biotech, finance)
- Kaggle competition rankings carry real weight
- Published papers on arXiv or Google Scholar are strong signals for senior roles
Alternative Credentials to Recognize
Certifications and non-traditional credentials that signal competence.
High-Value Certifications
- AWS Certified Data Engineer – Associate
- Google Professional Data Engineer
- Azure Data Engineer Associate (DP-203)
- Databricks Certified Associate Developer for Apache Spark
- dbt Analytics Engineering Certification
- Snowflake SnowPro Core Certification
High-Value Credentials
- Kaggle Competition rankings (Expert, Master, Grandmaster)
- Google Professional Machine Learning Engineer
- AWS Certified Machine Learning – Specialty
- Coursera / DeepLearning.AI Specializations (Andrew Ng)
- Published papers on arXiv or conference proceedings
- Active GitHub with ML projects / notebooks
Technical Skills & Tools
The technologies you'll see on resumes — explained in plain English.
Don't worry about memorizing all these tools. Focus on clusters: Data Engineers work with data movement and storage tools; Data Scientists work with analysis and modeling tools. Both use Python, but for different purposes.
Programming Languages
Data Storage & Warehousing
Pipeline & Orchestration Tools
Cloud Platforms
Core Languages & Analysis
ML & Statistics Libraries
When reviewing resumes, look for clusters not individual tools. A Data Engineer resume should show cloud platforms + pipeline tools + SQL. A Data Scientist resume should show ML libraries + statistical methods + visualization. If a resume has every tool in both categories, probe for actual depth in the screening call.
Typical Experience & Background
What career paths look like for each role — and what to expect on a resume.
Both roles often start from software engineering or analytics backgrounds. Data Engineers tend to come up through backend engineering or database administration. Data Scientists tend to come from academia, analytics, or software roles with a quantitative twist.
Online Presence: What to Look For
Where to look beyond the resume — and how to interpret what you find.
GitHub
A code portfolio. Look for: active repositories, quality of commit messages, projects relevant to the role. For DEs: data pipeline projects. For DSs: ML notebooks, model implementations.
Kaggle
A competitive ML platform with public notebooks and competitions. Ranked profiles (Expert → Master → Grandmaster) signal real hands-on ML capability. Strong signal for Data Scientists especially.
Hugging Face
The central hub for open-source AI models, datasets, and "Spaces" (interactive demos). A public profile with model cards, fine-tuned models, or popular Spaces indicates active AI/NLP work.
arXiv
A preprint server for research papers in CS, math, and physics. If a Data Scientist has published papers here, they are contributing to academic/research communities — a strong signal for senior or research-heavy roles.
Google Scholar
Academic citation profiles. A Data Scientist with a Google Scholar profile has published peer-reviewed work. Look at citation count and h-index for research impact. Very relevant for biotech, finance, or research-driven companies.
ResearchGate
Academic social network for researchers. Profiles show publications, citations, and research interests. Similar to Google Scholar in purpose — useful for evaluating research-oriented Data Scientists in STEM domains.
YouTube Learning Resources
Channels you can share with hiring managers or watch yourself to build deeper understanding.
These are reputable, expert-level YouTube channels. You don't need to watch every video — even 15 minutes on "what is a data pipeline?" will make you a more confident screener. Subscriber counts are approximate as of 2024–2025.
▸ Data Engineering Channels
Seattle Data Guy
Practical data engineering tutorials — Airflow, Spark, dbt, Snowflake. Great for understanding what modern DE work actually looks like day-to-day.
Data with Zach
Data engineering careers, resume tips, interview prep, and practical advice for breaking into the field. Very recruiter-friendly tone.
Andreas Kretz
Covers the full data engineering ecosystem with tutorials on Kafka, Airflow, Spark, and cloud platforms. Great for building vocabulary around modern data stacks.
▸ Data Science Channels
Ken Jee
Data science career, job search tips, and portfolio advice. Excellent for understanding what hiring managers look for in DS candidates. Very practical for recruiters too.
StatQuest (Josh Starmer)
Makes statistics and ML concepts crystal clear with visual explanations. The best channel to help you understand terms on a DS resume — neural networks, regression, clustering, etc.
3Blue1Brown
Stunningly visual explanations of mathematics and neural networks. Watch "Neural Networks" series to understand the math behind deep learning — builds vocabulary without requiring you to code anything.
Tina Huang
DS career advice, resume reviews, day-in-the-life content. Excellent for understanding what DS candidates look for in jobs and how they think — useful for improving job descriptions and outreach.
Trusted Reference Websites
roadmap.sh
Visual career roadmaps for Data Engineer and Data Scientist paths. Excellent for understanding the full skill progression from beginner to expert.
roadmap.sh/data-engineer ↗Towards Data Science
Medium publication with thousands of practitioner-written articles on both DE and DS. Searchable by topic — great for looking up any term you encounter on a resume.
towardsdatascience.com ↗DataCamp Blog
Career guides, role comparisons, and skill breakdowns written for non-technical audiences. Their "Data Engineer vs Data Scientist" articles are particularly recruiter-friendly.
datacamp.com/blog ↗Screening Interview Guide
Questions to ask during initial phone screens — with guidance on evaluating responses.
You don't need to understand the technical details deeply. Focus on: Does the candidate speak fluently and specifically about their work? Do they use concrete examples? Do they seem to understand the why behind their choices, not just the tools? Vague or buzzword-heavy answers warrant a probe.
Data Engineer Screening Questions
"Can you explain what a data pipeline is and describe one you've built?"
"A data pipeline is an automated process that moves data from a source, transforms it, and loads it to a destination. At [Company], I built one that pulled clickstream events from our app via Kafka, cleaned and aggregated them using Spark, and loaded summaries into Snowflake every 15 minutes — we went from a 3-hour reporting lag to near-real-time."
"A data pipeline moves data from point A to B. I've built ETL pipelines using Airflow and loaded data into a data warehouse. We had issues with latency that I helped fix."
"Yes, I know what a pipeline is — it's basically moving data around. I've used Python for that kind of work." (No specifics, no depth, no example.)
"What's the largest dataset you've worked with? How did you handle performance at that scale?"
"We processed about 10TB of log data daily. I used partitioning and clustering in BigQuery to reduce query costs by 60%, and implemented incremental loads instead of full refreshes, cutting our job runtime from 4 hours to 45 minutes."
"I've worked with large datasets, maybe a few hundred GBs. We used Spark to distribute the processing across a cluster."
"I've worked with pretty big data. I'm not sure of the exact size but it was a lot. We just ran it on the server."
"Tell me about a time a data pipeline broke in production. What happened and how did you fix it?"
"Our Airflow DAG started failing silently because an upstream API changed its response schema. I added schema validation checks at the ingestion stage using Great Expectations, set up PagerDuty alerts, and added a dead-letter queue for failed records. We hadn't lost data, just delayed it 2 hours."
"We had a pipeline break because of a bad data input. I debugged the logs, found the issue, and fixed the transformation script. Took about a day."
"Pipelines break sometimes. I usually tell the team and we figure it out together. I haven't had a major incident I can think of."
"Which cloud platform have you used most for data work, and what services did you rely on?"
"Primarily AWS. I used S3 as the raw data lake, Glue for ETL jobs, Redshift for the warehouse, and Step Functions to orchestrate the workflow. I also set up IAM roles with least-privilege access for each pipeline component."
"I've used AWS — mainly S3 and some Lambda functions. I have the AWS Cloud Practitioner cert and I'm studying for the Data Engineer cert."
"I've used AWS a little. I know it's important. Mostly we had a DevOps team handle the cloud stuff."
Data Scientist Screening Questions
"Can you describe an ML model you built and what business problem it solved?"
"I built a churn prediction model for our SaaS product using gradient boosting (XGBoost). By identifying users likely to cancel 30 days out, the customer success team reduced monthly churn by 18%. I used SHAP values to explain the model's predictions to non-technical stakeholders."
"I built a classification model to predict customer churn using scikit-learn. The accuracy was about 82%. It helped the business target at-risk customers."
"I've built machine learning models using Python. I know classification, regression, and clustering. I can build whatever the business needs."
"How would you explain overfitting to a non-technical stakeholder?"
"I'd say: imagine you studied last year's exam questions so intensely that you memorized all the answers — but then the new exam has slightly different questions and you fail. Our model did the same thing: it 'memorized' the training data instead of learning the pattern. That's why I always test it on data it's never seen."
"Overfitting is when a model performs well on training data but poorly on test data. It means the model is too complex and not generalizing. I use cross-validation to check for it."
"Overfitting means the model is too fit to the data. You need to regularize it or get more data." (Can't explain to a non-technical audience.)
"Tell me about a time you had to present a data insight to a non-technical audience. How did you make it understandable?"
"I presented our A/B test results to the VP of Marketing. Instead of showing p-values, I framed it as: 'Version B generated 420 more sign-ups per week at the same cost — that's an extra $50K in annual recurring revenue.' I used one chart, told a before/after story, and had a clear recommendation ready."
"I simplified the technical details, used charts instead of tables, and focused on what the numbers meant for the business rather than the methodology."
"I tried to explain the model but they didn't really get it. I think non-technical people just struggle with this stuff. I prefer working with data teams."
"How do you decide when you have enough data to trust a result?"
"I use statistical power analysis before running an experiment to determine the minimum sample size needed to detect a meaningful effect with confidence. I pre-register my hypotheses and define success metrics before seeing results to avoid p-hacking. For A/B tests, I aim for 95% confidence and make sure we reach the predetermined sample size before calling it."
"I look at the p-value — if it's below 0.05, the result is statistically significant. I also try to run the experiment long enough to get a good sample."
"When the trend looks consistent over a few days and the numbers are pointing in the right direction."
Evaluation Guide: What to Look For
✓ Green Flags — Strong Candidate Signs
- Uses specific numbers, tools, and company names when describing past work
- Explains why they made technical choices, not just what they did
- Comfortable admitting what they don't know and how they'd find out
- Can translate technical concepts into plain business language
- Asks thoughtful questions about the role, team, and tech stack
- Has a visible portfolio (GitHub, Kaggle, publications) that matches what they claim
- Shows progression and growth across roles, not just lateral moves
- Mentions collaboration across teams (engineers, analysts, PMs)
⚠ Red Flags — Warning Signs
- Uses buzzwords (AI, Big Data, ML) without specifics or examples
- Can't describe a project they listed on their resume in detail
- Claims expertise in every technology currently trending
- Dismissive of communication or stakeholder work ("I just do the tech")
- Resume lists tools but candidate can't explain basic use cases for them
- All experience is in coursework, personal projects only — no professional context
- Evasive or defensive when asked about failures or challenges
- For senior roles: no evidence of mentoring, leading, or designing systems
Screening Tips by Experience Level
Junior (0–2 years)
Focus on fundamentals and learning agility. Strong junior candidates have solid personal projects, can explain basic concepts clearly, and show genuine curiosity. Don't penalize for limited industry experience.
Mid-Level (2–5 years)
Look for ownership and depth. They should describe projects where they made decisions, not just executed others' plans. Can they explain trade-offs? Have they dealt with production issues?
Senior (5–9 years)
Expect system design thinking and business alignment. They should talk about architecture decisions, mentoring, and how their work tied to business goals — not just completing tickets.
Staff / Principal (9+ years)
Look for organizational impact. Do they describe building teams, defining standards, or driving technical strategy? At this level, interpersonal and influence skills matter as much as technical depth.