Data Science: Field Overview
Data science is an interdisciplinary field that extracts knowledge and actionable insight from structured and unstructured data. It combines statistical reasoning, machine learning, software engineering, and domain expertise to transform raw data into decisions and products.
Core Disciplines
A working data scientist draws on several overlapping areas:
- Statistics and probability — the mathematical foundation for inference, uncertainty quantification, and experimental design
- Machine learning — algorithms that learn patterns from data to generalise to unseen examples
- Data engineering — the pipelines, storage systems, and tooling that make data available and reliable
- Visualisation and communication — translating analytical findings into decisions by people who did not run the analysis
No practitioner is equally strong in all four. Teams distribute these strengths across roles.
The Data Science Workflow
Most analyses follow the CRISP-DM arc (Cross-Industry Standard Process for Data Mining):
- Business understanding — translate a decision problem into a measurable analytical objective
- Data understanding — explore available data, assess quality, identify gaps
- Data preparation — clean, join, transform, and engineer features
- Modelling — select, train, and tune a model or family of models
- Evaluation — assess performance against held-out data and the original objective
- Deployment — integrate the model or insight into a system or workflow
Iteration between stages is the rule. Data preparation typically consumes 60–80% of project time.
Exploratory Data Analysis
Before any model is built, exploratory data analysis (EDA) builds a mental model of the data:
- Distributions of individual variables (histograms, box plots, empirical CDFs)
- Relationships between variables (scatter plots, correlation matrices)
- Anomalies, outliers, and missing values
- Class imbalance and selection bias
EDA is not a formal stage with a fixed output — it is a habit of sustained curiosity.
Modelling Paradigms
| Paradigm | Objective | Examples |
|---|---|---|
| Supervised learning | Predict a labelled output | Regression, classification |
| Unsupervised learning | Discover structure in unlabelled data | Clustering, dimensionality reduction |
| Self-supervised learning | Predict masked or held-out parts of the input | Language modelling, masked autoencoders |
| Reinforcement learning | Maximise cumulative reward through interaction | Policy optimisation |
Relationship to Adjacent Fields
- Statistics — data science inherits statistical inference; it adds scale, automation, and a focus on prediction over explanation
- Machine learning research — data science applies what ML research produces
- Data engineering — data science consumes what data engineering builds; the boundary is porous
- Business intelligence — BI reports on what happened; data science attempts to explain why and predict what will happen