Data Science: Field Overview

Data science is an interdisciplinary field that extracts knowledge and actionable insight from structured and unstructured data. It combines statistical reasoning, machine learning, software engineering, and domain expertise to transform raw data into decisions and products.

Core Disciplines

A working data scientist draws on several overlapping areas:

Statistics and probability — the mathematical foundation for inference, uncertainty quantification, and experimental design
Machine learning — algorithms that learn patterns from data to generalise to unseen examples
Data engineering — the pipelines, storage systems, and tooling that make data available and reliable
Visualisation and communication — translating analytical findings into decisions by people who did not run the analysis

No practitioner is equally strong in all four. Teams distribute these strengths across roles.

The Data Science Workflow

Most analyses follow the CRISP-DM arc (Cross-Industry Standard Process for Data Mining):

Business understanding — translate a decision problem into a measurable analytical objective
Data understanding — explore available data, assess quality, identify gaps
Data preparation — clean, join, transform, and engineer features
Modelling — select, train, and tune a model or family of models
Evaluation — assess performance against held-out data and the original objective
Deployment — integrate the model or insight into a system or workflow

Iteration between stages is the rule. Data preparation typically consumes 60–80% of project time.

Exploratory Data Analysis

Before any model is built, exploratory data analysis (EDA) builds a mental model of the data:

Distributions of individual variables (histograms, box plots, empirical CDFs)
Relationships between variables (scatter plots, correlation matrices)
Anomalies, outliers, and missing values
Class imbalance and selection bias

EDA is not a formal stage with a fixed output — it is a habit of sustained curiosity.

Modelling Paradigms

Paradigm	Objective	Examples
Supervised learning	Predict a labelled output	Regression, classification
Unsupervised learning	Discover structure in unlabelled data	Clustering, dimensionality reduction
Self-supervised learning	Predict masked or held-out parts of the input	Language modelling, masked autoencoders
Reinforcement learning	Maximise cumulative reward through interaction	Policy optimisation

Relationship to Adjacent Fields

Statistics — data science inherits statistical inference; it adds scale, automation, and a focus on prediction over explanation
Machine learning research — data science applies what ML research produces
Data engineering — data science consumes what data engineering builds; the boundary is porous
Business intelligence — BI reports on what happened; data science attempts to explain why and predict what will happen