Skip to content Skip to sidebar Skip to footer

Business analytics 101: key concepts explained

Data science can feel like an alphabet soup of terms — so here’s a practical glossary that helps you navigate the basics: from statistics, to machine learning, to data-engineering and programming tools.

Algorithm — a set of repeatable instructions to solve a task. In data science / ML, algorithms define how data is transformed into predictions or insights. 

Average (Mean) — the arithmetic mean (or “media”): sum of values divided by count. Used as a basic measure of central tendency, though sensitive to outliers. 

Median — the middle value in an ordered dataset. If there are an even number of observations, it’s the average of the two central values. Unlike the mean, median is robust to outliers. 

Mode — the value that appears most frequently in a dataset. A dataset can be unimodal (one mode), multimodal (multiple), or have none. 

Variance & Standard Deviation — variance measures average squared deviation from the mean; standard deviation is its square root. They express how spread out data are. 

Outlier — a data point that diverges markedly from the rest of the distribution. Outliers can distort statistics like mean, misleading analysis if not identified. 

Dataset / Dataframe — a structured collection of data, typically a table where rows are observations and columns are features (variables). In many data-science tools (e.g. Python/pandas), dataframes are the basic data structure. 

Big Data — very large and/or complex datasets — too big or unstructured for traditional processing tools — that require specialized tools, architectures or distributed processing. 

Data Pipeline — the sequence of data collection, cleaning, transformation, storage, and processing operations that bring raw data to a state ready for analysis. Often involves ETL/ELT processes. 

Data Wrangling (or Data Cleaning / Preprocessing) — the process of cleaning, normalizing, transforming raw data into a usable format: handling missing values, correcting errors, standardizing formats, encoding features, etc. Essential before any analysis. 

Feature / Feature Engineering — a “feature” is a variable (column) used as input to a model. Feature engineering is the practice of creating or transforming variables to improve model performance (e.g. derive “age from date-of-birth”, encode categorical variables, create interaction variables). 

Supervised Learning — class of ML where the model is trained on labeled data (input features + known output), to learn to predict outputs for new data. 

Regression (Linear Regression) — a statistical / ML model used to model the relationship between a dependent (target) variable and one or more independent (predictor) variables, assuming a linear relationship. Useful for forecasting continuous outcomes (e.g. sales amount, price, demand). 

Logistic Regression (Categorical / Classification Regression) — a model similar in concept to regression but used when the target variable is categorical (e.g. yes/no, churn/keep). The model predicts probabilities of class membership. 

Decision Tree — a model that splits data based on feature values, forming a tree-like decision structure. Can be used for both classification and regression tasks. 

Random Forest (Ensemble Method) — an ensemble ML method that builds many decision trees (on random subsets of data and/or features) and aggregates their predictions (averaging for regression, voting for classification). Helps improve accuracy and reduce overfitting compared to a single tree. 

Gradient Boosting / Boosting (e.g. XGBoost, GBM) — technique that builds models sequentially: each new model tries to correct the mistakes of the previous ones. Very powerful for tabular data prediction tasks. 

Clustering — a set of unsupervised learning methods where the goal is to group similar data points together (clusters) based on features. Useful for segmentation, anomaly detection, exploration when no labeled outcomes. 

Evaluation Metrics (Accuracy, AUC, ROC, F1-Score, etc.) — quantitative measures used to assess how good a model’s predictions are. Accuracy is common for classification (correct predictions / total), but depending on problem may use precision, recall, AUC-ROC, etc. 

Cross-Validation — a technique to evaluate model performance by splitting data into training/validation (and sometimes test) sets multiple times to ensure performance is stable and generalizes beyond a single split. 

Overfitting / Underfitting — overfitting: when a model learns noise or idiosyncrasies of the training data and fails to generalize to new data. Underfitting: when a model is too simple to capture underlying patterns. Balancing complexity and generalization is key. 

Bias-Variance Tradeoff — reflects the tension between underfitting (high bias) and overfitting (high variance). A good model minimizes both via regularization, proper validation, etc. 

Hyperparameter — settings or parameters of a model defined before training (e.g. number of trees in random forest, learning rate in boosting, depth of tree). They influence how the model learns and generalizes. 

Training / Test Set — in supervised learning, data is split: training set to build (fit) the model, test set to evaluate how well it performs on unseen data. Sometimes also validation set. Helps avoid overfitting. 

Data Visualization — graphical representation of data (charts, plots, dashboards) to help humans interpret results: distributions, trends, outliers, relationships. Fundamental for presenting insights to non-technical stakeholders. 

Data Warehouse / Data Lake — architectures for storing large volumes of structured (warehouse) or unstructured/semi-structured (lake) data, often as part of a big-data or analytics infrastructure. 

ETL / ELT — Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT): processes to move data from source systems, clean/transform it, and store it for analysis. Fundamental in data pipelines. 

Data Governance — policies, roles, procedures that ensure data quality, privacy, security, and proper usage across an organization. Key to reliable analytics and compliance. 

Python — one of the leading programming languages in data science and machine learning, thanks to its readability, large ecosystem (pandas, NumPy, scikit-learn, TensorFlow, etc.), and strong community support.

Curabitur varius eros et lacus rutrum consequat. Mauris sollicitudin enim condimentum, luctus justo non, molestie nisl.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Lorem ipsum dolor sit amet

Aenean et egestas nulla. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Fusce gravida, ligula non molestie tristique, justo elit blandit risus, blandit maximus augue magna accumsan ante. Duis id mi tristique, pulvinar neque at, lobortis tortor.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Etiam vitae leo et diam pellentesque porta. Sed eleifend ultricies risus, vel rutrum erat commodo ut. Praesent finibus congue euismod. Nullam scelerisque massa vel augue placerat, a tempor sem egestas. Curabitur placerat finibus lacus.

Leave a comment

"The NODO Brief" Subscribe!

Working Hours

Mon-Fri: 9 AM – 6 PM

Saturday: 9 AM – 4 PM

Sunday: Closed

Office

We’re at WeWork! anywhere in the world, always near you.

Get in Touch

Nodo Labs® 2025. All rights reserved.