Informatics, digital & computational pathology

Artificial intelligence

Machine learning fundamentals


Editorial Board Member: Lewis A. Hassell, M.D.
Jerome Cheng, M.D.

Last author update: 12 April 2024
Last staff update: 12 April 2024

Copyright: 2003-2024, PathologyOutlines.com, Inc.

PubMed Search: Machine learning fundamentals

Jerome Cheng, M.D.
Page views in 2024 to date: 13
Cite this page: Cheng J. Machine learning fundamentals. PathologyOutlines.com website. https://www.pathologyoutlines.com/topic/informaticsmachinelearn.html. Accessed November 27th, 2024.
Definition / general
  • Science of using computer algorithms to learn from patterns present in data and making predictions based on the learned patterns
Essential features
  • Machine learning algorithm is used to create a model from a dataset from which predictions are made
  • Machine learning model can be built without any knowledge of computer programming; for beginners, Orange software is a good starting point (GitHub: Orange - Interactive Data Analysis [Accessed 3 April 2024])
  • Applications in pathology / laboratory medicine include molecular subtyping of cancer, image recognition / segmentation and identification of lesions in digital slides, digital slide stain normalization, information extraction from pathology reports
  • Publicly available datasets applicable to machine learning can be found on the internet
Terminology
  • AutoML (automated machine learning): these are tools / software libraries that build, choose and optimize machine learning models with minimal user input; in some cases, all you have to do is provide the dataset and set the target feature and it will start building machine learning models for you
  • Deep learning: refers to neural networks with many layers; transformers and convolutional neural networks fall under that category due to the large number of layers (e.g., convolution or pooling layers)
    • Other types of neural networks with multiple layers (such as artificial neural networks with several hidden layers) also fall under this classification
  • Overfitting: the machine learning model memorizes the training data and corresponding outcome it was given and performs poorly on data it has never seen before
  • Supervised machine learning: discovers relationships between feature variables and the target feature (label); the label has to be provided before the machine learning model can be trained to build a predictive model
  • Target / outcome variable (label): value predicted based on the values of other variables in a dataset; it is analogous to the dependent variable in statistics
    • Features that contribute towards making the prediction are equivalent to the independent variables in statistics
    • Feature and variable are terms that are used interchangeably
  • Unsupervised machine learning: unlike supervised machine learning, the data label is not needed to find patterns in a dataset (e.g., similar sets of data points forming unique clusters in a t-SNE plot or groups formed through k-means clustering)
Machine learning algorithms
  • Linear regression
    • Given a set of points x and y, it finds the best fit line that goes through each pair of x and y points
    • Used in laboratories to validate a new method for a particular test by comparing the results between the new method and the reference method (Clin Chem 1998;44:2340)
    • Various software tools such as R and Python with scikit-learn simplify the process by performing all the necessary calculations
  • Logistic regression (Ann Biol Clin (Paris) 1994;52:277)
    • Statistical method for solving classification problems
    • Equation based on the sigmoid function demonstrates the relationship between the target variable and 1 or more variables
    • Predicts the probability of an outcome occurring
  • Naïve Bayes
    • Based on Bayes theorem, with the assumption that the variables involved contribute independently to the outcome variable; this assumption may be wrong, hence the description naïve
  • Decision trees
    • Optimal decision tree is constructed to fit the dataset
    • Each node in the tree consists of a feature variable and each node splits into branches based on the value of the variable
    • End of a branch is the value predicted for the target variable
    • Outside of machine learning, decision trees are often used in diagnosis and treatment guidelines
  • Random forest
    • Versatile machine learning method
    • Results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables
  • Gradient boosted trees
    • Like random forests, it also involves multiple decision trees
    • Applied to nonimage datasets, it has been able to achieve better predictive accuracy than random forests and other machine learning methods in most cases
  • Support vector machines
    • Line, plane or hyperplane separates points in a dataset, separating them into classes
  • K-means clustering
    • Unsupervised machine learning algorithm that groups similar sets of data together
  • Dimensional reduction methods
    • E.g., PCA, t-SNE, autoencoders
    • Reduces number of feature dimensions for easier interpretation or visualization
  • Neural networks (Psychol Rev 1958;65:386)
    • Inspired by interconnections between neurons in biological neural networks
  • Convolutional neural network
    • Type of neural network often used for image classification (e.g., cancer subtype or benign versus malignant) (Urol Clin North Am 2024;51:15)
    • Also used for semantic segmentation, object detection with classification, generation of fake images and style transfer (generative adversarial network), natural language processing
  • Generative adversarial network
    • Involves 2 neural networks
      • Generator: creates fake data
      • Discriminator: distinguishes real from fake data
    • Goal of training is for the generator to become better at creating images that look real to the discriminator
  • Autoencoders
    • Type of neural network that transforms an input into an intermediate representation, from which the original input is recreated
  • Transformers
  • Natural language processing
Steps in building a machine learning model
  • Data collection and preparation
    • Usually the most time consuming process in building models
    • Data can come from a variety of sources including questionnaires, internet searches, databases and images
    • CSV (comma separated values) file format, as well as other spreadsheet style formats are commonly used for training machine learning models; each feature is designated by a column and each row represents a record
  • Choose a programming language or machine learning platform
    • Python is currently the most popular programming language used in machine learning; the Python scikit-learn library includes many commonly used machine learning algorithms
    • Orange and Knime are open source GUI (graphical user interface) based machine learning platforms; these are easy to use and require no programming experience
  • Choose a machine learning algorithm
  • Set the hyperparameters for the model
  • Split the data into training, validation and holdout sets
    • Validation set
      • During the process of training, the model is tested on the validation set to assess its performance (e.g., accuracy or AUC ROC); hyperparameters of the model may be altered during training process to improve the performance of the model
        • Training accuracy much higher than the validation accuracy is a sign of overfitting
      • Also referred to as the development set
    • Holdout dataset
      • Used to assess the performance of the final model on unseen data after it has been fully trained; unlike the validation set, it is never exposed to the trained model and should give a better measure of performance
    • For smaller datasets, the holdout set may be omitted and model accuracy can be assessed using cross validation methods
    • Commonly used ratios for training, validation and holdout sets
      • 70/15/15
      • 60/20/20
      • Very large datasets can have a larger proportion of the data in the training set (e.g., 80/10/10 or 90/5/5)
  • Train and test the model
    • Test the performance of the model on the validation set
    • Commonly used metrics
      • Accuracy
      • Log loss
      • AUC (area under ROC curve)
      • Precision
      • Recall
      • F1 score
    • For smaller sample sizes
      • K-fold cross validation: entire dataset is subdivided into K subsets; each subset acts as the validation / test set once and the rest of the data are used for training the model; model performance is assessed by averaging the results attained from each subset
      • Leave-one-out cross validation: model is trained using the entire dataset except for one data point and the model is validated with the data point that was left out; process is repeated until every data point has been used as the test data point
      • Random stratification: set percentage of the dataset is randomly assigned to the training set and the remaining become the test set
  • Optimize the model
    • Fine tune the model parameters
    • Add new data to the dataset
    • Change the features in the dataset
    • Add additional features
    • Image augmentation (for convolutional neural networks)
  • Test the performance of the final model on the holdout set
    • Gives a better measure of real world model performance
Making your first model
  • Creating machine learning models with GUI based platforms like Orange is quick and easy
  • Download and install Orange through the following link (GitHub: Orange - Interactive Data Analysis [Accessed 3 April 2024])
  • Launch the program and a welcome screen will appear
  • Select the New option
  • In Orange, various tasks are done through widgets, which are represented by various icons on the left side of the user interface
  • Click on the Datasets widget found on the left side of the screen and drag it into the canvas (the empty portion of the screen on the right side); double click on it and choose a dataset from the list that appears; click on the Send Data button
  • Add the Test & Score widget to the canvas
  • Add the Random Forest widget to the canvas
  • Another machine learning algorithm instead of Random Forest can be chosen
  • Connect the Test & Score widget to the Datasets widget by clicking on it and connecting the line that appears to the Datasets widget
  • Connect the Random Forest widget to Test & Score
  • Canvas should look something like

Double click on Test & Score and the performance of the newly built model will be displayed
Applications
Board review style question #1

Which machine learning algorithm predicts an outcome by combining results from multiple decision trees?

  1. Linear regression
  2. Neural networks
  3. Random forest
  4. Support vector machines
Board review style answer #1
C. Random forest. In random forest, results from multiple decision trees are pooled together to arrive at the final prediction; each tree is generated using a random subset of the input dataset along with a random subset of the feature variables. Answer A is incorrect because linear regression estimates the relationship between 2 variables using the line of best fit. Answer B is incorrect because neural networks do not use decision trees. Answer D is incorrect because support vector machines separate points in a dataset using lines, planes and hyperplanes.

Comment Here

Reference: Machine learning fundamentals
Board review style question #2
Which of the following looks for the best fitting straight line that goes through a set of X and Y points?

  1. Convolutional neural networks
  2. Linear regression
  3. Neural networks
  4. Random forest
Board review style answer #2
B. Linear regression. In linear regression, an entire set of X and Y points is used to arrive at the linear equation y = bx + a, where b is the slope and a is a constant. Linear regression can be used for validation of laboratory test results. Answer A is incorrect because a convolutional neural network is a type of neural network often used for image classification. Answer C is incorrect because neural networks optimize model performance by adjusting weights and biases. Answer D is incorrect because random forest models combine results from multiple trees.

Comment Here

Reference: Machine learning fundamentals
Back to top
Image 01 Image 02