ML - Lecture Notes Summary


  • Lecture Info

  • Objectives of ML

    • Example: Text Recognition

    • The ML Process

  • Types of Problems

    • Supervised Learning

    • Unsupervised Learning

    • Reinforcement Learning

  • Lecture Info

  • Problem Description

  • Example

  • Linear Models

  • Least Squares

  • Overfitting

  • Evaluating the Generalization

  • Regularization

  • Lecture Info

  • Discrete Random Variables

  • Continuous Random Variables

  • Expectation

  • Variance

  • Probability Distributions

  • Multivariate Distributions

    • Covariance

    • Random Vectors

    • Covariance Matrix

    • Correlation Matrix

    • Multinomial Distribution

    • Dirichlet Distribution

  • Gaussian Distribution

    • Univariate

    • Multivariate

      • Covariance Matrix

      • Spectral Properties of \(\Sigma\)

      • Linear Transformation

      • Marginal and conditional

      • Bayes formula

  • Lecture Info

  • Frequentists vs Bayesians

  • Bayesian Inference

  • Conjucate Distributions

    • Beta-Bernoulli

    • Beta-Binomial

    • Dirichlet-Multinomial

  • Text modeling

  • Lecture Info

  • Definition of Entropy

  • Properties

  • Conditional Entropy

  • KL Divergence

    • Convexity

    • Jensen's Inequality

    • Applying KL

  • Mutual Information

  • Lecture Info

  • Model Inference

    • Subproblems

  • Bayesian Learning

    • Over Model Space

    • Over Parameters Space

  • Point Estimate

    • Maximum Likelihood Estimate

      • Example (Bernoulli):

      • ML and Overfitting

    • Maximum a Posteriori Estimate

      • Example (Beta-Bernoulli)

      • Problems with MAP

  • Bayesian Estimate

  • Model Selection

    • Validation Process

      • Test set/Training set

      • Cross validation

    • Information Measures

      • Akaike Information Criterion (AIC)

      • Bayesian Information Criterion (BIC)

  • Lecture Info

  • Fitting in Terms of Probability

  • Frequentist Approach

  • Bayesian Approach

  • Bayesian Fitting

    • Quando Tutto รจ Gaussiano

  • Lecture Info

  • Linear Models

    • Base Functions

      • Examples

      • Training Set

  • ML and Least Squares

    • Maximizing the Likelihood

    • Geometric Interpretation

    • Regularized Least Squares

      • Basic Form

      • General Form

  • Gradient Descent

  • Kernel Equivalent

    • Impostazione Duale/Primale

    • Kernel as Distance Functioon

  • Lecture Info

  • What is LC?

  • Difference with Regression

  • Approaches to LC

  • Generalized Linear Models

  • Linear Discriminant Functions

    • Binary Classification

    • Multiclass Classification

      • First Approach

      • Second Approach

      • Third Approach

    • Generalized Discriminant Functions

  • Lecture Info

  • LC Using Regression

    • Why it Makes Sense?

    • Learning Functions

      • Coefficient Matrix

      • Prediction Matrix

      • Residual Matrix

      • Closed Form Solution

    • Considerations

  • Fisher Linear Discriminant

    • Example

    • Basic Approach

      • Measuring Separation

      • Deriving the Direction

      • Solution

    • Refinment

      • Measuring Separation

      • Formula for Within-Class Variance

      • Fisher Criterion

      • Deriving the Direction

      • Solution

    • Deriving a Treshold

  • Perceptron

    • Problems

    • Definition

    • Cost Function

    • Gradient Optimization

      • Basic Gradient Descent

      • Stochastic Gradient Descent

    • Convergence Theorem

    • Structure

  • Lecture Info

  • Naive Bayes Classifiers

    • Language Models

    • Bayesian Classifiers

      • Computing \(P(C_k)\)

      • Computing \(P(d|C_k)\)

  • Generative Models

    • Binary Case (Sigmoid)

    • General Case (Softmax)

  • Gaussian Discriminant Analysis

    • Same Covariance Matrix

      • Binary Case

      • Discriminant Function

      • Multiple Classes

      • Decision Boundaries

    • Different Covariance Matrices

    • Estimation with ML

      • Estimating \(\pi\)

      • Estimating \(\mu_1, \mu_2\)

      • Estimating \(\mathbf{\Sigma}\)

  • Lecture Info

  • With Discrete Featues

    • Exponential Family

  • Generalized Linear Models

    • Hypothesis

    • GLM and Normal

    • GLM and Bernoulli

    • GLM and Categorial

    • Additional Regressions

      • Poisson

      • Exponential

  • Lecture Info

  • Discriminative Approach

  • Logistic Regression

    • Degrees of Freedom

    • ML Estimation

    • Gradient Ascent

  • Newton-Raphson Method

    • Linear Regression

    • Logistic Regression

  • Iterated Reweighted Least Squares

  • Lecture Info

  • Logistic Regression and GDA

  • Softmax Regression

    • Calcolo Gradiente

  • Probit Regression

    • Stochastic Treshold Model

    • Probit Activation Function

  • Bayesian Logistic Regression

    • Computing Posterior

    • Managing Intractability

  • Lecture Info

  • The Basic Problem

  • Sampling General Distributions

    • Easy Case

      • Example 1: Exponential

    • Rejection Sampling

    • Importance Sampling

  • Markov Chain Montecarlo

    • Markov Chains

    • MCMC Idea

    • How to Use it

  • Metropolis Algorithm

    • Existence of Stationary Distribution

    • Uniqueness of Stationary Distribution

    • Why it Works

  • Metropolis-Hastings Algorithm

  • Gibbs Sampling

  • MCMC and Bayesian Models

    • Sampling the Evidence

    • Sampling the Predictive Distribution

  • Lecture Info

  • Parametric Approach

    • Estimating Parameters

      • Maximum Likelihood (ML)

      • Maximum a Posteriori (MAP)

    • Bayesian Approach

  • Non Parametric Approach

  • Histograms

  • Kernel Density Estimators

    • Parzen Windows

      • Drawbacks

    • Smooth Kernel Functions

      • Gaussian Kernel Examples

    • Classification with Parzen Windows

  • K-Nearest Neighbors (kNN)

    • Classification with kNN

    • Performance

  • Lecture Info

  • Modello Nadaraya-Watson

  • Locally Weighted Regression

  • Local Logistic Regression

  • Lecture Info

  • Partitions of Gaussian Distributions

    • Marginal density

    • Conditional Density

  • Distributions over Functions (Finite Domains)

    • Gaussian Distributions

  • Gaussian Processes (Infinite Domains)

    • Sampling from GP

    • RBF Kernel

  • Gaussian Process Regression

    • Estimating Kernel Parameters

  • Lecture Info

  • Main Idea

  • Binary Classifiers

  • Optimal Margin Classifiers

    • Functional Margins

    • Geometric Margin

      • Maxium Margin Hyperplanes

    • Classification Details

    • Computing The Optimal Margin

    • Duality

  • Lecture Info

  • Recap

  • Lagrangian Method

    • Karush-Kuhn-Tucker Theorem

    • Applying Karush-Kuhn-Tucker Theorem

    • Defining the Dual Problem

    • Solving the Dual Problem

  • Classification with SVM

  • Non Separability Case

    • Slack Variables

    • KKT Conditions

    • Dual Formulation

    • Item Characterization

    • Classification

  • Extensions

  • Lecture Info

  • Computational Issues

  • Loss Functions in Classification

    • 0/1 Loss

      • Regularization

      • Problems

    • Surrogate Smooth Loss Functions

    • Convex Loss Functions

    • Convex Surrogate Loss Functions

    • Hinge Loss

      • Subgradient

      • Perceprton

    • Logistic Loss

  • Regularization Terms

  • SVM and Gradient Descent

  • Lecture Info

  • Kernels Functions

    • Why Kernels?

    • Definition

    • Verification

    • Construction

    • Relevant Kernels

  • Kernels and SGD in SVM

  • Lecture Info

  • Multilayer Networks

  • Multi-Layered Percerpton

    • First Layer

    • Second Layer

    • Inner Layers

    • Output Layer

  • 3-Layer Networks

  • Lecture Info

  • Approximating Functions

  • Training with ML

    • Regression

    • Binary Classification

    • Multiclass Classification

  • Iterative Methodos to Minimize Loss

    • Gradient Descent

    • On-line (stochastic) Gradient Descent

    • Batch Gradient Descent

    • Backpropagation

      • Example: 3-layered network

      • Computational Efficiency

  • Lecture Info

  • Deep Networks

    • Types

    • Learning

    • Loss Functions

    • Regularization

    • Vanishing Gradient

    • Exploding Gradient

  • Lecture Info

  • Convolutional Neural Networks

    • MLP Problems with Images

    • Convolution Operation

    • Local Connections

    • ConvNet Structure

    • Types of Layers

    • Example: ConvNet for CIFAR-10

  • Convolutional Layer

    • Depth and Stride

    • Connections Between Layers

    • Real-World Example

    • Summary

  • Pooling Layer

  • RELU Layer

  • Layer Patterns

  • Case Studies

  • Lecture Info

  • Recurrent Neural Networks

    • Sequential Data

    • RNN Network Structure

      • Computing Recurrent States

      • Computing Output Value

      • Folded/Unfolded Rappresentations

    • Learning

  • LSTM Networks

    • First Version

      • Input Layer

      • Output Layer

    • With Forget Layer

    • Variants

      • Peephole

      • GRU

  • Topologies

  • Lecture Info

  • Structure

  • Usage

  • Example: Iris Dataset

  • Pros/Cons

  • Construction

    • Partinioning

    • Impurity Measure

    • Goodness of split

    • Gini Index

    • Entropy as Impurity Measure

    • Other Measures of Impurity

    • When to Stop

    • Pruning

  • Lecture Info

  • Ensemble Methods

  • Bagging

    • Bootstrap Sample

    • Usage

    • Variant

    • Why it Works?

      • Classification

      • Regressione

    • Out-of-Bag Error

    • Random Forest

  • Bootsting

    • Adaboost

      • Binary Classification

      • Example

    • Additive Models

      • Fitting Additive Models

      • Forward Stagewise Additive Modeling

      • Adaboost as Additive Model

    • Gradient Boosting

      • Link to Gradient Descent

      • Algorithm

      • Regression

      • Classification

  • Lecture Info

  • Curse of Dimensionality

  • Dimensionality Reduction

  • PCA

    • Caso \(d^{'} = 0\)

    • Caso \(d^{'} = 1\)

      • Best way to project

      • Best direction

    • Caso \(d^{'} > 1\)

    • Example

    • Choosing \(d^{'}\)

  • Lecture Info

  • SVD

    • Why it Exists?

  • PCA and SVD

  • Co-occurence data

  • Latent Semantic Analysis

    • Assumption

    • Model

    • Problems

    • Solution (SVD)

    • Interpretation

  • Lecture Info

  • Clustering

    • Types

  • Partitional Clustering

    • Brute Force

    • Clustering Cost (Sum of Squares)

    • K-Means

      • Algorithm

      • Example

      • How to Choose \(K\)

  • Hierarchical Clusterting

    • By Aggregation

    • Dendogram

    • Cluster Similarity

  • Lecture Info

  • Mixtures of distributions

    • Example

  • Mixture Parameters Estimation

    • Respect to \(\pi\)

    • Respect to \(\lambda\)

    • Combining

    • Respect to \(\theta\)

    • Analytical Intractability

  • Mixtures as Generative Process

    • Example

    • Probabilistic Clustering

    • Distributions with latent variables

  • Mixtures of Gaussian Distribution

  • Lecture Info

  • Complete Dataset

  • Gaussian Mixtures

    • With Complete Dataset

    • Log-Likelihood of Complete Dataset

    • Dealing with Latent Variables

      • M-step

      • E-Step

    • Expectation Maximization Algorithm