P19. A simple method for estimation of optimal dimensionality in PLS models based on regression vector features

A. A. Gowen1, G. Downey2, C. Esquerre2, C. P. O'Donnell1

1School of Agriculture, Veterinary Medicine and Food Science, UCD, Ireland

2Teagasc, Ashtown Food Research Centre, Ireland

PLS regression (PLSR) models are prone to overfitting through inclusion of redundant latent variables that model noise inherent in a given dataset. This can adversely affect model predictive ability with future datasets. Numerous methods have been proposed for prevention of overfitting in PLSR, including Wold's criterion, Monte Carlo cross-validation, smoothed partial least-squares regression [1] and the randomisation test [2]. One well-known sign of overfitting is the appearance of noise in regression vectors; this often takes the form of a reduction in apparent structure and the presence of sharp peaks with a high degree of directional oscillation, features which are currently estimated subjectively. We propose a simple method for objectively quantifying the shape of a regression vector; this measure can be combined with an indicator of model performance, such as root mean square error of cross-validation (RMSECV), to produce a new criterion for PLS model dimensionality estimation. The consistency of this new method is demonstrated on simulated and real datasets and compared with existing methods for estimation of optimal model dimensionality.

References:
1. S. Gourvenec, J. A. Fernandez Pierna, D. L. Massart, D. N. Rutledge, 2003. An evaluation of the PoLiSh smoothed regression and the Monte Carlo Cross-Validation for the determination of the complexity of a PLS model, Chemometrics and Intelligent Laboratory Systems, 68, 41-51.
2. M.P. Gomez-Carracedo, J.M. Andrade, D.N. Rutledge, N.M. Faber, 2007. Selecting the optimum number of partial least squares components for the calibration of attenuated total reflectance-mid-infrared spectra of undesigned kerosene samples, Analytica Chimica Acta, 585, 253-265.