iMESchelp!
Introduction
Welcome to the iMESchelp, your comprehensive guide to use iMESc:an interactive machine learning app designed to analyze environmental data. iMESc is a shiny-based application that allows the performance of end-to-end machine learning workflows. It provides a wide range of resources to meet the various needs of environmental scientist, making it a versatile tool for data analysis.
This manual is organized into the following sections:
Setup: Step-by-step instructions to run iMESc.
Layout: The dashboard organization.
Widgets: The interactive widgets used in iMESc, such as buttons, dropdowns, and checkboxes, which enable seamless interactions with the app.
Datalist: Exploring the core concept of Datalists and their attributes.
Essentials of Building a Workflow in iMESc: Generic workflow steps commonly encountered while using iMESc.
Pre-processing Tools: In-depth coverage of the tools available for pre-processing data, including Datalist creation and data transformation.
Sidebar-menu: Details about the modules, analyses and algorithms, parametrization, and results.
Packages & Functions: Details about the main R packages and functions used in iMESc, along with their respective versions and analytical tasks.
1 Setup
Once installed, open the R studio;
Install shiny package if it is not already installed;
Run the code below.
When you use the iMESc app for the first time, it will automatically install all the necessary packages, which may take several minutes to complete. However, once the first installation is finished, subsequent access of the app will be much faster. If the required packages are not already loaded, they typically take several seconds to load. On the other hand, if the packages are already loaded, iMESc will start almost instantly.
2 Layout
iMESc is designed with a dashboard layout, consisting of three main sections:
Pre-processing tools on the top-left containing widgets for Datalist options and pre-processing data.
Sidebar-menu on the left-hand side containing menu buttons.
The Main panel for viewing the analytical tasks.
Upon selecting a menu button, users will seamlessly navigate to a sub-screen housing the selected module. Each module features a header equipped with interactive widgets, along with multiple tab panels that support various functionalities.
To ensure an optimal display of iMESc content, we strongly recommend a minimum landscape resolution of 1377 x 768 pixels. Adhering to this resolution, it will guarantee an enhanced user experience and proper visualization of all elements on the screen.
3 Widgets
The app is built using widgets: web elements that users interact with. The standard iMESc widgets are:
Widgets | Task | |
---|---|---|
Button | Performs an action when clicked | |
Picker/Dropdown | Allows the user to select only one of a predefined set of mutually exclusive options | |
Checkbox | Interactive box that can be toggled by the user to indicate an affirmative or negative choice | |
Checkbox group | A group of check boxes | |
Radiobuttons | Allows the user to choose only one of a predefined set of mutually exclusive options | |
File | A file upload control wizard | |
Numeric | A field to enter numbers | |
Text | A field to enter text |
4 Datalist
iMESc manages data through Datalists Fig S4.1, which can include sheets and shape-files (user-provided). The sheets are internally treated as data.frame objects in R, where rows represent observations and columns represent variables. Matching observations among these attributes, regardless of the Datalists they come from, is achieved based on the row names, ensuring data consistency. To ensure proper handling of the sheets-attributes, you must provide a unique ID and place it in the first column when uploading the data. iMESc automatically removes this first column and adds it as the row names of the respective attribute. Decimals in iMESc are represented with dots (e.g. 1/2 =0.5), check it before uploading a file.
Required
Numeric-Attribute | Numerical data with continuous or discrete variables. |
Optional
Factor-Attribute: | Categorical data. If not provided, iMESc automatically generates this attribute as a sheet with a single column containing the IDs of the Numeric-Attribute. |
Coords-Attribute | Geographical coordinates (Longitude, Latitude) represented in decimal degrees, used for spatialization of data within the Spatial Tools. |
Base-Shape-Attribute: | A polygon shape to clip or interpolate spatialized data for map generation. |
Layer-Shape-Attribute: | Adds extra shapes for superimposition on maps. |
Extra-Shape Attribute: | Users can add other shapes as Extra-Shape-Attribute to further customize their maps. |
Furthermore, Datalists have the capacity to store models that have been trained using the iMESc, permitting users to integrate and manage predictive models along with their data sets. To access all the available analyses in iMESc, you need to create a Datalist by either uploading your own data or using the example data provided. For guidance on uploading sheets and the required attribute formats, as well as how to utilize the example Datalists, please refer to the “Creating a Datalist” section.
5 Main Analyses
The table provided below serves as a convenient reference for the main analyses utilized in iMESc. It displays their locations in the sidebar menu, along with their abbreviations and corresponding packages. For more comprehensive information about each analysis, you can refer to their respective sections in this manual.
Side Bar menu | Analyses | Abbreviation | Package | Author |
---|---|---|---|---|
Descriptive tools | Pearson’s correlation | - | base | Pearson (1895) |
Kendell’s correlation | - | base | Kendall (1938) | |
Spearman’s correlation | - | base | Spearman (1904) | |
Principal Component Analysis | PCA | base | Pearson (1901) | |
Nonmetric Multidimensional Scaling | MDS | vegan | Legendre & Anderson (1999) | |
Redundancy Analyses | RDA | vegan | Blanchet et al. (2008) | |
Piecewise Redundancy Analyses | pwRDA | segRDA | Vieira (2019) | |
Spatial tools | Krigging | |||
Inverse Distance Weighting | idw | - | Shepard (1968) | |
K-Nearest neighbor | KNN | stats | Fix & Hodges (1951) | |
Support Vector Machine with Radial Basis Function Kernel | svmRadial | kernlab | Karatzoglou et.al (2004) | |
Gaussian Process with Radial Basis Function Kernel | gaussprRadial | kernlab | Karatzoglou et.al (2004) | |
Support Vector Machine with Radial Basis Function Kernel and Cost Parameter Optimization | svmRadialCost | kernlab | Karatzoglou et.al (2004) | |
Unsupervised Algorithms | Sel-Organizing Maps | SOM | kohonen | Kohonen (1982) |
Hierarchical Clustering | HC | factoextra | Sneath (1957) | |
Random Forest | rf | randomForest | Breiman (2001) | |
Supervised Algorithms |
Stochastic gradient boosting | gbm | gbm | Friedman (2001) |
Conditional Inference Random Forest | cforest | party | Hothorn et al. (2006) | |
Recursive Partitioning and Regression Trees | rpart | rpart | Breiman et al. (1984) | |
Tree models from Genetic Algorithms | evtree | evtree | Grubinger et al. (2014) | |
Naive Bayes | nb | klaR | Duda et al. (2012) | |
K-Nearest Neighbors | knn | base | Fix & Hodges (1951) | |
Self-Organizing Maps | xyf | kohonen | Shepard (1962) | |
Generalized Linear Model | glm | base | Nelder & Wedderburn (1972) | |
Gaussian Process with Radial Basis Function Kernel | gaussprRadial | kernlab | Karatzoglou et.al (2004) | |
Support Vector Machine with Linear Kernel | svmLinear | kernlab | Karatzoglou et.al (2004) | |
Support Vector Machine with Radial Basis Function Kernel | svmRadial | kernlab | Karatzoglou et.al (2004) | |
Support Vector Machine with Radial Basis Function Kernel and Cost Parameter Optimization | svmRadialCost | kernlab | Karatzoglou et.al (2004) | |
Stacked AutoEncoder Deep Neural Network | dnn | deepnet | Vincent et al. (2010) | |
Model Averaged Neural Network | avNNet | nnet | Ripley (1996) | |
Neural Network | nnet | nnet | Ripley (1996) | |
Neural Networks with Feature Extraction | pcaNNet | nnet | Ripley (1996) | |
Monotone Multi-Layer Perceptron Neural Network | monmlp | monmlp | Lang (2005) | |
Feature Selection using randomForest Genetic Algorithm | rfGA | caret | Kuhn (2008) |
6 Essentials of Building a Workflow in iMESc
In this section, we will cover the four recurring steps to construct a workflow within iMESc:
6.1 Create Datalists Based on Model Specifications
Use pre-processing tools to Create Datalists with their associated attributes based on the chosen analytical method. | |
For unsupervised methods, only Numeric-Attribute (X) is needed. iMESc automatically recognizes the Numeric-Attribute associated with the Datalist as (X). | |
For supervised classification models, both the Numeric-Attribute (X) and the Factor-Attribute (Y) are needed. iMESc automatically recognizes the Numeric-Attribute and Factor-Attribute associated with the Datalist as X and Y, respectively. X and Y can be from the same or different Datalists. | |
For supervised regression models, X and Y are both Numeric-Attributes. iMESc automatically recognizes the Numeric-Attributes associated with the Datalist as both X and Y. |
6.2 Pre-processing
Use pre-processing tools to handle missing values (Data imputation tool). | |
Transform the data as needed (e.g., scaling, centering, log transformations), especially for distance-based methods (e.g., PCA, SOM). | |
Partition the Y data between training and testing for supervised machine learning methods. This action creates a column in the Factor Attribute indicating the partitioning. |
6.3 Save changes and models
Saving data changes or trained models is a recurring step throughout iMESc. Whenever saving is required, iMESc will indicate this with a flashing-blue disc button .
Saving data changes can be done as new Datalists or overwrite existing ones. Factor, Coords, and Shapes attributes are automatically transferred to the new Datalist. Models previously saved in a Datalist are not transferred.
Trained models are saved within the Datalist used as the predictor (X). After training a model, users have the option to save it as a new model or overwrite an existing one. This action creates a new attribute within the Datalist (e.g., RF-Attribute for Random Forest models).
6.4 Loading and downloading a savepoint
Download a save-point:
Open the pre-processing tools in iMESc.
Click the “Download” button in the “Create a savepoint” section.
The save-point file (.rds) will be downloaded to the computer, capturing your workspace, including all the Datalists and associated models.
Restore a save-point:
Go to the pre-processing tools.
In the “Load a savepoint” section, use “Browse” to upload the save-point file from your computer.
Click “Upload” or “Load” to restore your workspace to that point.
Save-points are incredibly useful for preserving your analysis progress and results. By downloading a save-point, you can conveniently store your work, and later, by uploading it, you can seamlessly continue your analysis from where you left off. This feature ensures that your work remains intact, even if you close the session or access iMESc from a different device.
Using save-points streamlines your workflow and enhances your overall experience with iMESc, providing a reliable way to manage and preserve your analysis outputs and data for future use.
6.4.1 Extracting the Savepoint Results by R Code:
You can extract specific results and attributes from the Savepoint using R code. The iMESc is not required here. To do this, use the following steps:
# Reading the Savepoint
savepoint <- readRDS("savepoint.rds")
savepoint$saved_data # To access all saved Datalists
names(savepoint$saved_data) # To access the names of the saved Datalists
# Accessing a specific Datalist named "datalist_name"
datalist_name <- savepoint$saved_data[['datalist_name']]
datalist_name
# Accessing specific attributes within the Datalist
attr(datalist_name, "factors") # To access the Factor-Attribute
attr(datalist_name, "coords") # To access the Coords-Attribute
attr(datalist_name, "base_shape") # To access the Base-Shape-Attribute
attr(datalist_name, "layer_shape") # To access the Layer-Shape-Attribute
attr(datalist_name, "extra_shape") # To access the Extra-Shapes-Attribute
# To extract saved models
attr(datalist_name, "som") # To access all SOM models saved in the Datalist
attr(datalist_name, "som")[["model_name"]] # To access a saved SOM model named 'model_name'
# To access other models, replace "som" with the corresponding model name:
# 'kmeans' (k-Means), 'nb' (Naive-Bayes), 'svm' (Support-Machine Vector), 'knn'(k-nearest neighbor), 'rf' (Random Forest),
# 'sgboost' (stochastic gradient boosting), 'xyf' (supervised som).
Note: Ensure that you specify the correct path and filename of your
save-point file in the readRDS
function.
Modify “datalist_name” and “model_name” in the R code to access specific
Datalists and saved models, respectively.
7 Pre-processing Tools
The Pre-processing Tools comprise a suite of functionalities for manipulating and preparing Datalists. These tools assist in refining the data, handling missing values, and generating custom palettes for graphical outputs. Below are the details of each tool:
7.1 Create a Datalist
To begin working with iMESc, you need to create a Datalist, which serves as the foundation for all analytical tasks. Click on “Create Datalist button” to open a modal dialog for Datalist creation. All analytical tasks in iMESc will require a Datalist, which can be uploaded by the user or generated using example data.
7.1.1 Upload
Name the Datalist: Use the text widget to provide a name for the Datalist.
Numeric-Attribute: Upload a .csv or .xlsx file containing the numeric variables. This file is mandatory and should include observations as rows and variables as columns.The first row must contain the variable headings, and the first column should have observation labels. Columns containing characters (text or mixed numeric and non-numeric values) will be automatically transferred to the Factor-Attribute.
Factor-Attribute: Upload a .csv or .xlsx file containing categorical variables. This file should have observations as rows and categorical variables as columns. The first row must contain variable headings, and the first column should have observation labels. If the Factor-Attribute is not uploaded, the observation IDs will be used automatically. This attribute is crucial for labeling, grouping, and visualizing results based on factor levels. It can be replaced at any time with a new one using the “Replace Factor-Attribute” button
Coords-Attribute: Upload a .csv or .xlsx file containing geographical coordinates. This file is optional for creating a Datalist but required for generating maps. The first column should contain the observation labels, the second column Longitude values, and the third column Latitude values (both in decimal degrees). The first row must contain the coordinate headings.
Base-Shape: Upload a single R file containing the polygon shape, such as an oceanic basin outline, to be used primarily with ggplot2 for map generation. This optional file provides the foundational geographical context for your visualizations. It can be generated using the SHP toolbox in the pre-processing tools, which converts shapefiles (.shp, .shx, and .dbf files) into an R file suitable for use as a base layer in ggplot2
Layer-Shape: Upload a single R file containing an additional shape layer, such as a continent shape, to be used primarily with ggplot2 for map generation. This optional file can also be created using the SHP toolbox available in the pre-processing tools.
Best practices when uploading your sheet
Prepare your data: Use the first row as column headers and the first column as observation labels.
Ensure each label is filled with unique information, removing any duplicated names.
Check for empty cells in the observation label column.
Ensure that the column names are unique; duplicated names are not allowed.
Avoid using cells with blank spaces or special symbols.
Avoid beginning variable names with a number.
Note that R is case-sensitive, so “name” is different from “Name” or “NAME.”
Avoid blank rows and/or columns in your data.
Replace missing values with NA (not available).
7.1.2 Use Example Data
This option allows users to explore the example data available in iMESc . After clicking on “Create a Datalist,” select the “Use example data” radio-button to proceed with Datalist insertion. This action will insert two Datalists from Araçá Bay, located on the southeastern coast of Brazil:
envi_araca: Contains 141 samples with 9 environmental variables.
nema_araca: Contains 141 samples with 194 free-living marine nematode species (southeastern coast of Brazil).
Both Datalists comprise five attributes: Numeric, Factor, Coords-Attribute, Base-Shape, and Layer-Shape. Studies that explored these data area include Corte et al., 2017, Checon et al., 2018 and Vieira et al., 2021.
7.2 Options
This drop-down menu offers the user a range of tools for editing Datalists.
7.2.1 Rename Datalist
Change the name of a selected Datalis.
7.2.2 Merge Datalists
Combine two or more Datalists by columns or rows. This action affects both Numeric and Factor-Attribute data. When merging by rows, it also combines associated Coords-Attribute (if any). When merging by columns, there is an option to fill missing columns with NA, or restrict the Datalists to common columns.
Please note that saved models in one of the Datalists are not transferred to the merged Datalist.
7.2.3 Exchange Factor/Variables
The “Exchange Factors/Variables” functionality in iMESc allows you to convert or transfer data between numeric and factor formats. This powerful tool provides flexibility in handling your data and enables transitions between different data types.
From Datalist Selector: Select the source Datalist from which you want to exchange data.
From Attribute Selector: Within the selected Datalist, choose between the Numeric or Factor Attribute that you wish to convert or transfer.
To Datalist Selector: Select the target Datalist where you want to transfer or convert the data.
To Attribute Selector: Within the target Datalist, specify whether you want to convert the data to Numeric or Factor format.
From Numeric…
To Numeric: This option allows you to copy or transfer the selected numeric variables from the source Datalist to the target Datalist while preserving their numeric format.
To Factor: Convert the selected numeric variables to factors. The default is to transform each unique numeric value into a new level of the factor. You can use the “cut” option to categorize the variables into specified bins or levels. The initial guess of bins can be determined by three methods: Surges, Scott’s, or Freedman-Diaconis. Additionally, you can manually define the number of bins and also edit the names and order of the factor levels.
From Factor…
To Factor: With this option, you can copy or transfer the selected factors from the source Datalist to the target Datalist, maintaining their original factor format.
To Numeric: This conversion allows you to convert the selected factors to numeric data before copying or moving them. You have two types of conversions available:
Binary: For each factor level, a single binary column is created, where 1 indicates the class of that observation.
Integer: A single column is created, representing the numeric (integer) representation of the factor levels (values).
7.2.4 Replace Attributes
The “Replace Attribute” option allows users to update existing Attributes within a Datalist by replacing them with new data from a CSV file.
7.2.5 Edit Datalist Columns
Modify the names of columns for both Numeric-Attributes and Factor-Attributes, and remove columns.
7.2.6 Edit Model names
Allows you to edit the names of saved models.
7.2.7 Transpose a Datalist
Rotate a Datalist (Numeric and Factor) from rows to columns. If a Coords-Attribute is associated with the Datalist, it will be removed.
7.2.8 SHP toolbox
This toolbox allows the creation of Base-Shapes, Layer-Shapes and Extra shapes.
Targets & Upload
- Upload shape files* at once
- Select the Target Shape-Attribute: Base-Shape, Layer-Shape or Extra-shape.
- Select the Target Datalist.
- Upload the shape files at once
shape files*
Shapefiles are a simple, nontopological format for storing geometric location and attribute information of geographic features. The shapefile format defines the geometry and attributes of geographically referenced features in three or more files with specific file extensions that should be stored in the same project. It requires at least three files:
.shp: The main file that stores the feature geometry.
.shx: The index file that stores the index of the feature geometry.
.dbf: The dBASE table that stores the attribute information of features.
There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.
Each file must have the same prefix, for example: basin.shp, basin.shx, and basin.dbf
Filter & Crop
A setup box that appears after uploading and reading the shape files. Options include filtering specific features, cropping to an existing shape attribute, or manual cropping.
Create & Save
A setup box displayed after uploading and reading the shape files. Use it to save the new shape in the target Datalist.
7.2.9 Run Script
Execute custom R scripts using user-created Datalists within iMESc.
Saved Datalists are accessible from the saved_data
object.
Example:
names(saved_data) #Lists the names of the Datalists.
attr(saved_data[["nema_araca"]],"factors") #acess the Factor-Attribute, where 'nema_araca' is the Datalist name
attr(saved_data[["nema_araca"]],"coords") #acess the Coords-Attribute
Modify permanently iMESc objects:
names(vals$saved_data)[1] <- "new name" #Modifies the name of the first Datalist.
7.2.10 Datalist Manager
Manage saved Datalists and their attributes. The manager displays the size of each Datalist, and options for deleting attributes.
7.2.11 Delete Datalists
Remove a Datalist entirely.
7.3 Filter observations
This tool allows manipulating numeric attributes by filtering observations based on certain criteria. The available options are:
Individual row selection: Manually select observations using Datalist IDs.
Na.omit: Remove all rows with any empty (NAs) cells.
Remove Zero Variance: Remove rows with near-zero variance.
Match IDs with Datalist: Constrain the target Datalist to observations (IDs) from another Datalist.
Filter by Factors: Filter observations using a tree structured by the levels of the Factor-Attribute. You can click on the nodes to expand and select the factor levels. This function is available for factors with fewer than 100 levels.
7.4 Filter variables
This tool allows manipulating the Numeric-Attribute by filtering variables. The available options for value-based removal are:
- Individual Selection: Manually select specific variables (columns) to keep or remove from the Datalist.
- Value-based removal: Remove numeric variables
contributing less than a specified percentage of the total sum across
all observations. The methods for this option are:
- Abund<: Remove variables with a total value less than x-percent of the total sum across all observations. This is useful to exclude variables with low overall contribution.
- Freq<: Remove variables that occur in less than x-percent of the total number of observations. This is helpful when you want to exclude rarely occurring variables.
- Singletons: Remove variables that occur only once in the dataset. This option is relevant for counting data and helps eliminate variables with no meaningful variation.
- Correlation-based removal: This option uses the
findCorrelation
function from the ‘caret’ package. It considers the absolute values of pair-wise correlations between variables. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable, considering the whole data, and removes the variable with the largest mean absolute correlation. Theexact
argument is set to TRUE, meaning that the function re-evaluates the average correlations at each step. - Remove Zero Variance: Revove columns with zero variance.
- Remove near Zero Variance: This option uses the
nearZeroVar
function from thecaret
package. It identifies and removes near zero variance predictors. Predictors with near-zero variance have either zero variance (only one unique value) or very few unique values relative to the number of samples, with a large frequency ratio between the most common and second most common values. Removing such predictors can help eliminate features that do not contribute much information.
7.5 Transformations
The “Transformations” tool enables preprocessing of the Numeric-Attribute using various transformation methods.
7.5.1 Transformation
Provides a wide range of transformation options:
None: No Transformation. Select this option if you do not want to apply any transformation to the Numeric-Attribute.
Log2: Logarithmic base 2 transformation as suggested by Anderson et al. (2006). It follows the formula log_b (x) + 1 for x > 0, where ‘b’ is the base of the logarithm. Zeros are left as zeros. Higher bases give less weight to quantities and more to presences, and logbase = Inf gives the presence/absence scaling. Note that this is not log(x+1).
Log10: Logarithmic base 10 transformation as suggested by Anderson et al. (2006). It follows the formula log_b (x) + 1 for x > 0, where ‘b’ is the base of the logarithm. Zeros are left as zeros. Higher bases give less weight to quantities and more to presences, and logbase = Inf gives the presence/absence scaling. Note that this is not log(x+1).
Total: Divide by the line (observation) total. This transformation scales the values based on the total sum of each observation.
Max: Divide by the column (variable) maximum. This transformation scales the values based on the maximum value of each variable.
Frequency: Divide by the column (variable) total and multiply by the number of non-zero items, so that the average of non-zero entries is one. This transformation scales the values based on the frequency of occurrence.
Range: Standardize column (variable) values into the range 0 … 1. If all values are constant, they will be transformed to 0. This transformation brings the values to a common scale.
Pa: Scale x to presence/absence scale (0/1). This transformation converts the values to binary (0 for absence, 1 for presence).
Chi.square: Divide by row sums and square root of column sums and adjust for the square root of the matrix total. This transformation is relevant for specific statistical analyses.
Hellinger: Square root of method = total. This transformation is used for certain distance calculations.
Sqrt2: Square root transformation. This transformation takes the square root of each value.
Sqrt4: 4th root transformation. This transformation takes the 4th root of each value.
Log2(x+1): Logarithmic base 2 transformation (x+1). This is a variant of the log2 transformation that adds 1 before taking the logarithm.
Log10(x+1): Logarithmic base 10 transformation (x+1). This is a variant of the log10 transformation that adds 1 before taking the logarithm.
BoxCox: Designed for non-negative responses. Boxcox transforms non-normally distributed data to a set of data that has an approximately normal distribution. The Box-Cox transformation is a family of power transformations.
Yeojohson: Like the Box-Cox model but can accommodate predictors with zero and/or negative values. This is another family of power transformations.
ExpoTrans: Exponential transformation. This transformation applies the exponential function to each value.
7.5.2 Scale and Centering
This tool uses the function scale
from
R base for scaling and centering operations. You have the following
options:
Scale: If checked, scaling is done by dividing the (centered) columns of “x” either by their standard deviations (if center is TRUE) or by the root mean square (if center is FALSE).
Center: If checked, centering is done by subtracting the column means (omitting NAs) of “x” from their corresponding columns
7.6 Data imputation
This tool provides methods for completing missing values with values
estimated from the observed data. It is available only for Datalists
that contain missing data either in the Numeric-Attribute or in the
Factor-Attribute. The function preProcess
from the caret
package is used for
imputation. To impute missing values, follow these steps:
Choose the Target-Attribute.
Pick a Method (described below).
Click the blue “Flash” button. The “Save Changes” dialog will automatically pop-up.
Save the Datalist with imputed values as a new Datalist or replace an existing one.
Methods for imputation:
Knn
(caret)
: k-nearest neighbor imputation is only available for the Numeric-Attribute. It is c arried out by finding the k closest samples (Euclidean distance) in the dataset. This method automatically centers and scales your data.Bagimpute
(caret)
: Only available for the Numeric-Attribute. Imputation via bagging fits a bagged tree model for each predictor (as a function of all the others). This method is simple, more accurate, and accepts missing values, but it has a much higher computational cost.MedianImpute
(caret)
: Only available for the Numeric-Attribute. Imputation via medians takes the median of each predictor in the training set and uses them to fill missing values. This method is simple, fast, and accepts missing values but treats each predictor independently, which may lead to inaccuracies.pmm
(mice)
: Predictive mean matching (PMM) is available for both Numeric and Factor Attributes. It involves selecting observations with the closest predicted values as imputation candidates. This method maintains the distribution and variability of the data, making it suitable for data that is normally distributed.rf
(mice)
: Random forest imputation is available for both Numeric and Categorical-Attributes. It use an ensemble of decision trees to predict missing values, This non-parametric method can handle complex interactions and nonlinear relationships but may be computationally intensive.cart
(mice)
: Classification and regression trees (CART) imputation is available for both Numeric and Categorical-Attributes. It is a method that applies decision trees for imputation. It works by splitting the data into subsets which then result in a prediction model.
7.7 Data partition
Data partitioning is a critical step in evaluating machine learning
models. Creating distinct training and testing sets allows for accurate
assessment of the model’s performance on unseen data, avoiding issues
like overfitting and obtaining more reliable performance metrics. In
iMESc, the Data Partition tool adds a partition as a factor in the
Factor-Attribute. It uses the createDataPartition
function
from the caret package. Users can specify the percentage of observations
to be used for the test and choose between the following methods:
Balanced Sampling: Ensures balanced distributions within the splits for classification or regression models.
For classification models, random sampling is done within the levels of the target variable (y) to balance class distributions within the splits.
For regression models, samples are divided into sections based on percentiles of the numeric target variable (Y), with sampling performed within these subgroups.
Random Sampling: simple random sampling is used.
7.8 Aggregate
The “Aggregate” tool utilizes the
aggregate
function from R base. This
process involves aggregating individual cases of the Numeric-Attribute
based on a grouping factor.
The tool offers various calculation options to aggregate the data:
Mean: Calculates the mean of each group (selected factor).
Sum: Calculates the sum of values for each group (selected factor).
Median: Calculates the median of each group (selected factor).
Var: Calculates the variance of each group (selected factor).
SD: Calculates the standard deviation of each group (selected factor).
Min: Retrieves the minimum value for each group (selected factor).
Max: Retrieves the maximum value for each group (selected factor).
7.9 Create Palette
The “Create Palette” tool utilizes the
colourpicker
tool from the
colourpicker
package, enabling users to
interactively select colors for their palette. Subsequently, iMESc
employs colorRampPalette
to generate
customized color palettes suitable for graphical outputs.
7.10 Savepoint
Create: create a Savepoint, which is a single R object to be downloaded and that can be reloaded later or shared to restore workspace.
Restore : Upload a Savepoint (.rds file) to restore the workspace.
9 Packages & functions
In this section, we present the key packages and functions used in the development of iMESc. While iMESc utilizes a wide range of packages and functions, this section highlights some of the key packages and their corresponding functions that play a crucial role throughout the entire app and in various analytical tasks. Please note that these tables might not be exhaustive, but they cover the most relevant packages and functions used in the iMESc software. The version numbers provided are subject to change with future package updates.
9.1 Table 1: Packages and Functions for Analytical Tasks
In this table, we highlight the packages and functions used for various analytical tasks within iMESc.
Package | Version | Functions | Task |
---|---|---|---|
automap |
1.1.9 | autofitVariogram |
Spatial Tools |
aweSOM |
1.3 | somDist , somQuality |
Self-Organizing Maps |
caret |
6.0.94 | createDataPartition , findCorrelation ,
confusionMatrix , gafsControl ,
getModelInfo , postResample ,
varImp , MAE , multiClassSummary ,
RMSE , train , trainControl |
Supervised Algorithms, Pre-processing tools |
dendextend |
1.17.1 | as.ggdend , color_branches ,
get_leaves_branches_col ,
heights_per_k.dendrogram ,
highlight_branches_lwd , labels_colors ,
prepare.ggdend , theme_dendro |
Hierarchical Clustering |
GGally |
2.2.1 | ggally_cor , ggally_densityDiag ,
ggally_points , ggpairs ,
ggally_barDiag |
Descriptive Tools |
ggforce |
0.4.1 | geom_arc_bar |
Self-Organizing Maps |
ggparty |
1.0.0 | geom_edge , geom_edge_label ,
geom_node_info , geom_node_plot ,
geom_node_splitvar , ggparty |
Supervised Algorithms |
ggraph |
2.1.0 | geom_edge_diagonal , geom_edge_link ,
geom_node_label , geom_node_point ,
geom_node_text , ggraph |
Supervised Algorithms |
ggridges |
0.5.6 | geom_density_ridges |
Descriptive Tools |
gstat |
2.1.1 | gstat , gstat.cv ,
variogramLine , vgm , idw |
Spatial Tools |
kernlab |
0.9.32 | ksvm , sigest |
Supervised Algorithms |
klaR |
1.7.3 | dkernel |
Supervised Algorithms |
kohonen |
3.0.12 | getCodes , object.distances ,
somgrid , supersom ,
unit.distances , map ,
check.whatmap , nunits ,
classvec2classmat , classmat2classvec ,
add.cluster.boundaries , dist2WU |
Self-Organizing Maps |
lattice |
0.21.9 | bwplot , densityplot , dotplot ,
parallelplot , splom ,
trellis.par.set , xyplot |
Compare Models |
leaflet |
2.2.1 | leafletOutput , renderLeaflet |
Spatial Tools |
Metrics |
0.1.4 | mae , mape , mse ,
rmse |
Supervised Algorithms |
mice |
3.16.0 | complete , mice |
Pre-processing tools |
NeuralNetTools |
1.5.3 | olden , neuralweights |
Supervised Algorithms |
party |
1.3.14 | prettytree |
Supervised Algorithms |
partykit |
1.2.20 | as.party , as.partynode ,
gettree |
Supervised Algorithms |
pdp |
0.8.1 | partial , exemplar ,
plotPartial |
Supervised Algorithms |
plot3D |
1.4.1 | persp3D , perspbox |
Spatial Tools |
plotly |
4.10.4 | add_surface , add_trace ,
plot_ly , plotlyOutput ,
renderPlotly , style |
Spatial Tools, Supervised Algorithms |
randomForest |
4.7.1.1 | importance , getTree |
Supervised Algorithms |
randomForestExplainer |
0.10.1 | important_variables ,
min_depth_interactions ,
plot_importance_rankings ,
plot_min_depth_interactions ,
measure_importance , min_depth_distribution ,
plot_multi_way_importance |
Supervised Algorithms |
raster |
3.6.26 | crop , extent , mask ,
raster , rasterize ,
rasterToPoints , values ,
writeRaster , crs , rasterFromXYZ ,
ratify |
Spatial Tools |
segRDA |
1.0.2 | bp , extract , OrdData |
Descriptive tools |
sf |
1.0.15 | st_as_sf , st_bbox , st_crs ,
st_set_crs , st_transform ,
st_cast , st_coordinates ,
st_geometry_type , st_point ,
st_sfc |
Spatial Tools |
shinyTree |
0.3.1 | get_selected , renderTree ,
shinyTree |
Pre-processing tools |
sp |
2.1.3 | coordinates , zerodist , CRS ,
spsample |
Spatial Tools |
vegan |
2.6.4 | decostand , diversity ,
estimateR , fisher.alpha ,
specnumber |
Diversity tools |
webshot |
0.5.5 | webshot |
Spatial Tools |
9.2 Table 2: Packages and Functions Used Throughout the App
This table presents the packages and their respective functions that are utilized across the entire app. These packages are essential for data manipulation, visualization, interactive features, and more.
Package | Version | Functions |
---|---|---|
base64enc |
0.1.3 | dataURI |
colorspace |
2.1.0 | hex2RGB , mixcolor |
colourpicker |
1.3.0 | colourInput |
data.table |
1.15.0 | melt , rbindlist |
devtools |
2.4.5 | source_url |
DT |
0.32 | DTOutput , renderDT ,
datatable , dataTableOutput ,
formatStyle , renderDataTable ,
JS |
e1071 |
1.7.14 | allShortestPaths |
foreach |
1.5.2 | foreach |
gbRd |
0.4.11 | Rd_fun |
ggnewscale |
0.4.10 | new_scale_fill , new_scale_color ,
new_scale |
ggplot2 |
3.5.1 | discrete_scale , element_rect ,
geom_sf , ggplot , layer ,
scale_color_gradientn , theme ,
aes , coord_cartesian , coord_flip ,
element_blank , element_line ,
element_text , geom_label ,
geom_segment , geom_text , ggtitle ,
margin , scale_colour_identity ,
scale_linetype_identity , scale_size_identity ,
scale_x_discrete , scale_y_reverse ,
theme_bw , theme_classic ,
theme_dark , theme_grey ,
theme_light , theme_linedraw ,
theme_minimal , theme_void , xlab ,
ylab , coord_fixed , geom_line ,
geom_point , geom_polygon ,
geom_vline , guide_legend , guides ,
scale_color_manual , scale_fill_gradientn ,
scale_fill_manual , scale_linetype_manual ,
scale_x_continuous , scale_y_continuous ,
sec_axis , standardise_aes_names ,
geom_raster , labs ,
scale_colour_manual , facet_wrap ,
geom_freqpoly , vars |
ggrepel |
0.9.5 | geom_label_repel |
htmlwidgets |
1.6.4 | saveWidget |
igraph |
2.0.2 | delete_vertices , graph_from_data_frame ,
V |
MLmetrics |
1.1.1 | R2_Score , F1_Score ,
FBeta_Score , Gini , MAE ,
MAPE , MedianAE , MedianAPE ,
MSE , NormalizedGini ,
Poisson_LogLoss , RAE , RMSE ,
RMSLE , RMSPE , RRSE |
pROC |
1.18.5 | multiclass.roc |
purrr |
1.0.2 | walk2 |
RColorBrewer |
1.1.3 | brewer.pal |
readr |
2.1.4 | read_file |
readxl |
1.4.3 | cell_cols , excel_sheets ,
read_excel |
reshape |
0.8.9 | melt |
reshape2 |
1.4.4 | melt |
scales |
1.3.0 | cbreaks , extended_breaks ,
rescale , col_numeric ,
label_number |
shiny |
1.8.0 | actionButton , actionLink ,
callModule , checkboxInput ,
column , div , downloadButton ,
downloadHandler , downloadLink ,
em , eventReactive , fluidRow ,
HTML , icon , insertTab ,
isolate , modalButton ,
modalDialog , moduleServer ,
navbarPage , need , NS ,
numericInput , observe ,
observeEvent , plotOutput ,
reactive , reactiveVal ,
reactiveValues , removeModal ,
removeTab , renderPlot ,
renderPrint , renderUI , req ,
selectInput , showModal , span ,
strong , tabPanel , tabsetPanel ,
textInput , uiOutput ,
updateCheckboxInput , updateNumericInput ,
updateTabsetPanel , updateTextInput ,
validate , withProgress , img ,
getDefaultReactiveDomain , incProgress ,
a , absolutePanel , br ,
code , conditionalPanel , h3 ,
h4 , h5 , htmlOutput ,
p , renderTable , splitLayout ,
verbatimTextOutput |
shinyBS |
0.61.1 | bsTooltip , addPopover ,
bsButton , popify , tipify |
shinybusy |
0.3.2 | add_busy_spinner |
shinydashboardPlus |
2.0.3 | dashboardFooter , dashboardHeader ,
dashboardPage , dashboardSidebar |
shinyjs |
2.1.0 | delay , hide , onevent ,
runjs , hidden , useShinyjs ,
addClass , colourInput , reset ,
toggle , toggleClass , addCssClass ,
removeCssClass , toggleState |
shinyWidgets |
0.8.1 | pickerInput , radioGroupButtons ,
updatePickerInput , updateVirtualSelect ,
virtualSelectInput , switchInput ,
updateSwitchInput , dropMenu ,
pickerOptions , updateRadioGroupButtons |
sortable |
0.5.0 | rank_list |
stringr |
1.5.1 | str_length , str_replace_all |
tibble |
3.2.1 | rownames_to_column |
10 Model-specific tunning (Supervised alorithms)
10.1 Random Forest
Parameter | Description |
---|---|
ntree | Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. |
replace | Should sampling of cases be done with or without replacement? |
nodesize | Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5). |
maxnodes | Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than maximum possible, a warning is issued. |
nPerm | Number of times the OOB data are permuted per tree for assessing variable importance. Number larger than 1 gives slightly more stable estimate, but not very effective. Currently only implemented for regression. |
norm.votes | If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs). Ignored for regression. |
10.2 Naive Bayes
Parameter | Description |
---|---|
bw | The smoothing bandwidth to be used. The kernels are scaled such that this is the standard deviation of the smoothing kernel. Can also be a character string giving a rule to choose the bandwidth. The default is “nrd0”. |
window | A character string giving the smoothing kernel to be used. Must partially match one of “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “cosine” or “optcosine”. Default is “gaussian”. |
kernel | A character string giving the smoothing kernel to be used. Must partially match one of “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “cosine” or “optcosine”. Default is “gaussian”. |
10.3 k-Nearest Neighbors
Parameter | Description |
---|---|
l | Minimum vote for a definite decision, otherwise doubt. Less than k-l dissenting votes are allowed, even if k is increased by ties. |
use.all | Controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbors. |
10.4 Stochastic Gradient Boosting
Parameter | Description |
---|---|
bag.fraction | The fraction of the training set observations randomly selected to propose the next tree in the expansion. Default is 0.5. |
10.5 Self-Organizing Maps
Parameter | Description |
---|---|
rlen | The number of times the complete data set will be presented to the network. |
alpha | Learning rate, a vector of two numbers indicating the amount of change. Default is to decline linearly from 0.05 to 0.01 over rlen updates. Not used for the batch algorithm. |
maxNA.fraction | The maximal fraction of values that may be NA to prevent the row from being removed. |
dist.fcts | Vector of distance functions to be used for the individual data layers. Default is “sumofsquares” for continuous data, and “tanimoto” for factors. |
mode | Type of learning algorithm. |
normalizeDataLayers | Boolean, indicating whether distance.weights should be calculated. If normalizeDataLayers == FALSE, user weights are applied to the data immediately. |
10.6 Generalized Linear Model
Parameter | Description |
---|---|
method | The method to be used in fitting the model. The default method “glm.fit” uses iteratively reweighted least squares (IWLS). |
singular.ok | Logical; if FALSE, a singular fit is an error. |
epsilon | Positive convergence tolerance; the iterations converge when |
maxit | Integer giving the maximal number of IWLS iterations. |
10.7 Stacked AutoEncoder Deep Neural Network
Parameter | Description |
---|---|
activationfun | Activation function of hidden unit. Can be “sigm”, “linear” or “tanh”. Default is “sigm”. |
learningrate | Learning rate for gradient descent. Default is 0.8. |
momentum | Momentum for gradient descent. Default is 0.5. |
learningrate_scale | Learning rate will be multiplied by this scale after every iteration. Default is 1. |
output | Function of output unit. Can be “sigm”, “linear” or “softmax”. Default is “sigm”. |
sae_output | Function of autoencoder output unit. Can be “sigm”, “linear” or “softmax”. Default is “linear”. |
numepochs | Number of iterations for samples. Default is 3. |
batchsize | Size of mini-batch. Default is 100. |
10.8 Conditional Inference Random Forest
Parameter | Description |
---|---|
teststat | A character specifying the type of the test statistic to be applied. |
testtype | A character specifying how to compute the distribution of the test statistic. |
mincriterion | The value of the test statistic or 1 - p-value that must be exceeded to implement a split. |
savesplitstats | A logical determining whether standardized two-sample statistics for split point estimate are saved for each primary split. |
ntree | Number of trees to grow in a forest. |
replace | A logical indicating whether sampling of observations is done with or without replacement. |
fraction | Fraction of number of observations to draw without replacement (only relevant if replace = FALSE). |
10.9 Gaussian Process with Radial Basis Function Kernel
Parameter | Description |
---|---|
scaled | A logical vector indicating the variables to be scaled. Default scales data to zero mean and unit variance. |
var | The initial noise variance for regression. Default is 0.001. |
tol | Tolerance of termination criterion. Default is 0.001. |
10.10 svmLinear - Support Vector Machines with Linear Kernel
Parameter | Description |
---|---|
nu | Parameter needed for nu-svc , one-svc , and
nu-svr . Sets the upper bound on training error and lower
bound on fraction of data points to become Support Vectors (default:
0.2). |
epsilon | Epsilon in the insensitive-loss function used for
eps-svr , nu-svr , and eps-bsvm
(default: 0.1). |
class.weights | A named vector of weights for different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1). |
cross | If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the model’s quality: accuracy rate for classification and Mean Squared Error for regression. |
tol | Tolerance of termination criterion (default: 0.001). |
shrinking | Option whether to use the shrinking-heuristics (default:
TRUE ). |
10.11 svmRadial - Support Vector Machines with Radial Basis Function Kernel
Parameter | Description |
---|---|
nu | Parameter needed for nu-svc , one-svc , and
nu-svr . Sets the upper bound on training error and lower
bound on fraction of data points to become Support Vectors (default:
0.2). |
epsilon | Epsilon in the insensitive-loss function used for
eps-svr , nu-svr , and eps-bsvm
(default: 0.1). |
class.weights | A named vector of weights for different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1). |
cross | If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the model’s quality: accuracy rate for classification and Mean Squared Error for regression. |
tol | Tolerance of termination criterion (default: 0.001). |
shrinking | Option whether to use the shrinking-heuristics (default:
TRUE ). |
10.12 svmRadialCost - Support Vector Machines with Radial Basis Function Kernel
Parameter | Description |
---|---|
nu | Parameter needed for nu-svc , one-svc , and
nu-svr . Sets the upper bound on training error and lower
bound on fraction of data points to become Support Vectors (default:
0.2). |
epsilon | Epsilon in the insensitive-loss function used for
eps-svr , nu-svr , and eps-bsvm
(default: 0.1). |
class.weights | A named vector of weights for different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1). |
cross | If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the model’s quality: accuracy rate for classification and Mean Squared Error for regression. |
tol | Tolerance of termination criterion (default: 0.001). |
shrinking | Option whether to use the shrinking-heuristics (default:
TRUE ). |
10.13 avNNet - Model Averaged Neural Network
No specific parameters provided for this model.
10.14 nnet - Neural Network
Parameter | Description |
---|---|
linout | Switch for linear output units. Default is logistic output units. |
entropy | Switch for entropy (= maximum conditional likelihood) fitting. Default is least-squares. |
censored | Variant on softmax , where non-zero targets mean
possible classes. For softmax a row of
(0, 1, 1) means one example each of classes 2 and 3, but
for censored it means one example whose class is only known
to be 2 or 3. |
skip | Switch to add skip-layer connections from input to output. |
rang | Initial random weights on [-rang , rang ].
Value about 0.5 unless inputs are large, in which case rang
* max(|x| ) should be about 1. |
maxit | Maximum number of iterations (default: 100). |
Hess | If true, returns the Hessian of the measure of fit at the best set of weights found. |
MaxNWts | Maximum allowable number of weights. Increasing MaxNWts
will likely slow down fitting. |
abstol | Stop if the fit criterion falls below abstol ,
indicating an essentially perfect fit. |
reltol | Stop if the optimizer is unable to reduce the fit criterion by a
factor of at least 1 - reltol . |
10.15 pcaNNet - Neural Networks with Feature Extraction
No specific parameters provided for this model.
10.16 rpart - CART
Parameter | Description |
---|---|
minsplit | The minimum number of observations that must exist in a node in order for a split to be attempted. |
minbucket | The minimum number of observations in any terminal
<leaf> node. |
maxcompete | Number of competitor splits retained in the output. Useful to know not just which split was chosen, but which variable came in second, third, etc. |
maxsurrogate | Number of surrogate splits retained in the output. Setting to zero reduces compute time. |
usesurrogate | How to use surrogates in the splitting process (0 = display only, 1 = use surrogates, 2 = use surrogates for missing primary variables). |
xval | Number of cross-validations. |
surrogatestyle | Controls the selection of a best surrogate (0 = total number of correct classifications, 1 = percent correct over non-missing values). |
maxdepth | Maximum depth of any node of the final tree, with the root node counted as depth 0. |
10.17 monmlp - Monotone Multi-Layer Perceptron Neural Network
Parameter | Description |
---|---|
hidden2 | Number of hidden nodes in the second hidden layer. |
iter.max | Maximum number of iterations of the optimization algorithm. |
n.trials | Number of repeated trials used to avoid local minima. |
bag | Logical variable indicating whether to use bootstrap aggregation (bagging). |
max.exceptions | Maximum number of exceptions of the optimization routine before fitting is terminated with an error. |
method | Code {<link>optimx} optimization method. |
10.18 mlpML - Multi-Layer Perceptron, with multiple layers
Parameter | Description |
---|---|
size | Number of units in the hidden layer(s). |
maxit | Maximum number of iterations to learn. |
initFunc | The initialization function to use. |
initFuncParams | The parameters for the initialization function. |
learnFunc | The learning function to use. |
learnFuncParams | The parameters for the learning function. |
updateFunc | The update function to use. |
updateFuncParams | The parameters for the update function. |
hiddenActFunc | The activation function of all hidden units. |
shufflePatterns | Should the patterns be shuffled? |
10.19 evtree - Tree Models from Genetic Algorithms
Parameter | Description |
---|---|
minbucket | The minimum sum of weights in a terminal node. |
minsplit | The minimum sum of weights in a node in order to be considered for splitting. |
maxdepth | Maximum depth of the tree. Note that memory requirements increase by the square of the maximum tree depth. |
11 References
Blanchet, F. G., Legendre, P., & Borcard, D. (2008). FORWARD SELECTION OF EXPLANATORY VARIABLES. Ecology, 89(9), 2623–2632. https://doi.org/10.1890/07-0986.1
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324
Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189–215. https://doi.org/10.1016/j.neucom.2019.10.118
Checon, H. H., Vieira, D. C., Corte, G. N., Sousa, E. C. P. M., Fonseca, G., & Amaral, A. C. Z. (2018). Defining soft bottom habitats and potential indicator species as tools for monitoring coastal systems: A case study in a subtropical bay. Ocean & Coastal Management, 164, 68–78. https://doi.org/10.1016/j.ocecoaman.2018.03.035
Corte, G. N., Checon, H. H., Fonseca, G., Vieira, D. C., Gallucci, F., Domenico, M. Di, & Amaral, A. C. Z. (2017). Cross-taxon congruence in benthic communities: Searching for surrogates in marine sediments. Ecological Indicators, 78, 173–182. https://doi.org/10.1016/j.ecolind.2017.03.031
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification (2nd ed.). John Wiley \& Sons.
Pearson, Karl. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 1, 2, 559–572. https://api.semanticscholar.org/CorpusID:125037489
Fix, E., & Hodges, J. L. (1989). Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238–247. http://www.jstor.org/stable/1403797
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.
Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification (pp. 986–996). https://doi.org/10.1007/978-3-540-39964-3_62
Gupta, B., Rawat, A., Jain, A., Arora, A., & Dhami, N. (2017). Analysis of Various Decision Tree Algorithms for Classification in Data Mining. International Journal of Computer Applications, 163(8), 15–19. https://doi.org/10.5120/ijca2017913660
Kalcheva, N., Todorova, M., & Marinova, G. (2020). NAIVE BAYES CLASSIFIER, DECISION TREE AND ADABOOST ENSEMBLE ALGORITHM – ADVANTAGES AND DISADVANTAGES. 153–157. https://doi.org/10.31410/ERAZ.2020.153
Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30, 81–93.
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69.
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5). https://doi.org/10.18637/jss.v028.i05
Legendre, P., & Anderson, M. (1999). Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69(1), 1–24. https://doi.org/10.1890/0012-9615
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.
Melssen, W., Wehrens, R., & Buydens, L. (2006). Supervised Kohonen networks for classification problems. Chemometrics and Intelligent Laboratory Systems, 83(2), 99–113. https://doi.org/10.1016/j.chemolab.2006.02.003
Pearson, K. (1920). Notes on the History of Correlation. Biometrika, 13, 25–45.
Shepard, D. (1968). A two-dimensional interpolation function for irregularly-spaced data. Proceedings of the 1968 23rd ACM National Conference On -, 517–524. https://doi.org/10.1145/800186.810616
Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika, 27(2), 125–140.
Sneath, P. H. A. (1957). The application of computers to taxonomy. Journal of General Microbiology, 17(1), 201–226.
Spearman, C. (1987). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 100(3/4), 441–471. http://www.jstor.org/stable/1422689
Vieira, D. C., Brustolin, M. C., Ferreira, F. C., & Fonseca, G. (2019). segRDA: An <scp>r</scp> package for performing piecewise redundancy analysis. Methods in Ecology and Evolution, 10(12), 2189–2194. https://doi.org/10.1111/2041-210X.13300
Vieira, D. C., Gallucci, F., Corte, G. N., Checon, H. H., Zacagnini Amaral, A. C., & Fonseca, G. (2021). The relative contribution of non-selection and selection processes in marine benthic assemblages. Marine Environmental Research, 163, 105223. https://doi.org/10.1016/j.marenvres.2020.105223
Yao, M., Zhu, Y., Li, J., Wei, H., & He, P. (2019). Research on Predicting Line Loss Rate in Low Voltage Distribution Network Based on Gradient Boosting Decision Tree. Energies, 12(13), 2522. https://doi.org/10.3390/en12132522