iMESchelp!

Introduction

Welcome to the iMESchelp, your comprehensive guide to use iMESc:an interactive machine learning app designed to analyze environmental data. iMESc is a shiny-based application that allows the performance of end-to-end machine learning workflows. It provides a wide range of resources to meet the various needs of environmental scientist, making it a versatile tool for data analysis.

This manual is organized into the following sections:

  1. Setup: Step-by-step instructions to run iMESc.

  2. Layout: The dashboard organization.

  3. Widgets: The interactive widgets used in iMESc, such as buttons, dropdowns, and checkboxes, which enable seamless interactions with the app.

  4. Datalist: Exploring the core concept of Datalists and their attributes.

  5. Essentials of Building a Workflow in iMESc: Generic workflow steps commonly encountered while using iMESc.

  6. Pre-processing Tools: In-depth coverage of the tools available for pre-processing data, including Datalist creation and data transformation.

  7. Sidebar-menu: Details about the modules, analyses and algorithms, parametrization, and results.

  8. Packages & Functions: Details about the main R packages and functions used in iMESc, along with their respective versions and analytical tasks.

1 Setup

  1. Install R and RStudio if you haven’t done so already;

  2. Once installed, open the R studio;

  3. Install shiny package if it is not already installed;

install.packages('shiny')
  1. Run the code below.

shiny::runGitHub('iMESc','DaniloCVieira', ref='main')

When you use the iMESc app for the first time, it will automatically install all the necessary packages, which may take several minutes to complete. However, once the first installation is finished, subsequent access of the app will be much faster. If the required packages are not already loaded, they typically take several seconds to load. On the other hand, if the packages are already loaded, iMESc will start almost instantly.

2 Layout

iMESc is designed with a dashboard layout, consisting of three main sections:

  1. Pre-processing tools on the top-left containing widgets for Datalist options and pre-processing data.

  2. Sidebar-menu on the left-hand side containing menu buttons.

  3. The Main panel for viewing the analytical tasks.

Upon selecting a menu button, users will seamlessly navigate to a sub-screen housing the selected module. Each module features a header equipped with interactive widgets, along with multiple tab panels that support various functionalities.

To ensure an optimal display of iMESc content, we strongly recommend a minimum landscape resolution of 1377 x 768 pixels. Adhering to this resolution, it will guarantee an enhanced user experience and proper visualization of all elements on the screen.

Fig S2.1 - iMESc layout
Fig S2.1 - iMESc layout

3 Widgets

The app is built using widgets: web elements that users interact with. The standard iMESc widgets are:

Table S3.1 - iMESc widgets
Widgets Task
Button Performs an action when clicked
Picker/Dropdown Allows the user to select only one of a predefined set of mutually exclusive options
Checkbox Interactive box that can be toggled by the user to indicate an affirmative or negative choice
Checkbox group A group of check boxes
Radiobuttons Allows the user to choose only one of a predefined set of mutually exclusive options
File A file upload control wizard
Numeric A field to enter numbers
Text A field to enter text

4 Datalist

iMESc manages data through Datalists Fig S4.1, which can include sheets and shape-files (user-provided). The sheets are internally treated as data.frame objects in R, where rows represent observations and columns represent variables. Matching observations among these attributes, regardless of the Datalists they come from, is achieved based on the row names, ensuring data consistency. To ensure proper handling of the sheets-attributes, you must provide a unique ID and place it in the first column when uploading the data. iMESc automatically removes this first column and adds it as the row names of the respective attribute. Decimals in iMESc are represented with dots (e.g. 1/2 =0.5), check it before uploading a file.

Required

Numeric-Attribute Numerical data with continuous or discrete variables.

Optional

Factor-Attribute: Categorical data. If not provided, iMESc automatically generates this attribute as a sheet with a single column containing the IDs of the Numeric-Attribute.
Coords-Attribute Geographical coordinates (Longitude, Latitude) represented in decimal degrees, used for spatialization of data within the Spatial Tools.
Base-Shape-Attribute: A polygon shape to clip or interpolate spatialized data for map generation.
Layer-Shape-Attribute: Adds extra shapes for superimposition on maps.
Extra-Shape Attribute: Users can add other shapes as Extra-Shape-Attribute to further customize their maps.

Furthermore, Datalists have the capacity to store models that have been trained using the iMESc, permitting users to integrate and manage predictive models along with their data sets. To access all the available analyses in iMESc, you need to create a Datalist by either uploading your own data or using the example data provided. For guidance on uploading sheets and the required attribute formats, as well as how to utilize the example Datalists, please refer to the “Creating a Datalist” section.

Fig. S4.1 - Schematic representation of a Datalist and its associated attributes
Fig. S4.1 - Schematic representation of a Datalist and its associated attributes

5 Main Analyses

The table provided below serves as a convenient reference for the main analyses utilized in iMESc. It displays their locations in the sidebar menu, along with their abbreviations and corresponding packages. For more comprehensive information about each analysis, you can refer to their respective sections in this manual.

Side Bar menu Analyses Abbreviation Package Author
Descriptive tools Pearson’s correlation - base Pearson (1895)
Kendell’s correlation - base Kendall (1938)
Spearman’s correlation - base Spearman (1904)
Principal Component Analysis PCA base Pearson (1901)
Nonmetric Multidimensional Scaling MDS vegan Legendre & Anderson (1999)
Redundancy Analyses RDA vegan Blanchet et al. (2008)
Piecewise Redundancy Analyses pwRDA segRDA Vieira (2019)
Spatial tools Krigging
Inverse Distance Weighting idw - Shepard (1968)
K-Nearest neighbor KNN stats Fix & Hodges (1951)
Support Vector Machine with Radial Basis Function Kernel svmRadial kernlab Karatzoglou et.al (2004)
Gaussian Process with Radial Basis Function Kernel gaussprRadial kernlab Karatzoglou et.al (2004)
Support Vector Machine with Radial Basis Function Kernel and Cost Parameter Optimization svmRadialCost kernlab Karatzoglou et.al (2004)
Unsupervised Algorithms Sel-Organizing Maps SOM kohonen Kohonen (1982)
Hierarchical Clustering HC factoextra Sneath (1957)
Random Forest rf randomForest Breiman (2001)

Supervised

Algorithms

Stochastic gradient boosting gbm gbm Friedman (2001)
Conditional Inference Random Forest cforest party Hothorn et al. (2006)
Recursive Partitioning and Regression Trees rpart rpart Breiman et al. (1984)
Tree models from Genetic Algorithms evtree evtree Grubinger et al. (2014)
Naive Bayes nb klaR Duda et al. (2012)
K-Nearest Neighbors knn base Fix & Hodges (1951)
Self-Organizing Maps xyf kohonen Shepard (1962)
Generalized Linear Model glm base Nelder & Wedderburn (1972)
Gaussian Process with Radial Basis Function Kernel gaussprRadial kernlab Karatzoglou et.al (2004)
Support Vector Machine with Linear Kernel svmLinear kernlab Karatzoglou et.al (2004)
Support Vector Machine with Radial Basis Function Kernel svmRadial kernlab Karatzoglou et.al (2004)
Support Vector Machine with Radial Basis Function Kernel and Cost Parameter Optimization svmRadialCost kernlab Karatzoglou et.al (2004)
Stacked AutoEncoder Deep Neural Network dnn deepnet Vincent et al. (2010)
Model Averaged Neural Network avNNet nnet Ripley (1996)
Neural Network nnet nnet Ripley (1996)
Neural Networks with Feature Extraction pcaNNet nnet Ripley (1996)
Monotone Multi-Layer Perceptron Neural Network monmlp monmlp Lang (2005)
Feature Selection using randomForest Genetic Algorithm rfGA caret Kuhn (2008)

6 Essentials of Building a Workflow in iMESc

In this section, we will cover the four recurring steps to construct a workflow within iMESc:

6.1 Create Datalists Based on Model Specifications

Fig. S6.1 - Conceptual Setup for the Models available in iMESc.
Fig. S6.1 - Conceptual Setup for the Models available in iMESc.
Table S6.1 - Datalist Creation and Selection Based on Model Specifications
Use pre-processing tools to Create Datalists with their associated attributes based on the chosen analytical method.
For unsupervised methods, only Numeric-Attribute (X) is needed. iMESc automatically recognizes the Numeric-Attribute associated with the Datalist as (X).
For supervised classification models, both the Numeric-Attribute (X) and the Factor-Attribute (Y) are needed. iMESc automatically recognizes the Numeric-Attribute and Factor-Attribute associated with the Datalist as X and Y, respectively. X and Y can be from the same or different Datalists.
For supervised regression models, X and Y are both Numeric-Attributes. iMESc automatically recognizes the Numeric-Attributes associated with the Datalist as both X and Y.

6.2 Pre-processing

Table S6.2 - Common Pre-processing steps
Use pre-processing tools to handle missing values (Data imputation tool).
Transform the data as needed (e.g., scaling, centering, log transformations), especially for distance-based methods (e.g., PCA, SOM).
Partition the Y data between training and testing for supervised machine learning methods. This action creates a column in the Factor Attribute indicating the partitioning.

6.3 Save changes and models

Saving data changes or trained models is a recurring step throughout iMESc. Whenever saving is required, iMESc will indicate this with a flashing-blue disc button .

  • Saving data changes can be done as new Datalists or overwrite existing ones. Factor, Coords, and Shapes attributes are automatically transferred to the new Datalist. Models previously saved in a Datalist are not transferred.

  • Trained models are saved within the Datalist used as the predictor (X). After training a model, users have the option to save it as a new model or overwrite an existing one. This action creates a new attribute within the Datalist (e.g., RF-Attribute for Random Forest models).

6.4 Loading and downloading a savepoint

Download a save-point:

  1. Open the pre-processing tools in iMESc.

  2. Click the “Download” button in the “Create a savepoint” section.

  3. The save-point file (.rds) will be downloaded to the computer, capturing your workspace, including all the Datalists and associated models.

  • Restore a save-point:

    1. Go to the pre-processing tools.

    2. In the “Load a savepoint” section, use “Browse” to upload the save-point file from your computer.

    3. Click “Upload” or “Load” to restore your workspace to that point.

Save-points are incredibly useful for preserving your analysis progress and results. By downloading a save-point, you can conveniently store your work, and later, by uploading it, you can seamlessly continue your analysis from where you left off. This feature ensures that your work remains intact, even if you close the session or access iMESc from a different device.

Using save-points streamlines your workflow and enhances your overall experience with iMESc, providing a reliable way to manage and preserve your analysis outputs and data for future use.

6.4.1 Extracting the Savepoint Results by R Code:

You can extract specific results and attributes from the Savepoint using R code. The iMESc is not required here. To do this, use the following steps:

# Reading the Savepoint
savepoint <- readRDS("savepoint.rds")
savepoint$saved_data     # To access all saved Datalists
names(savepoint$saved_data)     # To access the names of the saved Datalists

# Accessing a specific Datalist named "datalist_name"
datalist_name <- savepoint$saved_data[['datalist_name']]
datalist_name

# Accessing specific attributes within the Datalist
attr(datalist_name, "factors")       # To access the Factor-Attribute
attr(datalist_name, "coords")        # To access the Coords-Attribute
attr(datalist_name, "base_shape")    # To access the Base-Shape-Attribute
attr(datalist_name, "layer_shape")   # To access the Layer-Shape-Attribute
attr(datalist_name, "extra_shape")   # To access the Extra-Shapes-Attribute

# To extract saved models
attr(datalist_name, "som")                 # To access all SOM models saved in the Datalist
attr(datalist_name, "som")[["model_name"]] # To access a saved SOM model named 'model_name'

# To access other models, replace "som" with the corresponding model name:
# 'kmeans' (k-Means), 'nb' (Naive-Bayes), 'svm' (Support-Machine Vector), 'knn'(k-nearest neighbor), 'rf' (Random Forest),
# 'sgboost' (stochastic gradient boosting), 'xyf' (supervised som).

Note: Ensure that you specify the correct path and filename of your save-point file in the readRDS function. Modify “datalist_name” and “model_name” in the R code to access specific Datalists and saved models, respectively.

7 Pre-processing Tools

Video S7 - Pre-processing tools tutorial

The Pre-processing Tools comprise a suite of functionalities for manipulating and preparing Datalists. These tools assist in refining the data, handling missing values, and generating custom palettes for graphical outputs. Below are the details of each tool:

7.1 Create a Datalist

To begin working with iMESc, you need to create a Datalist, which serves as the foundation for all analytical tasks. Click on “Create Datalist button” to open a modal dialog for Datalist creation. All analytical tasks in iMESc will require a Datalist, which can be uploaded by the user or generated using example data.

7.1.1 Upload

  • Name the Datalist: Use the text widget to provide a name for the Datalist.

  • Numeric-Attribute: Upload a .csv or .xlsx file containing the numeric variables. This file is mandatory and should include observations as rows and variables as columns.The first row must contain the variable headings, and the first column should have observation labels. Columns containing characters (text or mixed numeric and non-numeric values) will be automatically transferred to the Factor-Attribute.

  • Factor-Attribute: Upload a .csv or .xlsx file containing categorical variables. This file should have observations as rows and categorical variables as columns. The first row must contain variable headings, and the first column should have observation labels. If the Factor-Attribute is not uploaded, the observation IDs will be used automatically. This attribute is crucial for labeling, grouping, and visualizing results based on factor levels. It can be replaced at any time with a new one using the “Replace Factor-Attribute” button

  • Coords-Attribute: Upload a .csv or .xlsx file containing geographical coordinates. This file is optional for creating a Datalist but required for generating maps. The first column should contain the observation labels, the second column Longitude values, and the third column Latitude values (both in decimal degrees). The first row must contain the coordinate headings.

  • Base-Shape: Upload a single R file containing the polygon shape, such as an oceanic basin outline, to be used primarily with ggplot2 for map generation. This optional file provides the foundational geographical context for your visualizations. It can be generated using the SHP toolbox in the pre-processing tools, which converts shapefiles (.shp, .shx, and .dbf files) into an R file suitable for use as a base layer in ggplot2

  • Layer-Shape: Upload a single R file containing an additional shape layer, such as a continent shape, to be used primarily with ggplot2 for map generation. This optional file can also be created using the SHP toolbox available in the pre-processing tools.

Gif S7.1 - Creating a Datalist from Upload
Gif S7.1 - Creating a Datalist from Upload

Best practices when uploading your sheet

  1. Prepare your data: Use the first row as column headers and the first column as observation labels.

  2. Ensure each label is filled with unique information, removing any duplicated names.

  3. Check for empty cells in the observation label column.

  4. Ensure that the column names are unique; duplicated names are not allowed.

  5. Avoid using cells with blank spaces or special symbols.

  6. Avoid beginning variable names with a number.

  7. Note that R is case-sensitive, so “name” is different from “Name” or “NAME.”

  8. Avoid blank rows and/or columns in your data.

  9. Replace missing values with NA (not available).

7.1.2 Use Example Data

This option allows users to explore the example data available in iMESc . After clicking on “Create a Datalist,” select the “Use example data” radio-button to proceed with Datalist insertion. This action will insert two Datalists from Araçá Bay, located on the southeastern coast of Brazil:

  • envi_araca: Contains 141 samples with 9 environmental variables.

  • nema_araca: Contains 141 samples with 194 free-living marine nematode species (southeastern coast of Brazil).

Both Datalists comprise five attributes: Numeric, Factor, Coords-Attribute, Base-Shape, and Layer-Shape. Studies that explored these data area include Corte et al., 2017, Checon et al., 2018 and Vieira et al., 2021.

Gif S7.2 - Creating a Datalist from Example Data
Gif S7.2 - Creating a Datalist from Example Data

7.2 Options

Fig. S7.2.1 - Options
Fig. S7.2.1 - Options

This drop-down menu offers the user a range of tools for editing Datalists.

7.2.1 Rename Datalist

Change the name of a selected Datalis.

7.2.2 Merge Datalists

Combine two or more Datalists by columns or rows. This action affects both Numeric and Factor-Attribute data. When merging by rows, it also combines associated Coords-Attribute (if any). When merging by columns, there is an option to fill missing columns with NA, or restrict the Datalists to common columns.

Please note that saved models in one of the Datalists are not transferred to the merged Datalist.

7.2.3 Exchange Factor/Variables

Video S7.2.3 - Exchange Factors/Variables tutorial

The “Exchange Factors/Variables” functionality in iMESc allows you to convert or transfer data between numeric and factor formats. This powerful tool provides flexibility in handling your data and enables transitions between different data types.

  1. From Datalist Selector: Select the source Datalist from which you want to exchange data.

  2. From Attribute Selector: Within the selected Datalist, choose between the Numeric or Factor Attribute that you wish to convert or transfer.

  3. To Datalist Selector: Select the target Datalist where you want to transfer or convert the data.

  4. To Attribute Selector: Within the target Datalist, specify whether you want to convert the data to Numeric or Factor format.

From Numeric…

  • To Numeric: This option allows you to copy or transfer the selected numeric variables from the source Datalist to the target Datalist while preserving their numeric format.

  • To Factor: Convert the selected numeric variables to factors. The default is to transform each unique numeric value into a new level of the factor. You can use the “cut” option to categorize the variables into specified bins or levels. The initial guess of bins can be determined by three methods: Surges, Scott’s, or Freedman-Diaconis. Additionally, you can manually define the number of bins and also edit the names and order of the factor levels.

From Factor…

  • To Factor: With this option, you can copy or transfer the selected factors from the source Datalist to the target Datalist, maintaining their original factor format.

  • To Numeric: This conversion allows you to convert the selected factors to numeric data before copying or moving them. You have two types of conversions available:

    • Binary: For each factor level, a single binary column is created, where 1 indicates the class of that observation.

    • Integer: A single column is created, representing the numeric (integer) representation of the factor levels (values).

7.2.4 Replace Attributes

The “Replace Attribute” option allows users to update existing Attributes within a Datalist by replacing them with new data from a CSV file.

7.2.5 Edit Datalist Columns

Modify the names of columns for both Numeric-Attributes and Factor-Attributes, and remove columns.

7.2.6 Edit Model names

Allows you to edit the names of saved models.

7.2.7 Transpose a Datalist

Rotate a Datalist (Numeric and Factor) from rows to columns. If a Coords-Attribute is associated with the Datalist, it will be removed.

7.2.8 SHP toolbox

Video S7.2.8 - SHP toolbox tutorial

This toolbox allows the creation of Base-Shapes, Layer-Shapes and Extra shapes.

Targets & Upload

  1. Upload shape files* at once
  2. Select the Target Shape-Attribute: Base-Shape, Layer-Shape or Extra-shape.
  3. Select the Target Datalist.
  4. Upload the shape files at once

shape files*

Shapefiles are a simple, nontopological format for storing geometric location and attribute information of geographic features. The shapefile format defines the geometry and attributes of geographically referenced features in three or more files with specific file extensions that should be stored in the same project. It requires at least three files:

.shp: The main file that stores the feature geometry.

.shx: The index file that stores the index of the feature geometry.

.dbf: The dBASE table that stores the attribute information of features.

There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.

Each file must have the same prefix, for example: basin.shp, basin.shx, and basin.dbf

Filter & Crop

A setup box that appears after uploading and reading the shape files. Options include filtering specific features, cropping to an existing shape attribute, or manual cropping.

Create & Save

A setup box displayed after uploading and reading the shape files. Use it to save the new shape in the target Datalist.

7.2.9 Run Script

Execute custom R scripts using user-created Datalists within iMESc. Saved Datalists are accessible from the saved_data object.

Example:

names(saved_data) #Lists the names of the Datalists.
attr(saved_data[["nema_araca"]],"factors") #acess the Factor-Attribute, where 'nema_araca' is the Datalist name
attr(saved_data[["nema_araca"]],"coords") #acess the Coords-Attribute

Modify permanently iMESc objects:

names(vals$saved_data)[1] <- "new name" #Modifies the name of the first Datalist.

7.2.10 Datalist Manager

Manage saved Datalists and their attributes. The manager displays the size of each Datalist, and options for deleting attributes.

7.2.11 Delete Datalists

Remove a Datalist entirely.

7.3 Filter observations

This tool allows manipulating numeric attributes by filtering observations based on certain criteria. The available options are:

  • Individual row selection: Manually select observations using Datalist IDs.

  • Na.omit: Remove all rows with any empty (NAs) cells.

  • Remove Zero Variance: Remove rows with near-zero variance.

  • Match IDs with Datalist: Constrain the target Datalist to observations (IDs) from another Datalist.

  • Filter by Factors: Filter observations using a tree structured by the levels of the Factor-Attribute. You can click on the nodes to expand and select the factor levels. This function is available for factors with fewer than 100 levels.

7.4 Filter variables

This tool allows manipulating the Numeric-Attribute by filtering variables. The available options for value-based removal are:

  • Individual Selection: Manually select specific variables (columns) to keep or remove from the Datalist.
  • Value-based removal: Remove numeric variables contributing less than a specified percentage of the total sum across all observations. The methods for this option are:
    • Abund<: Remove variables with a total value less than x-percent of the total sum across all observations. This is useful to exclude variables with low overall contribution.
    • Freq<: Remove variables that occur in less than x-percent of the total number of observations. This is helpful when you want to exclude rarely occurring variables.
    • Singletons: Remove variables that occur only once in the dataset. This option is relevant for counting data and helps eliminate variables with no meaningful variation.
  • Correlation-based removal: This option uses the findCorrelation function from the ‘caret’ package. It considers the absolute values of pair-wise correlations between variables. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable, considering the whole data, and removes the variable with the largest mean absolute correlation. The exact argument is set to TRUE, meaning that the function re-evaluates the average correlations at each step.
  • Remove Zero Variance: Revove columns with zero variance.
  • Remove near Zero Variance: This option uses the nearZeroVar function from the caret package. It identifies and removes near zero variance predictors. Predictors with near-zero variance have either zero variance (only one unique value) or very few unique values relative to the number of samples, with a large frequency ratio between the most common and second most common values. Removing such predictors can help eliminate features that do not contribute much information.

7.5 Transformations

The “Transformations” tool enables preprocessing of the Numeric-Attribute using various transformation methods.

7.5.1 Transformation

Provides a wide range of transformation options:

  1. None: No Transformation. Select this option if you do not want to apply any transformation to the Numeric-Attribute.

  2. Log2: Logarithmic base 2 transformation as suggested by Anderson et al. (2006). It follows the formula log_b (x) + 1 for x > 0, where ‘b’ is the base of the logarithm. Zeros are left as zeros. Higher bases give less weight to quantities and more to presences, and logbase = Inf gives the presence/absence scaling. Note that this is not log(x+1).

  3. Log10: Logarithmic base 10 transformation as suggested by Anderson et al. (2006). It follows the formula log_b (x) + 1 for x > 0, where ‘b’ is the base of the logarithm. Zeros are left as zeros. Higher bases give less weight to quantities and more to presences, and logbase = Inf gives the presence/absence scaling. Note that this is not log(x+1).

  4. Total: Divide by the line (observation) total. This transformation scales the values based on the total sum of each observation.

  5. Max: Divide by the column (variable) maximum. This transformation scales the values based on the maximum value of each variable.

  6. Frequency: Divide by the column (variable) total and multiply by the number of non-zero items, so that the average of non-zero entries is one. This transformation scales the values based on the frequency of occurrence.

  7. Range: Standardize column (variable) values into the range 0 … 1. If all values are constant, they will be transformed to 0. This transformation brings the values to a common scale.

  8. Pa: Scale x to presence/absence scale (0/1). This transformation converts the values to binary (0 for absence, 1 for presence).

  9. Chi.square: Divide by row sums and square root of column sums and adjust for the square root of the matrix total. This transformation is relevant for specific statistical analyses.

  10. Hellinger: Square root of method = total. This transformation is used for certain distance calculations.

    Sqrt2: Square root transformation. This transformation takes the square root of each value.

  11. Sqrt4: 4th root transformation. This transformation takes the 4th root of each value.

  12. Log2(x+1): Logarithmic base 2 transformation (x+1). This is a variant of the log2 transformation that adds 1 before taking the logarithm.

  13. Log10(x+1): Logarithmic base 10 transformation (x+1). This is a variant of the log10 transformation that adds 1 before taking the logarithm.

  14. BoxCox: Designed for non-negative responses. Boxcox transforms non-normally distributed data to a set of data that has an approximately normal distribution. The Box-Cox transformation is a family of power transformations.

  15. Yeojohson: Like the Box-Cox model but can accommodate predictors with zero and/or negative values. This is another family of power transformations.

  16. ExpoTrans: Exponential transformation. This transformation applies the exponential function to each value.

7.5.2 Scale and Centering

This tool uses the function scale from R base for scaling and centering operations. You have the following options:

  • Scale: If checked, scaling is done by dividing the (centered) columns of “x” either by their standard deviations (if center is TRUE) or by the root mean square (if center is FALSE).

  • Center: If checked, centering is done by subtracting the column means (omitting NAs) of “x” from their corresponding columns

7.6 Data imputation

This tool provides methods for completing missing values with values estimated from the observed data. It is available only for Datalists that contain missing data either in the Numeric-Attribute or in the Factor-Attribute. The function preProcess from the caret package is used for imputation. To impute missing values, follow these steps:

  1. Choose the Target-Attribute.

  2. Pick a Method (described below).

  3. Click the blue “Flash” button. The “Save Changes” dialog will automatically pop-up.

  4. Save the Datalist with imputed values as a new Datalist or replace an existing one.

Methods for imputation:

  • Knn (caret): k-nearest neighbor imputation is only available for the Numeric-Attribute. It is c arried out by finding the k closest samples (Euclidean distance) in the dataset. This method automatically centers and scales your data.

  • Bagimpute (caret): Only available for the Numeric-Attribute. Imputation via bagging fits a bagged tree model for each predictor (as a function of all the others). This method is simple, more accurate, and accepts missing values, but it has a much higher computational cost.

  • MedianImpute (caret): Only available for the Numeric-Attribute. Imputation via medians takes the median of each predictor in the training set and uses them to fill missing values. This method is simple, fast, and accepts missing values but treats each predictor independently, which may lead to inaccuracies.

  • pmm (mice): Predictive mean matching (PMM) is available for both Numeric and Factor Attributes. It involves selecting observations with the closest predicted values as imputation candidates. This method maintains the distribution and variability of the data, making it suitable for data that is normally distributed.

  • rf (mice): Random forest imputation is available for both Numeric and Categorical-Attributes. It use an ensemble of decision trees to predict missing values, This non-parametric method can handle complex interactions and nonlinear relationships but may be computationally intensive.

  • cart (mice): Classification and regression trees (CART) imputation is available for both Numeric and Categorical-Attributes. It is a method that applies decision trees for imputation. It works by splitting the data into subsets which then result in a prediction model.

7.7 Data partition

Data partitioning is a critical step in evaluating machine learning models. Creating distinct training and testing sets allows for accurate assessment of the model’s performance on unseen data, avoiding issues like overfitting and obtaining more reliable performance metrics. In iMESc, the Data Partition tool adds a partition as a factor in the Factor-Attribute. It uses the createDataPartition function from the caret package. Users can specify the percentage of observations to be used for the test and choose between the following methods:

  • Balanced Sampling: Ensures balanced distributions within the splits for classification or regression models.

    • For classification models, random sampling is done within the levels of the target variable (y) to balance class distributions within the splits.

    • For regression models, samples are divided into sections based on percentiles of the numeric target variable (Y), with sampling performed within these subgroups.

  • Random Sampling: simple random sampling is used.

7.8 Aggregate

The “Aggregate” tool utilizes the aggregate function from R base. This process involves aggregating individual cases of the Numeric-Attribute based on a grouping factor.

The tool offers various calculation options to aggregate the data:

  • Mean: Calculates the mean of each group (selected factor).

  • Sum: Calculates the sum of values for each group (selected factor).

  • Median: Calculates the median of each group (selected factor).

  • Var: Calculates the variance of each group (selected factor).

  • SD: Calculates the standard deviation of each group (selected factor).

  • Min: Retrieves the minimum value for each group (selected factor).

  • Max: Retrieves the maximum value for each group (selected factor).

7.9 Create Palette

The “Create Palette” tool utilizes the colourpicker tool from the colourpicker package, enabling users to interactively select colors for their palette. Subsequently, iMESc employs colorRampPalette to generate customized color palettes suitable for graphical outputs.

Gif S7.9 - Create Palette
Gif S7.9 - Create Palette

7.10 Savepoint

  • Create: create a Savepoint, which is a single R object to be downloaded and that can be reloaded later or shared to restore workspace.

  • Restore : Upload a Savepoint (.rds file) to restore the workspace.

Fig. S7.10.1 -
Fig. S7.10.1 -

9 Packages & functions

In this section, we present the key packages and functions used in the development of iMESc. While iMESc utilizes a wide range of packages and functions, this section highlights some of the key packages and their corresponding functions that play a crucial role throughout the entire app and in various analytical tasks. Please note that these tables might not be exhaustive, but they cover the most relevant packages and functions used in the iMESc software. The version numbers provided are subject to change with future package updates.

9.1 Table 1: Packages and Functions for Analytical Tasks

In this table, we highlight the packages and functions used for various analytical tasks within iMESc.

Package Version Functions Task
automap 1.1.9 autofitVariogram Spatial Tools
aweSOM 1.3 somDist, somQuality Self-Organizing Maps
caret 6.0.94 createDataPartition, findCorrelation, confusionMatrix, gafsControl, getModelInfo, postResample, varImp, MAE, multiClassSummary, RMSE, train, trainControl Supervised Algorithms, Pre-processing tools
dendextend 1.17.1 as.ggdend, color_branches, get_leaves_branches_col, heights_per_k.dendrogram, highlight_branches_lwd, labels_colors, prepare.ggdend, theme_dendro Hierarchical Clustering
GGally 2.2.1 ggally_cor, ggally_densityDiag, ggally_points, ggpairs, ggally_barDiag Descriptive Tools
ggforce 0.4.1 geom_arc_bar Self-Organizing Maps
ggparty 1.0.0 geom_edge, geom_edge_label, geom_node_info, geom_node_plot, geom_node_splitvar, ggparty Supervised Algorithms
ggraph 2.1.0 geom_edge_diagonal, geom_edge_link, geom_node_label, geom_node_point, geom_node_text, ggraph Supervised Algorithms
ggridges 0.5.6 geom_density_ridges Descriptive Tools
gstat 2.1.1 gstat, gstat.cv, variogramLine, vgm, idw Spatial Tools
kernlab 0.9.32 ksvm, sigest Supervised Algorithms
klaR 1.7.3 dkernel Supervised Algorithms
kohonen 3.0.12 getCodes, object.distances, somgrid, supersom, unit.distances, map, check.whatmap, nunits, classvec2classmat, classmat2classvec, add.cluster.boundaries, dist2WU Self-Organizing Maps
lattice 0.21.9 bwplot, densityplot, dotplot, parallelplot, splom, trellis.par.set, xyplot Compare Models
leaflet 2.2.1 leafletOutput, renderLeaflet Spatial Tools
Metrics 0.1.4 mae, mape, mse, rmse Supervised Algorithms
mice 3.16.0 complete, mice Pre-processing tools
NeuralNetTools 1.5.3 olden, neuralweights Supervised Algorithms
party 1.3.14 prettytree Supervised Algorithms
partykit 1.2.20 as.party, as.partynode, gettree Supervised Algorithms
pdp 0.8.1 partial, exemplar, plotPartial Supervised Algorithms
plot3D 1.4.1 persp3D, perspbox Spatial Tools
plotly 4.10.4 add_surface, add_trace, plot_ly, plotlyOutput, renderPlotly, style Spatial Tools, Supervised Algorithms
randomForest 4.7.1.1 importance, getTree Supervised Algorithms
randomForestExplainer 0.10.1 important_variables, min_depth_interactions, plot_importance_rankings, plot_min_depth_interactions, measure_importance, min_depth_distribution, plot_multi_way_importance Supervised Algorithms
raster 3.6.26 crop, extent, mask, raster, rasterize, rasterToPoints, values, writeRaster, crs, rasterFromXYZ, ratify Spatial Tools
segRDA 1.0.2 bp, extract, OrdData Descriptive tools
sf 1.0.15 st_as_sf, st_bbox, st_crs, st_set_crs, st_transform, st_cast, st_coordinates, st_geometry_type, st_point, st_sfc Spatial Tools
shinyTree 0.3.1 get_selected, renderTree, shinyTree Pre-processing tools
sp 2.1.3 coordinates, zerodist, CRS, spsample Spatial Tools
vegan 2.6.4 decostand, diversity, estimateR, fisher.alpha, specnumber Diversity tools
webshot 0.5.5 webshot Spatial Tools

9.2 Table 2: Packages and Functions Used Throughout the App

This table presents the packages and their respective functions that are utilized across the entire app. These packages are essential for data manipulation, visualization, interactive features, and more.

Package Version Functions
base64enc 0.1.3 dataURI
colorspace 2.1.0 hex2RGB, mixcolor
colourpicker 1.3.0 colourInput
data.table 1.15.0 melt, rbindlist
devtools 2.4.5 source_url
DT 0.32 DTOutput, renderDT, datatable, dataTableOutput, formatStyle, renderDataTable, JS
e1071 1.7.14 allShortestPaths
foreach 1.5.2 foreach
gbRd 0.4.11 Rd_fun
ggnewscale 0.4.10 new_scale_fill, new_scale_color, new_scale
ggplot2 3.5.1 discrete_scale, element_rect, geom_sf, ggplot, layer, scale_color_gradientn, theme, aes, coord_cartesian, coord_flip, element_blank, element_line, element_text, geom_label, geom_segment, geom_text, ggtitle, margin, scale_colour_identity, scale_linetype_identity, scale_size_identity, scale_x_discrete, scale_y_reverse, theme_bw, theme_classic, theme_dark, theme_grey, theme_light, theme_linedraw, theme_minimal, theme_void, xlab, ylab, coord_fixed, geom_line, geom_point, geom_polygon, geom_vline, guide_legend, guides, scale_color_manual, scale_fill_gradientn, scale_fill_manual, scale_linetype_manual, scale_x_continuous, scale_y_continuous, sec_axis, standardise_aes_names, geom_raster, labs, scale_colour_manual, facet_wrap, geom_freqpoly, vars
ggrepel 0.9.5 geom_label_repel
htmlwidgets 1.6.4 saveWidget
igraph 2.0.2 delete_vertices, graph_from_data_frame, V
MLmetrics 1.1.1 R2_Score, F1_Score, FBeta_Score, Gini, MAE, MAPE, MedianAE, MedianAPE, MSE, NormalizedGini, Poisson_LogLoss, RAE, RMSE, RMSLE, RMSPE, RRSE
pROC 1.18.5 multiclass.roc
purrr 1.0.2 walk2
RColorBrewer 1.1.3 brewer.pal
readr 2.1.4 read_file
readxl 1.4.3 cell_cols, excel_sheets, read_excel
reshape 0.8.9 melt
reshape2 1.4.4 melt
scales 1.3.0 cbreaks, extended_breaks, rescale, col_numeric, label_number
shiny 1.8.0 actionButton, actionLink, callModule, checkboxInput, column, div, downloadButton, downloadHandler, downloadLink, em, eventReactive, fluidRow, HTML, icon, insertTab, isolate, modalButton, modalDialog, moduleServer, navbarPage, need, NS, numericInput, observe, observeEvent, plotOutput, reactive, reactiveVal, reactiveValues, removeModal, removeTab, renderPlot, renderPrint, renderUI, req, selectInput, showModal, span, strong, tabPanel, tabsetPanel, textInput, uiOutput, updateCheckboxInput, updateNumericInput, updateTabsetPanel, updateTextInput, validate, withProgress, img, getDefaultReactiveDomain, incProgress, a, absolutePanel, br, code, conditionalPanel, h3, h4, h5, htmlOutput, p, renderTable, splitLayout, verbatimTextOutput
shinyBS 0.61.1 bsTooltip, addPopover, bsButton, popify, tipify
shinybusy 0.3.2 add_busy_spinner
shinydashboardPlus 2.0.3 dashboardFooter, dashboardHeader, dashboardPage, dashboardSidebar
shinyjs 2.1.0 delay, hide, onevent, runjs, hidden, useShinyjs, addClass, colourInput, reset, toggle, toggleClass, addCssClass, removeCssClass, toggleState
shinyWidgets 0.8.1 pickerInput, radioGroupButtons, updatePickerInput, updateVirtualSelect, virtualSelectInput, switchInput, updateSwitchInput, dropMenu, pickerOptions, updateRadioGroupButtons
sortable 0.5.0 rank_list
stringr 1.5.1 str_length, str_replace_all
tibble 3.2.1 rownames_to_column

10 Model-specific tunning (Supervised alorithms)

10.1 Random Forest

Parameter Description
ntree Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
replace Should sampling of cases be done with or without replacement?
nodesize Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5).
maxnodes Maximum number of terminal nodes trees in the forest can have. If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than maximum possible, a warning is issued.
nPerm Number of times the OOB data are permuted per tree for assessing variable importance. Number larger than 1 gives slightly more stable estimate, but not very effective. Currently only implemented for regression.
norm.votes If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs). Ignored for regression.

10.2 Naive Bayes

Parameter Description
bw The smoothing bandwidth to be used. The kernels are scaled such that this is the standard deviation of the smoothing kernel. Can also be a character string giving a rule to choose the bandwidth. The default is “nrd0”.
window A character string giving the smoothing kernel to be used. Must partially match one of “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “cosine” or “optcosine”. Default is “gaussian”.
kernel A character string giving the smoothing kernel to be used. Must partially match one of “gaussian”, “rectangular”, “triangular”, “epanechnikov”, “biweight”, “cosine” or “optcosine”. Default is “gaussian”.

10.3 k-Nearest Neighbors

Parameter Description
l Minimum vote for a definite decision, otherwise doubt. Less than k-l dissenting votes are allowed, even if k is increased by ties.
use.all Controls handling of ties. If true, all distances equal to the kth largest are included. If false, a random selection of distances equal to the kth is chosen to use exactly k neighbors.

10.4 Stochastic Gradient Boosting

Parameter Description
bag.fraction The fraction of the training set observations randomly selected to propose the next tree in the expansion. Default is 0.5.

10.5 Self-Organizing Maps

Parameter Description
rlen The number of times the complete data set will be presented to the network.
alpha Learning rate, a vector of two numbers indicating the amount of change. Default is to decline linearly from 0.05 to 0.01 over rlen updates. Not used for the batch algorithm.
maxNA.fraction The maximal fraction of values that may be NA to prevent the row from being removed.
dist.fcts Vector of distance functions to be used for the individual data layers. Default is “sumofsquares” for continuous data, and “tanimoto” for factors.
mode Type of learning algorithm.
normalizeDataLayers Boolean, indicating whether distance.weights should be calculated. If normalizeDataLayers == FALSE, user weights are applied to the data immediately.

10.6 Generalized Linear Model

Parameter Description
method The method to be used in fitting the model. The default method “glm.fit” uses iteratively reweighted least squares (IWLS).
singular.ok Logical; if FALSE, a singular fit is an error.
epsilon Positive convergence tolerance; the iterations converge when
maxit Integer giving the maximal number of IWLS iterations.

10.7 Stacked AutoEncoder Deep Neural Network

Parameter Description
activationfun Activation function of hidden unit. Can be “sigm”, “linear” or “tanh”. Default is “sigm”.
learningrate Learning rate for gradient descent. Default is 0.8.
momentum Momentum for gradient descent. Default is 0.5.
learningrate_scale Learning rate will be multiplied by this scale after every iteration. Default is 1.
output Function of output unit. Can be “sigm”, “linear” or “softmax”. Default is “sigm”.
sae_output Function of autoencoder output unit. Can be “sigm”, “linear” or “softmax”. Default is “linear”.
numepochs Number of iterations for samples. Default is 3.
batchsize Size of mini-batch. Default is 100.

10.8 Conditional Inference Random Forest

Parameter Description
teststat A character specifying the type of the test statistic to be applied.
testtype A character specifying how to compute the distribution of the test statistic.
mincriterion The value of the test statistic or 1 - p-value that must be exceeded to implement a split.
savesplitstats A logical determining whether standardized two-sample statistics for split point estimate are saved for each primary split.
ntree Number of trees to grow in a forest.
replace A logical indicating whether sampling of observations is done with or without replacement.
fraction Fraction of number of observations to draw without replacement (only relevant if replace = FALSE).

10.9 Gaussian Process with Radial Basis Function Kernel

Parameter Description
scaled A logical vector indicating the variables to be scaled. Default scales data to zero mean and unit variance.
var The initial noise variance for regression. Default is 0.001.
tol Tolerance of termination criterion. Default is 0.001.

10.10 svmLinear - Support Vector Machines with Linear Kernel

Parameter Description
nu Parameter needed for nu-svc, one-svc, and nu-svr. Sets the upper bound on training error and lower bound on fraction of data points to become Support Vectors (default: 0.2).
epsilon Epsilon in the insensitive-loss function used for eps-svr, nu-svr, and eps-bsvm (default: 0.1).
class.weights A named vector of weights for different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1).
cross If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the model’s quality: accuracy rate for classification and Mean Squared Error for regression.
tol Tolerance of termination criterion (default: 0.001).
shrinking Option whether to use the shrinking-heuristics (default: TRUE).

10.11 svmRadial - Support Vector Machines with Radial Basis Function Kernel

Parameter Description
nu Parameter needed for nu-svc, one-svc, and nu-svr. Sets the upper bound on training error and lower bound on fraction of data points to become Support Vectors (default: 0.2).
epsilon Epsilon in the insensitive-loss function used for eps-svr, nu-svr, and eps-bsvm (default: 0.1).
class.weights A named vector of weights for different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1).
cross If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the model’s quality: accuracy rate for classification and Mean Squared Error for regression.
tol Tolerance of termination criterion (default: 0.001).
shrinking Option whether to use the shrinking-heuristics (default: TRUE).

10.12 svmRadialCost - Support Vector Machines with Radial Basis Function Kernel

Parameter Description
nu Parameter needed for nu-svc, one-svc, and nu-svr. Sets the upper bound on training error and lower bound on fraction of data points to become Support Vectors (default: 0.2).
epsilon Epsilon in the insensitive-loss function used for eps-svr, nu-svr, and eps-bsvm (default: 0.1).
class.weights A named vector of weights for different classes, used for asymmetric class sizes. Not all factor levels have to be supplied (default weight: 1).
cross If an integer value k>0 is specified, a k-fold cross-validation on the training data is performed to assess the model’s quality: accuracy rate for classification and Mean Squared Error for regression.
tol Tolerance of termination criterion (default: 0.001).
shrinking Option whether to use the shrinking-heuristics (default: TRUE).

10.13 avNNet - Model Averaged Neural Network

No specific parameters provided for this model.

10.14 nnet - Neural Network

Parameter Description
linout Switch for linear output units. Default is logistic output units.
entropy Switch for entropy (= maximum conditional likelihood) fitting. Default is least-squares.
censored Variant on softmax, where non-zero targets mean possible classes. For softmax a row of (0, 1, 1) means one example each of classes 2 and 3, but for censored it means one example whose class is only known to be 2 or 3.
skip Switch to add skip-layer connections from input to output.
rang Initial random weights on [-rang, rang]. Value about 0.5 unless inputs are large, in which case rang * max(|x|) should be about 1.
maxit Maximum number of iterations (default: 100).
Hess If true, returns the Hessian of the measure of fit at the best set of weights found.
MaxNWts Maximum allowable number of weights. Increasing MaxNWts will likely slow down fitting.
abstol Stop if the fit criterion falls below abstol, indicating an essentially perfect fit.
reltol Stop if the optimizer is unable to reduce the fit criterion by a factor of at least 1 - reltol.

10.15 pcaNNet - Neural Networks with Feature Extraction

No specific parameters provided for this model.

10.16 rpart - CART

Parameter Description
minsplit The minimum number of observations that must exist in a node in order for a split to be attempted.
minbucket The minimum number of observations in any terminal <leaf> node.
maxcompete Number of competitor splits retained in the output. Useful to know not just which split was chosen, but which variable came in second, third, etc.
maxsurrogate Number of surrogate splits retained in the output. Setting to zero reduces compute time.
usesurrogate How to use surrogates in the splitting process (0 = display only, 1 = use surrogates, 2 = use surrogates for missing primary variables).
xval Number of cross-validations.
surrogatestyle Controls the selection of a best surrogate (0 = total number of correct classifications, 1 = percent correct over non-missing values).
maxdepth Maximum depth of any node of the final tree, with the root node counted as depth 0.

10.17 monmlp - Monotone Multi-Layer Perceptron Neural Network

Parameter Description
hidden2 Number of hidden nodes in the second hidden layer.
iter.max Maximum number of iterations of the optimization algorithm.
n.trials Number of repeated trials used to avoid local minima.
bag Logical variable indicating whether to use bootstrap aggregation (bagging).
max.exceptions Maximum number of exceptions of the optimization routine before fitting is terminated with an error.
method Code {<link>optimx} optimization method.

10.18 mlpML - Multi-Layer Perceptron, with multiple layers

Parameter Description
size Number of units in the hidden layer(s).
maxit Maximum number of iterations to learn.
initFunc The initialization function to use.
initFuncParams The parameters for the initialization function.
learnFunc The learning function to use.
learnFuncParams The parameters for the learning function.
updateFunc The update function to use.
updateFuncParams The parameters for the update function.
hiddenActFunc The activation function of all hidden units.
shufflePatterns Should the patterns be shuffled?

10.19 evtree - Tree Models from Genetic Algorithms

Parameter Description
minbucket The minimum sum of weights in a terminal node.
minsplit The minimum sum of weights in a node in order to be considered for splitting.
maxdepth Maximum depth of the tree. Note that memory requirements increase by the square of the maximum tree depth.

11 References

Blanchet, F. G., Legendre, P., & Borcard, D. (2008). FORWARD SELECTION OF EXPLANATORY VARIABLES. Ecology, 89(9), 2623–2632. https://doi.org/10.1890/07-0986.1

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189–215. https://doi.org/10.1016/j.neucom.2019.10.118

Checon, H. H., Vieira, D. C., Corte, G. N., Sousa, E. C. P. M., Fonseca, G., & Amaral, A. C. Z. (2018). Defining soft bottom habitats and potential indicator species as tools for monitoring coastal systems: A case study in a subtropical bay. Ocean & Coastal Management, 164, 68–78. https://doi.org/10.1016/j.ocecoaman.2018.03.035

Corte, G. N., Checon, H. H., Fonseca, G., Vieira, D. C., Gallucci, F., Domenico, M. Di, & Amaral, A. C. Z. (2017). Cross-taxon congruence in benthic communities: Searching for surrogates in marine sediments. Ecological Indicators, 78, 173–182. https://doi.org/10.1016/j.ecolind.2017.03.031

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification (2nd ed.). John Wiley \& Sons.

Pearson, Karl. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 1, 2, 559–572. https://api.semanticscholar.org/CorpusID:125037489

Fix, E., & Hodges, J. L. (1989). Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238–247. http://www.jstor.org/stable/1403797

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.

Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification (pp. 986–996). https://doi.org/10.1007/978-3-540-39964-3_62

Gupta, B., Rawat, A., Jain, A., Arora, A., & Dhami, N. (2017). Analysis of Various Decision Tree Algorithms for Classification in Data Mining. International Journal of Computer Applications, 163(8), 15–19. https://doi.org/10.5120/ijca2017913660

Kalcheva, N., Todorova, M., & Marinova, G. (2020). NAIVE BAYES CLASSIFIER, DECISION TREE AND ADABOOST ENSEMBLE ALGORITHM – ADVANTAGES AND DISADVANTAGES. 153–157. https://doi.org/10.31410/ERAZ.2020.153

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30, 81–93.

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69.

Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5). https://doi.org/10.18637/jss.v028.i05

Legendre, P., & Anderson, M. (1999). Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69(1), 1–24. https://doi.org/10.1890/0012-9615

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.

Melssen, W., Wehrens, R., & Buydens, L. (2006). Supervised Kohonen networks for classification problems. Chemometrics and Intelligent Laboratory Systems, 83(2), 99–113. https://doi.org/10.1016/j.chemolab.2006.02.003

Pearson, K. (1920). Notes on the History of Correlation. Biometrika, 13, 25–45.

Shepard, D. (1968). A two-dimensional interpolation function for irregularly-spaced data. Proceedings of the 1968 23rd ACM National Conference On -, 517–524. https://doi.org/10.1145/800186.810616

Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika, 27(2), 125–140.

Sneath, P. H. A. (1957). The application of computers to taxonomy. Journal of General Microbiology, 17(1), 201–226.

Spearman, C. (1987). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 100(3/4), 441–471. http://www.jstor.org/stable/1422689

Vieira, D. C., Brustolin, M. C., Ferreira, F. C., & Fonseca, G. (2019). segRDA: An <scp>r</scp> package for performing piecewise redundancy analysis. Methods in Ecology and Evolution, 10(12), 2189–2194. https://doi.org/10.1111/2041-210X.13300

Vieira, D. C., Gallucci, F., Corte, G. N., Checon, H. H., Zacagnini Amaral, A. C., & Fonseca, G. (2021). The relative contribution of non-selection and selection processes in marine benthic assemblages. Marine Environmental Research, 163, 105223. https://doi.org/10.1016/j.marenvres.2020.105223

Yao, M., Zhu, Y., Li, J., Wei, H., & He, P. (2019). Research on Predicting Line Loss Rate in Low Voltage Distribution Network Based on Gradient Boosting Decision Tree. Energies, 12(13), 2522. https://doi.org/10.3390/en12132522