iMESchelp!

Introduction

Welcome to the iMESchelp, your comprehensive guide to use iMESc:an interactive machine learning app designed to analyze environmental data. iMESc is a shiny-based application that allows the performance of end-to-end machine learning workflows. It provides a wide range of resources to meet the various needs of environmental scientist, making it a versatile tool for data analysis.

This manual is organized into the following sections:

  1. Setup: Step-by-step instructions to run iMESc.

  2. Layout: The dashboard organization.

  3. Widgets: The interactive widgets used in iMESc, such as buttons, dropdowns, and checkboxes, which enable seamless interactions with the app.

  4. Datalist: Exploring the core concept of Datalists and their attributes.

  5. Essentials of Building a Workflow in iMESc: Generic workflow steps commonly encountered while using iMESc.

  6. Pre-processing Tools: In-depth coverage of the tools available for pre-processing data, including Datalist creation and data transformation.

  7. Sidebar-menu: Details about the modules, analyses and algorithms, parametrization, and results.

  8. Packages & Functions: Details about the main R packages and functions used in iMESc, along with their respective versions and analytical tasks.

1 Setup

  1. Install R and RStudio if you haven’t done so already;

  2. Once installed, open the R studio;

  3. Install shiny package if it is not already installed;

install.packages('shiny')
  1. Run the code below.

shiny::runGitHub('iMESc','DaniloCVieira', ref='main')

When you use the iMESc app for the first time, it will automatically install all the necessary packages, which may take several minutes to complete. However, once the first installation is finished, subsequent access of the app will be much faster. If the required packages are not already loaded, they typically take several seconds to load. On the other hand, if the packages are already loaded, iMESc will start almost instantly.

2 Layout

iMESc is designed with a dashboard layout, consisting of three main sections:

  1. Pre-processing tools on the top-left containing widgets for Datalist options and pre-processing data.

  2. Sidebar-menu on the left-hand side containing menu buttons.

  3. The Main panel for viewing the analytical tasks.

Upon selecting a menu button, users will seamlessly navigate to a sub-screen housing the selected module. Each module features a header equipped with interactive widgets, along with multiple tab panels that support various functionalities.

To ensure an optimal display of iMESc content, we strongly recommend a minimum landscape resolution of 1377 x 768 pixels. Adhering to this resolution, it will guarantee an enhanced user experience and proper visualization of all elements on the screen.

Fig S3.1 - iMESc layout
Fig S3.1 - iMESc layout

3 Widgets

The app is built using widgets: web elements that users interact with. The standard iMESc widgets are:

Table S4.1 - iMESc widgets
Widgets Task
Button Performs an action when clicked
Picker/Dropdown Allows the user to select only one of a predefined set of mutually exclusive options
Checkbox Interactive box that can be toggled by the user to indicate an affirmative or negative choice
Checkbox group A group of check boxes
Radiobuttons Allows the user to choose only one of a predefined set of mutually exclusive options
File A file upload control wizard
Numeric A field to enter numbers
Text A field to enter text

4 Datalist

iMESc manages data through Datalists Fig S4.1, which can include sheets and shape-files (user-provided). The sheets are internally treated as data.frame objects in R, where rows represent observations and columns represent variables. Matching observations among these attributes, regardless of the Datalists they come from, is achieved based on the row names, ensuring data consistency. To ensure proper handling of the sheets-attributes, you must provide a unique ID and place it in the first column when uploading the data. iMESc automatically removes this first column and adds it as the row names of the respective attribute. Decimals in iMESc are represented with dots (e.g. 1/2 =0.5), check it before uploading a file.

Required

Numeric-Attribute Numerical data with continuous or discrete variables.

Optional

Factor-Attribute: Categorical data. If not provided, iMESc automatically generates this attribute as a sheet with a single column containing the IDs of the Numeric-Attribute.
Coords-Attribute Geographical coordinates (Longitude, Latitude) represented in decimal degrees, used for spatialization of data within the Spatial Tools.
Base-Shape-Attribute: A polygon shape to clip or interpolate spatialized data for map generation.
Layer-Shape-Attribute: Adds extra shapes for superimposition on maps.
Extra-Shape Attribute: Users can add other shapes as Extra-Shape-Attribute to further customize their maps.

Furthermore, Datalists have the capacity to store models that have been trained using the iMESc, permitting users to integrate and manage predictive models along with their data sets.

Fig. S5.1 - Schematic representation of a Datalist and its associated attributes
Fig. S4.1 - Schematic representation of a Datalist and its associated attributes

To access all the available analyses in iMESc, you need to create a Datalist by either uploading your own data or using the example data provided. For guidance on uploading sheets and the required attribute formats, as well as how to utilize the example Datalists, please refer to the “Creating a Datalist” section.

5 Main Analyses

The table provided below serves as a convenient reference for the main analyses utilized in iMESc. It displays their locations in the sidebar menu, along with their abbreviations and corresponding packages. For more comprehensive information about each analysis, you can refer to their respective sections in this manual.

Table S6- Overview of iMESc Analysis Techniques
Side Bar menu Analyses Abbreviation Package Author
Unsupervised Sel-Organizing Maps SOM kohonen Kohonen (1982)
Hierarchical Clustering HC factoextra Sneath (1957)
K-Means K-Means class MacQueen (1967)
Supervised Naive Bayes NB klarR Duda et al. (2012)
Support Machine Vector SVM kernlab Cortes & Vapnik (1995)
K-Nearest neighbor KNN stats Fix & Hodges (1951)
Random-Forest RF RandomForest Breiman (2001)
Stochastic Gradient Boosting GBM gbm Friedman (2001)
Sel-Organizing Maps XYF kohonen Shepard (1962)
Descriptive tools Pearson’s correlation - base Pearson (1895)
Kendell’s correlation - base Kendall (1938)
Spearman’s correlation - base Spearman (1904)
Principal Component Analysis PCA base Pearson (1901)
Nonmetric Multidimensional Scaling MDS vegan Legendre & Anderson (1999)
Redundancy Analyses RDA vegan Blanchet et al. (2008)
Piecewise Redundancy Analyses pwRDA segRDA Vieira (2019)
Spatial tools Inverse Distance Weighting idw - Shepard (1968)
Mantel Correlogram - vegan Mantel (1967)

6 Essentials of Building a Workflow in iMESc

In this section, we will cover the four recurring steps to construct a workflow within iMESc:

6.1 1. Create Datalists Based on Model Specifications

  • Use pre-processing tools to Create Datalists with their associated attributes based on the chosen analytical method.

  • For unsupervised methods, only Numeric-Attribute (X) is needed.

  • For classification models, both Numeric-Attribute (X) and Factor-Attribute (Y) are needed. X and Y can be from the same or different Datalists.

  • For regression models, X and Y are both Numeric-Attributes from different Datalists.

*For the Naive Bayes Algorithm, Y is always the Factor-Attribute, and X can be either Factor-Attribute or Numeric-Attribute.

Fig. S6.1 - Conceptual Setup for the Modles available in iMESc.
Fig. S6.1 - Conceptual Setup for the models available in iMESc.

6.2 2. Pre-processing

  • Use pre-processing tools to handle missing values (Data imputation tool).

  • Transform the data as needed (e.g., scaling, centering, log transformations), especially for distance-based methods (e.g., PCA, SOM).

  • Partition the Y data between training and testing for supervised machine learning methods. This action creates a column in the Factor Attribute indicating the partitioning.

6.3 3. Save changes and models

Saving data changes or trained models is a recurring step throughout iMESc. Whenever saving is required, iMESc will indicate this with a flashing-blue disc button.

  • Saving data changes can be done as new Datalists or overwrite existing ones. Factor, Coords, and Shapes attributes are automatically transferred to the new Datalist. Models previously saved in a Datalist are not transferred.

  • Trained models are saved within the Datalist used as the predictor (X). After training a model, users have the option to save it as a new model or overwrite an existing one. This action creates a new attribute within the Datalist (e.g., RF-Attribute for Random Forest models).

    Save changes after a Transformation.
    Save changes after a Transformation.
Train and save a model
Train and save a model

6.4 4. Loading and downloading a save-point

  • Download a save-point:

    1. Open the pre-processing tools in iMESc.

    2. Click the “Download” button in the “Create a savepoint” section.

    3. The save-point file (.rds) will be downloaded to the computer, capturing your workspace, including all the Datalists and associated models.

  • Restore a save-point:

    1. Go to the pre-processing tools.

    2. In the “Load a savepoint” section, use “Browse” to upload the save-point file from your computer.

    3. Click “Upload” or “Load” to restore your workspace to that point.

Save-points are incredibly useful for preserving your analysis progress and results. By downloading a save-point, you can conveniently store your work, and later, by uploading it, you can seamlessly continue your analysis from where you left off. This feature ensures that your work remains intact, even if you close the session or access iMESc from a different device.

Using save-points streamlines your workflow and enhances your overall experience with iMESc, providing a reliable way to manage and preserve your analysis outputs and data for future use.


Savepoint
Savepoint

6.4.1 Extracting the Savepoint Results by R Code:

You can extract specific results and attributes from the Savepoint using R code. The iMESc is not required here. To do this, use the following steps:

# Reading the Savepoint
savepoint <- readRDS("savepoint.rds")
savepoint$saved_data     # To access all saved Datalists
names(savepoint$saved_data)     # To access the names of the saved Datalists

# Accessing a specific Datalist named "datalist_name"
datalist_name <- savepoint$saved_data[['datalist_name']]
datalist_name

# Accessing specific attributes within the Datalist
attr(datalist_name, "factors")       # To access the Factor-Attribute
attr(datalist_name, "coords")        # To access the Coords-Attribute
attr(datalist_name, "base_shape")    # To access the Base-Shape-Attribute
attr(datalist_name, "layer_shape")   # To access the Layer-Shape-Attribute
attr(datalist_name, "extra_shape")   # To access the Extra-Shapes-Attribute

# To extract saved models
attr(datalist_name, "som")                 # To access all SOM models saved in the Datalist
attr(datalist_name, "som")[["model_name"]] # To access a saved SOM model named 'model_name'

# To access other models, replace "som" with the corresponding model name:
# 'kmeans' (k-Means), 'nb' (Naive-Bayes), 'svm' (Support-Machine Vector), 'knn'(k-nearest neighbor), 'rf' (Random Forest),
# 'sgboost' (stochastic gradient boosting), 'xyf' (supervised som).

Note: Ensure that you specify the correct path and filename of your save-point file in the readRDS function. Modify “datalist_name” and “model_name” in the R code to access specific Datalists and saved models, respectively.

7 Pre-processing Tools

The Pre-processing Tools comprise a suite of functionalities for manipulating and preparing Datalists. These tools assist in refining the data, handling missing values, and generating custom palettes for graphical outputs. Below are the details of each tool:

7.1 Create a Datalist

To begin working with iMESc, you need to create a Datalist, which serves as the foundation for all analytical tasks. Click on “Create Datalist button” to open a modal dialog for Datalist creation. All analytical tasks in iMESc will require a Datalist, which can be uploaded by the user or generated using example data.

7.1.1 Upload

  • Name the Datalist: Use the text widget to provide a name for the Datalist.

  • Numeric-Attribute: Upload a .csv or .xlsx file containing the numeric variables. This file is mandatory and should include observations as rows and variables as columns.The first row must contain the variable headings, and the first column should have observation labels. Columns containing characters (text or mixed numeric and non-numeric values) will be automatically transferred to the Factor-Attribute.

Create a Datalist - Upload
Create a Datalist - Upload
  • Factor-Attribute: Upload a .csv or .xlsx file containing categorical variables. This file should have observations as rows and categorical variables as columns. The first row must contain variable headings, and the first column should have observation labels. If the Factor-Attribute is not uploaded, the observation IDs will be used automatically. This attribute is crucial for labeling, grouping, and visualizing results based on factor levels. It can be replaced at any time with a new one using the “Replace Factor-Attribute” button

  • Coords-Attribute: Upload a .csv or .xlsx file containing geographical coordinates. This file is optional for creating a Datalist but required for generating maps. The first column should contain the observation labels, the second column Longitude values, and the third column Latitude values (both in decimal degrees). The first row must contain the coordinate headings.

  • Base-Shape: Upload a single R file containing the polygon shape to be used for generating maps, such as an oceanic basin shape. This optional file can be created using the SHP toolbox available in the pre-processing tools, which allows converting shapefiles (.shp, .shx, and .dbf files) into a single R file.

  • Layer-Shape: Upload a single R file containing an additional shape layer, such as a continent shape, to be used when generating maps. This optional file can also be created using the SHP toolbox available in the pre-processing tools.

7.1.1.1 Best practices when uploading your sheet

  1. Prepare your data: Use the first row as column headers and the first column as observation labels.

  2. Ensure each label is filled with unique information, removing any duplicated names.

  3. Check for empty cells in the observation label column.

  4. Ensure that the column names are unique; duplicated names are not allowed.

  5. Avoid using cells with blank spaces or special symbols.

  6. Avoid beginning variable names with a number.

  7. Note that R is case-sensitive, so “name” is different from “Name” or “NAME.”

  8. Avoid blank rows and/or columns in your data.

  9. Replace missing values with NA (not available).

7.1.2 Use Example Data

This option allows users to explore the example data available in iMESc . After clicking on “Create a Datalist,” select the “Use example data” radio-button to proceed with Datalist insertion. This action will insert two Datalists from Araçá Bay, located on the southeastern coast of Brazil:

  • envi_araca: Contains 141 samples with 9 environmental variables.

  • nema_araca: Contains 141 samples with 194 free-living marine nematode species (southeastern coast of Brazil).

Both Datalists comprise five attributes: Numeric, Factor, Coords-Attribute, Base-Shape, and Layer-Shape. Studies that explored these data area include Corte et al., 2017, Checon et al., 2018 and Vieira et al., 2021.

Create a Datalist - Use example data
Create a Datalist - Use example data

7.2 Options

This drop-down menu offers the user a range of tools for editing Datalists.

7.2.1 Rename Datalist

Change the name of a selected Datalist to enhance organization and clarity.

7.2.2 Merge Datalists

Combine two or more Datalists by columns or rows. This action affects both Numeric and Factor-Attribute data. When merging by rows, it also impacts the Coords-Attribute. Please note that saved models in one of the Datalists are not transferred to the merged Datalist.

7.2.3 Exchange Factor/Variables

The “Exchange Factors/Variables” functionality in iMESc allows you to seamlessly convert or transfer data between numeric and factor formats. This powerful tool provides flexibility in handling your data and enables smooth transitions between different data types.

  1. From Datalist Selector: Select the source Datalist from which you want to exchange data.

  2. From Attribute Selector: Within the selected Datalist, choose between the Numeric or Factor Attribute that you wish to convert or transfer.

  3. To Datalist Selector: Select the target Datalist where you want to transfer or convert the data.

  4. To Attribute Selector: Within the target Datalist, specify whether you want to convert the data to Numeric or Factor format.

Now, let’s look at the conversion options based on the selected attributes:

7.2.3.1 From Numeric…

  • To Numeric: This option allows you to copy or transfer the selected numeric variables from the source Datalist to the target Datalist while preserving their numeric format.

  • To Factor: Convert the selected numeric variables to factors. You can use the “cut” option to categorize the variables into specified bins or levels. If the “cut” option is unchecked, each unique value will become a separate factor level. Additionally, you have the flexibility to edit the names and order of the factor levels.

Numeric to Factor Convertion
Numeric to Factor Convertion

7.2.3.2 From Factor…

  • To Factor: With this option, you can copy or transfer the selected factors from the source Datalist to the target Datalist, maintaining their original factor format.

  • To Numeric: This conversion allows you to convert the selected factors to numeric data before copying or moving them. You have two types of conversions available:

    • Binary: For each factor level, a single binary column is created, where 1 indicates the class of that observation.

    • Integer: A single column is created, representing the numeric (integer) representation of the factor levels (values).

Factor to Numeric conversion
Factor to Numeric conversion

7.2.4 Replace Factor-Attribute

The “Replace Factor-Attribute” option allows users to update existing Factor-Attributes within a Datalist by replacing them with new data from a CSV file.

7.2.5 Edit Datalist Columns

The “Edit Datalist Columns” feature enables users to modify the names of columns, including both Numeric-Attributes and Factor-Attributes, within a Datalist.

7.2.6 Edit/Merge models

This functionality allows you to:

  • Edit model names: Select the target Datalist, the model type (e.g., random forest), enter a new name, and confirm the changes.

  • Merge models: Select the models and the target Datalist. The selected models will be copied to the target Datalist, preserving the original models. In case of duplicate model names, iMESc automatically differentiates them by adding ‘.1’, ‘.2’, and so on.

7.2.7 Build IDs/Columns & Formula

Creates columns or build IDs in the Numeric or Factor-Attribute using three options:

  • Find & Replace”: The user selects a target variable/factor, specifies a pattern to find, and another pattern to replace. Then, the user can create a new column using the results. This option employs the gsub function from R base to find the specified pattern and replace it with the desired one throughout the observations.

  • “Concatenate”: The user selects multiple target variables/factors, and like the Concatenate function in Excel, iMESc merges them. The user can build a new column or use the concatenated results as IDs for observations. This functionality in iMESc uses the paste0 function from R base.

  • “Apply formula”: Allows the user to apply custom formulas. The user selects multiple target variables to be available in the bucket and builds a formula by dragging the targets and formula elements (e.g., (, ), +, -,*, etc.). The user can then create a new column with the formula results.

7.2.8 Transpose a Datalist

This tool allows the rotation of a Datalist (Numeric and Factor) from rows to columns.

7.2.9 SHP toolbox

This toolbox allows the creation of Base-Shapes, Layer-Shapes and Extra shapes.

  1. Upload shape files* at once
  2. Select the shape attribution: Base-Shape, Layer-Shape or Extrashape
  3. Select the Target Datalist
  4. Include the shape or download a single R file to be uploaded later when creating a Datalist.

7.2.9.1 shape files*

Shapefiles are a simple, nontopological format for storing geometric location and attribute information of geographic features. The shapefile format defines the geometry and attributes of geographically referenced features in three or more files with specific file extensions that should be stored in the same project. It requires at least three files:

.shp: The main file that stores the feature geometry.

.shx: The index file that stores the index of the feature geometry.

.dbf: The dBASE table that stores the attribute information of features.

There is a one-to-one relationship between geometry and attributes, which is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file.

Each file must have the same prefix, for example: basin.shp, basin.shx, and basin.dbf

7.2.10 Datalist Manager

Manage saved Datalists and their attributes. The manager displays the size of each Datalist, and options for deleting attributes and downloading dataframes.

7.2.11 Delete Datalists

Remove a Datalist entirely.

7.3 Filter observations

This tool allows manipulating numeric attributes by filtering observations based on certain criteria. The available options are:

  • Na.omit: Remove all cells that are empty (NAs) by checking this box.

  • Match with: If checked, constrain the target Datalist to observations (IDs) from another Datalist.

  • Filter by Factors: After clicking this option, a Filter Tree will be displayed, structured by the levels of the Factor-Attribute. You can click on the nodes to expand and select the factor levels. This function is available for factors with fewer than 100 levels.

  • Individual Selection: After clicking this option, checkboxes will be displayed to select observations using Datalist IDs.

Filter observations
Filter observations

7.4 Filter variables

This tool allows manipulating the Numeric-Attribute by filtering variables. The available options for value-based removal are:

  • Abund<: Remove variables with a total value less than x-percent of the total sum across all observations. This is useful to exclude variables with low overall contribution.

  • Freq<: Remove variables that occur in less than x-percent of the total number of observations. This is helpful when you want to exclude rarely occurring variables.

  • Singletons: Remove variables that occur only once in the dataset. This option is relevant for counting data and helps eliminate variables with no meaningful variation.

  • Individual Selection: This option allows you to manually select specific variables to keep or remove from the dataset, using their column names.

  • Correlation-based removal: This option uses the findCorrelation function from the ‘caret’ package. It considers the absolute values of pair-wise correlations between variables. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable, considering the whole data, and removes the variable with the largest mean absolute correlation. The exact argument is set to TRUE, meaning that the function re-evaluates the average correlations at each step.

  • nearZeroVar removal: This option uses the nearZeroVar function from the caret package. It identifies and removes near zero variance predictors. Predictors with near-zero variance have either zero variance (only one unique value) or very few unique values relative to the number of samples, with a large frequency ratio between the most common and second most common values. Removing such predictors can help eliminate features that do not contribute much information.

Filter variables
Filter variables

7.5 Transformations

The “Transformations” tool enables preprocessing of the Numeric-Attribute using various transformation methods.

7.5.1 Transformation

Provides a wide range of transformation options:

  1. None: No Transformation. Select this option if you do not want to apply any transformation to the Numeric-Attribute.

  2. Log2: Logarithmic base 2 transformation as suggested by Anderson et al. (2006). It follows the formula log_b (x) + 1 for x > 0, where ‘b’ is the base of the logarithm. Zeros are left as zeros. Higher bases give less weight to quantities and more to presences, and logbase = Inf gives the presence/absence scaling. Note that this is not log(x+1).

  3. Log10: Logarithmic base 10 transformation as suggested by Anderson et al. (2006). It follows the formula log_b (x) + 1 for x > 0, where ‘b’ is the base of the logarithm. Zeros are left as zeros. Higher bases give less weight to quantities and more to presences, and logbase = Inf gives the presence/absence scaling. Note that this is not log(x+1).

  4. Total: Divide by the line (observation) total. This transformation scales the values based on the total sum of each observation.

  5. Max: Divide by the column (variable) maximum. This transformation scales the values based on the maximum value of each variable.

  6. Frequency: Divide by the column (variable) total and multiply by the number of non-zero items, so that the average of non-zero entries is one. This transformation scales the values based on the frequency of occurrence.

  7. Range: Standardize column (variable) values into the range 0 … 1. If all values are constant, they will be transformed to 0. This transformation brings the values to a common scale.

  8. Pa: Scale x to presence/absence scale (0/1). This transformation converts the values to binary (0 for absence, 1 for presence).

  9. Chi.square: Divide by row sums and square root of column sums and adjust for the square root of the matrix total. This transformation is relevant for specific statistical analyses.

  10. Hellinger: Square root of method = total. This transformation is used for certain distance calculations.

  11. Sqrt2: Square root transformation. This transformation takes the square root of each value.

  12. Sqrt4: 4th root transformation. This transformation takes the 4th root of each value.

  13. Log2(x+1): Logarithmic base 2 transformation (x+1). This is a variant of the log2 transformation that adds 1 before taking the logarithm.

  14. Log10(x+1): Logarithmic base 10 transformation (x+1). This is a variant of the log10 transformation that adds 1 before taking the logarithm.

  15. BoxCox: Designed for non-negative responses. Boxcox transforms non-normally distributed data to a set of data that has an approximately normal distribution. The Box-Cox transformation is a family of power transformations.

  16. Yeojohson: Like the Box-Cox model but can accommodate predictors with zero and/or negative values. This is another family of power transformations.

  17. ExpoTrans: Exponential transformation. This transformation applies the exponential function to each value.

7.5.2 Scale and Centering

This tool uses the function scale from R base for scaling and centering operations. You have the following options:

  • Scale: If checked, scaling is done by dividing the (centered) columns of “x” either by their standard deviations (if center is TRUE) or by the root mean square (if center is FALSE).

  • Center: If checked, centering is done by subtracting the column means (omitting NAs) of “x” from their corresponding columns

7.5.3 Random Rarefaction

  • This tool generates one randomly rarefied community given a sample size. If the sample size is equal to or larger than the observed number of individuals, the non-rarefied community will be returned.
Transformations
Transformations

7.6 Data imputation

This tool provides methods for completing missing values with values estimated from the observed data. It is available only for Datalists that contain missing data either in the Numeric-Attribute or in the Factor-Attribute. The function preProcess from the caret package is used for imputation. To impute missing values, follow these steps:

  1. Choose the Target-Attribute.

  2. Pick a Method (described below).

  3. Click the blue “Flash” button. The “Save Changes” dialog will automatically pop-up.

  4. Save the Datalist with imputed values as a new Datalist or replace an existing one.

Data imputation
Data imputation

Methods for imputation:

  • Median/mode: Numeric-Attribute columns are imputed with the median, and Factor-Attribute columns are imputed with the mode.

  • Knn: k-nearest neighbor imputation is only available for the Numeric-Attribute. It is carried out by finding the k closest samples (Euclidean distance) in the dataset. This method automatically centers and scales your data.

  • Bagimpute: Only available for the Numeric-Attribute. Imputation via bagging fits a bagged tree model for each predictor (as a function of all the others). This method is simple, more accurate, and accepts missing values, but it has a much higher computational cost.

  • MedianImpute: Only available for the Numeric-Attribute. Imputation via medians takes the median of each predictor in the training set and uses them to fill missing values. This method is simple, fast, and accepts missing values but treats each predictor independently, which may lead to inaccuracies.

7.7 Data partition

This tool adds a partition as a factor in the Factor-Attribute. It uses the createDataPartition function from the caret package. Users can specify the percentage of observations to be used for the test and choose between the methods:

  • Random Sampling: simple random sampling is used.

  • Balanced Sampling: This option conducts random sampling within the levels of a chosen factor, ensuring a balanced sampling distribution across the factor levels.

Data Partition
Data Partition

Data partitioning is a critical step in evaluating machine learning models. Creating distinct training and testing sets allows for accurate assessment of the model’s performance on unseen data, avoiding issues like overfitting and obtaining more reliable performance metrics.

7.8 Aggregate

The “Aggregate” tool utilizes the aggregate function from R base. This process involves aggregating individual cases of the Numeric-Attribute based on a grouping variable, which can be selected factors.

The tool offers various calculation options to aggregate the data:

  • Mean: Calculates the mean of each group (selected factor).

  • Sum: Calculates the sum of values for each group (selected factor).

  • Median: Calculates the median of each group (selected factor).

  • Var: Calculates the variance of each group (selected factor).

  • SD: Calculates the standard deviation of each group (selected factor).

  • Min: Retrieves the minimum value for each group (selected factor).

  • Max: Retrieves the maximum value for each group (selected factor).

Aggregate
Aggregate

7.9 Create Palette

The “Create Palette” tool utilizes the colourpicker tool from the colourpicker package, enabling users to interactively select colors for their palette. Subsequently, iMESc employs colorRampPalette to generate customized color palettes suitable for graphical outputs.

Create Palette
Create Palette

7.10 Savepoint

  • Create: create a Savepoint, which is a single R object to be downloaded and that can be reloaded later or shared to restore workspace.

  • Restore : Upload a Savepoint (.rds file) to restore the workspace.

7.11 Track changes

This group of panels appears when using the functionalities provided by the pre-processing toolbar . Users can track changes in real-time through summary panels and histograms. The Missing Values panel is particularly important for checking the number of missing values per variable.

Changes: This panel displays the current and previous changes made during the pre-processing process.

  • Summary: The Summary panel shows the number of missing values, rows, and columns, as well as the minimum, average, median, and maximum values of the data.

  • Data: This panel presents a histogram of all numerical values in the dataset.

  • colSums: The colSums panel displays a histogram of column sums.

  • rowSums: The rowSums panel displays a histogram of row sums.

  • Missing Values: In the Missing Values panel, users can see the number of missing values per variable.

Track-Changes
Track-Changes

9 Packages & functions

In this section, we present the key packages and functions used in the development of iMESc. While iMESc utilizes a wide range of packages and functions, this section highlights some of the key packages and their corresponding functions that play a crucial role throughout the entire app and in various analytical tasks. Please note that these tables might not be exhaustive, but they cover the most relevant packages and functions used in the iMESc software. The version numbers provided are subject to change with future package updates.

9.0.1 Table 1: Packages and Functions for Analytical Tasks

In this table, we highlight the packages and functions used for various analytical tasks within iMESc.

Package Version Main Functions Task
ade4 1.7.22 dudi.pca, niche, niche.param Biodiversity tools, Niche analyses
aweSOM 1.3 somDist, somQuality Self-Organizing Maps
caret 6.0.94 trainControl, multiClassSummary, MAE, RMSE, preProcess, postResample, train, confusionMatrix, knn3, knnreg, progress, varImp, predict.train, getModelInfo, nearZeroVar, createDataPartition, resamples, caretTheme Supervised Algorithms, Pre-Processing
cluster 2.1.4 clusGap Kmeans
data.table 1.14.8 rbindlist, melt, fread, rleid render tables
dendextend 1.17.1 cutree, color_branches, labels_colors, set Unsupervised Algoritms, K-Means
doParallel 1.0.17 registerDoParallel Supervised Algorithms, Permutation Importance
doSNOW 1.0.20 registerDoSNOW Supervised Algorithms, Permutation Importance
e1071 1.7.13 allShortestPaths, svm SVM
factoextra 1.0.7 hcut, fviz_nbclust, fviz_pca_biplot, fviz_gap_stat Unsupervised Algoritms, HC
geodist 0.0.8 geodist Spatial tools
ggraph 2.1.0 scale_color_viridis, ggraph, geom_edge_link, geom_edge_diagonal, geom_node_point, geom_node_text, geom_node_label Explore Saved Models,Random Forest, Decision Tree
ggridges 0.5.4 geom_density_ridges Descriptive tools, Ridges
gplots 3.1.3 lowess Descriptive tool, Pairs
igraph 1.4.2 graph_from_data_frame, delete_vertices, V, clusters, tree Explore Saved Models,Random Forest, Decision Tree
imputeMissings 0.0.3 impute Data Imputation
kernlab 0.9.32 size, error, sigest
kohonen 3.0.11 getCodes, object.distances, unit.distances, check.whatmap, nunits, supersom, somgrid, dist2WU, classmat2classvec, add.cluster.boundaries, classvec2classmat, som, xyf Self-Organizing Maps
lattice 0.21.8 trellis.par.set, bwplot, dotplot, parallelplot, splom, xyplot Compare & Ensemble Models
Metrics 0.1.4 mae, mse, rmse, mape Metrics for constrating Predicted vs Observed
parallel 4.3.0 stopCluster, makeCluster, clusterEvalQ, mclapply, parLapply, detectCores Supervised Algorithms, Permutation Importance
pdp 0.8.1 pdp Explore Saved Models,Random Forest
plot3D 1.4 scatter3D, segments3D Spatial tools
pROC 1.18.0 cov, var, multiclass.roc Ensemble Models
randomForestExplainer 0.10.1 measure_importance, min_depth_distribution, plot_multi_way_importance, important_variables, min_depth_interactions, plot_min_depth_interactions, plot_importance_rankings, plot_importance_ggpairs Random Forest
raster 3.6.20 proj4string, quantile, as.matrix, nrow, is.factor, cut, geom, coordinates, text, which.min, levels, mean, extent, flip, t, rasterToPoints, ncol, persp, as.factor, plot, as.raster, raster, rasterize, which.max, crop, stack, unique, scale, predict, weighted.mean, mask, crs, as.data.frame, ratify, as.list, rowSums, colSums, merge, aggregate, labels, hist, summary, lines, pairs, print, gain, density, barplot, all.equal, values, subset, image, writeRaster, head, boxplot, distance Spatial tools
Rcpp 1.0.10 sourceCpp Self-Organizing Maps
rgl 1.1.3 persp3d, par3d, polygon3d, points3d, text3d, open3d, rgl.dev.list, rglwidgetOutput, rgl.bg, rgl.open, rglwidget Spatial tools
scatterpie 0.1.9 geom_scatterpie Spatial tools
segRDA 1.0.2 bp, OrdData, pwRDA, smw Descriptive tools, Ridges
sf 1.0.12 st_coordinates, st_as_sf, as_Spatial, st_set_crs, st_bbox, st_crs, st_transform, st_make_grid, st_multipoint, st_sfc, st_point, read_sf, st_read, st_combine, st_crop, st_cast, st_sf Spatial tools
sp 1.6.0 SpatialLines, Lines, Line, split, SpatialPolygons, Polygons, Polygon, SpatialGridDataFrame, spsample, gridded, fullgrid, CRS Spatial tools
spatstat.explore 3.1.0 idw, auc, dkernel Spatial tools
spatstat.geom 3.1.0 rescale, rotate, coords Spatial tools
vegan 2.6.4 decostand, vegdist, scores, postMDS, wascores, initMDS, procrustes, ordiArrowMul, ordiArrowTextXY, rda, specnumber, diversity, mantel.correlog, rrarefy, rda, metaMDS Descriptive Tools, MDS, RDA, Spatial tools

9.0.2 Table 2: Packages and Functions Used Throughout the App

This table presents the packages and their respective functions that are utilized across the entire app. These packages are essential for data manipulation, visualization, interactive features, and more.

Package Version Main.Functions.Used.in.iMESc
base64enc 0.1.3 dataURI
beepr 1.3 beep
colorspace 2.1.0 lighten, darken, mixcolor, hex2RGB
dplyr 1.1.2 filter, intersect, between, groups, mutate, vars, group_by
DT 0.27 dataTableOutput, renderDataTable, datatable, DTOutput, replaceData
foreach 1.5.2 foreach, times
gbRd 0.4.11 Rd_fun
ggplot2 3.4.2 ggplot, geom_tile, aes, scale_fill_manual, scale_fill_gradientn, geom_sf, geom_sf_text, coord_sf, xlab, ylab, ggtitle, theme, element_blank, element_rect, unit, element_text, geom_text, geom_point, geom_hline, geom_vline, scale_radius, scale_color_manual, scale_colour_gradientn, guides, guide_legend, scale_size, element_line, discrete_scale, scale_color_gradientn, scale_x_log10, scale_x_continuous, scale_y_log10, scale_y_continuous, geom_contour, theme_bw, facet_grid, autoplot, geom_col, position_stack, coord_flip, scale_x_discrete, geom_errorbar, theme_set, theme_grey, aes_string, scale_alpha_discrete, scale_colour_manual, scale_x_sqrt, scale_y_sqrt, margin, geom_hex, standardise_aes_names, geom_polygon, coord_fixed, labs, geom_raster, geom_segment, geom_label, geom_bar, scale_shape_manual, position_dodge, layer, facet_wrap, theme_minimal, resolution, geom_line, geom_boxplot, geom_area
ggpubr 0.6.0 tab_add_footnote, tbody_style, table_cell_font, table_cell_bg
ggrepel 0.9.3 geom_text_repel, geom_label_repel
RColorBrewer 1.1.3 brewer.pal
readxl 1.4.2 excel_sheets, read_excel
rintrojs 0.3.2 introjsUI
rstudioapi 0.14 hasFun
scales 1.2.1 col_numeric, percent_format, alpha
shiny 1.7.4 splitLayout, HTML, getDefaultReactiveDomain, incProgress, req, div, validate, need, withProgress, actionLink, strong, span, icon, em, NS, setProgress, a, h5, absolutePanel, p, h4, code, runGitHub, h3, br, htmlOutput, verbatimTextOutput, column, updateTextInput, reactiveValues, img, uiOutput, actionButton, fileInput, renderPlot, fluidRow, removeModal, downloadButton, showModal, modalButton, textAreaInput, renderPrint, renderTable, checkboxGroupInput, numericInput, hr, callModule, observeEvent, updateRadioButtons, updateCheckboxInput, textInput, checkboxInput, conditionalPanel, updateActionLink, isolate, radioButtons, reactive, renderUI, plotOutput, reactiveVal, observe, renderText, updateSelectInput, modalDialog, updateTabsetPanel, reactiveValuesToList, sidebarPanel, updateNumericInput, downloadLink, selectInput, imageOutput, outputOptions, tabPanel, removeUI, insertUI, updateCheckboxGroupInput, includeCSS, tabsetPanel, sidebarLayout, mainPanel, withMathJax, helpText, tagList, tableOutput
shinyBS 0.61.1 tipify, popify, bsButton, bsPopover, updateButton, bsTooltip
shinybusy 0.3.1 add_busy_spinner
shinycssloaders 1.0.0 withSpinner
shinydashboard 0.7.2 updateTabItems, menuItem, menuSubItem
shinydashboardPlus 2.0.3 dashboardHeader
shinyjqui 0.4.1 JS
shinyjs 2.1.0 click, show, hide, runjs, delay, toggle, reset, toggleState, hidden, removeClass, addClass, useShinyjs, onclick, extendShinyjs
shinyTree 0.2.7 shinyTree, get_selected
shinyWidgets 0.7.6 pickerInput, updateRadioGroupButtons, updatePickerInput, updateSwitchInput, toggleDropdownButton, radioGroupButtons, noUiSliderInput, updateNumericInputIcon, dropdownButton, tooltipOptions, switchInput
sortable 0.5.0 sortable_js_capture_input
stringr 1.5.0 str_length, str_trunc
tools 4.3.0 Rd2HTML, package_dependencies
utils 4.3.0 data, getParseData, packageDescription, compareVersion, object.size, combn, txtProgressBar, setTxtProgressBar, flush.console, write.table, str, methods, zip, read.csv, capture.output, browseURL, packageVersion, find
viridisLite 0.4.2 turbo
wesanderson 0.3.6 wes_palette
writexl 1.4.2 write

10 References

Blanchet, F. G., Legendre, P., & Borcard, D. (2008). FORWARD SELECTION OF EXPLANATORY VARIABLES. Ecology, 89(9), 2623–2632. https://doi.org/10.1890/07-0986.1

Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

Cervantes, J., Garcia-Lamont, F., Rodríguez-Mazahua, L., & Lopez, A. (2020). A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing, 408, 189–215. https://doi.org/10.1016/j.neucom.2019.10.118

Checon, H. H., Vieira, D. C., Corte, G. N., Sousa, E. C. P. M., Fonseca, G., & Amaral, A. C. Z. (2018). Defining soft bottom habitats and potential indicator species as tools for monitoring coastal systems: A case study in a subtropical bay. Ocean & Coastal Management, 164, 68–78. https://doi.org/10.1016/j.ocecoaman.2018.03.035

Corte, G. N., Checon, H. H., Fonseca, G., Vieira, D. C., Gallucci, F., Domenico, M. Di, & Amaral, A. C. Z. (2017). Cross-taxon congruence in benthic communities: Searching for surrogates in marine sediments. Ecological Indicators, 78, 173–182. https://doi.org/10.1016/j.ecolind.2017.03.031

Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification (2nd ed.). John Wiley \& Sons.

Pearson, Karl. (1901). LIII. On lines and planes of closest fit to systems of points in space. Philosophical Magazine Series 1, 2, 559–572. https://api.semanticscholar.org/CorpusID:125037489

Fix, E., & Hodges, J. L. (1989). Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review / Revue Internationale de Statistique, 57(3), 238–247. http://www.jstor.org/stable/1403797

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.

Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. (2003). KNN Model-Based Approach in Classification (pp. 986–996). https://doi.org/10.1007/978-3-540-39964-3_62

Gupta, B., Rawat, A., Jain, A., Arora, A., & Dhami, N. (2017). Analysis of Various Decision Tree Algorithms for Classification in Data Mining. International Journal of Computer Applications, 163(8), 15–19. https://doi.org/10.5120/ijca2017913660

Kalcheva, N., Todorova, M., & Marinova, G. (2020). NAIVE BAYES CLASSIFIER, DECISION TREE AND ADABOOST ENSEMBLE ALGORITHM – ADVANTAGES AND DISADVANTAGES. 153–157. https://doi.org/10.31410/ERAZ.2020.153

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30, 81–93.

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69.

Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5). https://doi.org/10.18637/jss.v028.i05

Legendre, P., & Anderson, M. (1999). Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69(1), 1–24. https://doi.org/10.1890/0012-9615

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.

Mantel, N. (1967). The Detection of Disease Clustering and a Generalized Regression Approach. Cancer Research, 27(2 Part 1), 209–220.

Melssen, W., Wehrens, R., & Buydens, L. (2006). Supervised Kohonen networks for classification problems. Chemometrics and Intelligent Laboratory Systems, 83(2), 99–113. https://doi.org/10.1016/j.chemolab.2006.02.003

Pearson, K. (1920). Notes on the History of Correlation. Biometrika, 13, 25–45.

Shepard, D. (1968). A two-dimensional interpolation function for irregularly-spaced data. Proceedings of the 1968 23rd ACM National Conference On -, 517–524. https://doi.org/10.1145/800186.810616

Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika, 27(2), 125–140.

Sneath, P. H. A. (1957). The application of computers to taxonomy. Journal of General Microbiology, 17(1), 201–226.

Spearman, C. (1987). The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 100(3/4), 441–471. http://www.jstor.org/stable/1422689

Vieira, D. C., Brustolin, M. C., Ferreira, F. C., & Fonseca, G. (2019). segRDA: An <scp>r</scp> package for performing piecewise redundancy analysis. Methods in Ecology and Evolution, 10(12), 2189–2194. https://doi.org/10.1111/2041-210X.13300

Vieira, D. C., Gallucci, F., Corte, G. N., Checon, H. H., Zacagnini Amaral, A. C., & Fonseca, G. (2021). The relative contribution of non-selection and selection processes in marine benthic assemblages. Marine Environmental Research, 163, 105223. https://doi.org/10.1016/j.marenvres.2020.105223

Yao, M., Zhu, Y., Li, J., Wei, H., & He, P. (2019). Research on Predicting Line Loss Rate in Low Voltage Distribution Network Based on Gradient Boosting Decision Tree. Energies, 12(13), 2522. https://doi.org/10.3390/en12132522