This document provides a quick introduction to data wrangling. It is part of a series of tutorials for studying trait-environment relationship. The tutorial targets students and scientists in ecology with previous knowledge of the R software.

Be aware that this tutorial use base R functions (i.e. without any package). There are many other ways to process data in R, but it won’t be cover by this tutorial (e.g. tidyverse suite of packages).

1. Quick introduction

Problem

Trait-environment dataset are complex because they are made of multiple tables: the abundance information, the environmental conditions and the traits of the species. The question of trait-environment relationship (TER) analysis is how the trait and environment are related and how can they explain variations in species abundance?

Two practical questions will be addressed here:

  1. which trait and environmental variables should be included in TER?
  2. how to make sure the three tables are linked together?

We will use example dataset which is available in the zip archive NEAtl_FishTraitEnv_rawdata.zip, available for download here. To run the following examples, extract the archive in a folder, and set the working directory of R to this folder.

Variable selection - Theoretical considerations

The selection of variables is a key step for trait-environment analysis.

Variables must be generic enough so that most of species are concerned. For instance, if only one species is swimming in schools over 50+ species, then keeping the binary variable swimming in schools is not relevant for this dataset - it would give more weights to the single species swimming in schools (unless your initial objective has a special interest about swimming behavior).

Remove obviously correlated variables. For instance, age at maturation is highly correlated with lifespan. If both variables are included, double weight will be given to this characteristic. A good solution to remove this correlation is to compute ratios, e.g. age at maturation divided by lifespan which is the percentage of lifespan before maturation/first spawning. Yet, sometimes correlated variables could have different meaning, and it can be important to keep them both.

Remove the geographic variables or other variables that are not directly measured but a proxy. For instance, habitat or temperature tolerance are not a good trait to investigate trait-environment relationship because it will be directly correlated with temperature. If you include temperature tolerance as a variable, you will force the results to follow the temperature gradient. Similarly, latitude would be a poor environmental variable because it is mostly a proxy of temperature gradient.

Include enough variables to be able to depict all the diversity of traits and environmental condition that we want to consider. For traits, one could think of the main dimensions, growth, reproduction, mobility, feeding. Ideally, the number of variables per dimensions would be somehow balanced. Else, one must keep in mind that some dimensions are given more importance.

An exploratory analysis is a necessary step to select the relevant variables.

2. Traits dataset

There are many sources of information for trait dataset depending on the type of organisms. For fish, the largest resource is Fishbase. For benthos, the Biological Traits Information Catalogue (BIOTIC) collated multiple traits for benthic community ecology.

Other efforts to curate databases were published in data repository, such as:

2.1 Load traits dataset

Let’s load one example of trait database, extracted from Beukhof et al. 2019 DOI 10.1594/PANGAEA.900866

Make sure the file Fish_trait.csv is in your working directory, then load it in R.

trait <- read.csv("Fish_trait.csv")
dim(trait)
## [1] 775  15
names(trait)
##  [1] "family"             "genus"              "species"           
##  [4] "taxon"              "habitat"            "feeding.mode"      
##  [7] "tl"                 "offspring.size"     "spawning.type"     
## [10] "age.maturity"       "fecundity"          "length.infinity"   
## [13] "growth.coefficient" "length.max"         "age.max"

The table trait contain 15 variables (in column) characterizing 775 taxa (in rows). Among these variables:

  • family, genus, species: taxonomic information of the taxa
  • taxon: scientific name of the taxa
  • habitat: zone of the water column used by the taxon. Categorical variable with bathydemersal, bathypelagic, benthopelagic, demersal, non-pelagic, pelagic, reef-associated
  • feeding.mode: Main food source from stomach contents and biological descriptions of adults. Categorical variable with benthivorous, generalist, herbivorous, piscivorous, planktivorous
  • tl: trophic level based on the proportion of different prey in stomach
  • offspring.size: egg diameter, length of egg case or length of pup in mm
  • spawning.type: Type of spawning related to parental care. Categorical variable with bearer, external bearer, guarder hider, internal bearer, internal live bearer, nester, nonguarder, open water/substratum
  • age.maturity: age at first majority in years
  • fecundity: number of offspring produced by a female per year
  • length.infinity: Infinity length parameter estimated from the Von Bertalanffy equation, in cm
  • growth.coefficient: Growth coefficient K estimated from the Von Bertalanffy equation, expressed in year\(^{-1}\)
  • length.max: maximum body length in cm
  • age.max: lifespan in years

2.2 Check distribution of continuous variable

It is convenient to have continuous variables that don’t have strong outliers and roughly following a ‘normal distribution’. Therefore, it is recommended to check the distribution of each continuous variable and decide whether a transformation (log or square root) is needed.

Let’s see two examples: trophic levels and offspring size.

We can visually check the distribution and the presence of outliers by using the functions hist() or boxplot().

par(mfrow=c(1,2))
hist(trait$tl, main="Histogram", xlab="Trophic level")
boxplot(trait$tl, main="Boxplot", ylab="Trophic level")

summary(trait$tl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.320   3.580   3.639   3.920   4.570

Histograms may be the most intuitive graphic, but the visual output dependents on the number of bars plotted (the number of classes) and the separation.

Boxplot represent the key distribution statistics (quartiles) in a graph. The middle bold line represent the median, separating the dataset in two with half the dataset having higher values than the median. The gray box represent the first and third quartiles (i.e. the interquartile range, IQR), i.e. half the dataset are included in this box, 25% have higher value than the third quartile (Q3), and the other quarter have lower values than the first quartile (Q1). The outliers are represented as dots if they are higher than \(Q3 +1.5*IQR\) or lower than \(Q1-1.5*IQR\).

In both case, we see that the distribution of values are well spread, and tropic level doesn’t require data transformation.

Let’s now look at the distribution of offspring size (in mm).

par(mfrow=c(1,2))
hist(trait$offspring.size, main="Histogram", xlab="Offspring size")
boxplot(trait$offspring.size, main="Boxplot", ylab="Offspring size")

summary(trait$offspring.size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.05    1.00    1.60   22.83    3.75  900.00      19

Most species have offspring size lower than 4 mm (Q3), while few species have very large offspring (max 900 mm). Hence it is strongly recommended to transform this variable.

There are multiple choices of transformation. Log transformation is very efficient but require that all values are positive and higher than 0. Square-root transformation (or other power transformation) also requires positive values, but can handle 0s. Square-root transformation is less strong than log-transformation (but cube- or eighth-root transformation can be very efficient too).
In our case, the log-transformation seems the most appropriate for this variable.

par(mfrow=c(1,3))
boxplot(trait$offspring.size, main="Raw data")
boxplot(sqrt(trait$offspring.size),  main="Square root transform")
boxplot(log(trait$offspring.size),  main="Log transform")

So we create a new variable with log-transformed offspring size

trait$offspring.size_log <- log(trait$offspring.size)

The same is true for fecundity (very skewed distribution) so we will log-transform it as well.

trait$fecundity_log <- log(trait$fecundity)

2.3 Correlation among continous variables

A simple and graphical way to plot the correlation among continuous variable is to use the function pairs(). Additionally, we will define a function to show Spearman correlation coefficient rho which is based on rank (non parametric correlation).

panel.cor.m <- function(x, y, digits=2)
{
  usrold <- par()$usr; on.exit(par("usr"=usrold))
  par("usr" = c(0, 1, 0, 1))
  r <- cor(x, y,method = "spearman", use = "complete.obs")
  r2 <- abs(cor(x, y,method = "spearman", use = "complete.obs"))
  txt <- format(c(r, 0.123456789), digits=digits)[1]
  cex <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex * r2)

}

We must first select the continuous variables (it can’t display categorical variables).

continuous <- c("tl","offspring.size_log","age.maturity","fecundity_log",
                "length.infinity","growth.coefficient", "length.max","age.max")
pairs(trait[,continuous], lower.panel=panel.smooth, 
      upper.panel=panel.cor.m)

Length infinity is strongly positively correlated with maximum length (rho >0.7), and negatively with growth coefficient (rho <-0.75). So we will remove length infinity from further analysis. The other correlations are acceptable (here below 0.7).

Note: there is no rule set in stone for what is an acceptable level of correlation.

2.4 Check distribution of categorical variables

The use of categorical variable is allowed in most TER analysis, but often need small adjustment to the methodologies.

Before selecting categorical variable, it is important to check the distribution of the categories.

table(trait$spawning.type, useNA="ifany")
## 
##                  bearer     guarder non-guarder 
##          19          62         159         535
barplot(table(trait$spawning.type, useNA="ifany"))

The dataset is largely dominated by non-guarder. The three categories are non equally distributed, but the number of category is reasonably low and each category is represented by at least 62 taxa. However the issue here are the NAs, which in most cases are not tolerated in TER analysis.

Let’s look at another categorical variable: habitat.

nhabitat <- table(trait$habitat, useNA="ifany")
nhabitat
## 
##   bathydemersal    bathypelagic   benthopelagic        demersal     non-pelagic 
##             160              99              90             355               9 
##         pelagic reef-associated 
##              40              22
barplot(matrix(nhabitat, ncol=1), 
        col=rainbow(length(nhabitat)),
        legend.text=names(nhabitat),
        args.legend = list("x"="center"))

Habitat is the typical example of a variable with too many categories. The fact that there are only 9 taxa as non-pelagic, and 22 as reef-associated will give a relatively large weight to these taxa. It might be good to define broader categories (when possible) or simply discard this categorical variable.

2.5 Simplify the trait dataset

For simplicity we decided to keep only seven continuous variables. It is a good idea to keep the name of species as names of the rows in the trait table to help matching the rows at a later stage.

row.names(trait) <- trait$taxon

#set which traits to keep
keepTraits <- c("tl", "growth.coefficient" , "length.max",
                "age.max", "age.maturity", "fecundity_log", "offspring.size_log")

trait <- trait[,keepTraits]

dim(trait)
## [1] 775   7

3 Environemental variables

If you don’t have in-situ measurement, there are many other source of data if you have the coordinates of the sites. sdmpredictors is a comprehensive list of spatially defined environmental drivers (but considered constant in time, impossible to study temporal dynamics). Other sources of temporally resolved dataset is Copernicus. Please visit https://rfrelat.github.io/SpatialR.html for more information on how to extract spatially and temporally resolved data in R.

3.1 Load environmental dataset

Make sure the file Fish_Environment.csv is in your working directory, then load it in R.

env <- read.csv("Fish_Environment.csv")
dim(env)
## [1] 169   9
names(env)
## [1] "Longitude" "Latitude"  "Depth"     "SBT"       "SBS"       "Chl"      
## [7] "SBT_sea"   "Chl_sea"   "Fishing"

The table env contains 9 variables (in column) characterizing 169 sites (in rows). Among these variables:

  • Longitude and Latitude: coordinates of the grid cell
  • Depth: depth in meter, directly measured during the survey.
  • SBT: monthly sea bottom temperature in °C from the Global Ocean Physics Reanalysis (GLORYSs2v4)
  • SBS: monthly sea bottom salinity from the Global Ocean Physics Reanalysis (GLORYSs2v4)
  • Chl: Chlorophyll a concentration (in \(mg.m^{-3}\)) as a proxy for primary production and food availability from the GlobColour database
  • SBT_sea: seasonality of sea bottom temperature, calculated as the difference between the warmest and the coldest month of the year.
  • Chl_sea: seasonality of chlorophyll a concentration, calculated as the difference between the highest and the lowest primary production in the year
  • Fishing: the cumulative demersal fishing pressure in 2013, estimated globally by Halpern et al. 2015, DOI 10.1038/ncomms8615.

3.2 Check distribution and correlation among variables

The same considerations apply for the environmental variables as for the trait variables. To get an overview of the distribution of the variables, we can create a boxplot of the scaled variables (the scaling remove the difference in units).

boxplot(scale(env))

The distribution of SBS (salinity) is not wide, but we will keep it as it is.

3.2 Maping the environement variables

Another interesting feature is to look at the spatial distribution of environmental variables (to visually check if there is any error).

For that, we will define a ggplot2 customized function (no need to understand it fully).

library(ggplot2)
mapggplot<-function(x, y, Var, colpal, main="", 
                    xlab="Longitude", ylab="Latitude",
                    legx=0.9, legy=0.2){
  df <- data.frame(x,y,Var)
  ggplot() +
    geom_tile(data=df,aes(x=x,y=y,fill=Var)) + # 
    scale_fill_gradientn(name = main, colours=colpal, na.value = 'white')+#,limits=c(0,25100)) +
    borders(fill="gray44",colour="black") +
    coord_quickmap(xlim=c(range(x)),ylim=c(range(y)))+
    labs(x =xlab,y= ylab)+
    theme(legend.position = c(legx,legy))+
    theme(legend.title = element_text(size = 8),legend.text = element_text(size = 7))+
    guides(shape = guide_legend(override.aes = list(size = 0.2)))+
    theme(panel.background = element_rect(fill=alpha('light blue', 0.4), colour = 'black'))
}
#Choice of sequential color scale
colpal <- rev(heat.colors(6))
# or from RColorBrewer package
# colpal <- rev(RColorBrewer::brewer.pal(6,"OrRd")) 

mapggplot(env$Longitude, env$Latitude, env$Depth, colpal)

Conclusion

We presented some of the considerations that need to be tackled when preparing dataset for investigating trait-environment relationship. Once the dataset is ready, there are multiple methods that can be used, each of them differ slightly in their aim and hypothesis. Among them:

  1. Computing aggregated indicators of community weighted mean traits (CWM) and link them to environmental variables.
  2. Multivariate methods, such as RLQ analysis
  3. Hierarchical Modeling of Species Communities (HMSC)