The paleogeosciences are becoming more and more interdisciplinary, and studies increasingly rely on large collections of data derived from multiple data repositories. Integrating diverse datasets from multiple sources into complex workflows increases the challenge of creating reproducible and open science, as data formats and tools are often noninteroperable, requiring manual manipulation of data into standardized formats, resulting in a disconnect in data provenance and confounding reproducibility. Here we present a notebook that demonstrates how the Linked PaleoData (LiPD) framework is used as an interchange format to allow data from multiple data sources to be integrated in a complex workflow using emerging packages in R for geochronological uncertainty quantification and abrupt change detection. Specifically, in this notebook, we use the neotoma2 and lipdR packages to access paleoecological data from the Neotoma Database, and paleoclimate data from compilations hosted on Lipdverse. Age uncertainties for datasets from both sources are then quantified using the geoChronR package, and those data, and their associated age uncertainties, are then investigated for abrupt changes using the actR package, with accompanying visualizations. The result is an integrated, reproducible workflow in R that demonstrates how this complex series of multisource data integration, analysis and visualization can be integrated into an efficient, open scientific narrative.
This notebook showcases a workflow that uses emerging tools to efficiently access a pollen dataset from Neotoma (Williams et al. 2018), create an age model using Bacon (Blaauw and Christen 2011), store the age ensemble and create visualizations using geoChronR
(N. P. McKay, Emile-Geay, and Khider 2021), before finally conducting age-uncertain change detection using the Abrupt Change Toolkit in R (actR) (N. McKay and Emile-Geay 2022).
This notebook highlights recent advances in multiple R packages
neotoma2
package (Dominguez Vidana and Goring 2022)neotoma2lipd()
function in the lipdR
package (Heiser and McKay 2022)geoChronR
(N. P. McKay, Emile-Geay, and Khider 2021) to create age models and store ensemble dataactR
(N. McKay and Emile-Geay 2022)neotoma2
leverages Neotoma’s API v2.0 (Goring 2022). lipdR::neotoma2lipd()
converts neotoma2
site objects into lipdR
lipd objects. This notebook uses Bacon Bacon (Blaauw and Christen 2011) to create an ensemble age-depth model within the geoChronR
package (N. P. McKay, Emile-Geay, and Khider 2021). The changepoint detection algorithm implemented in actR
(N. McKay and Emile-Geay 2022) which is used in this notebook relies heavily on the changepoint
package Killick, Haynes, and Eckley (2016).
This notebook demonstrates how the Linked PaleoData (LiPD) framework can be used as an interchange format to allow data from multiple data sources to be integrated in a complex workflow using emerging packages in R for geochronological uncertainty quantification and abrupt change detection. Efficient workflows that support data from multiple sources are an important component of reproducible science.
This notebook describes several LinkedEarth activities. LinkedEarth is a community of paleoscientists working to develop standards and software to enable paleoscience in the era of Big Data. This community produces data products and standards, software, cyberinfrastructure, and training opportunities. LinkedEarth was launched as part of an EarthCube Integrative Activities project funded project (ICER-1540996), and is currently supported by an EarthCube Data Capabilities project (ICER-2126510).
Neotoma-LiPD interoperability, and the Abrubt Change Toolkit in R (actR) are LinkedEarth activities that are funded as part of the Belmont Forum’s Science-driven e-Infrastructure Innovation (SEI) initiative. In the US, this is NSF award ICER-1929460.
Include the recommended citation for the document. You may choose to use Zenodo, or another service (such as figShare or your University services) to obtain a DOI for your code repository. If you are using GitHub, consider adding a GitHub Citation file. Ensure that the title in the suggested citation, and the authors match those in your YAML header.
For example:
Thank you to Socorro Dominguez Vidana and Simon Goring for their work on the neotoma2
and to the scientists who generated and shared the Bambili pollen data and the MD03-2707 paleotemperature data.
You will need to download the Rmd, HTML and bibtex file from the earthcube repository, store it in a single directory and then set up a project (see this link for further discussion of this.) Then you can either knit the markdown file, or run the cells interactively.
Install all the required packages in the R setup section at the head of this document. As noted previously, use pacman to do the installation. Alternatively, use the install.R
script in the repository to help install some of the github packages prior to the pacman cells. Running the notebook on mybinder, is also an option.
The paleogeosciences are becoming more and more interdisciplinary, and studies increasingly rely on large collections of data derived from multiple data repositories. Integrating diverse datasets from multiple sources into complex workflows increases the challenge of creating reproducible and open science, as data formats and tools are often noninteroperable, requiring manual manipulation of data into standardized formats, resulting in a disconnect in data provenance and confounding reproducibility. Here, we illustrate a use case where a scientist wants to conduct age-uncertain abrupt-change analysis on two datasets from Africa - a pollen record from Lake Bambili, and a temperature reconstruction from the Gulf of Guinea.
Lake Bambili is a high-elevation (2273 masl) crater lake in Cameroon (05° 56′ 11.9 N, 10°14′ 31.6 E) in the highlands of equatorial West Africa. Recently, (Lézine et al. 2019) published a study using pollen data to show changes in forest composition over the past 90,000 years. In this notebook, we’ll take a look at abrupt shifts in these pollen assemblages, and compare them changes in a nearby climate record from the Gulf of Guinea.
Thankfully, Lezine et al. (2019) archived their pollen assemblage data at Neotoma, and we’ll use the neotoma2
package (Dominguez Vidana and Goring 2022) to access the data, in this case, searching by sitename.
L <- neotoma2::get_sites(sitename = "Bambili 2") %>% #Find the site of interest
neotoma2::get_downloads() %>% #download the data for this site
lipdR::neotoma2lipd() #convert the site object into a LiPD object
Now that we’ve downloaded the data and converted it to a LiPD object, it’s ready to be used in LinkedEarth’s ecosystem of tools built around the LiPD standard (e.g., geoChronR, actR, pyleoclim).
First, let’s quickly create a map using the mapLipd()
function in geoChronR to visualize Lake Bambili’s location in equatorial west Africa.
mapLipd(L,map.type = "stamen",extend.range = 5)
Figure 1. Map of Africa showing the location of Lake Bambili.
In this notebook, we want to take a look at some of the abrupt changes in ecosystems recorded by pollen assemblages in this lake, while propagating age uncertainties. Neotoma does not store age ensembles, so we will need to generate a new age model. Here we’ll use Bacon (Blaauw and Christen 2011), one of the methods built into geoChronR, to create an age model and import the ensembles into the LiPD structure. Here, we’ve specified alll of the parameters so it will run without input, but it’s often convenient to call runBacon()
with fewer inputs and run it interactively.
L <- geoChronR::runBacon(L,
lab.id.var = 'neotomaChronConrolId',
age.14c.var = NULL,
age.14c.uncertainty.var = NULL,
age.var = 'age',
age.uncertainty.var = 'ageUnc',
depth.var = 'depth',
reservoir.age.14c.var = NULL,
reservoir.age.14c.uncertainty.var = NULL,
rejected.ages.var = NULL,
plot.pdf = FALSE,
accept.suggestions = TRUE)
Now that the age model has finished running and the results are stored in the LiPD object, let’s plot the results.
plotChronEns(L) + ggtitle("Age - depth model for Lake Bambili")
## [1] "Found it! Moving on..."
## [1] "Found it! Moving on..."
## [1] "plotting your chron ensemble. This make take a few seconds..."
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
Figure 2. Age depth model for Lake Bambili, generated using Bacon. Ages (and their uncertainties) are shown as purple distributions. The median age model is shown in black, along with the 50 and 95% highest probability density regions in dark and light gray, respectively. Five randomly selected age ensemble members are shown in red.
Great, the uncertainties represented here are what we’d like to propagate through the abrupt change analysis. To do this, the first step is to “map” the age ensemble from the age model to the depths of the pollen data stored in the paleoData tables.
L <- mapAgeEnsembleToPaleoData(L)
## [1] "Bambili2.Lzine.2019"
## [1] "Looking for age ensemble...."
## [1] "Found it! Moving on..."
## [1] "Found it! Moving on..."
## [1] "getting depth from the paleodata table..."
## [1] "Found it! Moving on..."
Now that that’s done, let’s select the paleoData age ensemble for future reference.
ageEnsemble <- selectData(L,"ageEnsemble")
## [1] "Found it! Moving on..."
Now let’s do some analysis on the paleoecological data. For this example, we will just perform a simple calculation. We’re interested in percent of pollen counts that are classified as trees or shrubs by their “ecological group.”
To do this, we will take advantage of lipdR’s integration with tidyverse tools in R. So our first step will be to convert the LiPD dataset into a specialized “tibble” object. It’s a regular tibble, where each row is a variable in the dataset, which includes all the metadata, and nested variables for time and paleodata values. We then use dplyr
tools to calculate the total amount of tree and shrub pollen counts for each sample. Finally, we use the magrittr
pipe for efficiency and clarity. See the comments in the code block for step by step explanations.
TRSH <- as.lipdTsTibble(L) %>% #convert the LiPD object to a LiPD Timeseries Tibble object
dplyr::filter(paleoData_ecologicalgroup == "TRSH") %>% #retain only the variables that are in the trees and shrubs ecological group
lipdR::tabulateTs(time.var = "age") %>% #convert this filtered lipd-ts-tibble to a traditional paleoecological data table, with observations in the rows and age and pollen variables in the columns
dplyr::rowwise() %>% #set up the analysis to row runwise
dplyr::mutate(total = sum(c_across(-age),na.rm = TRUE)) #and use mutate to calculate a new column, the sum of all pollen counts.
Great, we now have a regular tibble that includes all our sample levels, and a new “total” column that we’ve calculated. This will be the numerator in our “percent trees and shrubs” calculation. Let’s repeat this process but with all of the non-aquatic pollen groups to ge the denominator.
total <- as.lipdTsTibble(L) %>%
dplyr::filter(paleoData_ecologicalgroup %in% c("TRSH","UPHE","VACR","SUCC","MANG")) %>%
tabulateTs(time.var = "age") %>%
rowwise() %>%
mutate(total = sum(c_across(-age),na.rm = TRUE))
Now we can calculate the percentage of tree and shrub pollen relative to total non-aquatic pollen.
treeShrubPercent <- TRSH$total/total$total * 100
Now let’s plot that percentage, along with the age ensemble, to visualize these data and their corresponding age uncertainties.
geoChronR::plotTimeseriesEnsRibbons(X = ageEnsemble, Y = treeShrubPercent) + ylab("Relative abundance of tree and shrub pollen (%)") + ggtitle("Tree/shrub pollen abundance at Lake Bambili.")
Figure 3. The relative abundance of tree and shrub pollen at Lake Bambili from ca. 90 to 20 ka, with associated age uncertainties. The median estimate is shown in black, along with the 50 and 95% highest probability density regions in dark and light gray, respectively.
Interesting, it does look like there are some rapid changes in this dataset that are worth further exploration.
We’ve now used the data we accessed from neotoma to generate an age ensemble, and a derived paleoecological indicator, and we’re ready to analyze these data to test whether or not the apparent abrupt shifts stand out given age uncertainties and a robust null hypothesis. To do this, we’ll use the “abrupt change toolkit in R (actR)” package to test for shifts in the mean of these data. actR can accept lipd-ts-tibble data directly, or data, ensembles and metadata separately. In this case, since we generated the paleoecological indicator outside of LiPD, we’ll use the latter strategy and specify the data explicitly.
treeShrubMeanShift <- actR::detectShift(time = ageEnsemble$values, #matrix of age ensemble data
time.variable.name = "age", #variable name
time.units = "BP", # and units
vals = treeShrubPercent, #our paleoecological indicator (This could also be an ensemble if desired)
vals.variable.name = "Tree/Shrub", #variable name
vals.units = "%", # and units
summary.bin.step = 500, #over what timestep should we examine the results?
time.range = c(20000,90000)) #over what time range should we conduct the analysis.
## Warning in if (!is.na(time.range)) {: the condition has length > 1 and only the
## first element will be used
## Testing null hypothesis with 100 simulations, each with 100 ensemble members.
## This will probably take about 2 minutes
Because this methodology tests for abrupt shifts across the age ensemble and repeats the entire process with a null model to evaluate the robust null hypothesis level, it can take awhile to run.
Let’s take a look at the results. The print()
or summarize()
methods will give some key details.
print(treeShrubMeanShift)
## Tree/Shrub: Shift in Mean results
## Searched for Shift in Mean with a minimum segment length of 1 years, summarizing the results over windows of 500 years.
##
## Overall result: Detected 2 Shift in Mean event(s) that were significant at the alpha < 0.05 level
## # A tibble: 2 × 3
## time_start time_end empirical_pvalue
## <dbl> <dbl> <dbl>
## 1 70153. 70653. 0.0200
## 2 65653. 66153. 0.0400
## Time uncertainty considered? TRUE Time ensemble supplied (n = 1000)
## Paleo uncertainty considered? TRUE Values ensemble generated in `propagateUncertainties()`)
## Error propagation ensemble members = 100
## Null hypothesis testing ensemble members = 100
## Members simulated using isospectral method
##
## Parameter choices:
## cpt.fun = changepoint::cpt.mean
## minimum.segment.length = 1
## method = AMOC
## penalty = MBIC
## ncpts.max = 1
These summaries are helpful, but ultimately, a visualization is often much more useful. Let’s plot the results:
treeShrubPlot <- plot(treeShrubMeanShift)
Figure 4. Results of the actR shift detection analysis for Lake Bambili tree and shrub abundance. Top panel: as in figure 3, except with red lines identifying the mean of intervals separated by significant detected shifts in the mean. Bottom panel: The black bars show the fraction of ensemble members for which a shift was detected during a each 500 year window, and the orange lines show 90 and 95th percentile thresholds results of the robust null hypothesis testing. Where the observed frequencies exceed the null hypothesis marks intervals where the detected shifts are robust at that confidence level.
Figure 4 visualizes many of the features of this analysis. First we see the identified intervals of abrupt change, as changes in the mean represented by the red line on the top panel, and as the black bars in the bottom panel, which show the proportion of ensemble members with detected abrupt changes during each 500-year window. Comparing this to the 95th percentile of changes found in the null model makes it clear that the shift that occurred about 71,000 years ago is a robust interval of abrupt change.
Now that we’ve identified an interval of rapid change, let’s take a look at a nearby climate record, a sea surface temperature (SST) reconstruction from the Gulf of Guinea (Weldeab et al. 2007), to take a first look of whether the change in ecological communities might be driven by climate. Neotoma doesn’t host SST reconstuctions, so we’ll get these data from lipdverse.org a community repository of LiPD datasets.
MD03_2707 <- readLipd("https://lipdverse.org/Temp12k/1_0_2/MD03_2707.Weldeab.2007.ensembles.lpd")
## [1] "Loading 1 datasets from /Volumes/data/tempdir//RtmpTqmA1m/MD03_2707.Weldeab.2007.ensembles.lpd..."
## [1] "reading: MD03_2707.Weldeab.2007.ensembles.lpd"
And as before, let’s take a look at the map.
mapLipd(MD03_2707,map.type = "stamen",extend.range = 5)
Figure 5. Location of core MD03-2707 in the Gulf of Guinea.
This dataset, originally published by Weldeab et al. (2007) and included in the Temperature 12k compilation (Kaufman et al. 2020), already includes age and temperature ensembles, so we don’t have to run another age model. We still will need to map the ensembles to the paleodata and convert it to a lipd-ts-tibble though.
MD03_2707_ts <- mapAgeEnsembleToPaleoData(MD03_2707) %>% as.lipdTsTibble()
## [1] "MD03_2707.Weldeab.2007"
## [1] "Looking for age ensemble...."
## [1] "Found it! Moving on..."
## [1] "Found it! Moving on..."
## [1] "getting depth from the paleodata table..."
## [1] "Found it! Moving on..."
As before, now we will use actR
to detect shifts in the mean, while propagating uncertainties, and comparing the results to a robust null hypothesis. This time, since all the data are stored in the LiPD object, we can just pass in the lipd-ts-tibble object directly, and then tell it which dataset we’d like to analyze.
MD03_2707_ms <- actR::detectShift(MD03_2707_ts, #the lipd-ts-tibble that includes all the data
vals.variable.name = "SST_from_d18o_ruber_pink_ensemble", #the variable to analyze
summary.bin.step = 500, #same as before
time.range = c(20000,90000)) #same as before
## Selected SST_from_d18o_ruber_pink_ensemble
## timeUnits is at least partially absent in the input data
## Testing null hypothesis with 100 simulations, each with 100 ensemble members.
## This will probably take about 1 minutes
And as before, we can print out the summary:
MD03_2707_ms
## MD03_2707.Weldeab.2007 - SST_from_d18o_ruber_pink_ensemble: Shift in Mean results
## Searched for Shift in Mean with a minimum segment length of 1 years, summarizing the results over windows of 500 years.
##
## Overall result: Detected 11 Shift in Mean event(s) that were significant at the alpha < 0.05 level
## # A tibble: 11 × 3
## time_start time_end empirical_pvalue
## <dbl> <dbl> <dbl>
## 1 75700 76200 0
## 2 75200 75700 0
## 3 74700 75200 0
## 4 74200 74700 0
## 5 76200 76700 0
## 6 76700 77200 0
## 7 77200 77700 0
## 8 78200 78700 0
## 9 73200 73700 0
## 10 73700 74200 0
## 11 77700 78200 0.0100
## Time uncertainty considered? TRUE Time ensemble supplied (n = 100)
## Paleo uncertainty considered? TRUE Values ensemble supplied (n = 1000)
## Error propagation ensemble members = 100
## Null hypothesis testing ensemble members = 100
## Members simulated using isospectral method
##
## Parameter choices:
## cpt.fun = changepoint::cpt.mean
## minimum.segment.length = 1
## method = AMOC
## penalty = MBIC
## ncpts.max = 1
Once again, we find some intervals with significant changes in the mean. Let’s look at the plot to see how the timing of the abrupt shifts matches the ecological change.
MD03_plot <- plot(MD03_2707_ms)
Figure 6. Results of the actR shift detection analysis for MDO3-2707 sea surface temperature estimates. Panels as in figure 4.
actR::detectShift()
, with the parameters used here, consistently identifies a change in mean between 80,000 and 70,000 years ago, which suggests that the shift in Gulf of Guinea temperatures may have slightly proceeded or occurred simultaneously with the decrease in trees and shrubs inferred from Lake Bambili pollen.
To compare the results directly, we plot the likelihood of each of the shifts on the same figure.
ggplot() +
geom_area(data = MD03_2707_ms$shiftDetection, aes(y = event_probability, x = time_mid, fill = "Gulf of Guinea SSTs")) +
geom_area(data = treeShrubMeanShift$shiftDetection, aes(y = event_probability, x = time_mid, fill = "Lake Bambili Tree/Shrub percentage"), alpha = 0.8) +
scale_fill_brewer("Dataset",palette = "Set1")+
scale_x_reverse("Age (yr BP)") +
ylab("Shift probability") +
theme_bw() +
ggtitle("Detected shifts in equatorial West Africa")
Figure 7. Detected shift frequency at Lake Bambili and in the Gulf Guinea. The fraction of ensemble members for which a shift was detected during a each 500 year window for each record, as shown in figures 4 and 6.
Clearly, there is more work to be done here to explore how changes in climate lead to community change in afromontane forests. This example simply highlights how these emerging tools can be used to efficiently and reproducibly work with paleogeoscientific data from multiple repositories.
Without these tools, preparing the data for analysis would be tedious, as the structure of the data can be heterogeneous. For example, without neotoma2
and neotoma2lipd()
, a user would access the Lake Bambili data by navigating the Neotoma Explorer website, finding Lake Bambili, and downloading csv files before manipulating for various analyses. The Gulf of Guinea SST data can be found at NOAA’s World Data Service for Paleoclimatology, and downloaded as a text file. The user would then edit that file as necessary before loading and manipulating for subsequent analysis.