1 Abstract

The paleogeosciences are becoming more and more interdisciplinary, and studies increasingly rely on large collections of data derived from multiple data repositories. Integrating diverse datasets from multiple sources into complex workflows increases the challenge of creating reproducible and open science, as data formats and tools are often noninteroperable, requiring manual manipulation of data into standardized formats, resulting in a disconnect in data provenance and confounding reproducibility. Here we present a notebook that demonstrates how the Linked PaleoData (LiPD) framework is used as an interchange format to allow data from multiple data sources to be integrated in a complex workflow using emerging packages in R for geochronological uncertainty quantification and abrupt change detection. Specifically, in this notebook, we use the neotoma2 and lipdR packages to access paleoecological data from the Neotoma Database, and paleoclimate data from compilations hosted on Lipdverse. Age uncertainties for datasets from both sources are then quantified using the geoChronR package, and those data, and their associated age uncertainties, are then investigated for abrupt changes using the actR package, with accompanying visualizations. The result is an integrated, reproducible workflow in R that demonstrates how this complex series of multisource data integration, analysis and visualization can be integrated into an efficient, open scientific narrative.

2 Purpose

This notebook showcases a workflow that uses emerging tools to efficiently access a pollen dataset from Neotoma (Williams et al. 2018), create an age model using Bacon (Blaauw and Christen 2011), store the age ensemble and create visualizations using geoChronR (N. P. McKay, Emile-Geay, and Khider 2021), before finally conducting age-uncertain change detection using the Abrupt Change Toolkit in R (actR) (N. McKay and Emile-Geay 2022).

3 Technical contributions

This notebook highlights recent advances in multiple R packages

4 Methodology

neotoma2 leverages Neotoma’s API v2.0 (Goring 2022). lipdR::neotoma2lipd() converts neotoma2 site objects into lipdR lipd objects. This notebook uses Bacon Bacon (Blaauw and Christen 2011) to create an ensemble age-depth model within the geoChronR package (N. P. McKay, Emile-Geay, and Khider 2021). The changepoint detection algorithm implemented in actR (N. McKay and Emile-Geay 2022) which is used in this notebook relies heavily on the changepoint package Killick, Haynes, and Eckley (2016).

5 Results

This notebook demonstrates how the Linked PaleoData (LiPD) framework can be used as an interchange format to allow data from multiple data sources to be integrated in a complex workflow using emerging packages in R for geochronological uncertainty quantification and abrupt change detection. Efficient workflows that support data from multiple sources are an important component of reproducible science.

6 Funding

This notebook describes several LinkedEarth activities. LinkedEarth is a community of paleoscientists working to develop standards and software to enable paleoscience in the era of Big Data. This community produces data products and standards, software, cyberinfrastructure, and training opportunities. LinkedEarth was launched as part of an EarthCube Integrative Activities project funded project (ICER-1540996), and is currently supported by an EarthCube Data Capabilities project (ICER-2126510).

Neotoma-LiPD interoperability, and the Abrubt Change Toolkit in R (actR) are LinkedEarth activities that are funded as part of the Belmont Forum’s Science-driven e-Infrastructure Innovation (SEI) initiative. In the US, this is NSF award ICER-1929460.

7 Citation

Include the recommended citation for the document. You may choose to use Zenodo, or another service (such as figShare or your University services) to obtain a DOI for your code repository. If you are using GitHub, consider adding a GitHub Citation file. Ensure that the title in the suggested citation, and the authors match those in your YAML header.

For example:

8 Acknowledgements

Thank you to Socorro Dominguez Vidana and Simon Goring for their work on the neotoma2 and to the scientists who generated and shared the Bambili pollen data and the MD03-2707 paleotemperature data.

9 Setup

You will need to download the Rmd, HTML and bibtex file from the earthcube repository, store it in a single directory and then set up a project (see this link for further discussion of this.) Then you can either knit the markdown file, or run the cells interactively.

9.1 Packages

Install all the required packages in the R setup section at the head of this document. As noted previously, use pacman to do the installation. Alternatively, use the install.R script in the repository to help install some of the github packages prior to the pacman cells. Running the notebook on mybinder, is also an option.

9.2 Data processing and analysis

The paleogeosciences are becoming more and more interdisciplinary, and studies increasingly rely on large collections of data derived from multiple data repositories. Integrating diverse datasets from multiple sources into complex workflows increases the challenge of creating reproducible and open science, as data formats and tools are often noninteroperable, requiring manual manipulation of data into standardized formats, resulting in a disconnect in data provenance and confounding reproducibility. Here, we illustrate a use case where a scientist wants to conduct age-uncertain abrupt-change analysis on two datasets from Africa - a pollen record from Lake Bambili, and a temperature reconstruction from the Gulf of Guinea.

9.2.1 Lake Bambili

Lake Bambili is a high-elevation (2273 masl) crater lake in Cameroon (05° 56′ 11.9 N, 10°14′ 31.6 E) in the highlands of equatorial West Africa. Recently, (Lézine et al. 2019) published a study using pollen data to show changes in forest composition over the past 90,000 years. In this notebook, we’ll take a look at abrupt shifts in these pollen assemblages, and compare them changes in a nearby climate record from the Gulf of Guinea.

Thankfully, Lezine et al. (2019) archived their pollen assemblage data at Neotoma, and we’ll use the neotoma2 package (Dominguez Vidana and Goring 2022) to access the data, in this case, searching by sitename.

L <- neotoma2::get_sites(sitename = "Bambili 2") %>% #Find the site of interest
  neotoma2::get_downloads() %>% #download the data for this site
  lipdR::neotoma2lipd() #convert the site object into a LiPD object

Now that we’ve downloaded the data and converted it to a LiPD object, it’s ready to be used in LinkedEarth’s ecosystem of tools built around the LiPD standard (e.g., geoChronR, actR, pyleoclim).

First, let’s quickly create a map using the mapLipd() function in geoChronR to visualize Lake Bambili’s location in equatorial west Africa.

mapLipd(L,map.type = "stamen",extend.range = 5)
**Figure 1.** Map of Africa showing the location of Lake Bambili.

Figure 1. Map of Africa showing the location of Lake Bambili.

9.2.2 Age modelling

In this notebook, we want to take a look at some of the abrupt changes in ecosystems recorded by pollen assemblages in this lake, while propagating age uncertainties. Neotoma does not store age ensembles, so we will need to generate a new age model. Here we’ll use Bacon (Blaauw and Christen 2011), one of the methods built into geoChronR, to create an age model and import the ensembles into the LiPD structure. Here, we’ve specified alll of the parameters so it will run without input, but it’s often convenient to call runBacon() with fewer inputs and run it interactively.

L <- geoChronR::runBacon(L,
              lab.id.var = 'neotomaChronConrolId',
              age.14c.var = NULL,
              age.14c.uncertainty.var = NULL,
              age.var = 'age',
              age.uncertainty.var = 'ageUnc',
              depth.var = 'depth',
              reservoir.age.14c.var = NULL,
              reservoir.age.14c.uncertainty.var = NULL,
              rejected.ages.var = NULL,
              plot.pdf = FALSE, 
              accept.suggestions = TRUE)

Now that the age model has finished running and the results are stored in the LiPD object, let’s plot the results.

plotChronEns(L) + ggtitle("Age - depth model for Lake Bambili")
## [1] "Found it! Moving on..."
## [1] "Found it! Moving on..."
## [1] "plotting your chron ensemble. This make take a few seconds..."
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
**Figure 2**. Age depth model for Lake Bambili, generated using Bacon. Ages (and their uncertainties) are shown as purple distributions. The median age model is shown in black, along with the 50 and 95% highest probability density regions in dark and light gray, respectively. Five randomly selected age ensemble members are shown in red.

Figure 2. Age depth model for Lake Bambili, generated using Bacon. Ages (and their uncertainties) are shown as purple distributions. The median age model is shown in black, along with the 50 and 95% highest probability density regions in dark and light gray, respectively. Five randomly selected age ensemble members are shown in red.

Great, the uncertainties represented here are what we’d like to propagate through the abrupt change analysis. To do this, the first step is to “map” the age ensemble from the age model to the depths of the pollen data stored in the paleoData tables.

L <- mapAgeEnsembleToPaleoData(L)
## [1] "Bambili2.Lzine.2019"
## [1] "Looking for age ensemble...."
## [1] "Found it! Moving on..."
## [1] "Found it! Moving on..."
## [1] "getting depth from the paleodata table..."
## [1] "Found it! Moving on..."

Now that that’s done, let’s select the paleoData age ensemble for future reference.

ageEnsemble <- selectData(L,"ageEnsemble")
## [1] "Found it! Moving on..."

9.2.3 Paleoecological data analysis

Now let’s do some analysis on the paleoecological data. For this example, we will just perform a simple calculation. We’re interested in percent of pollen counts that are classified as trees or shrubs by their “ecological group.”

To do this, we will take advantage of lipdR’s integration with tidyverse tools in R. So our first step will be to convert the LiPD dataset into a specialized “tibble” object. It’s a regular tibble, where each row is a variable in the dataset, which includes all the metadata, and nested variables for time and paleodata values. We then use dplyr tools to calculate the total amount of tree and shrub pollen counts for each sample. Finally, we use the magrittr pipe for efficiency and clarity. See the comments in the code block for step by step explanations.

TRSH <- as.lipdTsTibble(L) %>% #convert the LiPD object to a LiPD Timeseries Tibble object
  dplyr::filter(paleoData_ecologicalgroup == "TRSH") %>% #retain only the variables that are in the trees and shrubs ecological group
  lipdR::tabulateTs(time.var = "age") %>% #convert this filtered lipd-ts-tibble to a traditional paleoecological data table, with observations in the rows and age and pollen variables in the columns
  dplyr::rowwise() %>% #set up the analysis to row runwise
  dplyr::mutate(total = sum(c_across(-age),na.rm = TRUE)) #and use mutate to calculate a new column, the sum of all pollen counts. 

Great, we now have a regular tibble that includes all our sample levels, and a new “total” column that we’ve calculated. This will be the numerator in our “percent trees and shrubs” calculation. Let’s repeat this process but with all of the non-aquatic pollen groups to ge the denominator.

total <- as.lipdTsTibble(L) %>%
  dplyr::filter(paleoData_ecologicalgroup %in% c("TRSH","UPHE","VACR","SUCC","MANG")) %>%
  tabulateTs(time.var = "age") %>%
  rowwise() %>%
  mutate(total = sum(c_across(-age),na.rm = TRUE))

Now we can calculate the percentage of tree and shrub pollen relative to total non-aquatic pollen.

treeShrubPercent <- TRSH$total/total$total * 100

Now let’s plot that percentage, along with the age ensemble, to visualize these data and their corresponding age uncertainties.

geoChronR::plotTimeseriesEnsRibbons(X = ageEnsemble, Y =  treeShrubPercent) + ylab("Relative abundance of tree and shrub pollen (%)") + ggtitle("Tree/shrub pollen abundance at Lake Bambili.")
**Figure 3**. The relative abundance of tree and shrub pollen at Lake Bambili from ca. 90 to 20 ka, with associated age uncertainties. The median estimate is shown in black, along with the 50 and 95% highest probability density regions in dark and light gray, respectively.

Figure 3. The relative abundance of tree and shrub pollen at Lake Bambili from ca. 90 to 20 ka, with associated age uncertainties. The median estimate is shown in black, along with the 50 and 95% highest probability density regions in dark and light gray, respectively.

Interesting, it does look like there are some rapid changes in this dataset that are worth further exploration.

9.2.4 Abrupt changes equatorial west African forest composition

We’ve now used the data we accessed from neotoma to generate an age ensemble, and a derived paleoecological indicator, and we’re ready to analyze these data to test whether or not the apparent abrupt shifts stand out given age uncertainties and a robust null hypothesis. To do this, we’ll use the “abrupt change toolkit in R (actR)” package to test for shifts in the mean of these data. actR can accept lipd-ts-tibble data directly, or data, ensembles and metadata separately. In this case, since we generated the paleoecological indicator outside of LiPD, we’ll use the latter strategy and specify the data explicitly.

treeShrubMeanShift <- actR::detectShift(time = ageEnsemble$values, #matrix of age ensemble data
                  time.variable.name = "age", #variable name
                  time.units = "BP", # and units
                  vals = treeShrubPercent, #our paleoecological indicator (This could also be an ensemble if desired)
                  vals.variable.name = "Tree/Shrub", #variable name
                  vals.units = "%", # and units
                  summary.bin.step = 500, #over what timestep should we examine the results?
                  time.range = c(20000,90000)) #over what time range should we conduct the analysis. 
## Warning in if (!is.na(time.range)) {: the condition has length > 1 and only the
## first element will be used
## Testing null hypothesis with 100 simulations, each with 100 ensemble members. 
## This will probably take about 2 minutes

Because this methodology tests for abrupt shifts across the age ensemble and repeats the entire process with a null model to evaluate the robust null hypothesis level, it can take awhile to run.

Let’s take a look at the results. The print() or summarize() methods will give some key details.

print(treeShrubMeanShift)
## Tree/Shrub: Shift in Mean results
## Searched for Shift in Mean with a minimum segment length of 1 years, summarizing the results over windows of 500 years.
## 
## Overall result: Detected 2 Shift in Mean event(s) that were significant at the alpha < 0.05 level
## # A tibble: 2 × 3
##   time_start time_end empirical_pvalue
##        <dbl>    <dbl>            <dbl>
## 1     70153.   70653.           0.0200
## 2     65653.   66153.           0.0400
## Time uncertainty considered? TRUE Time ensemble supplied (n = 1000)
## Paleo uncertainty considered? TRUE Values ensemble generated in `propagateUncertainties()`)
## Error propagation ensemble members = 100
## Null hypothesis testing ensemble members = 100 
## Members simulated using isospectral method
## 
## Parameter choices:
## cpt.fun = changepoint::cpt.mean
## minimum.segment.length = 1
## method = AMOC
## penalty = MBIC
## ncpts.max = 1

These summaries are helpful, but ultimately, a visualization is often much more useful. Let’s plot the results:

treeShrubPlot <- plot(treeShrubMeanShift)
**Figure 4**. Results of the actR shift detection analysis for Lake Bambili tree and shrub abundance. Top panel: as in figure 3, except with red lines identifying the mean of intervals separated by significant detected shifts in the mean. Bottom panel: The black bars show the fraction of ensemble members for which a shift was detected during a each 500 year window, and the orange lines show 90 and 95th percentile thresholds results of the robust null hypothesis testing. Where the observed frequencies exceed the null hypothesis marks intervals where the detected shifts are robust at that confidence level.

Figure 4. Results of the actR shift detection analysis for Lake Bambili tree and shrub abundance. Top panel: as in figure 3, except with red lines identifying the mean of intervals separated by significant detected shifts in the mean. Bottom panel: The black bars show the fraction of ensemble members for which a shift was detected during a each 500 year window, and the orange lines show 90 and 95th percentile thresholds results of the robust null hypothesis testing. Where the observed frequencies exceed the null hypothesis marks intervals where the detected shifts are robust at that confidence level.

Figure 4 visualizes many of the features of this analysis. First we see the identified intervals of abrupt change, as changes in the mean represented by the red line on the top panel, and as the black bars in the bottom panel, which show the proportion of ensemble members with detected abrupt changes during each 500-year window. Comparing this to the 95th percentile of changes found in the null model makes it clear that the shift that occurred about 71,000 years ago is a robust interval of abrupt change.

References

Blaauw, M., and J. A. Christen. 2011. “Flexible Paleoclimate Age-Depth Models Using an Autoregressive Gamma Process.” Bayesian Analysis 6 (3): 457–74.
Dominguez Vidana, Socorro, and Simon Goring. 2022. “Neotoma2 r Package.” https://github.com/neotomadb/neotoma2.
Goring, Simon. 2022. “Neotoma API 2.0.” https://api.neotomadb.org/api-docs/ .
Heiser, Chris, and Nicholas McKay. 2022. “lipdR Package.” https://nickmckay.github.io/lipdR/.
Kaufman, Darrell, Nicholas McKay, Cody Routson, Michael Erb, Basil Davis, Oliver Heiri, Samuel Jaccard, et al. 2020. “A Global Database of Holocene Paleotemperature Records.” Scientific Data 7 (1): 1–34.
Killick, Rebecca, and Idris A. Eckley. 2014. changepoint: An R Package for Changepoint Analysis.” Journal of Statistical Software 58 (3): 1–19. http://www.jstatsoft.org/v58/i03/.
Killick, Rebecca, Kaylea Haynes, and Idris A. Eckley. 2016. changepoint: An R Package for Changepoint Analysis. https://CRAN.R-project.org/package=changepoint.
Lézine, Anne-Marie, Kenji Izumi, Masa Kageyama, and Gaston Achoundong. 2019. “A 90,000-Year Record of Afromontane Forest Responses to Climate Change.” Science 363 (6423): 177–81.
McKay, Nicholas P, Julien Emile-Geay, and Deborah Khider. 2021. “geoChronR–an r Package to Model, Analyze, and Visualize Age-Uncertain Data.” Geochronology 3 (1): 149–69.
McKay, Nicholas, and Julien Emile-Geay. 2022. “Abrupt Change Toolkit in r.” http://linked.earth/actR/.
Weldeab, Syee, David W Lea, Ralph R Schneider, and Nils Andersen. 2007. “155,000 Years of West African Monsoon and Ocean Thermal Evolution.” Science 316 (5829): 1303–7.
Williams, John W., Eric C. Grimm, Jessica L. Blois, Donald F. Charles, Edward B. Davis, Simon J. Goring, Russell W. Graham, et al. 2018. The Neotoma Paleoecology Database, a multiproxy, international, community-curated data resource.” Quaternary Research 89 (1): 156–77. https://doi.org/10.1017/qua.2017.105.