{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Cleaning with Pandas and Jupyter Notebooks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Author(s)\n",
    "\n",
    "- Author1 = {\"name\": \"Wai-Yin Kwan\", \"affiliation\": \"Whirl-i-Gig\", \"email\": \"wyk@whirl-i-gig.com\", \"orcid\": \"0000-0001-6113-0210\"}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Data-Cleaning-with-Pandas-and-Jupyter-Notebooks\" data-toc-modified-id=\"Data-Cleaning-with-Pandas-and-Jupyter-Notebooks-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Data Cleaning with Pandas and Jupyter Notebooks</a></span><ul class=\"toc-item\"><li><span><a href=\"#Author(s)\" data-toc-modified-id=\"Author(s)-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Author(s)</a></span></li><li><span><a href=\"#Purpose\" data-toc-modified-id=\"Purpose-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Purpose</a></span></li><li><span><a href=\"#Technical-contributions\" data-toc-modified-id=\"Technical-contributions-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Technical contributions</a></span></li><li><span><a href=\"#Methodology\" data-toc-modified-id=\"Methodology-1.4\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Methodology</a></span></li><li><span><a href=\"#Results\" data-toc-modified-id=\"Results-1.5\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>Results</a></span></li><li><span><a href=\"#Funding\" data-toc-modified-id=\"Funding-1.6\"><span class=\"toc-item-num\">1.6&nbsp;&nbsp;</span>Funding</a></span></li><li><span><a href=\"#Keywords\" data-toc-modified-id=\"Keywords-1.7\"><span class=\"toc-item-num\">1.7&nbsp;&nbsp;</span>Keywords</a></span></li><li><span><a href=\"#Citation\" data-toc-modified-id=\"Citation-1.8\"><span class=\"toc-item-num\">1.8&nbsp;&nbsp;</span>Citation</a></span></li><li><span><a href=\"#Acknowledgements\" data-toc-modified-id=\"Acknowledgements-1.9\"><span class=\"toc-item-num\">1.9&nbsp;&nbsp;</span>Acknowledgements</a></span></li></ul></li><li><span><a href=\"#Setup\" data-toc-modified-id=\"Setup-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Setup</a></span><ul class=\"toc-item\"><li><span><a href=\"#Directory-structure\" data-toc-modified-id=\"Directory-structure-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Directory structure</a></span></li><li><span><a href=\"#Library-import\" data-toc-modified-id=\"Library-import-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Library import</a></span></li><li><span><a href=\"#Local-library-import\" data-toc-modified-id=\"Local-library-import-2.3\"><span class=\"toc-item-num\">2.3&nbsp;&nbsp;</span>Local library import</a></span></li><li><span><a href=\"#Set-file-path-variables\" data-toc-modified-id=\"Set-file-path-variables-2.4\"><span class=\"toc-item-num\">2.4&nbsp;&nbsp;</span>Set file path variables</a></span></li></ul></li><li><span><a href=\"#Best-practices-and-processing-notes\" data-toc-modified-id=\"Best-practices-and-processing-notes-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Best practices and processing notes</a></span><ul class=\"toc-item\"><li><span><a href=\"#Use-conda-to-manage-dependencies-and-virtual-environments\" data-toc-modified-id=\"Use-conda-to-manage-dependencies-and-virtual-environments-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Use conda to manage dependencies and virtual environments</a></span></li><li><span><a href=\"#Version-Control-/-Git\" data-toc-modified-id=\"Version-Control-/-Git-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Version Control / Git</a></span></li><li><span><a href=\"#Unit-testing\" data-toc-modified-id=\"Unit-testing-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>Unit testing</a></span></li><li><span><a href=\"#Changes-to-data-files-are-done-in-code\" data-toc-modified-id=\"Changes-to-data-files-are-done-in-code-3.4\"><span class=\"toc-item-num\">3.4&nbsp;&nbsp;</span>Changes to data files are done in code</a></span></li><li><span><a href=\"#Read-files\" data-toc-modified-id=\"Read-files-3.5\"><span class=\"toc-item-num\">3.5&nbsp;&nbsp;</span>Read files</a></span><ul class=\"toc-item\"><li><span><a href=\"#Example\" data-toc-modified-id=\"Example-3.5.1\"><span class=\"toc-item-num\">3.5.1&nbsp;&nbsp;</span>Example</a></span></li></ul></li><li><span><a href=\"#Viewing-the-dataframe\" data-toc-modified-id=\"Viewing-the-dataframe-3.6\"><span class=\"toc-item-num\">3.6&nbsp;&nbsp;</span>Viewing the dataframe</a></span></li><li><span><a href=\"#Basic-cleanup-pattern\" data-toc-modified-id=\"Basic-cleanup-pattern-3.7\"><span class=\"toc-item-num\">3.7&nbsp;&nbsp;</span>Basic cleanup pattern</a></span></li><li><span><a href=\"#View-the-changed-files\" data-toc-modified-id=\"View-the-changed-files-3.8\"><span class=\"toc-item-num\">3.8&nbsp;&nbsp;</span>View the changed files</a></span></li></ul></li><li><span><a href=\"#Data-processing-and-analysis\" data-toc-modified-id=\"Data-processing-and-analysis-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Data processing and analysis</a></span><ul class=\"toc-item\"><li><span><a href=\"#Basic-file-cleanup\" data-toc-modified-id=\"Basic-file-cleanup-4.1\"><span class=\"toc-item-num\">4.1&nbsp;&nbsp;</span>Basic file cleanup</a></span><ul class=\"toc-item\"><li><span><a href=\"#Before-cleanup\" data-toc-modified-id=\"Before-cleanup-4.1.1\"><span class=\"toc-item-num\">4.1.1&nbsp;&nbsp;</span>Before cleanup</a></span></li><li><span><a href=\"#After-cleanup\" data-toc-modified-id=\"After-cleanup-4.1.2\"><span class=\"toc-item-num\">4.1.2&nbsp;&nbsp;</span>After cleanup</a></span></li><li><span><a href=\"#Clean-up-all-files-and-save-the-changes\" data-toc-modified-id=\"Clean-up-all-files-and-save-the-changes-4.1.3\"><span class=\"toc-item-num\">4.1.3&nbsp;&nbsp;</span>Clean up all files and save the changes</a></span></li></ul></li><li><span><a href=\"#Remove-leading-and-trailing-white-spaces\" data-toc-modified-id=\"Remove-leading-and-trailing-white-spaces-4.2\"><span class=\"toc-item-num\">4.2&nbsp;&nbsp;</span>Remove leading and trailing white spaces</a></span><ul class=\"toc-item\"><li><span><a href=\"#Before-cleanup\" data-toc-modified-id=\"Before-cleanup-4.2.1\"><span class=\"toc-item-num\">4.2.1&nbsp;&nbsp;</span>Before cleanup</a></span></li><li><span><a href=\"#After-cleanup\" data-toc-modified-id=\"After-cleanup-4.2.2\"><span class=\"toc-item-num\">4.2.2&nbsp;&nbsp;</span>After cleanup</a></span></li><li><span><a href=\"#Clean-up-all-files-and-save-the-changes\" data-toc-modified-id=\"Clean-up-all-files-and-save-the-changes-4.2.3\"><span class=\"toc-item-num\">4.2.3&nbsp;&nbsp;</span>Clean up all files and save the changes</a></span></li></ul></li><li><span><a href=\"#Normalizing-columns-names\" data-toc-modified-id=\"Normalizing-columns-names-4.3\"><span class=\"toc-item-num\">4.3&nbsp;&nbsp;</span>Normalizing columns names</a></span><ul class=\"toc-item\"><li><span><a href=\"#Get-all-unique-column-names\" data-toc-modified-id=\"Get-all-unique-column-names-4.3.1\"><span class=\"toc-item-num\">4.3.1&nbsp;&nbsp;</span>Get all unique column names</a></span></li><li><span><a href=\"#Create-taxa-and-non-taxa-files\" data-toc-modified-id=\"Create-taxa-and-non-taxa-files-4.3.2\"><span class=\"toc-item-num\">4.3.2&nbsp;&nbsp;</span>Create taxa and non-taxa files</a></span></li><li><span><a href=\"#Normalize-headers\" data-toc-modified-id=\"Normalize-headers-4.3.3\"><span class=\"toc-item-num\">4.3.3&nbsp;&nbsp;</span>Normalize headers</a></span></li><li><span><a href=\"#Before-changing-headers\" data-toc-modified-id=\"Before-changing-headers-4.3.4\"><span class=\"toc-item-num\">4.3.4&nbsp;&nbsp;</span>Before changing headers</a></span></li><li><span><a href=\"#After-changing-headers\" data-toc-modified-id=\"After-changing-headers-4.3.5\"><span class=\"toc-item-num\">4.3.5&nbsp;&nbsp;</span>After changing headers</a></span></li><li><span><a href=\"#Change-headers-for-all-files-and-save-the-changes\" data-toc-modified-id=\"Change-headers-for-all-files-and-save-the-changes-4.3.6\"><span class=\"toc-item-num\">4.3.6&nbsp;&nbsp;</span>Change headers for all files and save the changes</a></span></li></ul></li><li><span><a href=\"#Turn-one-column-into-multiple-columns\" data-toc-modified-id=\"Turn-one-column-into-multiple-columns-4.4\"><span class=\"toc-item-num\">4.4&nbsp;&nbsp;</span>Turn one column into multiple columns</a></span><ul class=\"toc-item\"><li><span><a href=\"#Before-cleanup\" data-toc-modified-id=\"Before-cleanup-4.4.1\"><span class=\"toc-item-num\">4.4.1&nbsp;&nbsp;</span>Before cleanup</a></span></li><li><span><a href=\"#After-cleanup\" data-toc-modified-id=\"After-cleanup-4.4.2\"><span class=\"toc-item-num\">4.4.2&nbsp;&nbsp;</span>After cleanup</a></span></li><li><span><a href=\"#Clean-up-all-files-and-save-the-changes\" data-toc-modified-id=\"Clean-up-all-files-and-save-the-changes-4.4.3\"><span class=\"toc-item-num\">4.4.3&nbsp;&nbsp;</span>Clean up all files and save the changes</a></span></li></ul></li><li><span><a href=\"#Clean-up-row-values\" data-toc-modified-id=\"Clean-up-row-values-4.5\"><span class=\"toc-item-num\">4.5&nbsp;&nbsp;</span>Clean up row values</a></span><ul class=\"toc-item\"><li><span><a href=\"#Before-cleanup\" data-toc-modified-id=\"Before-cleanup-4.5.1\"><span class=\"toc-item-num\">4.5.1&nbsp;&nbsp;</span>Before cleanup</a></span></li><li><span><a href=\"#After-cleanup\" data-toc-modified-id=\"After-cleanup-4.5.2\"><span class=\"toc-item-num\">4.5.2&nbsp;&nbsp;</span>After cleanup</a></span></li><li><span><a href=\"#Clean-up-all-files-and-save-the-changes\" data-toc-modified-id=\"Clean-up-all-files-and-save-the-changes-4.5.3\"><span class=\"toc-item-num\">4.5.3&nbsp;&nbsp;</span>Clean up all files and save the changes</a></span></li></ul></li><li><span><a href=\"#Check-if-required-columns-exists\" data-toc-modified-id=\"Check-if-required-columns-exists-4.6\"><span class=\"toc-item-num\">4.6&nbsp;&nbsp;</span>Check if required columns exists</a></span></li></ul></li><li><span><a href=\"#References\" data-toc-modified-id=\"References-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>References</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "## Purpose\n",
    "\n",
    "Since its inception in the 1960s, the International Ocean Discovery Program (IODP) and its predecessors have recovered sediment cores from ocean basins around the world. The cores provide an extensive amount of information about microorganism fossils and lithologies spanning millions of years. Current challenges to using the scientific ocean drilling (SOD) data include: datasets are stored in different databases with different formats, databases have limited search capabilities, and no one database has all the SOD data. \n",
    "\n",
    "The [extending Ocean Drilling Pursuits (eODP)](https://eodp.github.io/) project aims to standardized SOD data from several sources into one standardize database structure. The standardized data is then added to the Paleobiology Database (paleobiodb.org) and Macrostrat (macrostrat.org) to make the data easily accessible to researchers. This notebook demonstrates how the eODP team uses Jupyter notebooks to clean and standardize 50+ years of SOD data.  This notebook is developed for Python 3.8+ and Jupyter Lab.\n",
    "\n",
    "## Technical contributions\n",
    "\n",
    "Demonstration of how to create a reproducible data cleaning workflow using Git, unit testing, directory organization, and modular code.\n",
    "\n",
    "\n",
    "## Methodology\n",
    "For eODP, we use multiple notebooks to clean and standardize over 10,000 CSVs of ocean core data. For this presentation, we are using one notebook to clean just 10 files.  \n",
    "\n",
    "For each data cleaning step we created: separate sections in this notebook, data cleaning functions in `scripts/normalize_data.py`, and unit tests in `test/test_normalize_data.py`\n",
    "\n",
    "## Results\n",
    "\n",
    "By making the raw data, processed data, and data cleaning code available, we hope to give ocean core drilling researchers the ability review our data cleaning process and customize the data cleaning steps if so desired. We also hope researchers can apply ideas and tips from our data cleaning process to their own datasets. \n",
    "\n",
    "## Funding\n",
    "\n",
    "- Award1 = {\"agency\": \"US National Science Foundation\", \"award_code\": \"1928362\", \"award_URL\": \"https://www.nsf.gov/awardsearch/showAward?AWD_ID=1928362 \"}\n",
    " \n",
    " \n",
    "## Keywords\n",
    "Include up to 5 keywords, using the template below.\n",
    "\n",
    "keywords=[\"data cleaning\", \"reproducibility\", \"version control\", \"testing\"]\n",
    "\n",
    "## Citation\n",
    "\n",
    "Wai-Yin Kwan, 2021. Data Cleaning with Pandas and Jupyter Notebooks. Accessed 5/30/2022 at https://github.com/wykhuh/ec_eodp_demo\n",
    "\n",
    "\n",
    "\n",
    "## Acknowledgements \n",
    "\n",
    "The template is licensed under a <a href=\"http://creativecommons.org/licenses/by/4.0/\">Creative Commons Attribution 4.0 International License.</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Directory structure\n",
    "\n",
    "- notebooks: Jupyter notebooks\n",
    "- processed_data: processed data files\n",
    "- raw_data: raw, unprocessed data files\n",
    "- scripts: custom data cleaning scripts \n",
    "- tests: unit tests written in pytest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Library import"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data manipulation\n",
    "import pandas as pd\n",
    "\n",
    "# handle file system\n",
    "from pathlib import Path "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Local library import\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include local library paths\n",
    "import sys\n",
    "sys.path.append(str(Path.cwd().parent))\n",
    "\n",
    "# Import local libraries\n",
    "from scripts.normalize_data import (\n",
    "    normalize_columns, \n",
    "    remove_bracket_text,\n",
    "    remove_whitespace,\n",
    "    normalize_expedition_section_cols,\n",
    "    print_df\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set file path variables\n",
    "\n",
    "We store the file paths as variables so that can access the paths in multiple data cleaning steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "normalized_nontaxa_path = Path('..', 'processed_data', 'normalized_nontaxa_list.csv')\n",
    "normalized_taxa_path = Path('..', 'processed_data',  'normalized_taxa_list.csv')\n",
    "\n",
    "taxa_list_path = Path('..', 'processed_data', 'drafts', 'taxa_list.csv')\n",
    "nontaxa_list_path = Path('..', 'processed_data', 'drafts', 'nontaxa_list.csv')\n",
    "\n",
    "\n",
    "# Use `Path` and `rglob` to get all the CSVs in `clean_data/taxa` directory.\n",
    "paths = list(Path('..', 'processed_data', 'clean_data', 'taxa').rglob('*.csv'))\n",
    "len(paths)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Best practices and processing notes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use conda to manage dependencies and virtual environments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The package versions are stored environmental.yml so that when other people run the code, they will install the same packages.\n",
    "\n",
    "We use virtual environments to avoid dependency conflicts with other projects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Version Control / Git\n",
    "\n",
    "We use git to track changes in the code and data files.\n",
    "\n",
    "We found it helpful to make a commit after each data cleaning step in order to make it easier keep track of the data cleaning steps, and undo the certain steps if needed. To undo a data cleaning step that hasn't been committed, use `git restore`. To undo a data cleaning step that has been committed, use `git revert`, `git reset --hard`, or `get rebase`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Unit testing\n",
    "\n",
    "We wrote unit tests for the data cleaning functions in order to reduce the chances that our data cleaning steps would alter the data files in unexpected ways. The tests are in the `./tests` directory. We used pytest for the unit tests.\n",
    "\n",
    "Since we use the same data cleaning functions in multiple notebooks, we created a separate tests directory instead of putting the tests inside the notebooks.\n",
    "\n",
    "To run the tests, use `pytest` from the command line. All green dots and `[100%]` means all the tests passed.\n",
    "\n",
    "![screenshot pytest results](images/pytest.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "## Changes to data files are done in code\n",
    "\n",
    "In order to make the data cleaning process reproducible, all the changes to the processed data files are done in code. We do not manually edit the processed files. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We used `pandas.read_csv(path, dtype=str)` to read csv and treat all columns as strings. The reason why we used `dtype=str` is because `pandas.read_csv(path)`  will automatically convert the columns to strings, integers, floats, dates. This automatic conversion can change values in unexpected ways such as converting a column with integers and NaN into floats and NaN. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = paths[4]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use `dtype=str` so that the column has correct integer values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    NaN\n",
       "1     20\n",
       "2    NaN\n",
       "3     23\n",
       "4     35\n",
       "Name: Pulleniatina coiling (dextral), dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(path, nrows=5 , dtype=str)\n",
    "df['Pulleniatina coiling (dextral)']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we don't include `dtype`, the columns has incorrect float values. pandas automatically converts the integers to floats because of NaNs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     NaN\n",
       "1    20.0\n",
       "2     NaN\n",
       "3    23.0\n",
       "4    35.0\n",
       "Name: Pulleniatina coiling (dextral), dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(path, nrows=5)\n",
    "df['Pulleniatina coiling (dextral)']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Viewing the dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing that we found helpful when data cleaning is to view the dataframe and the total number of rows and columns to check the changes in the dataframe.\n",
    "\n",
    "`print_df` is a custom function that calls `pd.DataFrame.shape` and `pd.DataFrame.head()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(23, 23)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>Bottom [cm]</th>\n",
       "      <th>Top Depth [m]</th>\n",
       "      <th>Bottom Depth [m]</th>\n",
       "      <th>Datum name</th>\n",
       "      <th>Zone name (short)</th>\n",
       "      <th>Zone name</th>\n",
       "      <th>Preservation</th>\n",
       "      <th>Group abundance</th>\n",
       "      <th>...</th>\n",
       "      <th>Globorotalia flexuosa</th>\n",
       "      <th>Globorotalia hessi</th>\n",
       "      <th>Globorotalia tosaensis</th>\n",
       "      <th>Globorotalia tumida</th>\n",
       "      <th>Neogloboquadrina acostaensis (dextral)</th>\n",
       "      <th>Neogloboquadrina humerosa</th>\n",
       "      <th>Pulleniatina coiling (dextral)</th>\n",
       "      <th>Pulleniatina coiling (sinistral)</th>\n",
       "      <th>Pulleniatina finalis</th>\n",
       "      <th>Sphaeroidinella dahiscens sensu lato</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>362-U1480E-1H-CC-PAL-FORAM_CHEN</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>7.71</td>\n",
       "      <td>7.76</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>P [P43]</td>\n",
       "      <td>R [A101]</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>R [A102]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>362-U1480E-2H-CC-PAL-FORAM_CHEN</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>17.54</td>\n",
       "      <td>17.59</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>P [P43]</td>\n",
       "      <td>A [A101]</td>\n",
       "      <td>...</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>A [A102]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>362-U1480E-3H-1-W 130/132-FORAM_CHEN</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>18.6</td>\n",
       "      <td>18.62</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>P [P43]</td>\n",
       "      <td>R [A101]</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>F [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>F [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>A [A102]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>362-U1480E-3H-2-W 70/72-FORAM_CHEN</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>19.5</td>\n",
       "      <td>19.52</td>\n",
       "      <td>T Globorotalia tosaensis (Gradstein et al., 2012)</td>\n",
       "      <td>PT1a [AO12]</td>\n",
       "      <td>Globigerinoides fistulosus (Anthonissen and Og...</td>\n",
       "      <td>P [P43]</td>\n",
       "      <td>A [A101]</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C [A102]</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>A [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>F [A102]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>362-U1480E-3H-3-W 130/132-FORAM_CHEN</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>21.6</td>\n",
       "      <td>21.62</td>\n",
       "      <td>T Neogloboquadrina acostaensis (Gradstein et a...</td>\n",
       "      <td>PT1a [AO12]</td>\n",
       "      <td>Globigerinoides fistulosus (Anthonissen and Og...</td>\n",
       "      <td>M [P43]</td>\n",
       "      <td>A [A101]</td>\n",
       "      <td>...</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>R [A102]</td>\n",
       "      <td>A [A102]</td>\n",
       "      <td>F [A102]</td>\n",
       "      <td>F [A102]</td>\n",
       "      <td>35</td>\n",
       "      <td>12</td>\n",
       "      <td>F [A102]</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 23 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                 Sample Top [cm] Bottom [cm] Top Depth [m]  \\\n",
       "0       362-U1480E-1H-CC-PAL-FORAM_CHEN        0           5          7.71   \n",
       "1       362-U1480E-2H-CC-PAL-FORAM_CHEN        0           5         17.54   \n",
       "2  362-U1480E-3H-1-W 130/132-FORAM_CHEN        0           2          18.6   \n",
       "3    362-U1480E-3H-2-W 70/72-FORAM_CHEN        0           2          19.5   \n",
       "4  362-U1480E-3H-3-W 130/132-FORAM_CHEN        0           2          21.6   \n",
       "\n",
       "  Bottom Depth [m]                                         Datum name  \\\n",
       "0             7.76                                                NaN   \n",
       "1            17.59                                                NaN   \n",
       "2            18.62                                                NaN   \n",
       "3            19.52  T Globorotalia tosaensis (Gradstein et al., 2012)   \n",
       "4            21.62  T Neogloboquadrina acostaensis (Gradstein et a...   \n",
       "\n",
       "  Zone name (short)                                          Zone name  \\\n",
       "0               NaN                                                NaN   \n",
       "1               NaN                                                NaN   \n",
       "2               NaN                                                NaN   \n",
       "3       PT1a [AO12]  Globigerinoides fistulosus (Anthonissen and Og...   \n",
       "4       PT1a [AO12]  Globigerinoides fistulosus (Anthonissen and Og...   \n",
       "\n",
       "  Preservation Group abundance  ... Globorotalia flexuosa Globorotalia hessi  \\\n",
       "0      P [P43]        R [A101]  ...                   NaN                NaN   \n",
       "1      P [P43]        A [A101]  ...              R [A102]                NaN   \n",
       "2      P [P43]        R [A101]  ...                   NaN           F [A102]   \n",
       "3      P [P43]        A [A101]  ...                   NaN           C [A102]   \n",
       "4      M [P43]        A [A101]  ...              R [A102]                NaN   \n",
       "\n",
       "  Globorotalia tosaensis Globorotalia tumida  \\\n",
       "0                    NaN            R [A102]   \n",
       "1                    NaN            A [A102]   \n",
       "2                    NaN            C [A102]   \n",
       "3               R [A102]            A [A102]   \n",
       "4               R [A102]            A [A102]   \n",
       "\n",
       "  Neogloboquadrina acostaensis (dextral) Neogloboquadrina humerosa  \\\n",
       "0                                    NaN                  R [A102]   \n",
       "1                                    NaN                  R [A102]   \n",
       "2                                    NaN                  F [A102]   \n",
       "3                                    NaN                       NaN   \n",
       "4                               F [A102]                  F [A102]   \n",
       "\n",
       "  Pulleniatina coiling (dextral) Pulleniatina coiling (sinistral)  \\\n",
       "0                            NaN                              NaN   \n",
       "1                             20                                0   \n",
       "2                            NaN                              NaN   \n",
       "3                             23                                0   \n",
       "4                             35                               12   \n",
       "\n",
       "  Pulleniatina finalis Sphaeroidinella dahiscens sensu lato  \n",
       "0                  NaN                             R [A102]  \n",
       "1             R [A102]                             A [A102]  \n",
       "2                  NaN                             A [A102]  \n",
       "3             R [A102]                             F [A102]  \n",
       "4             F [A102]                                  NaN  \n",
       "\n",
       "[5 rows x 23 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = paths[4]\n",
    "df = pd.read_csv(path, dtype=str)\n",
    "\n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic cleanup pattern\n",
    "\n",
    "For each data cleanup step, we loop over all the files, create a dataframe for each file, execute some code to clean the data, and then save the revised file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "for path in paths:  \n",
    "    \n",
    "    df = pd.read_csv(path, dtype=str)\n",
    "    \n",
    "    # code to change file   \n",
    "    \n",
    "    df.to_csv(path, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## View the changed files\n",
    "\n",
    "After each cleanup step, we use the desktop application [Github Desktop](https://desktop.github.com/) to spot check the changes to the files. \n",
    "\n",
    "If the files look ok, make a commit. If the data cleaning step did not act as expected, undo the changes in the data files using `git restore`, update the data cleaning function and the tests, and rerun the data cleaning step. \n",
    "\n",
    "The screenshot below shows the changes to `Micropal_CSV_1/318_U1355A_Planktic_Forams.csv`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![screenshot of Github Desktop showing the changed data files](images/git_diff.png)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data processing and analysis\n",
    "\n",
    "There are many steps needed to clean up the data files. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic file cleanup\n",
    "\n",
    "pandas has methods that can be used to do some basic file cleanup.\n",
    "  \n",
    "- delete dataframe column if all values are NA  \n",
    "  dropna(axis='columns', how='all', inplace=True) - [pandas.DataFrame.dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)\n",
    "  \n",
    "\n",
    "- delete dataframe row if all values are NA  \n",
    "  dropna(axis='index', how='all', inplace=True) - [pandas.DataFrame.dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)\n",
    "  \n",
    "  \n",
    "- remove duplicate rows in dataframe  \n",
    "  drop_duplicates(inplace=True) - [pandas.DataFrame.drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Before cleanup\n",
    "\n",
    "The file has 20 rows and 41 columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(20, 41)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Exp</th>\n",
       "      <th>Site</th>\n",
       "      <th>Hole</th>\n",
       "      <th>Core</th>\n",
       "      <th>Core-Sect</th>\n",
       "      <th>Type</th>\n",
       "      <th>Section</th>\n",
       "      <th>A/W</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>...</th>\n",
       "      <th>Neogloboquadrina pachyderma (sinistral)</th>\n",
       "      <th>Neogloboquadrina pachyderma (dextral)</th>\n",
       "      <th>Neogloboquadrina pachyderma A (dextral, inflated form)</th>\n",
       "      <th>Neogloboquadrina pachyderma B (sinistral, inflated form)</th>\n",
       "      <th>Neogloboquadrina pachyderma C (sinistral, compressed form)</th>\n",
       "      <th>Orbulina universa</th>\n",
       "      <th>Comments</th>\n",
       "      <th>Ship File Links</th>\n",
       "      <th>Shore File Links</th>\n",
       "      <th>File Data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>341-U1417B-2H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "      <td>2-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>341-U1417B-4H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>4</td>\n",
       "      <td>4-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>sandy</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>341-U1417B-6H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>6</td>\n",
       "      <td>6-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>diatom ooze. pyritilize foram</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>341-U1417B-9H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>9</td>\n",
       "      <td>9-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Rich in sandy terrigenous grains with presence...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>341-U1417B-10H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>10</td>\n",
       "      <td>10-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Rich in planktic foraminifers, pebbles, pyrite</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 41 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  Sample  Exp   Site Hole Core Core-Sect Type Section  A/W  \\\n",
       "0   341-U1417B-2H-CC-PAL  341  U1417    B    2      2-CC    H      CC  PAL   \n",
       "1   341-U1417B-4H-CC-PAL  341  U1417    B    4      4-CC    H      CC  PAL   \n",
       "2   341-U1417B-6H-CC-PAL  341  U1417    B    6      6-CC    H      CC  PAL   \n",
       "3   341-U1417B-9H-CC-PAL  341  U1417    B    9      9-CC    H      CC  PAL   \n",
       "4  341-U1417B-10H-CC-PAL  341  U1417    B   10     10-CC    H      CC  PAL   \n",
       "\n",
       "  Top [cm]  ... Neogloboquadrina pachyderma (sinistral)  \\\n",
       "0        0  ...                                 D [A84]   \n",
       "1        0  ...                                 D [A84]   \n",
       "2        0  ...                                 D [A84]   \n",
       "3        0  ...                                     NaN   \n",
       "4        0  ...                                 D [A84]   \n",
       "\n",
       "  Neogloboquadrina pachyderma (dextral)  \\\n",
       "0                               R [A84]   \n",
       "1                               R [A84]   \n",
       "2                                   NaN   \n",
       "3                                   NaN   \n",
       "4                               R [A84]   \n",
       "\n",
       "  Neogloboquadrina pachyderma A (dextral, inflated form)  \\\n",
       "0                                                NaN       \n",
       "1                                                NaN       \n",
       "2                                                NaN       \n",
       "3                                                NaN       \n",
       "4                                                NaN       \n",
       "\n",
       "  Neogloboquadrina pachyderma B (sinistral, inflated form)  \\\n",
       "0                                            R [A84]         \n",
       "1                                                NaN         \n",
       "2                                                NaN         \n",
       "3                                                NaN         \n",
       "4                                                NaN         \n",
       "\n",
       "  Neogloboquadrina pachyderma C (sinistral, compressed form)  \\\n",
       "0                                                NaN           \n",
       "1                                                NaN           \n",
       "2                                                NaN           \n",
       "3                                                NaN           \n",
       "4                                                NaN           \n",
       "\n",
       "  Orbulina universa                                           Comments  \\\n",
       "0               NaN                                                NaN   \n",
       "1               NaN                                              sandy   \n",
       "2               NaN                      diatom ooze. pyritilize foram   \n",
       "3               NaN  Rich in sandy terrigenous grains with presence...   \n",
       "4               NaN     Rich in planktic foraminifers, pebbles, pyrite   \n",
       "\n",
       "  Ship File Links Shore File Links File Data  \n",
       "0             NaN              NaN       NaN  \n",
       "1             NaN              NaN       NaN  \n",
       "2             NaN              NaN       NaN  \n",
       "3             NaN              NaN       NaN  \n",
       "4             NaN              NaN       NaN  \n",
       "\n",
       "[5 rows x 41 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = paths[7]\n",
    "\n",
    "df = pd.read_csv(path, dtype=str)\n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### After cleanup \n",
    "\n",
    "The number of rows changed from 20 to 17, and the number of columns changed from 41 to 26."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(17, 26)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Exp</th>\n",
       "      <th>Site</th>\n",
       "      <th>Hole</th>\n",
       "      <th>Core</th>\n",
       "      <th>Core-Sect</th>\n",
       "      <th>Type</th>\n",
       "      <th>Section</th>\n",
       "      <th>A/W</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>...</th>\n",
       "      <th>Preservation</th>\n",
       "      <th>Group abundance</th>\n",
       "      <th>Globigerina umbilicata</th>\n",
       "      <th>Globigerinita glutinata</th>\n",
       "      <th>Neogloboquadrina inglei</th>\n",
       "      <th>Neogloboquadrina kagaensis</th>\n",
       "      <th>Neogloboquadrina pachyderma (sinistral)</th>\n",
       "      <th>Neogloboquadrina pachyderma (dextral)</th>\n",
       "      <th>Neogloboquadrina pachyderma B (sinistral, inflated form)</th>\n",
       "      <th>Comments</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>341-U1417B-2H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>2</td>\n",
       "      <td>2-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>G [P15]</td>\n",
       "      <td>R [A83]</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>341-U1417B-4H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>4</td>\n",
       "      <td>4-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>G [P15]</td>\n",
       "      <td>C  [A61]</td>\n",
       "      <td>A [A84]</td>\n",
       "      <td>P [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>sandy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>341-U1417B-6H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>6</td>\n",
       "      <td>6-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>P [P15]</td>\n",
       "      <td>P [A83]</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>diatom ooze. pyritilize foram</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>341-U1417B-9H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>9</td>\n",
       "      <td>9-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>B [A83]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Rich in sandy terrigenous grains with presence...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>341-U1417B-10H-CC-PAL</td>\n",
       "      <td>341</td>\n",
       "      <td>U1417</td>\n",
       "      <td>B</td>\n",
       "      <td>10</td>\n",
       "      <td>10-CC</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>G [P15]</td>\n",
       "      <td>A [A83]</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>P [A84]</td>\n",
       "      <td>P [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>D [A84]</td>\n",
       "      <td>R [A84]</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Rich in planktic foraminifers, pebbles, pyrite</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 26 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  Sample  Exp   Site Hole Core Core-Sect Type Section  A/W  \\\n",
       "0   341-U1417B-2H-CC-PAL  341  U1417    B    2      2-CC    H      CC  PAL   \n",
       "1   341-U1417B-4H-CC-PAL  341  U1417    B    4      4-CC    H      CC  PAL   \n",
       "2   341-U1417B-6H-CC-PAL  341  U1417    B    6      6-CC    H      CC  PAL   \n",
       "3   341-U1417B-9H-CC-PAL  341  U1417    B    9      9-CC    H      CC  PAL   \n",
       "4  341-U1417B-10H-CC-PAL  341  U1417    B   10     10-CC    H      CC  PAL   \n",
       "\n",
       "  Top [cm]  ... Preservation Group abundance Globigerina umbilicata  \\\n",
       "0        0  ...      G [P15]         R [A83]                D [A84]   \n",
       "1        0  ...      G [P15]        C  [A61]                A [A84]   \n",
       "2        0  ...      P [P15]         P [A83]                D [A84]   \n",
       "3        0  ...          NaN         B [A83]                    NaN   \n",
       "4        0  ...      G [P15]         A [A83]                D [A84]   \n",
       "\n",
       "  Globigerinita glutinata Neogloboquadrina inglei Neogloboquadrina kagaensis  \\\n",
       "0                 R [A84]                     NaN                        NaN   \n",
       "1                 P [A84]                     NaN                        NaN   \n",
       "2                     NaN                     NaN                        NaN   \n",
       "3                     NaN                     NaN                        NaN   \n",
       "4                 P [A84]                 P [A84]                        NaN   \n",
       "\n",
       "  Neogloboquadrina pachyderma (sinistral)  \\\n",
       "0                                 D [A84]   \n",
       "1                                 D [A84]   \n",
       "2                                 D [A84]   \n",
       "3                                     NaN   \n",
       "4                                 D [A84]   \n",
       "\n",
       "  Neogloboquadrina pachyderma (dextral)  \\\n",
       "0                               R [A84]   \n",
       "1                               R [A84]   \n",
       "2                                   NaN   \n",
       "3                                   NaN   \n",
       "4                               R [A84]   \n",
       "\n",
       "  Neogloboquadrina pachyderma B (sinistral, inflated form)  \\\n",
       "0                                            R [A84]         \n",
       "1                                                NaN         \n",
       "2                                                NaN         \n",
       "3                                                NaN         \n",
       "4                                                NaN         \n",
       "\n",
       "                                            Comments  \n",
       "0                                                NaN  \n",
       "1                                              sandy  \n",
       "2                      diatom ooze. pyritilize foram  \n",
       "3  Rich in sandy terrigenous grains with presence...  \n",
       "4     Rich in planktic foraminifers, pebbles, pyrite  \n",
       "\n",
       "[5 rows x 26 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(axis='columns', how='all', inplace=True)  \n",
    "df.dropna(axis='index', how='all', inplace=True)\n",
    "df.drop_duplicates(inplace=True)\n",
    "\n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean up all files and save the changes\n",
    "\n",
    "Do basic file cleanup on all the files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "for path in paths:\n",
    "    df = pd.read_csv(path, dtype=str)\n",
    "    \n",
    "    df.dropna(axis='columns', how='all', inplace=True)  \n",
    "    df.dropna(axis='index', how='all', inplace=True)\n",
    "    df.drop_duplicates(inplace=True)\n",
    "    \n",
    "    df.to_csv(path, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove leading and trailing white spaces\n",
    "\n",
    "We created a custom function `remove_whitespace` to remove all leading and trailing white spaces from both the headers and row values. \n",
    "\n",
    "Since we wanted to remove white spaces from both the headers and row values, we used `read_csv(header=None)` and `to_csv(header=False)` so that pandas treat the first row like any other row."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Before cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2, 16)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "      <th>15</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Label ID</td>\n",
       "      <td>Exp</td>\n",
       "      <td>Site</td>\n",
       "      <td>Hole</td>\n",
       "      <td>Core</td>\n",
       "      <td>Core-Sect</td>\n",
       "      <td>Type</td>\n",
       "      <td>Section</td>\n",
       "      <td>A/W</td>\n",
       "      <td>Extra Sample ID Data</td>\n",
       "      <td>Top [cm]</td>\n",
       "      <td>Bottom [cm]</td>\n",
       "      <td>Top Depth [m]</td>\n",
       "      <td>Bottom Depth [m]</td>\n",
       "      <td>Globigerina bulloides</td>\n",
       "      <td>Neogloboquadrina cf. pachyderma</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>318-U1355A-1R-1-PAL-FORAM</td>\n",
       "      <td>318</td>\n",
       "      <td>U1355</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1-1</td>\n",
       "      <td>R</td>\n",
       "      <td>1</td>\n",
       "      <td>PAL</td>\n",
       "      <td>FORAM</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.01</td>\n",
       "      <td>R[318_PF]</td>\n",
       "      <td>R[318_PF]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          0    1      2     3     4          5     6   \\\n",
       "0                   Label ID  Exp   Site  Hole  Core  Core-Sect  Type   \n",
       "1  318-U1355A-1R-1-PAL-FORAM  318  U1355     A     1        1-1     R   \n",
       "\n",
       "        7    8                     9         10           11             12  \\\n",
       "0  Section  A/W  Extra Sample ID Data  Top [cm]  Bottom [cm]  Top Depth [m]   \n",
       "1        1  PAL                 FORAM         0            1              0   \n",
       "\n",
       "                 13                     14                               15  \n",
       "0  Bottom Depth [m]  Globigerina bulloides  Neogloboquadrina cf. pachyderma  \n",
       "1              0.01              R[318_PF]                        R[318_PF]  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[0], dtype=str, header=None)\n",
    "\n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### After cleanup "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2, 16)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>11</th>\n",
       "      <th>12</th>\n",
       "      <th>13</th>\n",
       "      <th>14</th>\n",
       "      <th>15</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Label ID</td>\n",
       "      <td>Exp</td>\n",
       "      <td>Site</td>\n",
       "      <td>Hole</td>\n",
       "      <td>Core</td>\n",
       "      <td>Core-Sect</td>\n",
       "      <td>Type</td>\n",
       "      <td>Section</td>\n",
       "      <td>A/W</td>\n",
       "      <td>Extra Sample ID Data</td>\n",
       "      <td>Top [cm]</td>\n",
       "      <td>Bottom [cm]</td>\n",
       "      <td>Top Depth [m]</td>\n",
       "      <td>Bottom Depth [m]</td>\n",
       "      <td>Globigerina bulloides</td>\n",
       "      <td>Neogloboquadrina cf. pachyderma</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>318-U1355A-1R-1-PAL-FORAM</td>\n",
       "      <td>318</td>\n",
       "      <td>U1355</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1-1</td>\n",
       "      <td>R</td>\n",
       "      <td>1</td>\n",
       "      <td>PAL</td>\n",
       "      <td>FORAM</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.01</td>\n",
       "      <td>R[318_PF]</td>\n",
       "      <td>R[318_PF]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          0    1      2     3     4          5     6   \\\n",
       "0                   Label ID  Exp   Site  Hole  Core  Core-Sect  Type   \n",
       "1  318-U1355A-1R-1-PAL-FORAM  318  U1355     A     1        1-1     R   \n",
       "\n",
       "        7    8                     9         10           11             12  \\\n",
       "0  Section  A/W  Extra Sample ID Data  Top [cm]  Bottom [cm]  Top Depth [m]   \n",
       "1        1  PAL                 FORAM         0            1              0   \n",
       "\n",
       "                 13                     14                               15  \n",
       "0  Bottom Depth [m]  Globigerina bulloides  Neogloboquadrina cf. pachyderma  \n",
       "1              0.01              R[318_PF]                        R[318_PF]  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[0], dtype=str, header=None)\n",
    "\n",
    "remove_whitespace(df)\n",
    "\n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean up all files and save the changes\n",
    "\n",
    "Remove white space from all files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "for path in paths:\n",
    "    df = pd.read_csv(path, dtype=str, header=None)\n",
    "    \n",
    "    remove_whitespace(df)\n",
    "    \n",
    "    df.to_csv(path, index=False, header=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Normalizing columns names\n",
    "\n",
    "For the ocean core drilling expedition 312 and later, the researchers for each expedition determined the format of their data files. This resulted in a lot of variability in the file columns. We had to standardized the columns names and update data files. For instance, 'Bottom [cm]' and 'Bottom[cm] [cm]' in the raw files are 'Bottom [cm]' in the processed files.\n",
    "\n",
    "Another challenge with parsing the files is that we had to standardized the taxa names. Issues include misspelling, taxa names change over time, and inconsistent ways of treating rank modifiers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get all unique column names\n",
    "\n",
    "In order to normalize the header header names, we needed to get all the headers for all the files. \n",
    "\n",
    "Since we only need the header names, use `nrow=0` with `read_csv`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>Bottom [cm]</th>\n",
       "      <th>Top Depth [m]</th>\n",
       "      <th>Bottom Depth [m]</th>\n",
       "      <th>Zone name (short)</th>\n",
       "      <th>Zone name</th>\n",
       "      <th>Additional zone name (short)</th>\n",
       "      <th>Additional zone name</th>\n",
       "      <th>Preservation</th>\n",
       "      <th>...</th>\n",
       "      <th>Globoturborotalita decoraperta _T and B</th>\n",
       "      <th>Globoturborotalita rubescens</th>\n",
       "      <th>Sphaeroidinellopsis seminulina _T_ _PL4</th>\n",
       "      <th>Sphaeroidinellopsis kochi _T</th>\n",
       "      <th>Globigerinoidesella fistulosa _T and B_ _Pt1a</th>\n",
       "      <th>Globigerinoides ruber _pink_ T</th>\n",
       "      <th>Globigerinoides extremus _T and B</th>\n",
       "      <th>Globigerinoides obliquus _T</th>\n",
       "      <th>Globigerinella calida _B</th>\n",
       "      <th>Sample comment</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>0 rows × 64 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [Sample, Top [cm], Bottom [cm], Top Depth [m], Bottom Depth [m], Zone name (short), Zone name, Additional zone name (short), Additional zone name, Preservation, Group abundance, Globorotalia flexuosa _T and B_, Globigerinoides ruber _pink_ _T_, Globigerinella calida _B_, Globorotalia tosaensis _T and B_ _Pt1b_, Globorotalia hessi _B_, Globigerinoides obliquus _T_, Globoturborotalita apertura _T and B_, Globigerinoidesella fistulosa _T and B_ _Pt1a_, Globorotalia truncatulinoides _B_, Globigerinoides extremus _T and B_, Pulleniatina finalis _B_, Globorotalia pseudomiocenica _T_ _PL6_, Globorotalia limbata _T_, Globoturborotalita decoraperta _T and B_, Globorotalia multicamerata _T_, Dentoglobigerina altispira _T_ _PL5_, Sphaeroidinellopsis seminulina _T_ _PL4_, Pulleniatina primalis  _Tand B_, Globorotalia plesiotumida _T_, Globorotalia margaritae _T and B_ _PL3_, Sphaeroidinellopsis kochi _T_, Sphaeroidinella dehiscens sensu lato _B_, Globorotalia tumida _B_ _PL1a_, Globorotalia plesiotumida _B_ _M13b_, Globorotalia limbata _B_, Pulleniatina primalis  _Tand B, Dextral:Sinistral _P. primalis_, Pulleniatina praecursor, Dextral:Sinistral _P. praecursor_, Pulleniatina obliquiloculata, Dextral:Sinistral _P. obliquiloculata_, Pulleniatina finalis _B, Pulleniatina praespectabilis, Dentoglobigerina altispira _T_ _PL5, Globorotalia limbata _B, Globorotalia multicamerata _T, Globorotalia pseudomiocenica _T_ _PL6, Globorotalia plesiotumida _T, Globorotalia tumida, Globorotalia margaritae _T and B_ _PL3, Globorotalia tosaensis _T and B_ _Pt1b, Globorotalia truncatulinoides _B, Globoturborotalita apertura _T and B, Globoturborotalita decoraperta _T and B, Globoturborotalita rubescens, Sphaeroidinellopsis seminulina _T_ _PL4, Sphaeroidinellopsis kochi _T, Globigerinoidesella fistulosa _T and B_ _Pt1a, Globigerinoides ruber _pink_ T, Globigerinoides extremus _T and B, Globigerinoides obliquus _T, Globigerinella calida _B, Sample comment]\n",
       "Index: []\n",
       "\n",
       "[0 rows x 64 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.read_csv(paths[1], dtype=str, nrows=0)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We used `pandas.DataFrame.columns()` and python `set` to get all the unique columns for all the files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "158"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_columns = set()\n",
    "for path in paths:\n",
    "    df = pd.read_csv(path, dtype=str, nrows=0)\n",
    "    \n",
    "    all_columns.update(df.columns)\n",
    "    \n",
    "len(all_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'A/W',\n",
       " 'Additional zone name',\n",
       " 'Additional zone name (short)',\n",
       " 'Bottom Depth [m]',\n",
       " 'Bottom Depth[m] [m]',\n",
       " 'Bottom [cm]',\n",
       " 'Bottom[cm] [cm]',\n",
       " 'Candeina nitida',\n",
       " 'Comment (general)',\n",
       " 'Comments',\n",
       " 'Core',\n",
       " 'Core-Sect',\n",
       " 'Datum name',\n",
       " 'Dentoglobigerina altispira _T_ _PL5',\n",
       " 'Dentoglobigerina altispira _T_ _PL5_',\n",
       " 'Dextral:Sinistral _P. obliquiloculata_',\n",
       " 'Dextral:Sinistral _P. praecursor_',\n",
       " 'Dextral:Sinistral _P. primalis_',\n",
       " 'Exp',\n",
       " 'Extra Sample ID Data',\n",
       " 'Fragmentation',\n",
       " 'Fragmentation rank [auto-pop]',\n",
       " 'Genus/species (upper zone)',\n",
       " 'Genus/species lower zone)',\n",
       " 'Globigerina bulloides',\n",
       " 'Globigerina cf. woodi',\n",
       " 'Globigerina falconensis',\n",
       " 'Globigerina umbilicata',\n",
       " 'Globigerinella aequilateralis',\n",
       " 'Globigerinella calida',\n",
       " 'Globigerinella calida _B',\n",
       " 'Globigerinella calida _B_',\n",
       " 'Globigerinella obesa',\n",
       " 'Globigerinita glutinata',\n",
       " 'Globigerinita parkerae',\n",
       " 'Globigerinita uvula',\n",
       " 'Globigerinoides bulloideus',\n",
       " 'Globigerinoides conglobatus',\n",
       " 'Globigerinoides extremus _T and B',\n",
       " 'Globigerinoides extremus _T and B_',\n",
       " 'Globigerinoides fistulosus',\n",
       " 'Globigerinoides obliquus _T',\n",
       " 'Globigerinoides obliquus _T_',\n",
       " 'Globigerinoides quadrilobatus',\n",
       " 'Globigerinoides ruber',\n",
       " 'Globigerinoides ruber (pink)',\n",
       " 'Globigerinoides ruber (white)',\n",
       " 'Globigerinoides ruber _pink_ T',\n",
       " 'Globigerinoides ruber _pink_ _T_',\n",
       " 'Globigerinoides sacculifer',\n",
       " 'Globigerinoides sacculifer (without sack)',\n",
       " 'Globigerinoides tenellus',\n",
       " 'Globigerinoides trilobus',\n",
       " 'Globigerinoidesella fistulosa _T and B_ _Pt1a',\n",
       " 'Globigerinoidesella fistulosa _T and B_ _Pt1a_',\n",
       " 'Globoconella miozea',\n",
       " 'Globorotalia (Globoconella) inflata',\n",
       " 'Globorotalia (Globorotalia) tumida tumida',\n",
       " 'Globorotalia (Hirsutella) hirsuta',\n",
       " 'Globorotalia (Hirsutella) scitula',\n",
       " 'Globorotalia (Truncorotalia) crossaformis',\n",
       " 'Globorotalia (Truncorotalia) truncatulinoides',\n",
       " 'Globorotalia anfracta',\n",
       " 'Globorotalia crassaformis',\n",
       " 'Globorotalia crassaformis sensu lato',\n",
       " 'Globorotalia flexuosa',\n",
       " 'Globorotalia flexuosa _T and B_',\n",
       " 'Globorotalia hessi',\n",
       " 'Globorotalia hessi _B_',\n",
       " 'Globorotalia hirsuta',\n",
       " 'Globorotalia inflata',\n",
       " 'Globorotalia limbata _B',\n",
       " 'Globorotalia limbata _B_',\n",
       " 'Globorotalia limbata _T_',\n",
       " 'Globorotalia margaritae _T and B_ _PL3',\n",
       " 'Globorotalia margaritae _T and B_ _PL3_',\n",
       " 'Globorotalia menardii',\n",
       " 'Globorotalia multicamerata _T',\n",
       " 'Globorotalia multicamerata _T_',\n",
       " 'Globorotalia plesiotumida _B_ _M13b_',\n",
       " 'Globorotalia plesiotumida _T',\n",
       " 'Globorotalia plesiotumida _T_',\n",
       " 'Globorotalia pseudomiocenica _T_ _PL6',\n",
       " 'Globorotalia pseudomiocenica _T_ _PL6_',\n",
       " 'Globorotalia scitula',\n",
       " 'Globorotalia tosaensis',\n",
       " 'Globorotalia tosaensis _T and B_ _Pt1b',\n",
       " 'Globorotalia tosaensis _T and B_ _Pt1b_',\n",
       " 'Globorotalia truncatulinoides',\n",
       " 'Globorotalia truncatulinoides _B',\n",
       " 'Globorotalia truncatulinoides _B_',\n",
       " 'Globorotalia tumida',\n",
       " 'Globorotalia tumida _B_ _PL1a_',\n",
       " 'Globoturborotalita apertura _T and B',\n",
       " 'Globoturborotalita apertura _T and B_',\n",
       " 'Globoturborotalita decoraperta _T and B',\n",
       " 'Globoturborotalita decoraperta _T and B_',\n",
       " 'Globoturborotalita rubescens',\n",
       " 'Group Abundance',\n",
       " 'Group abundance',\n",
       " 'Group abundance (%)',\n",
       " 'Hole',\n",
       " 'Label ID',\n",
       " 'Neogloboquadrina acostaensis',\n",
       " 'Neogloboquadrina acostaensis (dextral)',\n",
       " 'Neogloboquadrina cf. pachyderma',\n",
       " 'Neogloboquadrina dutertrei',\n",
       " 'Neogloboquadrina humerosa',\n",
       " 'Neogloboquadrina incompta (dextral)',\n",
       " 'Neogloboquadrina inglei',\n",
       " 'Neogloboquadrina kagaensis',\n",
       " 'Neogloboquadrina nympha',\n",
       " 'Neogloboquadrina pachyderma (dextral)',\n",
       " 'Neogloboquadrina pachyderma (sin)',\n",
       " 'Neogloboquadrina pachyderma (sinistral)',\n",
       " 'Neogloboquadrina pachyderma B (sinistral, inflated form)',\n",
       " 'Neogloboquadrina pachyderma(dex)',\n",
       " 'Orbulina universa',\n",
       " 'PF Preservation',\n",
       " 'Preservation',\n",
       " 'Pulleniatina coiling (dextral)',\n",
       " 'Pulleniatina coiling (sinistral)',\n",
       " 'Pulleniatina finalis',\n",
       " 'Pulleniatina finalis _B',\n",
       " 'Pulleniatina finalis _B_',\n",
       " 'Pulleniatina obliquiloculata',\n",
       " 'Pulleniatina obliquiloculata (D)',\n",
       " 'Pulleniatina praecursor',\n",
       " 'Pulleniatina praespectabilis',\n",
       " 'Pulleniatina primalis  _Tand B',\n",
       " 'Pulleniatina primalis  _Tand B_',\n",
       " 'Sample',\n",
       " 'Sample comment',\n",
       " 'Section',\n",
       " 'Site',\n",
       " 'Sphaeroidinella dahiscens sensu lato',\n",
       " 'Sphaeroidinella dehiscens',\n",
       " 'Sphaeroidinella dehiscens s.l.',\n",
       " 'Sphaeroidinella dehiscens sensu lato _B_',\n",
       " 'Sphaeroidinellopsis kochi _T',\n",
       " 'Sphaeroidinellopsis kochi _T_',\n",
       " 'Sphaeroidinellopsis seminulina _T_ _PL4',\n",
       " 'Sphaeroidinellopsis seminulina _T_ _PL4_',\n",
       " 'Top Depth [m]',\n",
       " 'Top Depth[m] [m]',\n",
       " 'Top [cm]',\n",
       " 'Top[cm] [cm]',\n",
       " 'Type',\n",
       " 'Type (lower zone)',\n",
       " 'Upper boundary age av. [Ma]',\n",
       " 'Zone author (year)',\n",
       " 'Zone group',\n",
       " 'Zone name',\n",
       " 'Zone name (short)',\n",
       " 'pc_abundance_name_mode',\n",
       " 'pc_fossil_group',\n",
       " 'pc_fossil_name',\n",
       " 'physical_constituent_name'}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "all_columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then manually separate taxa names from other headers so that we could do some more processing on the taxa."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "taxa_columns = {\n",
    " 'Candeina nitida',\n",
    " 'Dentoglobigerina altispira _T_ _PL5',\n",
    " 'Dentoglobigerina altispira _T_ _PL5_',\n",
    " 'Dextral:Sinistral _P. obliquiloculata_',\n",
    " 'Dextral:Sinistral _P. praecursor_',\n",
    " 'Dextral:Sinistral _P. primalis_',\n",
    " 'Fragmentation rank [auto-pop]',\n",
    " 'Genus/species (upper zone)',\n",
    " 'Genus/species lower zone)',\n",
    " 'Globigerina bulloides',\n",
    " 'Globigerina cf. woodi',\n",
    " 'Globigerina falconensis',\n",
    " 'Globigerina umbilicata',\n",
    " 'Globigerinella aequilateralis',\n",
    " 'Globigerinella calida',\n",
    " 'Globigerinella calida _B',\n",
    " 'Globigerinella calida _B_',\n",
    " 'Globigerinella obesa',\n",
    " 'Globigerinita glutinata',\n",
    " 'Globigerinita parkerae',\n",
    " 'Globigerinita uvula',\n",
    " 'Globigerinoides bulloideus',\n",
    " 'Globigerinoides conglobatus',\n",
    " 'Globigerinoides extremus _T and B',\n",
    " 'Globigerinoides extremus _T and B_',\n",
    " 'Globigerinoides fistulosus',\n",
    " 'Globigerinoides obliquus _T',\n",
    " 'Globigerinoides obliquus _T_',\n",
    " 'Globigerinoides quadrilobatus',\n",
    " 'Globigerinoides ruber',\n",
    " 'Globigerinoides ruber (pink)',\n",
    " 'Globigerinoides ruber (white)',\n",
    " 'Globigerinoides ruber _pink_ T',\n",
    " 'Globigerinoides ruber _pink_ _T_',\n",
    " 'Globigerinoides sacculifer',\n",
    " 'Globigerinoides sacculifer (without sack)',\n",
    " 'Globigerinoides tenellus',\n",
    " 'Globigerinoides trilobus',\n",
    " 'Globigerinoidesella fistulosa _T and B_ _Pt1a',\n",
    " 'Globigerinoidesella fistulosa _T and B_ _Pt1a_',\n",
    " 'Globoconella miozea',\n",
    " 'Globorotalia (Globoconella) inflata',\n",
    " 'Globorotalia (Globorotalia) tumida tumida',\n",
    " 'Globorotalia (Hirsutella) hirsuta',\n",
    " 'Globorotalia (Hirsutella) scitula',\n",
    " 'Globorotalia (Truncorotalia) crossaformis',\n",
    " 'Globorotalia (Truncorotalia) truncatulinoides',\n",
    " 'Globorotalia anfracta',\n",
    " 'Globorotalia crassaformis',\n",
    " 'Globorotalia crassaformis sensu lato',\n",
    " 'Globorotalia flexuosa',\n",
    " 'Globorotalia flexuosa _T and B_',\n",
    " 'Globorotalia hessi',\n",
    " 'Globorotalia hessi _B_',\n",
    " 'Globorotalia hirsuta',\n",
    " 'Globorotalia inflata',\n",
    " 'Globorotalia limbata _B',\n",
    " 'Globorotalia limbata _B_',\n",
    " 'Globorotalia limbata _T_',\n",
    " 'Globorotalia margaritae _T and B_ _PL3',\n",
    " 'Globorotalia margaritae _T and B_ _PL3_',\n",
    " 'Globorotalia menardii',\n",
    " 'Globorotalia multicamerata _T',\n",
    " 'Globorotalia multicamerata _T_',\n",
    " 'Globorotalia plesiotumida _B_ _M13b_',\n",
    " 'Globorotalia plesiotumida _T',\n",
    " 'Globorotalia plesiotumida _T_',\n",
    " 'Globorotalia pseudomiocenica _T_ _PL6',\n",
    " 'Globorotalia pseudomiocenica _T_ _PL6_',\n",
    " 'Globorotalia scitula',\n",
    " 'Globorotalia tosaensis',\n",
    " 'Globorotalia tosaensis _T and B_ _Pt1b',\n",
    " 'Globorotalia tosaensis _T and B_ _Pt1b_',\n",
    " 'Globorotalia truncatulinoides',\n",
    " 'Globorotalia truncatulinoides _B',\n",
    " 'Globorotalia truncatulinoides _B_',\n",
    " 'Globorotalia tumida',\n",
    " 'Globorotalia tumida _B_ _PL1a_',\n",
    " 'Globoturborotalita apertura _T and B',\n",
    " 'Globoturborotalita apertura _T and B_',\n",
    " 'Globoturborotalita decoraperta _T and B',\n",
    " 'Globoturborotalita decoraperta _T and B_',\n",
    " 'Globoturborotalita rubescens',\n",
    " 'Neogloboquadrina acostaensis',\n",
    " 'Neogloboquadrina acostaensis (dextral)',\n",
    " 'Neogloboquadrina cf. pachyderma',\n",
    " 'Neogloboquadrina dutertrei',\n",
    " 'Neogloboquadrina humerosa',\n",
    " 'Neogloboquadrina incompta (dextral)',\n",
    " 'Neogloboquadrina inglei',\n",
    " 'Neogloboquadrina kagaensis',\n",
    " 'Neogloboquadrina nympha',\n",
    " 'Neogloboquadrina pachyderma (dextral)',\n",
    " 'Neogloboquadrina pachyderma (sin)',\n",
    " 'Neogloboquadrina pachyderma (sinistral)',\n",
    " 'Neogloboquadrina pachyderma B (sinistral, inflated form)',\n",
    " 'Neogloboquadrina pachyderma(dex)',\n",
    " 'Orbulina universa',\n",
    " 'Pulleniatina coiling (dextral)',\n",
    " 'Pulleniatina coiling (sinistral)',\n",
    " 'Pulleniatina finalis',\n",
    " 'Pulleniatina finalis _B',\n",
    " 'Pulleniatina finalis _B_',\n",
    " 'Pulleniatina obliquiloculata',\n",
    " 'Pulleniatina obliquiloculata (D)',\n",
    " 'Pulleniatina praecursor',\n",
    " 'Pulleniatina praespectabilis',\n",
    " 'Pulleniatina primalis  _Tand B',\n",
    " 'Pulleniatina primalis  _Tand B_',\n",
    " 'Sphaeroidinella dahiscens sensu lato',\n",
    " 'Sphaeroidinella dehiscens',\n",
    " 'Sphaeroidinella dehiscens s.l.',\n",
    " 'Sphaeroidinella dehiscens sensu lato _B_',\n",
    " 'Sphaeroidinellopsis kochi _T',\n",
    " 'Sphaeroidinellopsis kochi _T_',\n",
    " 'Sphaeroidinellopsis seminulina _T_ _PL4',\n",
    " 'Sphaeroidinellopsis seminulina _T_ _PL4_',\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "117"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(taxa_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since both `all_columns` and `taxa_columns` are sets, we can subtract them to get the nontaxa headers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'A/W',\n",
       " 'Additional zone name',\n",
       " 'Additional zone name (short)',\n",
       " 'Bottom Depth [m]',\n",
       " 'Bottom Depth[m] [m]',\n",
       " 'Bottom [cm]',\n",
       " 'Bottom[cm] [cm]',\n",
       " 'Comment (general)',\n",
       " 'Comments',\n",
       " 'Core',\n",
       " 'Core-Sect',\n",
       " 'Datum name',\n",
       " 'Exp',\n",
       " 'Extra Sample ID Data',\n",
       " 'Fragmentation',\n",
       " 'Group Abundance',\n",
       " 'Group abundance',\n",
       " 'Group abundance (%)',\n",
       " 'Hole',\n",
       " 'Label ID',\n",
       " 'PF Preservation',\n",
       " 'Preservation',\n",
       " 'Sample',\n",
       " 'Sample comment',\n",
       " 'Section',\n",
       " 'Site',\n",
       " 'Top Depth [m]',\n",
       " 'Top Depth[m] [m]',\n",
       " 'Top [cm]',\n",
       " 'Top[cm] [cm]',\n",
       " 'Type',\n",
       " 'Type (lower zone)',\n",
       " 'Upper boundary age av. [Ma]',\n",
       " 'Zone author (year)',\n",
       " 'Zone group',\n",
       " 'Zone name',\n",
       " 'Zone name (short)',\n",
       " 'pc_abundance_name_mode',\n",
       " 'pc_fossil_group',\n",
       " 'pc_fossil_name',\n",
       " 'physical_constituent_name'}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nontaxa_columns = all_columns - taxa_columns\n",
    "\n",
    "nontaxa_columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "41"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(nontaxa_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create taxa and non-taxa files\n",
    "\n",
    "We saved the the taxa and nontaxa headers to csv so that we can further process them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(117, 1)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>verbatim_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>62</th>\n",
       "      <td>Candeina nitida</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>53</th>\n",
       "      <td>Dentoglobigerina altispira _T_ _PL5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Dentoglobigerina altispira _T_ _PL5_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Dextral:Sinistral _P. obliquiloculata_</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Dextral:Sinistral _P. praecursor_</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             verbatim_name\n",
       "62                         Candeina nitida\n",
       "53     Dentoglobigerina altispira _T_ _PL5\n",
       "21    Dentoglobigerina altispira _T_ _PL5_\n",
       "23  Dextral:Sinistral _P. obliquiloculata_\n",
       "17       Dextral:Sinistral _P. praecursor_"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "taxa_df = pd.DataFrame(taxa_columns, columns=['verbatim_name'])\n",
    "taxa_df.sort_values('verbatim_name', inplace=True)\n",
    "taxa_df.to_csv(taxa_list_path, index=False)\n",
    "\n",
    "print_df(taxa_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(41, 1)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>field</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>A/W</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Additional zone name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Additional zone name (short)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Bottom Depth [m]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Bottom Depth[m] [m]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                           field\n",
       "40                           A/W\n",
       "18          Additional zone name\n",
       "15  Additional zone name (short)\n",
       "29              Bottom Depth [m]\n",
       "33           Bottom Depth[m] [m]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "non_taxa_df = pd.DataFrame(nontaxa_columns, columns=['field'])\n",
    "non_taxa_df.sort_values('field', inplace=True)\n",
    "non_taxa_df.to_csv(nontaxa_list_path, index=False)\n",
    "\n",
    "print_df(non_taxa_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalize headers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The software developer first uploads the taxa and nontaxa CSVs to Google Sheets, then project PIs and their students work together to manually normalize the original values. The PIs are currently writing a paper with details about the normalization process [5]. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(117, 2)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>verbatim_name</th>\n",
       "      <th>normalized_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Candeina nitida</td>\n",
       "      <td>Candeina nitida</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Dentoglobigerina altispira _T_ _PL5</td>\n",
       "      <td>Dentoglobigerina altispira</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Dentoglobigerina altispira _T_ _PL5_</td>\n",
       "      <td>Dentoglobigerina altispira</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Dextral:Sinistral _P. obliquiloculata_</td>\n",
       "      <td>Pulleniatina obliquiloculata (dextral)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Dextral:Sinistral _P. obliquiloculata_</td>\n",
       "      <td>Pulleniatina obliquiloculata (sinistral)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            verbatim_name  \\\n",
       "0                         Candeina nitida   \n",
       "1     Dentoglobigerina altispira _T_ _PL5   \n",
       "2    Dentoglobigerina altispira _T_ _PL5_   \n",
       "3  Dextral:Sinistral _P. obliquiloculata_   \n",
       "4  Dextral:Sinistral _P. obliquiloculata_   \n",
       "\n",
       "                            normalized_name  \n",
       "0                           Candeina nitida  \n",
       "1                Dentoglobigerina altispira  \n",
       "2                Dentoglobigerina altispira  \n",
       "3    Pulleniatina obliquiloculata (dextral)  \n",
       "4  Pulleniatina obliquiloculata (sinistral)  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "taxa_df = pd.read_csv(normalized_taxa_path, dtype=str)\n",
    "print_df(taxa_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(44, 2)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>field</th>\n",
       "      <th>normalized_field</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>A/W</td>\n",
       "      <td>A/W</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Additional zone name</td>\n",
       "      <td>Additional zone name</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Additional zone name (short)</td>\n",
       "      <td>Additional zone name (short)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Bottom Depth [m]</td>\n",
       "      <td>Bottom Depth [m]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bottom Depth[m] [m]</td>\n",
       "      <td>Bottom Depth [m]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                          field              normalized_field\n",
       "0                           A/W                           A/W\n",
       "1          Additional zone name          Additional zone name\n",
       "2  Additional zone name (short)  Additional zone name (short)\n",
       "3              Bottom Depth [m]              Bottom Depth [m]\n",
       "4           Bottom Depth[m] [m]              Bottom Depth [m]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nontaxa_df = pd.read_csv(normalized_nontaxa_path, dtype=str)\n",
    "print_df(nontaxa_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After the project PIs manually normalized the columns, we need to update the data files with the normalized columns. We create a dictionary that lists the original field name and normalized field name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Candeina nitida': 'Candeina nitida',\n",
       " 'Dentoglobigerina altispira _T_ _PL5': 'Dentoglobigerina altispira',\n",
       " 'Dentoglobigerina altispira _T_ _PL5_': 'Dentoglobigerina altispira',\n",
       " 'Dextral:Sinistral _P. obliquiloculata_': 'Pulleniatina obliquiloculata (sinistral)',\n",
       " 'Dextral:Sinistral _P. praecursor_': 'Pulleniatina praecursor (sinistral)',\n",
       " 'Dextral:Sinistral _P. primalis_': 'Pulleniatina primalis (sinistral)',\n",
       " 'Globigerina bulloides': 'Globigerina bulloides',\n",
       " 'Globigerina cf. woodi': 'Globigerina cf. woodi',\n",
       " 'Globigerina falconensis': 'Globigerina falconensis',\n",
       " 'Globigerina umbilicata': 'Globigerina umbilicata',\n",
       " 'Globigerinella aequilateralis': 'Globigerinella aequilateralis',\n",
       " 'Globigerinella calida': 'Globigerinella calida',\n",
       " 'Globigerinella calida _B': 'Globigerinella calida',\n",
       " 'Globigerinella calida _B_': 'Globigerinella calida',\n",
       " 'Globigerinella obesa': 'Globigerinella obesa',\n",
       " 'Globigerinita glutinata': 'Globigerinita glutinata',\n",
       " 'Globigerinita parkerae': 'Globigerinita parkerae',\n",
       " 'Globigerinita uvula': 'Globigerinita uvula',\n",
       " 'Globigerinoides bulloideus': 'Globigerinoides bulloideus',\n",
       " 'Globigerinoides conglobatus': 'Globigerinoides conglobatus',\n",
       " 'Globigerinoides extremus _T and B': 'Globigerinoides extremus',\n",
       " 'Globigerinoides extremus _T and B_': 'Globigerinoides extremus',\n",
       " 'Globigerinoides fistulosus': 'Globigerinoides fistulosus',\n",
       " 'Globigerinoides obliquus _T': 'Globigerinoides obliquus',\n",
       " 'Globigerinoides obliquus _T_': 'Globigerinoides obliquus',\n",
       " 'Globigerinoides quadrilobatus': 'Globigerinoides quadrilobatus',\n",
       " 'Globigerinoides ruber': 'Globigerinoides ruber',\n",
       " 'Globigerinoides ruber (pink)': 'Globigerinoides ruber (pink)',\n",
       " 'Globigerinoides ruber (white)': 'Globigerinoides ruber (white)',\n",
       " 'Globigerinoides ruber _pink_ T': 'Globigerinoides ruber (pink)',\n",
       " 'Globigerinoides ruber _pink_ _T_': 'Globigerinoides ruber (pink)',\n",
       " 'Globigerinoides sacculifer': 'Globigerinoides sacculifer',\n",
       " 'Globigerinoides sacculifer (without sack)': 'Globigerinoides sacculifer (without sac)',\n",
       " 'Globigerinoides tenellus': 'Globigerinoides tenellus',\n",
       " 'Globigerinoides trilobus': 'Globigerinoides trilobus',\n",
       " 'Globigerinoidesella fistulosa _T and B_ _Pt1a': 'Globigerinoidesella fistulosa',\n",
       " 'Globigerinoidesella fistulosa _T and B_ _Pt1a_': 'Globigerinoidesella fistulosa',\n",
       " 'Globoconella miozea': 'Globoconella miozea',\n",
       " 'Globorotalia (Globoconella) inflata': 'Globorotalia Globoconella inflata',\n",
       " 'Globorotalia (Globorotalia) tumida tumida': 'Globorotalia Globoconella tumida tumida',\n",
       " 'Globorotalia (Hirsutella) hirsuta': 'Globorotalia Hirsutella hirsuta',\n",
       " 'Globorotalia (Hirsutella) scitula': 'Globorotalia Hirsutella scitula',\n",
       " 'Globorotalia (Truncorotalia) crossaformis': 'Globorotalia Truncorotalia crossaformis',\n",
       " 'Globorotalia (Truncorotalia) truncatulinoides': 'Globorotalia Truncorotalia truncatulinoides',\n",
       " 'Globorotalia anfracta': 'Globorotalia anfracta',\n",
       " 'Globorotalia crassaformis': 'Globorotalia crassaformis',\n",
       " 'Globorotalia crassaformis sensu lato': 'Globorotalia s.l. crassaformis',\n",
       " 'Globorotalia flexuosa': 'Globorotalia flexuosa',\n",
       " 'Globorotalia flexuosa _T and B_': 'Globorotalia flexuosa',\n",
       " 'Globorotalia hessi': 'Globorotalia hessi',\n",
       " 'Globorotalia hessi _B_': 'Globorotalia hessi',\n",
       " 'Globorotalia hirsuta': 'Globorotalia hirsuta',\n",
       " 'Globorotalia inflata': 'Globorotalia inflata',\n",
       " 'Globorotalia limbata _B': 'Globorotalia limbata',\n",
       " 'Globorotalia limbata _B_': 'Globorotalia limbata',\n",
       " 'Globorotalia limbata _T_': 'Globorotalia limbata',\n",
       " 'Globorotalia margaritae _T and B_ _PL3': 'Globorotalia margaritae',\n",
       " 'Globorotalia margaritae _T and B_ _PL3_': 'Globorotalia margaritae',\n",
       " 'Globorotalia menardii': 'Globorotalia menardii',\n",
       " 'Globorotalia multicamerata _T': 'Globorotalia multicamerata',\n",
       " 'Globorotalia multicamerata _T_': 'Globorotalia multicamerata',\n",
       " 'Globorotalia plesiotumida _B_ _M13b_': 'Globorotalia plesiotumida',\n",
       " 'Globorotalia plesiotumida _T': 'Globorotalia plesiotumida',\n",
       " 'Globorotalia plesiotumida _T_': 'Globorotalia plesiotumida',\n",
       " 'Globorotalia pseudomiocenica _T_ _PL6': 'Globorotalia pseudomiocenica',\n",
       " 'Globorotalia pseudomiocenica _T_ _PL6_': 'Globorotalia pseudomiocenica',\n",
       " 'Globorotalia scitula': 'Globorotalia scitula',\n",
       " 'Globorotalia tosaensis': 'Globorotalia tosaensis',\n",
       " 'Globorotalia tosaensis _T and B_ _Pt1b': 'Globorotalia tosaensis',\n",
       " 'Globorotalia tosaensis _T and B_ _Pt1b_': 'Globorotalia tosaensis',\n",
       " 'Globorotalia truncatulinoides': 'Globorotalia truncatulinoides',\n",
       " 'Globorotalia truncatulinoides _B': 'Globorotalia truncatulinoides',\n",
       " 'Globorotalia truncatulinoides _B_': 'Globorotalia truncatulinoides',\n",
       " 'Globorotalia tumida': 'Globorotalia tumida',\n",
       " 'Globorotalia tumida _B_ _PL1a_': 'Globorotalia tumida',\n",
       " 'Globoturborotalita apertura _T and B': 'Globoturborotalita apertura',\n",
       " 'Globoturborotalita apertura _T and B_': 'Globoturborotalita apertura',\n",
       " 'Globoturborotalita decoraperta _T and B': 'Globoturborotalita decoraperta',\n",
       " 'Globoturborotalita decoraperta _T and B_': 'Globoturborotalita decoraperta',\n",
       " 'Globoturborotalita rubescens': 'Globoturborotalita rubescens',\n",
       " 'Neogloboquadrina acostaensis': 'Neogloboquadrina acostaensis',\n",
       " 'Neogloboquadrina acostaensis (dextral)': 'Neogloboquadrina acostaensis (dextral)',\n",
       " 'Neogloboquadrina cf. pachyderma': 'Neogloboquadrina cf. pachyderma',\n",
       " 'Neogloboquadrina dutertrei': 'Neogloboquadrina dutertrei',\n",
       " 'Neogloboquadrina humerosa': 'Neogloboquadrina humerosa',\n",
       " 'Neogloboquadrina incompta (dextral)': 'Neogloboquadrina incompta (dextral)',\n",
       " 'Neogloboquadrina inglei': 'Neogloboquadrina inglei',\n",
       " 'Neogloboquadrina kagaensis': 'Neogloboquadrina kagaensis',\n",
       " 'Neogloboquadrina nympha': 'Neogloboquadrina nympha',\n",
       " 'Neogloboquadrina pachyderma (dextral)': 'Neogloboquadrina pachyderma (dextral)',\n",
       " 'Neogloboquadrina pachyderma (sin)': 'Neogloboquadrina pachyderma (sinistral)',\n",
       " 'Neogloboquadrina pachyderma (sinistral)': 'Neogloboquadrina pachyderma (sinistral)',\n",
       " 'Neogloboquadrina pachyderma B (sinistral, inflated form)': 'Neogloboquadrina pachyderma B (sinistral, inflated form)',\n",
       " 'Neogloboquadrina pachyderma(dex)': 'Neogloboquadrina pachyderma (dextral)',\n",
       " 'Orbulina universa': 'Orbulina universa',\n",
       " 'Pulleniatina coiling (dextral)': 'Pulleniatina (dextral)',\n",
       " 'Pulleniatina coiling (sinistral)': 'Pulleniatina (sinistral)',\n",
       " 'Pulleniatina finalis': 'Pulleniatina finalis',\n",
       " 'Pulleniatina finalis _B': 'Pulleniatina finalis',\n",
       " 'Pulleniatina finalis _B_': 'Pulleniatina finalis',\n",
       " 'Pulleniatina obliquiloculata': 'Pulleniatina obliquiloculata',\n",
       " 'Pulleniatina obliquiloculata (D)': 'Pulleniatina obliquiloculata (dextral)',\n",
       " 'Pulleniatina praecursor': 'Pulleniatina praecursor',\n",
       " 'Pulleniatina praespectabilis': 'Pulleniatina praespectabilis',\n",
       " 'Pulleniatina primalis  _Tand B': 'Pulleniatina primalis',\n",
       " 'Pulleniatina primalis  _Tand B_': 'Pulleniatina primalis',\n",
       " 'Sphaeroidinella dahiscens sensu lato': 'Sphaeroidinella s.l. dahiscens',\n",
       " 'Sphaeroidinella dehiscens': 'Sphaeroidinella dehiscens',\n",
       " 'Sphaeroidinella dehiscens s.l.': 'Sphaeroidinella s.l. dehiscens',\n",
       " 'Sphaeroidinella dehiscens sensu lato _B_': 'Sphaeroidinella s.l. dehiscens',\n",
       " 'Sphaeroidinellopsis kochi _T': 'Sphaeroidinellopsis kochi',\n",
       " 'Sphaeroidinellopsis kochi _T_': 'Sphaeroidinellopsis kochi',\n",
       " 'Sphaeroidinellopsis seminulina _T_ _PL4': 'Sphaeroidinellopsis seminulina',\n",
       " 'Sphaeroidinellopsis seminulina _T_ _PL4_': 'Sphaeroidinellopsis seminulina'}"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "taxa_mapping = taxa_df.set_index('verbatim_name').to_dict()['normalized_name']\n",
    "taxa_mapping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'A/W': 'A/W',\n",
       " 'Additional zone name': 'Additional zone name',\n",
       " 'Additional zone name (short)': 'Additional zone name (short)',\n",
       " 'Bottom Depth [m]': 'Bottom Depth [m]',\n",
       " 'Bottom Depth[m] [m]': 'Bottom Depth [m]',\n",
       " 'Bottom [cm]': 'Bottom [cm]',\n",
       " 'Bottom[cm] [cm]': 'Bottom [cm]',\n",
       " 'Comment (general)': 'Comments',\n",
       " 'Comments': 'Comments',\n",
       " 'Core': 'Core',\n",
       " 'Core-Sect': 'Core-Sect',\n",
       " 'Datum name': 'Datum name',\n",
       " 'Exp': 'Exp',\n",
       " 'Extra Sample ID Data': 'Extra Sample ID Data',\n",
       " 'Fragmentation': 'Fragmentation',\n",
       " 'Fragmentation rank [auto-pop]': 'Fragmentation rank [auto-pop]',\n",
       " 'Genus/species (upper zone)': 'Genus/species (upper zone)',\n",
       " 'Genus/species lower zone)': 'Genus/species (lower zone)',\n",
       " 'Group Abundance': 'Group Abundance',\n",
       " 'Group abundance': 'Group Abundance',\n",
       " 'Group abundance (%)': 'Group Abundance',\n",
       " 'Hole': 'Hole',\n",
       " 'Label ID': 'Sample',\n",
       " 'PF Preservation': 'PF Preservation',\n",
       " 'Preservation': 'Preservation',\n",
       " 'Sample': 'Sample',\n",
       " 'Sample comment': 'Comments',\n",
       " 'Section': 'Section',\n",
       " 'Site': 'Site',\n",
       " 'Top Depth [m]': 'Top Depth [m]',\n",
       " 'Top Depth[m] [m]': 'Top Depth [m]',\n",
       " 'Top [cm]': 'Top [cm]',\n",
       " 'Top[cm] [cm]': 'Top [cm]',\n",
       " 'Type': 'Type',\n",
       " 'Type (lower zone)': 'Type (lower zone)',\n",
       " 'Upper boundary age av. [Ma]': 'Upper boundary age av. [Ma]',\n",
       " 'Zone author (year)': 'Zone author (year)',\n",
       " 'Zone group': 'Zone group',\n",
       " 'Zone name': 'Zone name',\n",
       " 'Zone name (short)': 'Zone name (short)',\n",
       " 'pc_abundance_name_mode': 'pc_abundance_name_mode',\n",
       " 'pc_fossil_group': 'pc_fossil_group',\n",
       " 'pc_fossil_name': 'pc_fossil_name',\n",
       " 'physical_constituent_name': 'physical_constituent_name'}"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nontaxa_mapping = nontaxa_df.set_index('field').to_dict()['normalized_field']\n",
    "nontaxa_mapping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Before changing headers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Label ID', 'Exp', 'Site', 'Hole', 'Core', 'Core-Sect', 'Type',\n",
       "       'Section', 'A/W', 'Extra Sample ID Data', 'Top [cm]', 'Bottom [cm]',\n",
       "       'Top Depth [m]', 'Bottom Depth [m]', 'Comments', 'Preservation',\n",
       "       'Group Abundance', 'Catapsydrax dissimilis', 'Catapsydrax unicavus',\n",
       "       'Globigerina bulloides', 'Globigerina ciperoensis',\n",
       "       'Globigerina parabulloides', 'Globigerina umbilicata',\n",
       "       'Globigerinita glutinata', 'Globigerinita uvula',\n",
       "       'Globoquadrina duterei', 'Globorotalia incisa',\n",
       "       'Globorotalia (Globoconella) praescitula',\n",
       "       'Globorotalia (Hirsutella) scitula', 'Globorotalia suterae',\n",
       "       'Neogloboquadrina atlantica atlantica',\n",
       "       'Neogloboquadrina atlantica praeatlantica',\n",
       "       'Neogloboquadrina dutertri subcretacea',\n",
       "       'Neogloboquadrina pachyderma (sin)', 'Neogloboquadrina pachyderma(dex)',\n",
       "       'Orbulina universa', 'Turborotalia quinqueloba', 'pc_fossil_group',\n",
       "       'physical_constituent_name', 'pc_fossil_name', 'pc_abundance_name_mode',\n",
       "       'pc_preservation_name_average', 'File Data'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[8], dtype=str)    \n",
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### After changing headers\n",
    "\n",
    "`normalize_columns` replaces the original column names with normalized column names. \n",
    "\n",
    "For the nontaxa mapping, 'Label ID' is changed to 'Sample'. For the taxa mapping, 'Neogloboquadrina pachyderma (sin)' is changed to 'Neogloboquadrina pachyderma (sinistral)',  and 'Neogloboquadrina pachyderma(dex)' to 'Neogloboquadrina pachyderma (dextral)'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Sample', 'Exp', 'Site', 'Hole', 'Core', 'Core-Sect', 'Type', 'Section',\n",
       "       'A/W', 'Extra Sample ID Data', 'Top [cm]', 'Bottom [cm]',\n",
       "       'Top Depth [m]', 'Bottom Depth [m]', 'Comments', 'Preservation',\n",
       "       'Group Abundance', 'Catapsydrax dissimilis', 'Catapsydrax unicavus',\n",
       "       'Globigerina bulloides', 'Globigerina ciperoensis',\n",
       "       'Globigerina parabulloides', 'Globigerina umbilicata',\n",
       "       'Globigerinita glutinata', 'Globigerinita uvula',\n",
       "       'Globoquadrina duterei', 'Globorotalia incisa',\n",
       "       'Globorotalia (Globoconella) praescitula',\n",
       "       'Globorotalia Hirsutella scitula', 'Globorotalia suterae',\n",
       "       'Neogloboquadrina atlantica atlantica',\n",
       "       'Neogloboquadrina atlantica praeatlantica',\n",
       "       'Neogloboquadrina dutertri subcretacea',\n",
       "       'Neogloboquadrina pachyderma (sinistral)',\n",
       "       'Neogloboquadrina pachyderma (dextral)', 'Orbulina universa',\n",
       "       'Turborotalia quinqueloba', 'pc_fossil_group',\n",
       "       'physical_constituent_name', 'pc_fossil_name', 'pc_abundance_name_mode',\n",
       "       'pc_preservation_name_average', 'File Data'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[8], dtype=str) \n",
    "normalize_columns(df, taxa_mapping)\n",
    "normalize_columns(df, nontaxa_mapping)\n",
    "\n",
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Change headers for all files and save the changes\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "for path in paths:\n",
    "    df = pd.read_csv(path, dtype=str)    \n",
    "    \n",
    "    normalize_columns(df, nontaxa_mapping)\n",
    "    normalize_columns(df, taxa_mapping)\n",
    "    \n",
    "    df.to_csv(path, index=False)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Turn one column into multiple columns \n",
    "\n",
    "For some files, `Sample` or `Label ID` column was given, but `Exp, Site, Hole, Core, Type, Section, A/W` columns where not given. All those columns should be in every file.\n",
    "\n",
    "`normalize_expedition_section_cols` converts `Sample` or `Label ID` into separate `Exp, Site, Hole, Core, Type, Section, A/W` columns. \n",
    "\n",
    "Sample: 363-U1483A-1H-2-W 75/77-FORAM  \n",
    "Exp: 363, Site: U1483, Hole: A, Core: 1, Type: H, Section: 2, A/W: W"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Before cleanup\n",
    "\n",
    "File has `Sample` column, but is missing `Exp, Site, Hole, Core, Type, Section, A/W` columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 9)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>Bottom [cm]</th>\n",
       "      <th>Top Depth [m]</th>\n",
       "      <th>Bottom Depth [m]</th>\n",
       "      <th>Preservation</th>\n",
       "      <th>Group Abundance</th>\n",
       "      <th>Globigerina bulloides</th>\n",
       "      <th>Neogloboquadrina pachyderma (sinistral)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>346-U1422A-1H-CC-PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9.71</td>\n",
       "      <td>9.71</td>\n",
       "      <td>VG</td>\n",
       "      <td>R</td>\n",
       "      <td>P</td>\n",
       "      <td>P</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 Sample Top [cm] Bottom [cm] Top Depth [m] Bottom Depth [m]  \\\n",
       "0  346-U1422A-1H-CC-PAL        0           0          9.71             9.71   \n",
       "\n",
       "  Preservation Group Abundance Globigerina bulloides  \\\n",
       "0           VG               R                     P   \n",
       "\n",
       "  Neogloboquadrina pachyderma (sinistral)  \n",
       "0                                       P  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[6], dtype=str)  \n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### After cleanup\n",
    "\n",
    "File has `Sample, Exp, Site, Hole, Core, Type, Section, A/W` columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 16)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>Bottom [cm]</th>\n",
       "      <th>Top Depth [m]</th>\n",
       "      <th>Bottom Depth [m]</th>\n",
       "      <th>Preservation</th>\n",
       "      <th>Group Abundance</th>\n",
       "      <th>Globigerina bulloides</th>\n",
       "      <th>Neogloboquadrina pachyderma (sinistral)</th>\n",
       "      <th>Exp</th>\n",
       "      <th>Site</th>\n",
       "      <th>Hole</th>\n",
       "      <th>Core</th>\n",
       "      <th>Type</th>\n",
       "      <th>Section</th>\n",
       "      <th>A/W</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>346-U1422A-1H-CC-PAL</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>9.71</td>\n",
       "      <td>9.71</td>\n",
       "      <td>VG</td>\n",
       "      <td>R</td>\n",
       "      <td>P</td>\n",
       "      <td>P</td>\n",
       "      <td>346</td>\n",
       "      <td>U1422</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>H</td>\n",
       "      <td>CC</td>\n",
       "      <td>PAL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 Sample Top [cm] Bottom [cm] Top Depth [m] Bottom Depth [m]  \\\n",
       "0  346-U1422A-1H-CC-PAL        0           0          9.71             9.71   \n",
       "\n",
       "  Preservation Group Abundance Globigerina bulloides  \\\n",
       "0           VG               R                     P   \n",
       "\n",
       "  Neogloboquadrina pachyderma (sinistral)  Exp   Site Hole Core Type Section  \\\n",
       "0                                       P  346  U1422    A    1    H      CC   \n",
       "\n",
       "   A/W  \n",
       "0  PAL  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[6], dtype=str) \n",
    "df = normalize_expedition_section_cols(df)\n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean up all files and save the changes\n",
    "\n",
    "Normalize `Sample, Exp, Site, Hole, Core, Type, Section, A/W` columns for all files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "for path in paths:\n",
    "    df = pd.read_csv(path, dtype=str)   \n",
    "    \n",
    "    df = normalize_expedition_section_cols(df)\n",
    "    \n",
    "    df.to_csv(path, index=False) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Clean up row values\n",
    "\n",
    "For some of the columns such as taxa, the rows contain abundance data that we want to keep and notes in brackets that we want to remove.  `remove_bracket_text` removes the [note] text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Before cleanup\n",
    "\n",
    "'Globigerina bulloides' and 'Neogloboquadrina cf. pachyderma' have [318_PF]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 16)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Exp</th>\n",
       "      <th>Site</th>\n",
       "      <th>Hole</th>\n",
       "      <th>Core</th>\n",
       "      <th>Core-Sect</th>\n",
       "      <th>Type</th>\n",
       "      <th>Section</th>\n",
       "      <th>A/W</th>\n",
       "      <th>Extra Sample ID Data</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>Bottom [cm]</th>\n",
       "      <th>Top Depth [m]</th>\n",
       "      <th>Bottom Depth [m]</th>\n",
       "      <th>Globigerina bulloides</th>\n",
       "      <th>Neogloboquadrina cf. pachyderma</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>318-U1355A-1R-1-PAL-FORAM</td>\n",
       "      <td>318</td>\n",
       "      <td>U1355</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1-1</td>\n",
       "      <td>R</td>\n",
       "      <td>1</td>\n",
       "      <td>PAL</td>\n",
       "      <td>FORAM</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.01</td>\n",
       "      <td>R[318_PF]</td>\n",
       "      <td>R[318_PF]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      Sample  Exp   Site Hole Core Core-Sect Type Section  \\\n",
       "0  318-U1355A-1R-1-PAL-FORAM  318  U1355    A    1       1-1    R       1   \n",
       "\n",
       "   A/W Extra Sample ID Data Top [cm] Bottom [cm] Top Depth [m]  \\\n",
       "0  PAL                FORAM        0           1             0   \n",
       "\n",
       "  Bottom Depth [m] Globigerina bulloides Neogloboquadrina cf. pachyderma  \n",
       "0             0.01             R[318_PF]                       R[318_PF]  "
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[0], dtype=str)    \n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### After cleanup\n",
    "\n",
    "Remove [318_PF] from 'Globigerina bulloides' and 'Neogloboquadrina cf. pachyderma'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 16)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sample</th>\n",
       "      <th>Exp</th>\n",
       "      <th>Site</th>\n",
       "      <th>Hole</th>\n",
       "      <th>Core</th>\n",
       "      <th>Core-Sect</th>\n",
       "      <th>Type</th>\n",
       "      <th>Section</th>\n",
       "      <th>A/W</th>\n",
       "      <th>Extra Sample ID Data</th>\n",
       "      <th>Top [cm]</th>\n",
       "      <th>Bottom [cm]</th>\n",
       "      <th>Top Depth [m]</th>\n",
       "      <th>Bottom Depth [m]</th>\n",
       "      <th>Globigerina bulloides</th>\n",
       "      <th>Neogloboquadrina cf. pachyderma</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>318-U1355A-1R-1-PAL-FORAM</td>\n",
       "      <td>318</td>\n",
       "      <td>U1355</td>\n",
       "      <td>A</td>\n",
       "      <td>1</td>\n",
       "      <td>1-1</td>\n",
       "      <td>R</td>\n",
       "      <td>1</td>\n",
       "      <td>PAL</td>\n",
       "      <td>FORAM</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.01</td>\n",
       "      <td>R</td>\n",
       "      <td>R</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      Sample  Exp   Site Hole Core Core-Sect Type Section  \\\n",
       "0  318-U1355A-1R-1-PAL-FORAM  318  U1355    A    1       1-1    R       1   \n",
       "\n",
       "   A/W Extra Sample ID Data Top [cm] Bottom [cm] Top Depth [m]  \\\n",
       "0  PAL                FORAM        0           1             0   \n",
       "\n",
       "  Bottom Depth [m] Globigerina bulloides Neogloboquadrina cf. pachyderma  \n",
       "0             0.01                     R                               R  "
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(paths[0], dtype=str) \n",
    "df = remove_bracket_text(df)\n",
    "print_df(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean up all files and save the changes\n",
    "\n",
    "Remove bracket text from all files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "for path in paths:\n",
    "    df = pd.read_csv(path, dtype=str)\n",
    "    \n",
    "    df = remove_bracket_text(df)\n",
    "    \n",
    "    df.to_csv(path, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check if required columns exists\n",
    "\n",
    "After we completed the data cleaning steps, we want to check if certain required columns are present in all the data files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "required_columns = {\n",
    " 'A/W',\n",
    " 'Bottom [cm]',\n",
    " 'Bottom Depth [m]',\n",
    " 'Core',\n",
    " 'Exp',\n",
    " 'Hole',\n",
    " 'Sample',\n",
    " 'Section',\n",
    " 'Site',\n",
    " 'Top [cm]',\n",
    " 'Top Depth [m]',\n",
    " 'Type'\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a file is missing some required columns, print the file name and the missing columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All files have required columns.\n"
     ]
    }
   ],
   "source": [
    "missing = False\n",
    "\n",
    "for path in paths:\n",
    "    df = pd.read_csv(path, dtype=str)    \n",
    "    cols = set(df.columns)\n",
    "    diff = required_columns - cols\n",
    "    \n",
    "    if(len(diff) > 0):\n",
    "        missing  = True\n",
    "        print(path)\n",
    "        print(required_columns - cols)\n",
    "    \n",
    "if not missing:\n",
    "    print('All files have required columns.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "\n",
    "1. Extending Ocean Drilling Pursuits website https://eodp.github.io\n",
    "2. Github repo for processing eODP data: https://github.com/eODP/data-processing. Folder for eODP raw data files: https://github.com/eODP/data-processing/tree/master/raw_data/\n",
    "3. The normalized nontaxa and taxa csv were provided by Andrew J. Fraass, Leah J. Levay, and Jocelyn A. Sessa.\n",
    "4. LeVay, Leah; Fraass, Andrew; Peters, Shanan; Sessa, Jocelyn; Kaufman, Seth; Kwan, Wai-Yin. Geologic data standardization for database entry: preparing diverse datasets for hosting and accessibility. EarthCube Annual Meeting. Poster.  \n",
    "5. Sessa, Jocelyn; Levay, Leah; Peters, Shanan; Fraass, Andrew. The extending Ocean Drilling Pursuits (eODP) project: synthesizing scientific ocean drilling data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "248.182px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}