Environmental Data Science

Spectral Attenuation Pipeline

A scientific data workflow for deriving optical attenuation outputs and related light-penetration indicators from field measurements.

Spectral Attenuation is a scientific data-processing project centered on underwater light measurements. The repository turns raw field spectra into reusable attenuation outputs, compact ML-ready datasets across FULL and PAR wavelength ranges, reproducible train/test splits, derived optical features, and a usable web interface for selected analysis workflows.

Overview

The core repository is a reproducible Python pipeline that turns raw depth-resolved measurement files into structured outputs for water-quality modeling. It includes data auditing, attenuation-output generation, dataset assembly, feature engineering, split generation, and derived optical summaries.

The processed datasets support two spectral configurations: a FULL range from 380 to 900 nm and a PAR range from 400 to 700 nm. Both are built from the same raw source table and exported as compact modeling datasets.

The repository also contains an application layer around the research code: a FastAPI backend and a React/Vite frontend that provide a usable interface for selected scientific analysis workflows.

Problem

Raw underwater light measurements are difficult to reuse directly for modeling because they are spread across heterogeneous field files, locations, dates, and measurement conventions.

Water-quality modeling workflows need consistent attenuation targets and aligned environmental features rather than ad hoc notebook processing.

Light-penetration analysis benefits from a reproducible way to turn field measurements into domain-specific optical metrics and downstream modeling features.

The main goal was to study how attenuation-related behavior connects to core environmental features such as chlorophyll, CDOM, and TSS within a reusable analysis workflow.

What This Repository Implements

Built a public pipeline entrypoint that runs audit, attenuation-output generation, dataset, feature, split, and derived-optics stages in a fixed reproducible order.

Implemented a domain-specific processing workflow that turns raw measurement files into station-level outputs, reusable datasets, and saved visual artifacts.

Built compact FULL and PAR modeling datasets with log-transformed targets and environmental features derived from chlorophyll, CDOM, and TSS inputs.

Added reproducible split generation for both random-by-sample and holdout-by-location evaluation setups.

Added a FastAPI backend and React frontend so the scientific workflow can also be accessed through a usable web interface.

Approach

Applied domain-specific preprocessing to raw field measurements so they could be transformed into structured attenuation outputs and reusable modeling inputs.

Generated saved station-level outputs and quality-check visuals to support review of the processed optical results.

Assembled a long-form raw dataset and transformed it into compact ML tables for FULL and PAR wavelength ranges with one row per sample and one target column per wavelength.

Generated reproducible train/test exports with a fixed seed and an explicit holdout-by-location strategy to separate location generalization from random sample splits.

Derived higher-level optical and irradiance-related features for downstream analysis, interpretation, and interface-level workflows.

Reduced spectral feature correlation with PCA and fit a regression model to predict attenuation targets from chlorophyll, CDOM, and TSS inputs.

Results

The regression model reached R² 0.86 and RMSE ≈ 0.65 (log scale) on the held-out test set — robust given the noise typical of environmental water data.

Produces a complete workflow from raw field measurements to modeling tables, split artifacts, irradiance derivatives, and reusable model exports.

A FastAPI backend and React/Vite frontend make the pipeline usable for interactive analysis — visualizing irradiance at depth and exporting results, not just running in a notebook.

Test R²

0.86

Held-out test set, log-scale targets.

Test RMSE

≈ 0.65

Log scale; robust to noisy water-quality data.

Raw Dataset

129,729 rows

Long-form source measurements.

Model Samples

245

One row per water sample (FULL & PAR datasets).

Visuals

Outputs and diagrams from the project.

Example station-level optical output plot.

Quality-control plot for a station measurement.

Charts & Figures

Saved figures and chart artifacts referenced by the project.

Saved project figure showing summary feature importance.

Chart Data

PAR Random Split Sample Counts

Dataset split randomly at the sample level to evaluate general model performance under standard i.i.d. assumptions.

Train190

Test48

PAR Holdout-by-Location Sample Counts

Dataset split by location, ensuring that all samples from specific locations are held out for testing. This evaluates the model's ability to generalize to unseen geographic conditions.

Train207

Test31

PAR Holdout-by-Location Coverage

Number of unique locations in the saved split metadata.

Train Locations33

Test Locations5

Stack

PythonuvPandasNumPySciPyscikit-learnCatBoostMatplotlibpvlibFastAPIPydanticReactViteRecharts

Challenges

Handling legacy and newer measurement branches with different calibration and metadata assumptions.

Aligning raw spectral text files, filename audits, replicate IDs, and depth metadata into a single processable pipeline.

Maintaining scientific pipeline logic while adding a clean application layer that does not rewrite the research project layout.

Lessons

Separating audit, feature building, splits, and irradiance generation into explicit stages makes scientific workflows easier to rerun and review.

Reproducibility benefits from shipping processed datasets and split metadata alongside code, not just notebooks.

A thin adapter layer is useful when adding an API to an existing research codebase because it limits coupling to internal project changes.

Wrapping a research codebase in an API surfaces exactly which functions are genuinely reusable and which still assume notebook context.