# Data-driven Modeling

The emergence of organized multiscale patterns resulting from convection is ubiquitous, observed throughout different cloud types. The reproduction of such patterns by general circulation models remains a challenge due to the complex nature of clouds, characterized by processes interacting over a wide range of spatio-temporal scales. The new advances in data-driven modeling techniques have raised a lot of promises to discover dynamical equations from partial observations of complex systems.

This study presents such a discovery from high-resolution satellite datasets of continental cloud fields. The model is made of stochastic differential equations able to simulate with high fidelity the spatio-temporal coherence and variability of the cloud patterns such as the characteristic lifetime of individual clouds or global organizational features governed by convective inertia gravity waves. This feat is achieved through the model's lagged effects associated with convection recirculation times, and hidden variables parameterizing the unobserved processes and variables.

We parameterize sub-grid scale (SGS) fluxes in sinusoidally forced two-dimensional turbulence on the *β*-plane at high Reynolds numbers (Re ∼25,000) using simple 2-layer convolutional neural networks (CNN) having only O(1000) parameters, two orders of magnitude smaller than recent studies employing deeper CNNs with 8–10 layers; we obtain stable, accurate, and long-term online or a posteriori solutions at 16× downscaling factors. Our methodology significantly improves training efficiency and speed of online large eddy simulations runs, while offering insights into the physics of closure in such turbulent flows. Our approach benefits from extensive hyperparameter searching in learning rate and weight decay coefficient space, as well as the use of cyclical learning rate annealing, which leads to more robust and accurate online solutions compared to fixed learning rates. Our CNNs use either the coarse velocity or the vorticity and strain fields as inputs, and output the two components of the deviatoric stress tensor, *S*_{d}. We minimize a loss between the SGS vorticity flux divergence (computed from the high-resolution solver) and that obtained from the CNN-modeled *S*_{d}, without requiring energy or enstrophy preserving constraints. The success of shallow CNNs in accurately parameterizing this class of turbulent flows implies that the SGS stresses have a weak non-local dependence on coarse fields; it also aligns with our physical conception that small-scales are locally controlled by larger scales such as vortices and their strained filaments. Furthermore, 2-layer CNN-parameterizations are more likely to be interpretable.

A theory of Ruelle–Pollicott (RP) resonances for stochastic differential systems is presented. These resonances are defined as the eigenvalues of the generator (Kolmogorov operator) of a given stochastic system. By relying on the theory of Markov semigroups, decomposition formulas of correlation functions and power spectral densities (PSDs) in terms of RP resonances are then derived. These formulas describe, for a broad class of stochastic differential equations (SDEs), how the RP resonances characterize the decay of correlations as well as the signal’s oscillatory components manifested by peaks in the PSD. It is then shown that a notion reduced RP resonances can be rigorously defined, as soon as the dynamics is partially observed within a reduced state space *V*. These reduced resonances are obtained from the spectral elements of reduced Markov operators acting on functions of the state space *V*, and can be estimated from series. They inform us about the spectral elements of some coarse-grained version of the SDE generator. When the time-lag at which the transitions are collected from partial observations in *V*, is either sufficiently small or large, it is shown that the reduced RP resonances approximate the (weak) RP resonances of the generator of the conditional expectation in *V*, i.e. the optimal reduced system in *V* obtained by averaging out the contribution of the unobserved variables. The approach is illustrated on a stochastic slow-fast system for which it is shown that the reduced RP resonances allow for a good reconstruction of the correlation functions and PSDs, even when the time-scale separation is weak. The companions articles, Part II and Part III, deal with further practical aspects of the theory presented in this contribution. One important byproduct consists of the diagnosis usefulness of stochastic dynamics that RP resonances provide. This is illustrated in the case of a stochastic Hopf bifurcation in Part II. There, it is shown that such a bifurcation has a clear manifestation in terms of a geometric organization of the RP resonances along discrete parabolas in the left half plane. Such geometric features formed by (reduced) RP resonances are extractable from time series and allow thus for providing an unambiguous “signature” of nonlinear oscillations embedded within a stochastic background. By relying then on the theory of reduced RP resonances presented in this contribution, Part III addresses the question of detection and characterization of such oscillations in a high-dimensional stochastic system, namely the Cane–Zebiak model of El Niño-Southern Oscillation subject to noise modeling fast atmospheric fluctuations.

The response of a low-frequency mode of climate variability, El Niño–Southern Oscillation, to stochastic forcing is studied in a high-dimensional model of intermediate complexity, the fully-coupled Cane–Zebiak model (Zebiak and Cane 1987), from the spectral analysis of Markov operators governing the decay of correlations and resonances in the power spectrum. Noise-induced oscillations excited before a supercritical Hopf bifurcation are examined by means of complex resonances, the reduced Ruelle–Pollicott (RP) resonances, via a numerical application of the reduction approach of the first part of this contribution (Chekroun et al. 2019) to model simulations. The oscillations manifest themselves as peaks in the power spectrum which are associated with RP resonances organized along parabolas, as the bifurcation is neared. These resonances and the associated eigenvectors are furthermore well described by the small-noise expansion formulas obtained by Gaspard (2002) and made explicit in the second part of this contribution (Tantet et al. 2019). Beyond the bifurcation, the spectral gap between the imaginary axis and the real part of the leading resonances quantifies the diffusion of phase of the noise-induced oscillations and can be computed from the linearization of the model and from the diffusion matrix of the noise. In this model, the phase diffusion coefficient thus gives a measure of the predictability of oscillatory events representing ENSO. ENSO events being known to be locked to the seasonal cycle, these results should be extended to the non-autonomous case. More generally, the reduction approach theorized in Chekroun et al. (2019), complemented by our understanding of the spectral properties of reference systems such as the stochastic Hopf bifurcation, provides a promising methodology for the analysis of low-frequency variability in high-dimensional stochastic systems.

Decline in the Arctic sea ice extent (SIE) is an area of active scientific research with profound socio-economic implications. Of particular interest are reliable methods for SIE forecasting on subseasonal time scales, in particular from early summer into fall, when sea ice coverage in the Arctic reaches its minimum. Here, we apply the recent data-adaptive harmonic (DAH) technique of Chekroun and Kondrashov, (2017), *Chaos*, **27** for the description, modeling and prediction of the Multisensor Analyzed Sea Ice Extent (MASIE, 2006–2016) data set. The DAH decomposition of MASIE identifies narrowband, spatio-temporal data-adaptive modes over four key Arctic regions. The time evolution of the DAH coefficients of these modes can be modelled and predicted by using a set of coupled Stuart–Landau stochastic differential equations that capture the modes’ frequencies and amplitude modulation in time. Retrospective forecasts show that our resulting multilayer Stuart–Landau model (MSLM) is quite skilful in predicting September SIE compared to year-to-year persistence; moreover, the DAH–MSLM approach provided accurate real-time prediction that was highly competitive for the 2016–2017 Sea Ice Outlook.

The multiscale variability of the ocean circulation due to its nonlinear dynamics remains a big challenge for theoretical understanding and practical ocean modeling. This paper demonstrates how the data-adaptive harmonic (DAH) decomposition and inverse stochastic modeling techniques introduced in (Chekroun and Kondrashov, (2017), Chaos, 27), allow for reproducing with high fidelity the main statistical properties of multiscale variability in a coarse-grained eddy-resolving ocean flow. This fully-data-driven approach relies on extraction of frequency-ranked time-dependent coefficients describing the evolution of spatio-temporal DAH modes (DAHMs) in the oceanic flow data. In turn, the time series of these coefficients are efficiently modeled by a family of low-order stochastic differential equations (SDEs) stacked per frequency, involving a fixed set of predictor functions and a small number of model coefficients. These SDEs take the form of stochastic oscillators, identified as multilayer Stuart–Landau models (MSLMs), and their use is justified by relying on the theory of Ruelle–Pollicott resonances. The good modeling skills shown by the resulting DAH-MSLM emulators demonstrates the feasibility of using a network of stochastic oscillators for the modeling of geophysical turbulence. In a certain sense, the original quasiperiodic Landau view of turbulence, with the amendment of the inclusion of stochasticity, may be well suited to describe turbulence.

The solar wind-magnetosphere coupling is studied by new data-adaptive harmonic (DAH) decomposition approach for the spectral analysis and inverse modeling of multivariate time observations of complex nonlinear dynamical systems. DAH identifies frequency-based modes of interactions in the combined dataset of Auroral Electrojet (AE) index and solar wind forcing. The time evolution of these modes can be very effi- ciently simulated by using systems of stochastic differential equations (SDEs) that are stacked per frequency and formed by coupled Stuart-Landau oscillators. These systems of SDEs capture the modes’ frequencies as well as their amplitude modulations, and yield, in turn, an accurate modeling of the AE index’ statistical properties.

We present and apply a novel method of describing and modeling complex multivariate datasets in the geosciences and elsewhere. Data-adaptive harmonic (DAH) decomposition identifies narrow-banded, spatio-temporal modes (DAHMs) whose frequencies are not necessarily integer multiples of each other. The evolution in time of the DAH coefficients (DAHCs) of these modes can be modeled using a set of coupled Stuart-Landau stochastic differential equations that capture the modes’ frequencies and amplitude modulation in time and space. This methodology is applied first to a challenging synthetic dataset and then to Arctic sea ice concentration (SIC) data from the US National Snow and Ice Data Center (NSIDC). The 36-year (1979–2014) dataset is parsimoniously and accurately described by our DAHMs. Preliminary results indicate that simulations using our multilayer Stuart-Landau model (MSLM) of SICs are stable for much longer time intervals, beyond the end of the twenty-first century, and exhibit interdecadal variability consistent with past historical records. Preliminary results indicate that this MSLM is quite skillful in predicting September sea ice extent.

Harmonic decompositions of multivariate time series are considered for which we adopt an integral operator approach with

periodic semigroup kernels. Spectral decomposition theorems are derived that cover the important cases of two-time statistics drawn from a mixing invariant measure.

The corresponding eigenvalues can be grouped per Fourier frequency, and are actually given, at each frequency, as the singular values of a cross-spectral matrix depending on the data. These eigenvalues obey furthermore a variational principle that allows us to define naturally a multidimensional power spectrum. The eigenmodes, as far as they are concerned, exhibit a data-adaptive character manifested in their phase which allows us in turn to define a multidimensional phase spectrum.

The resulting data-adaptive harmonic (DAH) modes allow for reducing the data-driven modeling effort to elemental models stacked per frequency, only coupled at different frequencies by the same noise realization. In particular, the DAH decomposition extracts time-dependent coefficients stacked by Fourier frequency which can be efficiently modeled---provided the decay of temporal correlations is sufficiently well-resolved---within a class of multilayer stochastic models (MSMs) tailored here on stochastic Stuart-Landau oscillators.

Applications to the Lorenz 96 model and to a stochastic heat equation driven by a space-time white noise, are considered. In both cases, the DAH decomposition allows for an extraction of spatio-temporal modes revealing key features of the dynamics in the embedded phase space. The multilayer Stuart-Landau models (MSLMs) are shown to successfully model the typical patterns of the corresponding time-evolving fields, as well as their statistics of occurrence.

Proxy records from Greenland ice cores have been studied for several decades, yet many open questions remain regarding the climate variability encoded therein. Here, we use a Bayesian framework for inferring inverse, stochastic-dynamic models from *δ*^{18}O and dust records of unprecedented, subdecadal temporal resolution. The records stem from the North Greenland Ice Core Project (NGRIP) and we focus on the time interval 59 ka–22 ka b2k. Our model reproduces the dynamical characteristics of both the *δ*^{18}O and dust proxy records, including the millennial-scale Dansgaard–Oeschger variability, as well as statistical properties such as probability density functions, waiting times and power spectra, with no need for any external forcing. The crucial ingredients for capturing these properties are (i) high-resolution training data; (ii) cubic drift terms; (iii) nonlinear coupling terms between the *δ*^{18}O and dust time series; and (iv) non-Markovian contributions that represent short-term memory effects.

The comparison performed in Berry *et al.* [Phys. Rev. E **91**, 032915 (2015)] between the skill in predicting the El Niño-Southern Oscillation climate phenomenon by the prediction method of Berry *et al.* and the “past-noise” forecasting method of Chekroun *et al.* [Proc. Natl. Acad. Sci. USA **108**, 11766 (2011)] is flawed. Three specific misunderstandings in Berry *et al.* are pointed out and corrected.

A suite of empirical model experiments under the empirical model reduction framework are conducted to advance the understanding of ENSO diversity, nonlinearity, seasonality, and the memory effect in the simulation and prediction of tropical Pacific sea surface temperature (SST) anomalies. The model training and evaluation are carried out using 4000-yr preindustrial control simulation data from the coupled model GFDL CM2.1. The results show that multivariate models with tropical Pacific subsurface information and multilevel models with SST history information both improve the prediction skill dramatically. These two types of models represent the ENSO memory effect based on either the recharge oscillator or the time-delayed oscillator viewpoint. Multilevel SST models are a bit more efficient, requiring fewer model coefficients. Nonlinearity is found necessary to reproduce the ENSO diversity feature for extreme events. The nonlinear models reconstruct the skewed probability density function of SST anomalies and improve the prediction of the skewed amplitude, though the role of nonlinearity may be slightly overestimated given the strong nonlinear ENSO in GFDL CM2.1. The models with periodic terms reproduce the SST seasonal phase locking but do not improve the prediction appreciably. The models with multiple ingredients capture several ENSO characteristics simultaneously and exhibit overall better prediction skill for more diverse target patterns. In particular, they alleviate the spring/autumn prediction barrier and reduce the tendency for predicted values to lag the target month value.

Despite the importance of uncertainties encountered in climate model simulations, the fundamental mechanisms at the origin of sensitive behavior of long-term model statistics remain unclear. Variability of turbulent flows in the atmosphere and oceans exhibits recurrent large-scale patterns. These patterns, while evolving irregularly in time, manifest characteristic frequencies across a large range of time scales, from intraseasonal through interdecadal. Based on modern spectral theory of chaotic and dissipative dynamical systems, the associated low-frequency variability may be formulated in terms of Ruelle-Pollicott (RP) resonances. RP resonances encode information on the nonlinear dynamics of the system, and an approach for estimating them—as filtered through an observable of the system—is proposed. This approach relies on an appropriate Markov representation of the dynamics associated with a given observable. It is shown that, within this representation, the spectral gap—defined as the distance between the subdominant RP resonance and the unit circle—plays a major role in the roughness of parameter dependences. The model statistics are the most sensitive for the smallest spectral gaps; such small gaps turn out to correspond to regimes where the low-frequency variability is more pronounced, whereas autocorrelations decay more slowly. The present approach is applied to analyze the rough parameter dependence encountered in key statistics of an El-Niño–Southern Oscillation model of intermediate complexity. Theoretical arguments, however, strongly suggest that such links between model sensitivity and the decay of correlation properties are not limited to this particular model and could hold much more generally.

This paper presents a predictability study of the Madden-Julian Oscillation (MJO) that relies on combining empirical model reduction (EMR) with the “past-noise forecasting” (PNF) method. EMR is a data-driven methodology for constructing stochastic low-dimensional models that account for nonlinearity, seasonality and serial correlation in the estimated noise, while PNF constructs an ensemble of forecasts that accounts for interactions between (i) high-frequency variability (noise), estimated here by EMR, and (ii) the low-frequency mode of MJO, as captured by singular spectrum analysis (SSA). A key result is that—compared to an EMR ensemble driven by generic white noise—PNF is able to considerably improve prediction of MJO phase. When forecasts are initiated from weak MJO conditions, the useful skill is of up to 30 days. PNF also significantly improves MJO prediction skill for forecasts that start over the Indian Ocean.

Interannual and interdecadal prediction are major challenges of climate dynamics. In this article we develop a prediction method for climate processes that exhibit low-frequency variability (LFV). The method constructs a nonlinear stochastic model from past observations and estimates a path of the “weather” noise that drives this model over previous finite-time windows. The method has two steps: (*i*) select noise samples—or “snippets”—from the past noise, which have forced the system during short-time intervals that resemble the LFV phase just preceding the currently observed state; and (*ii*) use these snippets to drive the system from the current state into the future. The method is placed in the framework of pathwise linear-response theory and is then applied to an El Niño–Southern Oscillation (ENSO) model derived by the empirical model reduction (EMR) methodology; this nonlinear model has 40 coupled, slow, and fast variables. The domain of validity of this forecasting procedure depends on the nature of the system’s pathwise response; it is shown numerically that the ENSO model’s response is linear on interannual time scales. As a result, the method’s skill at a 6- to 16-month lead is highly competitive when compared with currently used dynamic and statistic prediction methods for the Niño-3 index and the global sea surface temperature field.