Introduction
Visual summary of the fully probabilistic source inversion algorithm
PRISM presented in the companion paper , on the example
of a magnitude-5.7 earthquake in the US state of Virginia on 23 August 2011.
(a) Candidate source solutions are evaluated according to the
cross-correlation fit they produce between observed broadband, teleseismic
P waveforms (black) or SH waveforms (blue), and their modelled
counterparts (red). The present study is concerned with quantifying the noise
distribution on these cross-correlation measurements CC – one scalar per
source–receiver pair, 48 in total for this earthquake. (b) To
reduce the dimensionality of the model space to a number accessible to
Bayesian sampling, the source time function (STF) is parameterised as a
linear combination of 15 empirical orthogonal functions found to best span
the space of a large set of 900 reference STFs . (c) The “Bayesian beach ball”, a visual average of
the posterior ensemble of well-fitting solutions, conveys not only the nature
of the moment tensor but also the magnitude and nature of its uncertainties.
(d) The marginal probability of the hypocentre depth. (e)
Weighted average of STFs from the posterior ensemble of good solutions
permits assessment of the uncertainties in STF shape. This STF is clearly
unimodal and of less than 5 s duration. (f) As a secondary benefit,
this procedure yields the uncertainties (standard deviations) of
cross-correlation travel time measurements at all stations, and their
inter-station correlations. Travel times are the primary input data for
seismic tomography, and these insights into their uncertainties are not
readily available from other methods.
The quantitative estimation of seismic source characteristics is one of the
most important inverse problems in geophysics, from both scientific and
societal points of views. Source parameters not only can be used to locate
earthquakes and to understand earthquake mechanisms and their implications for
tectonic settings and seismic hazard, but they are also important in seismic
tomography, where accurate source information is a prerequisite for achieving
optimal fits between observed and modelled (waveform) data.
Estimation of seismic source parameters includes an earthquake's location,
depth, fault plane and temporal rupture evolution. The inverse problem is
non-linear, and parameter correlations result in trade-offs and
non-uniqueness, e.g. the correlation between dip and scalar moment that was
discovered by . Source depth is a particularly
challenging parameter; for example often find multiple
local minima in waveform data misfits as a function of depth, even when source time functions (STFs)
are explicitly estimated. This makes global search methods and ensemble
sampling particularly attractive if the associated computational hurdles can
be surmounted. For finite-fault inversion of large earthquakes, Bayesian
methods have been developed in recent years , as they also have been for non-kinematic inversions of regional events
, but we focus on the inversion of source time functions of
intermediate-sized events (mb 5.5 to 7.5) from broadband, teleseismic
waveforms.
In a companion paper , we developed the PRobabilistic Interference of Source Mechanisms (PRISM) algorithm,
a fully probabilistic inversion for source depth, moment tensor and
STF, via sampling by both stages of the neighbourhood
algorithm NA;. Figure
sums up the procedure and its results.
The need for PRISM arose from our work in global-scale waveform tomography,
which fits broadband body-wave seismograms of moderate to large earthquakes
to modelled synthetics, up to the highest occurring frequencies
(≈1 Hz). This can only be achieved with good a priori
estimates of source depth, which strongly shapes the synthetic Green's
functions, and of source time functions, which convolve the Green's
functions. At the time, no data centre delivered routine estimates of
broadband STFs by now, efforts other than ours are
underway;. Hence developed a
linearised, iterative approach that semi-automatically deconvolved broadband
source time functions, source depths and moments tensors of more than 2000
earthquakes, which were subsequently used in several waveform tomographies
.
The required human supervision time called for full automatisation,
preferably in a Bayesian setting that would circumvent the occasional
divergence of the non-linear optimisation and would automatically diagnose
parameter trade-offs of the kind described. PRISM solved
this problem, but we left the justification of its misfit criterion and the
derivation of its noise model and likelihood function to the present study.
To render ensemble sampling with the NA computationally feasible, the
dimensionality of the model parameter space has to be as small as possible,
preferably less than 20. Depth is one parameter, and a normalised description
of the moment tensor requires five more a more rigorous and uniform
parameterisation of the moment tensor has been derived
by. Although latitude and longitude could easily be
added to this list, we do not consider them here, because the lateral
location problem is adequately addressed by existing data centres National Earthquake Information Center
(NEIC) or, and in any case we would re-estimate all hypocentres at
the time of tomographic inversion. The STF is a high-dimensional parameter
vector, which and parameterised simply as a
time series of 256 unknowns (10 Hz sampling rate, 25.6 s length). To reduce
its dimensionality for Bayesian sampling, made use of a
dataset of >2000 deterministic earthquake source solutions (depth, moment
tensor and STF) obtained by . We selected the 900
best-constrained STFs and composed this set into empirical orthogonal
functions (EOFs), denoted sl(t). Any broadband STF s(t) of
events up to magnitudes of about 7.5 is well described by a linear
combination of the first L EOFs, where L≈15 delivers sufficient
accuracy for our purpose: s(t)=∑l=115alsl(t). These EOFs sl(t), shown in
Fig. b, are the primary means by which we feed a
priori expert knowledge into the Bayesian sampling problem. PRISM's STF
parameterisation consists of the first L EOF weights al,
bringing the total dimensionality of the parameter space to ≈20.
This space is sampled by both stages of the neighbourhood algorithm,
resulting in an ensemble of source solutions m (cf.
Table ). From this ensemble, marginal probabilities for
any model parameter can be estimated, e.g. for the depth
(Fig. d) or the STF
(Fig. e). As a visual means of conveying
uncertainties in the moment tensor, we invented “Bayesian beach ball” plots
(Fig. c), a superposition of many beach ball
representations in the a posteriori ensemble. A valuable side benefit is
full uncertainties on travel time measurements ΔTj at stations j.
These travel time delays are incidental in the context of source inversion
(as the time shifts between observed and synthetic seismograms that maximise
the cross-correlation coefficients CCj, Fig. f),
but they represent the primary input data for our seismic waveform
tomographies.
The primary measure of fit (or “input data”) for PRISM's source inversions
is the CCj. When parameter estimation is performed as a deterministic
optimisation problem, (only) a relative measure of fit or misfit is required:
the optimal solution is the one that yields the smallest misfit between
observations and model predictions, in our case the largest possible values
of cross-correlation coefficients CCj. By contrast, Bayesian parameter
estimation requires not just a measure of misfit but also a likelihood
function for it, which is derived from the probability distribution on the
data (the “noise model”). In the absence of a noise model, the likelihood of a
randomly drawn candidate solution cannot be evaluated. Obtaining a noise
model for a misfit requires much more information about the measurement
process and its statistics than the mere adoption of a misfit measure. This
is the big challenge of Bayesian “inversion”, which will be covered in this
paper.
Section argues for the adoption of the signal
decorrelation D=1-CC as a robust measure of misfit, where CC is
the normalised cross-correlation coefficient (Table ).
To our knowledge, the decorrelation D of seismological waveforms has not
been used as a misfit criterion in Bayesian inference other than
by because its noise model and likelihood function were
unknown – a shortcoming D shared with other deterministic misfit choices,
such as the instantaneous phase coherence , time phase
misfits or multi-tapers .
Section shows that the popular ℓ2 and
ℓ1 norms would be sub-optimal misfit criteria
because noise in seismic signals is not simply additive Gaussian or
Laplacian but rather partly signal-generated, i.e. highly correlated across time
samples and stations, and better described by a transfer function.
Figure shows an example of this systematic noise
“coda”. Section defines the general
requirements of a good misfit criterion, and Sect.
demonstrates that the signal decorrelation D performs more robustly than
sample-by-sample (ℓp) norms on realistic seismological waveform data.
To identify a likelihood function L(m|d) of misfit
D in Sect. , we draw once more on
the prior knowledge contained in our set of deterministic source solutions
for 900 earthquakes and on the 200 000 measurements of CC=1-D
made to obtain them. From this large, representative and highly
quality-controlled dataset of confident source solutions, we obtain the
statistics of the residual misfits D, which we use to construct an
empirical likelihood L∗(m|d). Thus we can
instruct the probabilistic inversion to explore subspaces of solutions
m that yield similarly low levels of misfit D as these
best-fitting deterministic solutions.
Section presents a worked example for the construction
of a likelihood function L(m|d) from data of a
typical earthquake, the 2011 Virginia event used throughout this paper and
its companion . We conclude with a discussion in
Sect. .
Noise and misfit criteria
Three noise cases for compressional (P) waves in source
inversion; the waveforms were produced by the M 5.7 earthquake in Virginia
(23 August 2011). Station BFO has a high signal-to-noise ratio (no wiggles
preceding the P pulse), and the waveform is fit well by a WKBJ
synthetic using our best source solution for this earthquake. Station LPAZ
has a high signal-to-noise ratio, but 3-D structure produces a strong coda
following the P pulse, i.e. signal-generated, systematic “noise”
not fit by the synthetic waveform. Station LCO has a low signal-to-noise
ratio and a coda. Since the coda cannot be modelled, it must be considered
noise, albeit of a systematic nature and correlated across time samples and
across stations. By contrast, ambient noise is random and not correlated
across stations, only across time samples (since the signal is band-limited).
Bayesian inference
Bayesian inference estimates the posterior distribution π(m) of
the parameters m given d, using the prior distribution
p(m) of the model parameters m and the likelihood
L(m|d) of the data d, given the model
m, by applying Bayes' rule:
π(m|d)=1p(d)L(m|d)p(m).
p(d) is the prior distribution of the data d and does not
depend on the experiment. A likelihood function
L(m|d) is equivalent to the probability distribution
p(d|m) of data d given the model parameters
m . It depends on the difference
between measured data d and predicted data g(m). This
difference or misfit is defined, following convention, as
Φ(d,g(m))=-ln(L(m|d)),
so that a model with a high likelihood has a diminishing misfit. Since the
likelihood of a model can vary by orders of magnitude, the logarithm brings
the misfit back to natural scaling.
The exact formula for L(m|d) depends on the assumed
noise model and potential error sources in the forward model.
Equation () requires that the misfit criterion
take those into account as well. Next, we will show that this is
straightforward only for specific assumptions about the noise, which are
usually not realistic.
Metric-based misfit criteria
“Good” solutions m are associated with small misfits Φ,
where the exact definition of Φ depends on the nature of the data
d, which may be hand-picked arrival times; dispersion curves; or, in
our case, seismic displacement time series (“waveforms”). A waveform misfit
is generally a functional ΦW:RN×RN↦[0,∞) on d,g(m)∈RN.
The misfit functional has similar properties to a metric on RN,
but it should be noted that there is no natural choice; rather, its
choice implies a strong assumption of prior knowledge about the statistical
properties of the noise on d. In the case of seismic waveform data,
the data vector d is the measured time-sampled seismogram
ui and the separate data are the samples ui,i={1,…,n}
of this time series. The vector g(m) is the synthetic
seismogram uic,i={1,…,n} predicted by the
forward operator g for the model m.
When the method of least squares is used to calculate the ℓ2 misfit,
Φℓ2W(m|d)=k′12(d-g(m))TSD-1(d-g(m)),
the assumption is that the noise ϵ is additive and Gaussian-distributed:
d=g(m)+ϵ,ϵ∼N(0,SD).
The size [N×N] data covariance matrix SD∈SymN describes the correlation between the error of individual
measurements di. k′ is a normalisation constant.
In the case of a seismic waveform ui, ΦW is
Φℓ2W(m|d)=k′∑i=1n∑i′=1n(ui-uic)T(SD-1)i,i′(ui′-ui′c),
and SD describes mainly the band-limited spectrum of
environmental noise. Since a simple time shifting of ui or
uic will violate the assumption of
Eq. (), the ui or uic need to be
aligned first. Because we assume this noise to be time-invariant, we can
build SD from the autocorrelation function
Rϵϵ of the (discrete) noise time series ϵi.
SD is a Toeplitz matrix, where the rows are shifted
instances of the autocorrelation function Rϵϵ.
SD,k,k+l=Rϵϵ(l)=∑inϵiϵi-l
See for an example of how to construct
SD under the assumption of an autoregressive (AR)
noise model.
For the estimation of the parameters m of one earthquake source, we
would normally use seismograms measured at different stations, cut into a
total of nS time windows ui, counted with index j. The
overall misfit Φ(m) for a source solution will be comprised of
the misfits of the single waveforms Φℓ2,jW(m). If the
noise on each waveform j is assumed to be uncorrelated with the noise on
all others, then it is legitimate to define the overall misfit as being
simply additive:
Φ(m)=∑j=1nSΦℓ2,jW(m).
If the noise on the waveforms is correlated, then
Eq. () has to be extended, such that d,
m and SD contain all time samples of all
waveforms recorded at different stations. This effort has – to our best
knowledge – not been made in seismic inverse problems.
If each measurement i is considered to be uncorrelated with the others and
has a variance σi, then SD is a diagonal
matrix with diagonal elements σi2 and Eq. ()
reduces to
Φℓ2W(m|d)=k′2∑i=1N(di-gi(m))2σi2
or, in the case of waveforms,
Φℓ2W(m|d)=k′2∑i=1N(ui-uic)2σi2.
With a set of nS waveforms ui,j, the total misfit defined
in Eq. () becomes
Φ=k′2∑j=1nS∑i=1nj(ui,j-ui,jc)2σi2,
the weighted least-squares criterion.
If the noise can be described well by the normal distribution, the
ℓ2 norm can be successfully applied. It is, however very sensitive to
data di deviating strongly from the prediction gi(m). Outlier
samples can dominate the whole inversion process, while the residual misfit
of almost-fitting parts of the waveform has no influence. Experience shows
that realistic noise on seismic waveforms usually has more outliers than
predicted by Eq. ().
Hence, have proposed to use the more outlier-resistant
ℓ1 norm as a misfit criterion of observed and modelled seismograms.
They assume that noise on the time samples ui is independently
Laplace-distributed with width bi, i.e. no temporal correlation:
d=g(m)+ϵ,ϵi∼Laplace(0,bi),
Φℓ1W(m|d)=-∑i|di-gi(m)|bi-ln2bi.
Time samples of realistic, band-limited seismograms are strongly correlated,
which calls for the use of multivariate Laplace distributions. This is the
subject of ongoing research , but the
resulting probability density functions (PDFs) are still too complex to be used in
ensemble inference. To make things worse, seismograms recorded at different
stations j will generally also be correlated. Hence the simplicity of the
univariate Laplace distribution is not applicable, and the robustness of the
ℓ1 norm currently cannot be harnessed.
Other authors have proposed to use misfits based on general ℓp norms
e.g. p=1.5 in, which allow the robustness
of the misfit to be tuned to the noise on the data.
ΦℓpW(m|d)=∑i=1ndi-gi(m)pσp1/p
The underlying noise model is an exponential power distribution. However, all
problems described for the ℓ1 norm apply here as well, and no
multivariate forms exist in general.
In summary, it is tempting to chose ℓp misfits based on the
time-sample-wise distance between observed and modelled waveforms because the
underlying noise models are straightforward to state (uncorrelated or
correlated Gaussian, uncorrelated Laplace distribution) and to translate into
corresponding likelihood functions. Unfortunately, these noise models are
very crude approximations of the pervasive noise characteristics and
correlation found in real time series.
These serious shortcomings motivate our proposal of alternate misfit
criteria.
Noise-model-based misfit
In a Bayesian context, the likelihood L(m|d) is a
defined by the noise model on the data. An equivalent function
L∗(m|d) can be constructed from the distribution
p(F) of any functional F of the observed and predicted waveforms
ui,uic∈R: F:R×R↦[0,∞). In our attempt to move beyond F being a
sample-wise distance between ui and uic, we
generally want a candidate F to meet the following conditions:
For ui=uic, F should take a fixed value, say 0.
With decreasing similarity of ui and uic, F should increase, irrespective of the exact definition of similarity
(Sect. will consider this further).
F should be robust against time shifts Δt=k⋅dt or amplitude errors a affecting the waveform ui, i.e.
Fa⋅ui+k,uic≊Fui,uic for any a∈R,k∈N,
because such unknown time shifts will affect real-world seismograms.
F should have discriminative power with respect to the model parameters m, combined with robustness against realistic noise and theoretical errors.
Concerning the noise, we need to be able to calculate the distribution of F
for a waveform afflicted by the typical three error sources: background
noise, waveform modelling error and instrument error.
Ambient noise ϵnoise: this is noise from man-made or natural sources around the receiver.
It can be described very well by an additional term, like ϵnoise∼N(0,S) (see Eq. ).
Waveform modelling error Tmodel,i:
the synthetic waveform uic can never be identical to the observed ui, even in the absence of ambient noise. In the context of source modelling, the earth's impulse response
(Green's function) can be considered a linear, time-invariant operator that acts on the source time function. The calculation of this Green's function is not perfect (e.g. due to errors in the earth model or
imperfect computational methods).
called this the theoretical density function and proposed to model this systematic error by an additive term on uic, but we think that it should rather
take the form of a transfer function Tmodel,i, between ui and uic, which will hopefully be Dirac-like in character. However, Tmodel,i
will include the site response (receiver side reverberations), which can create strong waveform coda; see Fig. . Hence, Tmodel,i could in practice be rather oscillatory.
Instrument error Tinst,i: a displacement seismogram ui is assumed to have been corrected for the instrument response of its seismic sensor. In practice, this correction
may be imperfect , e.g. due to erroneous sensor metadata. We model this systematic error by another (hopefully Dirac-like) transfer function Tinst,i convolving ui.
In summary, the difference between a modelled uic and
observed waveform ui is
ui=uic∗Tmodel,i∗Tinst,i+ϵnoise,i.
It is this complex mixture of noises that misfit criterion F should be
robust against while retaining discriminatory power toward source model
parameters m.
Next, we will test the signal decorrelation D as an alternative to
ℓp norms against these four criteria.
Signal decorrelation coefficient as a misfit
Comparison of the ℓ1,ℓ2 norm and the signal
decorrelationComparison of the ℓ1,ℓ2 norm and the signal
decorrelation D=1-CC as misfit criteria in noisy signals. A
perturbed synthetic waveform upertc for a 10 km deep explosion
source, measured at a station at 40∘ epicentral distance, was
compared to synthetic seismograms uc for other depths, using the three
misfit criteria. The shaded colours mark the 95 % quantiles of the misfit
values, calculated by perturbing the reference waveform with different random
seeds. The figure shows the relatively high robustness of the
cross-correlation coefficient in recognising reference signals in perturbed
measurements. For better visualisation, all misfit values have been
normalised separately to have an average values of 1 between 20 and 30 km.
Distance between misfit value for the true source depth vs. the
plateau for depths 20–30 km in standard deviations. See
Fig. for waveforms and misfit curves. The “weak-perturbation”
curve is calculated with perturbation factor α=0.1, and
the “strong-perturbation” curve with α=0.9 (see
Eq. ). For all SNR values, the decorrelation has a
higher discriminative power than ℓ1 or ℓ2.
We choose the signal decorrelation D as a misfit criterion, defined as
Dui,uic=1-maxk{CCkui,uic},
where
CCkui,uic=∑i=1n(wiui-kc⋅ui)∑i=1n(wiui-kc)2⋅∑i=1n(wiui)2
is the normalised cross-correlation coefficient and k is the time delay
between uic and ui for which the normalised
cross-correlation function CCkui,uic
takes its maximum value. wi is a window function that allows to
select a time window for the cross-correlation measurement. D satisfies
three of the four criteria that we desired of a misfit in the last section:
Dui,uic takes the value 0 for identical signals uic≡ui,
since CCk=0ui,uic=1.
For ui≠uic, 0<Dui,uic<2, i.e. D values larger than for the case
uic≡ui, and Dui,uic increases with decreasing similarity of ui and uic.
If a time shift k′ is small compared to the window length, we have
CCkui,uic≈CCk+k′ui,ui+k′c and thus Dui,uic≈Dui,ui+k′c .Due to the normalisation in Eq. (), D is amplitude-independent:
CCui,uic=CCui,a⋅uic and thus Dui,uic=Dui,a⋅uic
The fourth criterion, discriminative power and robustness against noise is
less straightforward to demonstrate. We proceed empirically by showing its
superior performance over the ℓ2 and ℓ1 misfits on an example of
the kind of waveforms we typically use for source inversion.
Figure shows in black a simulated, broadband, noise-free
P wave train, recorded at 40∘ epicentral distance. The
seismograms were modelled using the WKBJ method of in the
IASP91 velocity model , assuming an explosion source with
M0=1020 Nm. Since the chosen source depth is shallow (10 km), the
P pulse is followed within seconds by depth phases like pP,
which effectively permits inversion for source depth. However, once this
waveform gets perturbed by realistic modelling error (convolutive) and
additive noise, resulting in the red waveform, the fit to the unperturbed
original becomes tedious. A meaningful robustness test is as follows: if the perturbed
(red) waveform is modelled for different candidate source depths, will the
smallest misfit be achieved for the perturbed wave simulated at the correct
depth of 10 km? This is a meaningful test of robustness, because source
depth tends to be the most challenging parameter to retrieve in source
inversions. Algorithmically, the perturbation is done in two steps:
Perturbation by convolution with a “modelling error function” Terror,i, which encompasses effects of Tmodel,i and Tinst,i.
It is defined as having a unit amplitude spectrum and a random phase spectrum between 0 and α⋅π/2.um.e.=uic∗Terror,iThis method adds realistic coda to the waveform, which simulates the effects of structure, that was not included in the forward simulation. The parameter α regulates the perturbing
effect of the modelling error function.
By adding a band-limited noise termupert=um.e.+βϵ,whereϵ∼N(0,SD),the covariance matrix SD is set to model a band-limited noise with corner frequencies of (1/15,1/6Hz), similar to microseismic background noise at the seismic station.
The peak amplitude is normalised to that of uic, so that the parameter β controls the relative amplitude of this noise term.
Figure shows the resulting reference waveform (left) and
perturbed waveforms for α=0.4 and β=0.8, i.e. moderate
perturbation of the signal and strong background noise. The unperturbed
waveform ui is plotted in solid, thin black, the waveform
perturbed with modelling error um.e. in dotted blue and the
resulting reference trace in solid red. It bears little resemblance to the
unperturbed waveform.
The right plot shows the value of the three waveform misfits ℓ1,
ℓ2 and D between uic and upert
over varying source depths. It simulates an inversion for the depth of an
earthquake using seismic waveforms. The waveform contains the P and
pP arrival. The depth is mainly constrained by the relative arrival
time of the three and the resulting waveform of the whole P-pP wave
train. The perturbation of Eq. () adds
artificial coda with additional arrivals to the waveform, which a good
waveform misfit should be robust against. The misfit should have a
distinctively lower value for the “true” depth of 10 km than for any of
the others. To take into account the stochastic nature of these
perturbations, 500 realisations of upert were calculated
for the same parameters, α and β, but with different random numbers.
The coloured shades mark the 95 % (2σ) quantiles of the misfit
values; the solid line marks the median.
The ℓ2 misfit could not recognise uic in
upert anymore and assigns the lowest misfit to a depth of
3 km. An analysis of different noise and perturbation levels shows that the
ℓ2 norm is relatively robust against background noise, but not against
perturbations from a modelling error; see Fig. S1 in the Supplement. This
seems reasonable given the underlying noise model of this misfit.
The ℓ1 norm does better, in that it has a minimum at 9 km depth, close
to the true value. The zigzag shape however suggests that the value of 9 km
is stochastic. The median value at 9 to 10 km reaches only slightly below
the lower quartile for other depths, meaning that in reality the resolution
power of the ℓ1 norm for this kind of problem will be very limited. The
studies for different noise and perturbation levels show that it is generally
more robust against background noise and modelling error than the
ℓ2 norm but less so than the cross-correlation coefficient.
The cross-correlation misfit has the strongest difference between the plateau
of wrong depth solutions and the true one. For low noise levels, the minimum
is slightly wider than the one for the ℓ1 norm. More values of α
and β are shown in Fig. S1. The analysis of the
confidence intervals shows that the values for CC scatter slightly more
than the ones for ℓ2 and much more than for ℓ1. To employ it in
Bayesian inference, a detailed analysis of the statistical properties will be
necessary. The analysis also shows that the actual values of D are
influenced more strongly by the background noise level than by the modelling
error. We will use that observation in Sect. .
Figure compares the resolution power of the three
misfits for different perturbation levels and signal-to-noise ratios (SNRs). It
shows the difference between the misfit value for the true depth 10 km and
the average misfit value for the depths between 20 and 30 km. The difference
is expressed in numbers of standard deviations (sigmas) from the 500 separate
noise realisations. The dashed line shows the result for weak perturbation
(α=0.1), and the solid line for strong perturbation (α=0.9). It
can be seen that, for strongly perturbed waveforms, the ℓ1 and ℓ2
norm cannot recognise the true depth with more than 2σ, even for high
signal-to-noise ratios, while the decorrelation D stays well above 3σ, even for SNRs of 6.
Empirical likelihood function for the signal decorrelation
Empirical likelihood function obtained from high-quality, deterministic source estimates
Probability distribution of D, the decorrelation of measured
and synthetic P waveforms used for deterministic source inversions.
(a) Empirical histogram of D is shown as grey bars. From 200 000
broadband, teleseismic P waveforms for 900 earthquakes, only
waveforms with signal-to-noise ratios between 20.0 and 21.0 were considered
for this figure (because the scaling parameters of analytic fitting functions
depend mainly on SNR). Coloured lines show best-fitting realisations of three
analytic probability density distributions: beta (red), exponential (green)
and log-normal (blue). The log-normal distribution yields the best fit to data.
(b) Quantile–quantile plot for the three candidate distributions of
(a) confirms that the log-normal distribution best fits the
empirical histogram of D. The values on the x axis are percentiles of the
cumulative histogram of D in our dataset. The y axis shows the
percentiles of the best-fitting distribution of each class. The closer the
percentiles are to the line y=x, the better the fit of the distribution to
the underlying data over the entire range of values. Both subfigures indicate
that a log-normal distribution best fits the values of D=1-CC.
In seismology, the cross-correlation coefficient CC=1-D has been
used as a measure of goodness of fit to detect predicted waveforms in noisy
signals , to filter bad recordings, to detect
temporal changes in repeating signals e.g. and to
estimate the spatial extents of earthquake clusters . It has rarely been used as a misfit criterion in
source inversion – we are only aware of and
. CC and D=1-CC have not been used
in probabilistic inversion, and the main obstacle would have been their
unknown statistics.
We present an empirical solution to this problem by drawing on a large,
pre-existing database of cross-correlation measurements that we assembled in
the context of deterministic source inversions, as described in Section 1.
Essentially we assert that our human expert knowledge and extensive
experience have generated a large, representative and highly
quality-controlled set of 900 teleseismic source parameter estimates that are
sufficiently close to the true source parameters to reveal the statistics of
the noise in the measurements d these estimates m are
based upon. The measurements d consisted of 200 000
cross-correlation coefficients CC obtained from 200 000 broadband fits of
observed seismograms to WKBJ synthetics. The synthetic waveforms were
calculated using the WKBJ method in velocity model IASP91
, with attenuation and density taken from PREM . To
the extent that our source solutions mj approach the true source
parameters m0,j, the histogram of the CC (or D=1-CC)
values approximate the probability density function of CC (or D) in the
presence of noise and modelling errors. Thus we can obtain an “empirical
likelihood function” L∗(m|d) even in the absence of
an analytically describable noise model. We preface the term “likelihood”
by “empirical” because strictly speaking the likelihood would be associated
with the noise model on the raw samples i, rather than with the noise on
the composite measure D. A similar approach has been adopted independently
and recently by in the context of receiver-function
inversion. Note that the term “empirical likelihood” has been used
differently in statistics .
Our reasoning and procedure can be summed up as follows:
We can consider the measurements of misfit functional Φj(m0|d) for one earthquake at j=1,…,nS
recording receivers as realisations of a random process that follows a yet unknown probability density function p(x). m0 are the true source parameters, and any misfit
Φj is therefore due to ambient noise and modelling errors in the seismograms, as described in section .
In practice we never get to know m0 but only a (hopefully close) estimate mest, the result of a deterministic source inversion procedure.
Hence all we can actually observe is
Φ(mest|d), some of which is due to the
estimation error mest-m0. However, by
estimating mest carefully and repeatedly (for 900
different earthquakes), and by considering the resulting 900 sets of misfits
Φ (at 200 000 source–receiver pairs) jointly, the histogram of their
200 000 D values should approximate a histogram of the true
Φ(m0|d) as closely as we can hope to get.
Figure a shows this empirically obtained histogram
Φcumulative of D in grey (for the subset of P seismograms
that had a SNR of 20; reason to be discussed).
To evaluate the likelihood of a misfit value Φ′ encountered in a future (Bayesian) inversion, we could in principle compare it to this empirical
histogram Φcumulative. It would however be more convenient and computationally efficient to identify an analytic expression for the p(x) that
produced this histogram Φcumulative and to evaluate any Φ′ against this p(x).
The best we can do is to identify a suitable type of distribution and fit its parameters to the empirical histogram Φcumulative of Fig. a,
thus obtaining a PDF pfit(x) as our best estimate for the true
p(x).
The likelihood of a data vector d given model m is then considered to beL∗(m|d)=pfitΦ(d|m).
Approximate log-normal distribution of decorrelation D
We will consider three candidate distributions for fitting an analytic
pfit(x): beta, exponential and log-normal. They are all
positive one-sided (defined only for D>0) and can take negligible values
for D>2, where strictly they should be 0.
Figure a shows their fits to the empirical histogram
after determining the best-fitting scale parameters for each.
The beta and the exponential distributions are seen to overestimate the
number of very small D values (i.e. values of CC ≈ 1). Hence these
distributions would predict more excellent waveform fits than observed. The
likelihood of actually well-fitting waveforms would be estimated too low;
i.e. we would be too pessimistic about the achievability of good waveform
fits.
The log-normal distribution clearly yields the best approximation of the
D histogram. This is confirmed by the quantile–quantile plot of
Fig. b. Hence we choose the log-normal distribution
to express our likelihood function.
The (univariate) log-normal distribution function is defined by two scale
parameters μ and σ:
f(x)=1x2πσ2exp-(lnx-μ)22σ2.
The log-normal distribution also yields the best fit to our synthetic data
from Sect. , as calculated with the perturbations in
Eqs. () and (). See
Fig. S4 for a corresponding quantile–quantile plot.
If random variable x in Eq. () is equated with the
decorrelation Dj of one waveform j, the logarithm ln(Dj) is
normally distributed with mean μ and standard deviation σ. This
fortunate link of our empirical D histogram to the Gaussian distribution
makes it trivial to express the joint, multivariate distribution of all
nS waveform measurements of an earthquake, collecting the Dj
in vector D and the inter-station covariances in nS×nS covariance matrix SD.
The nS-variate likelihood function for D becomes
LD∗=exp-12ln(D)-μTSD-1ln(D)-μ(2π)n2|det(SD)|,
and the misfit becomes
Φ=12∑j=1n∑k=1nln(Dj)-μjT(SD-1)jkln(Dk)-μk+12ln((2π)n|det(SD)|).
This is the Mahalanobis distance, not between the individual samples of two
waveforms ui and uic as in
Eq. () but between the decorrelation Dj of these
two waveforms and its expected value μj, taking into account correlated
noise between two stations in SD.
Thus the use of D as a misfit criterion reduces the number of misfit values
to nS per earthquake (the number of source receiver paths, or
waveforms) compared to ∑j=1nSnj in the case of the
ℓ1 or ℓ2 norms (nj is the number of samples on waveform j).
In other words, Dj itself accounts for any correlations across time
samples on seismogram j and subsumes them into a single number, leaving
only spatial (inter-station) correlations to be dealt with in
SD and in the empirical likelihood function
L∗.
Distribution coefficients determined by signal-to-noise-ratio
Colour shade map out a two-dimensional histogram of waveform
decorrelation D, as a function of waveform SNR along the y axis. All
200 000 waveform measurements from our 900 deterministic source inversions
entered this histogram. Black lines are the best-fitting log-normal
distributions for SNRs of 10, 20 and 30. (The 1-D histogram for
SNR =20 was discussed in Fig. .) Toward smaller
SNRs (high-noise conditions), the D distribution widens (more occurrences
of poorly fitting waveforms).
Here we describe how μ and SD can be
estimated for one earthquake. So far it was implicitly assumed that a single
distribution pfit might fit Φcumulative for
all source–receiver paths.
This may be an oversimplification since ambient noise levels
ϵnoise show significant diurnal and seasonal variations,
and are elevated at stations close to coastlines or cities
. Hence we might expect goodness of fit to
vary across stations, which could be modelled by adjusting the scale
parameters of the log-normal distribution for each station. Goodness of fit
is also influenced by earthquake magnitude, and by station distance and
back azimuth, so we might even require different scale parameters for each
source–receiver pair.
To avoid this level of complexity, recall the investigation of
Sect. that revealed the distribution of D to
be most sensitive to the level of ambient noise ϵnoise.
Hence we bin our 200 000 source–receiver pairs by SNR and estimate only one pair of (μ,σ) distribution
parameters per SNR bin. This hopefully subsumes all individual sources of
random misfit.
SNR is defined as the integrated spectral energy in the signal time window,
divided by that of a 120 s noise window prior to the arrival of the
first body-wave energy. Signal time windows ui,i=1,…,Nsignal are as follows: for P phase, 5 s before to
20.6 s after its theoretical arrival time in IASP91, on the
Z component; for SH phase, 10 s before to 41.2 s after,
on the T component. Noise time windows ni,i=1,…,Nnoise are as follows: for both P and SH phases,
-150 to -30 s before theoretical arrival time. We calculate SNRs for P and SH waves as
SNR=Nnoise∑i=1Nsignalui2Nsignal∑i=1Nnoiseni2.
Note that this way the noise window of the P wave measurement contains only
ambient noise, whereas the SH wave noise window is in addition
afflicted by some signal-generated noise: P coda and phases like PP or PcP,
which get scattered into the transverse component due to lateral
heterogeneities and anisotropy in the real earth.
Figure shows the D histogram and three fitted
probability densities pfit(D), as a function of SNR. Under
low-noise conditions (high SNR), the log-normal distributions are narrower
and centred on smaller D misfit values, which seems plausible.
By fitting functions of the form h(SNR)=a1+a2⋅exp(a3⋅SNR) to the SNR-binned D histograms, we
determined distribution parameters μP(SNR),
μSH(SNR), σP(SNR) and
σSH(SNR) for SNR ranging from 1 to 1000 for
P waveforms and from 1 to 200 for SH waveforms (see Supplement for details).
Hence the log-normal distribution pfit(D) ascribed to a given
source–receiver pair depends only on the ambient signal-to-noise ratio of the
receiver i, and its scale parameters are given by
μi=aμ,1+aμ,2⋅exp(aμ,3⋅SNRi),σi=aσ,1+aσ,2⋅exp(aσ,3⋅SNRi).
The exact values for ai depend on the velocity model and the solution
method. Here, we used the WKBJ method, which results in a simplistic crustal
response. Other methods, like the spectral-element method, in combination with
a waveform database as implemented in Instaseis by
may produce more realistic seismograms, resulting in higher average values of
D. What matters is that the actual inversion uses exactly the same solver
and velocity model as was used to determine the distributions of D.
Estimating inter-station covariances
Correlation in misfit between neighbouring stations. The measured
Pearson correlation (see Eq. ) is plotted over the difference
in azimuths between two station for the same earthquake. A fit function
gb1,b2,b3(ϑ)=b1+b2⋅exp(-b3ϑ2) is
plotted in dashed red lines.
Decorrelation values D measured at different stations cannot be expected to
be uncorrelated, because systematic modelling errors (due to differences
between assumed earth model and true earth, and to methodical inadequacies in
the Green's function computations) will affect neighbouring stations in
similar ways. A reasonable guess is that stations at similar azimuths from
the source would show the strongest correlations because their wave paths
have sampled similar parts of the sub-surface, in particular similar parts of
the crust and upper mantle – regions to which the strongest modelling errors
can be ascribed.
To check these systematics, we calculated the Pearson correlation coefficient
r(ϑ) as a function of azimuthal distance ϑ as follows.
For each earthquake, we calculated the azimuthal distances ϑjk
between all station pairs (j,k) and binned those. A set {j,k}ϑ
then contains all stations pairs for one event that have the same azimuthal
distance ϑ (in bins of 5∘ width).
We need to adjust for the fact that stations j and k usually have
different SNR and hence different μj and σj in their
log-normal distributions of D. Hence we calculate the standard score of
each station j as zj=ln(Dj)-μj/σj and from
this the Pearson correlation coefficient of a ϑ bin
{j,k}ϑ, using all nϑ station pairs in that bin:
r(ϑ)=1nϑ-1∑{j,k}ϑzjzk.
The use of standard scores permits comparison of stations of different SNR and
hence log-normal distribution parameters. The values for r(ϑ) are
then fit by a function (see Fig. )
g(ϑ)=b1+b2⋅exp(-b3ϑ2).
This permits comparison of Dj for stations with different SNR and
distributions of Dj. Then the correlation coefficient was calculated for
each azimuthal bin ϑ using all nϑ pairs
{i,j}ϑ in this bin.
This azimuth-dependent correlation coefficient g(ϑ) can be used to
fill the elements of covariance matrix SD in
Eq. ():
SD,i,j=σiσj⋅b1+b2⋅exp(-b3ϑ2),i≠jσi2,i=j.
An example of such a covariance matrix is shown in Fig. . It
is for the 2011 earthquake in the US state of Virginia that was used as a
detailed working example of Bayesian source inversion in the companion paper
.
Visualisation of an inter-station covariance matrix
SD for misfit D (centre panel; cf.
Eq. ), on the example of an mb 5.7 earthquake that
occurred in the US state of Virginia in 2011. Two maps for P and
SH data show the recording seismic stations as dots; colour fill
indicates the SNR of each waveform measurement. Inter-station correlation
depends directly on the azimuthal proximity of two stations. This results in
a block-diagonal matrix structure for SD, because we have sorted stations
by azimuth from the source. Blocks correspond to groups of stations with an
expected high correlation of errors: (1) a Northern Hemisphere cluster of
P wave measurements (circled in dark red), (2) a South American cluster of
P waveforms (green) and (3) a Northern Hemisphere cluster of SH waveforms
measurement (olive). P and SH measurements are modelled as being
uncorrelated. For the analysis, only stations between 32 and
85∘ epicentral distance have been used, as marked by the dashed lines.
Covariance matrix of the stations for the Virginia event
Misfit distribution of waveform amplitude measurements
Waveform amplitudes have not been considered so far, even though they provide
crucial constraints on focal mechanisms. Our amplitude measurement consists
of a comparison of the logarithmic energy content ln(A) in a 1 s time
window around the peak i=i1,…,i2 of the measured seismogram and its
synthetic:
Δln(A)j=ln∑i=i1i2uj,i2-ln∑i=i1i2uj,ic2.
Again our goal it to approximate the distribution of this misfit in order to
obtain an empirical likelihood function. The distribution of Δln(A)
is almost symmetric around 0; see Fig. S2. The amplitude misfit
|Δln(A)| approximately follows a Laplace distribution, where
parameter k does not vary much with SNR (see Supplement). We construct
the likelihood function
LAmp∗=∑j=1nS12kexp-|Δln(A)|k,
which assumes no correlation in amplitude misfit between two stations. This
assumption is not without problems, but motivated by the fact that amplitude
errors are often caused by localised site effects.
Application in Bayesian source inversion
In practice these concepts are integrated with the Bayesian source inversion
procedure of as follows:
For every new earthquake, download and archive a suitable selection of broadband, three-component, teleseismic seismograms (Δ=32 to 85∘).
A pragmatic approach is to use stations from a handful of international, permanent networks (e.g. II, IU, G and GE) to ensure high quality, reliability and relatively even azimuthal coverage,
avoiding station clustering in any particular region. This is easily automated using the freely available data management software ObsPyDMT .
Bandpass filter between 0.02 and 1.0 Hz. Rotate horizontal components to the RTZ system. Select signal time windows and noise time windows, and calculate SNR as defined in Eq. ().
For each station, and for P and SH separately, use SNR to calculate distribution parameters μi and σi from Eq. ().
Populate the diagonal of covariance matrix SD,ii with the σi2.
Estimate correlation coefficient r(ϑj,k) between two stations (j,k) using Eq. (). Fill off-diagonal elements:
SD,jk=r(ϑj,k)σjσk.
Insert μi and SD in the likelihood equation (Eq. ), and combine with LAmp∗ (Fig. )
to create the total likelihood function
L∗=LD∗+LAmp∗.
For each source model m proposed by the sampling algorithm, calculate synthetic seismograms and pass them through the filters of step 2.
Calculate the empirical likelihood L∗(m|d)
(Eq. ), which is multiplied with a suitable prior to
obtain a posterior probability for m. Parameterisation of
m, Bayesian sampling strategy and construction of the posterior
distribution of m are described in the companion paper
.
Discussion
The most common approach to Bayesian inversion is to assert a simple noise
model for which an analytic likelihood function is known: this determines the
measure of misfit. We have gone the opposite route in designing a misfit D
based on considerations of robustness and dimensionality reduction. Since no
noise model was known, we had to investigate the actual noise statistics and
thus derive an empirical noise model and likelihood function from the data
D. We were fortunate to find that the (multivariate) log-normal
distribution provides the best fit to our decorrelation data because it can
be evaluated almost as easily and cheaply as the most favourable of all
distributions, the Gaussian (normal) distribution.
In fact, analytic probability densities are known for only a few misfit
functionals. By far the most commonly used are the Gaussian (normal)
distribution, associated with the ℓ2 norm misfit, and the Laplace
distribution, associated with the ℓ1 norm. Evaluating residuals of data
fits against these analytic distributions is straightforward and fast, which
is important in the computationally expensive Bayesian realm.
In practice however, the adoption of ℓ1 or ℓ2 misfits may be
inappropriate or even impossible. Gauss and Laplace functions may be (too-)poor
approximations of the actual distributions of data residuals. Even if
they can be deemed adequate for some measurements (e.g. for the sample-wise
distance of two times series), they may generate huge and non-sparse
covariance matrices (because time samples are numerous and correlated), which
are difficult to estimate from the data. Even worse in such multivariate
scenarios, analytic expressions of the joint distribution functions may not
exist – as is the case for the Laplace distribution (ℓ1 norm).
Effectively this often leaves as the only “choice” for a noise model the
(multivariate) normal distribution – whether or not it fits the data at
hand.
More often than not, real data contain many more outliers than expected by
the normal distribution, certainly in the case of seismic data. Under the
ℓ2 norm, outliers disproportionally bias the solution (deterministic
case) or posterior distribution (Bayesian case) and also affect convergence
in the Bayesian case. The problem may be mitigated by manual removal of very
poorly fitting waveforms, but this is usually time-intensive guesswork and
likely to result in other biases.
The ℓ1 norm is more robust against outliers, and with the same
motivation distance norms with non-integer exponents ℓp have been
proposed and successfully applied, including for source inversion
. But all norms with p≠2 share the serious
limitation that no analytic expressions are known for the multivariate case.
Samples of real-world, band-limited time series are correlated. If a measured
seismogram of length N samples is considered,
ui=uic+ϵnoise,i,
then an (N×N) covariance matrix for ϵnoise needs
to be estimated under the ℓ2 norm. Hierarchical Bayesian methods can be
applied to estimate the noise level and covariance from the data itself
see), but in many cases it may be
more guessed than estimated.
The situation is further complicated if the noise model can no longer be
purely additive (“+ϵnoise”). We have argued that our
noise model needs to be
ui=uic∗Tmodeli∗Tinst,i+ϵnoise,i,
where the convolving terms are systematic modelling error. In theory this
type of error might be eliminated with computationally powerful waveform
forward modelling and more research into detailed earth structure. But since
those efforts would be tangential to the problem at hand (source inversion),
the cost would seem prohibitive. Hence we do want the option of treating the
modelling error as “just another source of noise”, to be accommodated by a
more sophisticated noise model, the analytic expression of which will be
unknown.
Another reason for leaving the Gaussian or ℓ2 realm might be a change
of measurement. In our case, the cross-correlation or decorrelation
measurements collapse N×2 samples of two times series into a single
scalar CC or D. Even if inter-sample correlations of the time series
actually were multivariate Gaussian, the statistics of CC or D would be
something more complicated. On the upside, the dimensionality of the
multivariate problem is reduced by a factor of N, which helps substantially
when forced to take the empirical path toward obtaining a likelihood
function. Thus inter-station covariances are the only correlations to
estimate, and the fact that they are simple covariances (second moments) is,
again, owed to the fortunate fact that the log-normal distribution yielded the
best fit to the misfit histogram.
We are not sure whether there is a theoretical reason that the log-normal
distribution should be associated with the decorrelation misfit D, and thus
effectively with CC. Whatever the case, this finding is highly relevant in
that it also opens up the path to Bayesian sampling of other optimisation
problems that have previously adopted the cross-correlation coefficient CC
of seismograms as their misfit criterion, e.g. other flavours of seismic
source inversion , seismic tomography
or the estimation of earthquake cluster sizes
.
As noted, the proposed empirical likelihood function
L∗(m|d) is no likelihood function in a strict
sense because it is not derived from the noise on the raw data samples but
rather from the noise (i.e. residual) of misfit functional D. For other
inverse problems, it has to be evaluated separately, whether or not a noise
model exists that can describe the difference between modelled and measured
seismograms completely as an additive term. If that is the case, a classical
likelihood can be used, but many inverse problems in seismology are similar
to the one presented here, and the proposed empirical likelihood offers a path
to a more thorough Bayesian treatment. It is just important to remember that
the distribution of D has to be determined from synthetic seismograms
calculated with the same velocity model and forward solver as it is used for
the actual inversion.
Other misfit criteria have been used in optimisation contexts in seismology.
For the purpose of source parameter inversion, their noise properties could
be investigated along the lines laid out by this work, and their empirical
likelihood functions studied. But unless their noise distributions turn out
to be as simple as for the D misfit (they would essentially have to follow
the normal or log-normal distribution), these other misfit choices will be
computationally more costly to sample. It is pleasing that the
cross-correlation, long appreciated for its robust performance in
deterministic optimisation, is now also vindicated in a Bayesian context by
the results of our study.