Fully probabilistic seismic source inversion – Part 1: Eﬀicient parameterisation

. Seismic source inversion is a non-linear problem in seismology where not just the earthquake parameters themselves but also estimates of their uncertainties are of great practical importance. Probabilistic source inversion (Bayesian inference) is very adapted to this challenge, provided that the parameter space can be chosen small enough to make Bayesian sampling computationally feasible. We propose a framework for PRobabilistic Inference of Seismic source Mechanisms ( PRISM ) that parameterises and samples earthquake depth, moment tensor, and source time function efﬁciently by using information from previous non-Bayesian inversions. The source time function is expressed as a weighted sum of a small number of empirical orthogonal functions, which were derived from a catalogue of > 1000 source time functions (STFs) by a principal component analysis. We use a likelihood model based on the cross-correlation misﬁt between observed and predicted waveforms. The resulting ensemble of solutions provides full uncertainty and covariance information for the source parameters, and permits propagating these source uncertainties into travel time estimates used for seismic tomography. The computational effort is such that routine, global estimation of earthquake mechanisms and source time functions from teleseismic broadband waveforms is


Introduction
Seismic source inversion is one of the primary tasks of seismology, and the need to explain devastating ground movements was at the origin of the discipline.The interest is to locate the earthquake source using seismogram recordings, and to combine this information with geological knowledge, in order to estimate the probability of further earthquakes in the same region.This purpose is served well by a variety of existing source catalogues, global and regional.Large earthquakes and those in densely instrumented areas are being studied in detail, using extended-source frameworks like finite-fault or back-projection.
Smaller earthquakes (M S ≤ 7.5), and especially remote events with sparse data coverage, are better parameterised by a point source.Most catalogues determine only a location and a moment tensor solution, which often allows for identification of the associated fault.But the waveform data contain additional information: for earthquakes exceeding M S ≥ 5.5, it is generally possible to invert for the temporal evolution of the rupture, described by a time series called the source time function (STF) (Ruff, 1989;Houston, 2001).While the STF may further aid the understanding of earthquake mechanisms (Vallée, 2013) and hazard or the interpretation of an event in a mining context (Gibowicz, 2009), our primary motivation for estimating it is a different one: the STF convolves the broadband Green function and strongly affects its waveform.Waveform tomography estimates threedimensional earth structure by optimising the fit of observed to predicted waveforms, but at high frequencies (e.g.exceeding 0.1 Hz) such fits can only succeed when the source time function is incorporated into the predicted waveform (Sigloch and Nolet, 2006;Stähler et al., 2012).Hence the purpose here is to develop an automated procedure to routinely estimate broadband source time functions and point source parameters from global seismogram recordings, including a full treatment of parameter uncertainties.

Published by Copernicus Publications on behalf of the European Geosciences Union.
A few recent catalogues now include STF estimates (Vallée et al., 2011;Garcia et al., 2013), but the treatment of parameter uncertainties is still incomplete.Uncertainties in the STF correlate most strongly with source depth estimates, especially for shallow earthquakes (Sigloch and Nolet, 2006), where surface-reflected phases (pP, sP) inevitably enter the time window for STF estimation (see Fig. 1).Inversion for the STF and the moment tensor is linear, whereas inversion for depth is inherently non-linear.Hence gradient-free optimisation techniques like simulated annealing (Kirkpatrick et al., 1983) or the first stage of the neighbourhood algorithm (NA) (Sambridge, 1999a) have become popular; Table 4 presents an overview of gradient-free source inversion algorithms from recent years.These optimisation algorithms provide only rudimentary uncertainty estimates.
A natural alternative, pursued here, is Bayesian sampling, where an ensemble of models is generated.The members of this ensemble are distributed according to the posterior probability density P (m), where m is the model parameter vector to estimate.Integrating over certain parameters of this joint posterior P (m), or linear combinations thereof, yields marginal distributions over arbitrary individual parameters or parameter combinations.To the best of our knowledge, ensemble sampling in the context of source parameter estimation has been tried twice so far (Wéber, 2006;Deb ¸ski, 2008), and has been limited to a few events in either case.
A hurdle to using sampling algorithms has been the efficient parameterisation of the source time function.We propose a parameterisation based on empirical orthogonal wavelets (Sect.2.1), which reduces the number of free parameters to less than 12 for the STF, and to around 18 in total.We show that this makes Bayesian sampling of the entire model space computationally feasible.
A normalised moment tensor is sampled explicitly, and the scalar moment and absolute values for M j are derived from the amplitude misfit (Sect.2.2).Section 3 introduces Bayesian inference as a concept and explains the model space and prior assumptions.The ensemble inference is done with the neighbourhood algorithm (Sambridge, 1999a, b).In Sect.4, the code is applied to a magnitude 5.7 earthquake in Virginia, 2011.Section 5 discusses aspects of our algorithm and potential alternatives, which we compare to related studies by other workers in Sect.5.4 and in the Appendix.
Our procedure is called PRISM (PRobabilistic Inference of Source Mechanisms); by applying it routinely, we plan to publish ensemble solutions for intermediate-size earthquakes in the near future.A usage of uncertainty information gained from the ensemble is demonstrated in Sect.4.3, where the influence of source uncertainties on tomographic travel time observables is estimated.Further investigations of noise and of inter-station covariances are presented in a companion paper (Stähler et al., 2014).and Nolet (2006).Trial source depths ranged from 2 km to 17 km, in increments of volution was based on the same 86 broadband, teleseismic P-waveforms.Note the stro moment tensor as a function of depth.Top left shows the moment tensor solution from comparison.For every candidate solution, the percentage of "non-negative" energy is giv cillatory (and thus inherently non-physical) the solution is.The third number gives the av coefficient between observed and predicted waveforms achieved by each solution.At d km, the STF is pulse-like, simple, non-negative, and waveform cross-correlation attains i the most likely depth range for this event.The present study offers an approach to qu tradeoffs and judgements.Sigloch and Nolet (2006).Trial source depths ranged from 2 km to 17 km, in increments of 1 km, and each deconvolution was based on the same 86 broadband, teleseismic P waveforms.Note the strong changes in STF and moment tensor as a function of depth.Top left shows the moment tensor solution from the NEIC catalogue for comparison.For every candidate solution, the percentage of "non-negative" energy is given, a proxy for how oscillatory (and thus inherently non-physical) the solution is.The third number gives the average cross-correlation coefficient between observed and predicted waveforms achieved by each solution.At depths between 2 and 7 km, the STF is pulse-like, simple, and non-negative, and waveform cross-correlation attains its maximum, signalling the most likely depth range for this event.The present study offers an approach to quantify these qualitative tradeoffs and judgements.
where s(t) ≡ ṁ(t) is the STF; M j,k denotes the elements of the symmetric, 3 × 3 moment tensor, M; and G(r s , r r , t) is the Green function between the hypocentre r s and receiver location r r .
Due to the symmetry of M, we can reduce Eq. ( 1) to a simpler form: where M l are the unique moment tensor elements and g l are the respective derivatives of the Green function.The elements g j are not 3-D vectors because we compute either only its vertical component (for P waves) or its transverse component (for SH waves).In either case, g is a superposition of six partial functions g j , corresponding to contributions from six unique moment tensor elements M l , with a weighting for the non-diagonal elements of M, which appear twice in Eq. ( 1).The orientation of the source is considered to remain fixed during the rupture -i.e., M l does not depend on t -so that a single time series s(t) is sufficient to describe rupture evolution.
For intermediate-size earthquakes (5.5 < M W < 7.0) the STF typically has a duration of several seconds, which is not short compared to the rapid sequence of P-pP-sP or S-sS pulses that shallow earthquakes produce in broadband seismograms.Most earthquakes are shallow in this sense, i.e., shallower than 50 km.In order to assemble tomographysized data sets, it is therefore imperative to account for the source time function in any waveform fitting attempt that goes to frequencies above ≈ 0.05 Hz (Sigloch and Nolet, 2006).
Equations ( 1) and ( 2) are linear in s(t), so that s(t) can be determined by deconvolving g from u if M j in considered fixed.However, g depends strongly on source depth (third component of vector r s ), so that a misestimated source depth will strongly distort the shape of the STF, as demonstrated by Fig. 1.Another complication is present in the fact that observed seismograms u(t) (as opposed to the predicted Green functions) are time-shifted relative to each other due to 3-D heterogeneity in the earth, and should be empirically aligned before deconvolving s(t).
These issues can be overcome by solving iteratively for s(t) and M j with a fixed depth (Sigloch and Nolet, 2006;Stähler et al., 2012), but the approach requires significant human interaction, which poses a challenge for the amounts of data now available for regional or global tomography.Moreover, such an optimisation approach does not provide systematic estimates of parameter uncertainties.
Monte Carlo sampling avoids the unstable deconvolution and permits straightforward estimation of full parameter uncertainties and covariances.However, the model space to sample grows exponentially with the number of parameters, and the STF adds a significant number of parameters.In a naive approach, this number could easily be on the order of 100, i.e., computationally prohibitive.For example, the STFs deconvolved in Fig. 1 were parameterised as a time series of 25 s duration, sampled at 10 Hz, and thus yielding 250 unknowns -not efficient, since neighbouring samples are expected to be strongly correlated.This raises the question of how many independent parameters or degrees of freedom this problem actually has.
Due to intrinsic attenuation of the earth, the highest frequencies still significantly represented in teleseismic P waves are around 1 Hz.If from experience we require a duration of 25 s to render the longest possible STFs occurring for our magnitude range (Houston, 2001), then the time-bandwidth product is 1 Hz • 25 s = 25, and the problem cannot have more degrees of freedom than that.
Efficient parameterisation then amounts to finding a basis of not more than 25 orthogonal functions that span the subspace of the real-world, band-limited STFs just described.In fact, we can empirically decrease the number of parameters even further.By the method of Sigloch and Nolet (2006), we have semi-automatically deconvolved more than 3000 broadband STFs while building data sets for finite-frequency tomography.Of these, we propose to use the 1000 STFs that we consider most confidently determined as prior information for what the range of possible STFs looks like, for earthquakes of magnitude 5.5 < M W < 7.5.By performing a principal component analysis on this large set of prior STFs, we find that only around 10 empirical orthogonal wavelets are needed to satisfactorily explain almost all of the STFs, as shown in Fig. 2.
In concrete terms, we applied the MATLAB function princomp.m to a matrix containing the 1000 prior STFs in its rows.The mean over the matrix columns (time samples) was subtracted prior to performing the decomposition, and is shown in Fig. 2a as wavelet s 0 (t).Principal component analysis then determines s 1 (t) as the function orthonormal to s 0 (t) that explains as much of the variance in the matrix rows as possible.After subtracting (optimally weighted) s 1 (t) from each row, function s 2 (t) is determined such that it is orthonormal to s 0 (t) and s 1 (t), and explains as much as possible of the remaining variance.Each subsequent iteration generates another orthonormal s i until i = 256, the number of time samples (matrix columns).The source time function can now be expressed as In this parameterisation, the new unknowns to solve for during source estimation are the a i .Since principal component analysis has sorted the a i by their importance to explaining a typical STF, we may choose to truncate this sum at a relatively low value N 256: In practice, N will be chosen based on the residual misfit between s(t) and s N (t) that one is willing to tolerate.Figure 2b shows the dependence of this misfit on N .If we tolerate an average root mean square (RMS) misfit of 10 % in total signal variance, N = 10 base functions are sufficient, compared to 16, when using a sinc base.In the following we use N = 12.A set of potentially problematic STFs expressed by our base functions is shown in an electronic supplement to this paper.

Parameterisation of the moment tensor
The orientation of the source can be parameterised either by a moment tensor using 6 parameters or as a pure shear displacement source (Aki and Richards, 2002, p. 112) with strike, slip and dip (to which a term for an isotropic component might be added).Here we want to estimate the nondouble-couple content of the solutions, and hence we sample the full moment tensor.The scalar moment is fixed to 1, so that only relative M j are estimated.This is equivalent to sampling a hypersphere in the six-dimensional vector space Uniform sampling on a n-D hypersphere can be achieved by the method of Tashiro (1977), which transforms n − 1 uniformly distributed random variables x i to produce n random variables r i that are distributed uniformly on a hypersphere with 6 i=1 r 2 i = 1.We identify r i with the moment tensor components and note that the non-diagonal elements M kl , k = l appear twice in the sum (thus we actually sample an ellipsoid rather than a hypersphere).We then have

Forward simulation
Broadband, teleseismic Green's functions for P-pP-sP and SH-sSH wave trains are calculated by the WKBJ code of Chapman (1978), using IASP91 (Kennett and Engdahl, 1991) as the spherically symmetric reference model for the mantle.The reference crust at the receiver site is replaced by a two-layered crust predicted by the model CRUST2.0(Bassin et al., 2000).It uses the mean of layers 3-5 (soft sediments, hard sediments, upper crust) from the surface to the Conrad discontinuity and the mean of layers 6 and 7 (middle crust and lower crust) between the Conrad and the Moho.Values for intrinsic attenuation in mantle and crust are taken from the spherically symmetric earth model PREM (Dziewoński, 1981).The synthetic waveforms are compared to the observed seismograms in time windows that start 10 s before the theoretical P wave arrival time (according to IASP91) and end 41.2 s after.

Bayesian inversion
Bayesian inversion is an application of Bayes' rule: where m is a vector of model parameters (in our case depth, moment tensor elements M j and STF weights a i ), and d is a vector of data, i.e., a concatenation of P and SH waveforms.These quantities are considered to be random variables that follow Bayes' rule.We can then identify P (m) with the prior probability density of a model.This is the information on the model parameters that we have independent of the experiment.The conditional probability of d given m, P (d|m), also called L(m|d), is the likelihood of a model m to produce the data d.Term P (d) is constant for all models and is therefore dropped in what follows.P (m|d) is called the posterior probability density (short, "the posterior") and denotes the probability assigned to a model m after having done the experiment.
Since the posterior P (m|d) may vary by orders of magnitude for different d, we work with its logarithm.We introduce the quantity (m|d) to denote some kind of data misfit such that the likelihood can be written as The normalisation constant k is and calculated by the neighbourhood algorithm in the ensemble inference stage.
In the case of multivariate, Gaussian-distributed noise on the data with a covariance matrix S D , where g(m) is the data predicted by model m, we would obtain the familiar expression This term is usually called Mahalanobis distance or 2 -misfit.
We do not choose this sample-wise difference between observed and predicted waveforms as our measure of misfit.There are questions about the Gaussian noise assumption for real data, but mainly we consider there to be a measure that is more robust and adapted to our purpose, the cross-correlation (mis-)fit between data and synthetics (Stähler et al., 2014), which essentially quantifies phase misfit.In the optimisationbased, linearised approach to tomography, fitting the phase shift between two waveforms remains a near-linear problem in a wider range around the reference model than fitting the waveforms sample-wise.The cross-correlation fit is defined as where u i (t) is the measured and u c i (t) is the synthetic waveform for a model m at station i.In general, CC is a function of the time lag T i for which we compare the observed and predicted waveforms, but here we imply that T i has already been chosen such as to maximise CC( T i ).(This value of T i that maximises the cross-correlation is called the "finitefrequency travel time anomaly" of waveform u i (t), and represents the most important observable for finite-frequency tomography (Nolet, 2008;Sigloch and Nolet, 2006).Section 4.3, which discusses error propagation from source inversion into tomographic observables, further clarifies this motivation of the cross-correlation criterion further.) Correlation CC( T i ) measures goodness of fit, so we choose decorrelation D i = 1 − CC( T i ) as our measure of misfit (one scalar per wave path i).From the large set of preexisting deterministic source solutions described in Sect.2.1, we estimated the distribution of this misfit D i , based on our reference data set of about 1000 very confidently deconvolved STF solutions.For this large and highly qualitycontrolled set of earthquakes, we empirically find that the decorrelation D i of its associated seismograms u i (t) and u c i (t) follows a log-normal distribution in the presence of the actual noise and modelling errors.The statistics of this finding are discussed further in the companion paper (Stähler et al., 2014), but here we use it to state our likelihood function L, which is the multivariate log-normal distribution: D is the decorrelation vector into which n decorrelation coefficients D i are gathered.Each D i was measured on a pair of observed/predicted broadband waveforms that contained either a P or an SH arrival.The parameters of this multivariate log-normal distribution are its mean vector µ containing n means µ i and its covariance matrix S D .Empirically we find that the µ i and the standard deviations σ i (diagonal elements of S D ) depend mainly on the signal-to-noise-ratio (SNR) of waveform u i .The data covariance between two stations i and j (off-diagonal elements in S D ) is predominantly a function of the distance between station i and station j .We estimate their values from the data set of the 1000 trustworthy STF solutions, i.e., from prior information, and proceed to use these µ and S D in our Bayesian source inversions.It follows from Eq. ( 14) that the misfit is

Construction of the prior probability density
A crucial step in Bayesian inference is the selection of prior probabilities P (m) on the model parameters m.Our model parameters are as follows: m 1 : source depth.We assume a uniform prior based on the assumed depth of the event in the National Earthquake Information Center (NEIC) catalogue.If the event is shallow according to the International Seismological Centre (ISC) catalogue (< 30 km), we draw from depths between 0 km and 50 km; i.e., m 1 ∼ U(0, 50).For deeper events, we draw from depths between 20 km and 100 km.Events deeper than 100 km have to be treated separately, using a longer time window in Eq. ( 13) that includes the surface reflected phases pP and sP.
m 2 , . . ., m 13 = a 1 , . . ., a 12 : the weights of the source time function (Eq.4).The samples are chosen from uniform distributions with ranges shown in Table 1, but are subjected to a prior, π STF (see below).
Intermediate-sized and large earthquakes are caused by the release of stress that has built up along a fault, driven by shear motion in the underlying, viscously flowing mantle.Hence the rupture is expected to proceed in only one direction, the direction that releases the stress.The source time function is defined as the time derivative of the moment, s(t) = ṁ(t).The moment is proportional to the stress and thus monotonous, and hence s(t) should be non-negative.In practice, an estimated STF is often not completely nonnegative (unless this characteristic was strictly enforced).The reason for smaller amounts of "negative energy" (time samples with negative values) in the STF include reverberations at heterogeneities close to the source, which produce systematic oscillations that are present in most or all of the observed seismograms.Motivated by waveform tomography, our primary aim is to fit predicted to observed waveforms.If a moderately non-negative STF produces better-fitting synthetics, then our pragmatic approach is to accept it, since we are not interested in source physics per se.However, we still need to moderately penalise non-negative samples in the STF, because otherwise they creep in unduly when the problem is underconstrained, due to poor azimuthal receiver coverage.In such cases, severely negative STFs often produce marginally better fits by fitting the noise.Smaller earthquakes in other contexts, like mining tremors or dyke collapse in volcanic settings, may have strong volume changes involved and therefore polarity changes in the STF (e.g.Chouet et al., 2003).However, such events are outside of the scope of this study.
Our approach is to punish slightly non-negative STF estimates only slightly, but to severely increase the penalty once the fraction of "negative energy" I exceeds a certain threshold I 0 .To quantify this, we define I as the squared negative part of the STF divided by the entire STF squared: , where ( 16) and is the Heaviside function.Based on I , we define a prior π STF : where the third power and I 0 = 0.1 have been found to work best.In other words, up to 10 % of STF variance may be contributed by negative samples (mostly oscillations) without penalty, but any larger contribution is strongly penalized by the prior π STF .
The neighbourhood algorithm supports only uniform distributions on parameters.The introduction of π STF defined by Eq. ( 18) leads to a certain inefficiency, in that parts of the model space are sampled that are essentially ruled out by the prior.We carefully selected the ranges of the a i by examining their distributions for the 1000 catalogue solutions.A test was to count which fraction of random models were consistent with I < 0.1.For the ranges given in Table 1, we found that roughly 10 % of the random STF estimates had I < 0.1.
A second prior constraint is that earthquakes caused by stress release on a fault should involve no volume change, meaning that the isotropic component where M 0 is the scalar moment, and σ iso = 0.1 is chosen empirically.
Third, we also want to encourage the source to be doublecouple-like.A suitable prior is defined on the compensated linear vector dipole (CLVD) content, which is the ratio = |λ 3 |/|λ 1 | between smallest and largest deviatoric eigenvalues of the moment tensor: In the absence of volume change, a moment tensor with = 0.5 corresponds to a purely CLVD source, while = 0 is a pure DC source.Again we have to decide on a sensible value for the characteristic constant σ CLVD .We choose σ CLVD = 0.2, which seems to be a reasonable value for the intermediate-sized earthquakes of the kind we are interested in (Kuge and Lay, 1994).

Sampling with the neighbourhood algorithm
Our efficient wavelet parameterisation of the STF reduces the total number of model parameters to around 18, but sampling this space remains non-trivial.The popular Metropolis-Hastings algorithm (MH) (Hastings, 1970) can handle problems of this dimensionality, but is non-trivial to use for sampling multimodal distributions (see the discussion for details).These problems are less severe for a Gibbs sampler, but this algorithm needs to know the conditional distribution p(x j |x 1 , . . .x j −1 , x j +1 , x n ) along parameter x j in the ndimensional model space (Geman and Geman, 1984).This conditional distribution is usually not available, especially not for non-linear inverse problems.
To overcome the problem of navigation in complex highdimensional model spaces, the neighbourhood algorithm uses Voronoi cells (Sambridge, 1998) to approximate a map of the misfit landscape (Sambridge, 1999a, first stage), followed by a Gibbs sampler to appraise an ensemble based on this map (Sambridge, 1999b, second stage).
In order to point the map-making first stage of the NA into the direction of a priori allowed models, we use a precalculated set of starting models.For that, the NA is run without forward simulations and without calculating the likelihood, so that only a map of the prior landscape is produced, from 32 768 samples (Fig. 3a).This means that from the start the map will be more detailed in a priori favourable regions, and avoids the algorithm wasting too much time refining the map in regions that are essentially ruled out by the prior.for a two-dimensional toy problem (the underlying distributions are fictional and chosen for demonstration purposes).Top: in the premapping stage, only the prior distribution is evaluated, resulting in a map of starting models that cluster in regions of high prior probability (marked by lighter shades of red).Middle: next, the NA loads this map, evaluates the posterior probability for every sample, and refines the map only in the best-fitting Voronoi cells.Lighter shades of blue correspond to a higher posterior probability.Bottom: in the sampling or appraisal stage, the value of the posterior is interpolated to the whole Voronoi cell.The Gibbs sampler uses this map to produce an ensemble.This ensemble can be used to calculate integrals over the model space, like the mean or mode of selected parameters.
www.solid-earth.net/5/1055/2014/Solid Earth, 5, 1055-1069, 2014 Next, the prior landscape is loaded and a forward simulation is run for each member in order to evaluate its posterior probability.Then this map is further refined by 512 forward simulations around the 128 best models.This is repeated until a total of 65 536 models have been evaluated.
In the second stage of the NA, which is the sampling stage, 400 000 ensemble members are drawn according to the posterior landscape from the first step.This process runs on a 16-core Xeon machine and takes around 2 h in total per earthquake.

2011/08/23 Virginia earthquake
In the following we present a fully worked example for a Bayesian source inversion, by applying our software to the M W 5.7 earthquake that occurred in central Virginia on 23 August 2011 (Figs. 4 and 5, also compare to Fig. 1).While not considered a typical earthquake region, events from this area have nevertheless been recorded since the early days of quantitative seismology (Taber, 1913).Due to its occurrence in an unusual but densely populated area, this relatively small earthquake was studied in considerable detail, affording us the opportunity to compare to results of other workers.Moderate-sized events of this kind are typical for our targeted application of assembling a large catalogue.The greatest abundance of suitable events is found just below magnitude 6; toward smaller magnitudes, the teleseismic signal-tonoise ratio quickly deteriorates below the usable level.
For the inversion, we used a set of 41 P waveforms and 17 SH waveforms recorded by broadband stations at teleseismic distances (Fig. 4).For waveform modelling, a simplified version of the crustal model CRUST2.0(Bassin et al., 2000) was assumed around the source region.Layers 3-5 of CRUST2.0 were averaged into one layer above the Conrad discontinuity, and layers 6-7 were averaged into one layer from the Conrad discontinuity to the Moho; the resulting values are given in Table 2.The algorithm ran 65 536 forward simulations to generate a map of the posterior landscape, and produced an ensemble of 400 000 members in the second step.From this ensemble, the source parameters were estimated.Table 3 shows the estimated credible intervals and the median of the probability distribution for the depth and the moment tensor.These quantiles represent only a tiny part of the information contained in the ensemble, i.e., two statistics of 1dimensional marginals derived from a 16-dimensional probability density function.Some credible intervals are large; for example we cannot constrain the depth to a range narrower than 10 km with 90 % credibility.Using such credible interval estimates, routine production runs of our software should be able to clarify whether depth uncertainties in existing catalogues tend to be overly optimistic or pessimistic.The complete marginal distribution of the source depth estimate is shown in Fig. 3, bottom left.We aim for additional, informative ways of summarising and conveying the resulting ensemble.Figure 5 is what we call a "Bayesian beach ball": an overlay of 1024 focal mechanisms drawn from the ensemble at random.The thrust faulting character of the event is unambiguous, but the direction of slip is seen to be less well constrained.The estimate of the source time function and its uncertainty are displayed in Fig. 4, bottom right.Within their frequency limits, our teleseismic data prefer a single-pulsed rupture of roughly 3 s duration, with a certain probability of a much smaller foreshock immediately preceding the main event.Smaller aftershocks are possible, but not constrained by our inversion.

Comparison to source estimates of other workers
Our solution is consistent with the solution from the SCARDEC catalogue (Vallée, 2012), which puts the depth of this event at 9 km, and its STF duration at 2.5 s.Chapman (2013) studied the source process of the 2011 Virginia event in great detail.He argues for three sub-events having occurred within 1.6 s at a depth of 7-8 km, and spaced less than 2 km apart.This is compatible with our solution: since teleseismic waveforms contain little energy above frequencies of 1 Hz, we would not expect to resolve three pulses within 1.6 s with the method presented here.Chapman (2013)     50 % higher than ours, which may explain why he estimates the source 1-2 km deeper than our most probable depth of 5.9 km (Fig. 4, bottom left).

Uncertainty propagation into tomographic observables
We are interested in source estimation primarily because we want to account for the prominent signature of the source wavelet in the broadband waveforms that we use for waveform tomography.Input data for the inversion, primarily travel time anomalies T i , where i is the station index, are generated by cross-correlating observed seismograms with predicted ones.A predicted waveform consists of the convolution of a synthetic Green's function with an estimated source time function (Eq.2).Thus uncertainty in the STF estimate propagates into the cross-correlation measurements that generate our input data for tomography.Previous experience has led us to believe that the source model plays a large role in the uncertainty of T i .The probabilistic approach presented here permits the quantification of this influence by calculating T i,j for each ensemble member j .From all values for one station, the ensemble mean T i and its standard deviation σ i can then be used as input data for the tomographic inversion.Thus we obtain a new and robust observable: Bayesian travel time anomalies with full uncertainty information.
Figure 6 shows the standard deviation σ i of P wave T i at all stations.Comparison to the signal-to-noise ratios of Fig. 6 shows no overall correlation, except for South American stations, where a higher noise level is correlated with P−wave dT standard deviation (s) 0 0.5 1 1.5 Fig. 6.Standard deviations σ i of P-wave travel times ∆T i , as calculated from the ensemble of travel time estimates are by-products of using waveform cross-correlation as the measure for and they represent our main input data for tomographic inversions.The unit on the colour scale

Uncertainty propagation into tomographic observables
We are interested in source estimation primarily because we want to account for 365 signature of the source wavelet in the broadband waveforms that we use for waveform Input data for the inversion, primarily traveltime anomalies ∆T i , where i is the stat generated by cross-correlating observed seismograms with predicted ones.A predic consists of the convolution of a synthetic Green's function with an estimated source (eq.1).Thus uncertainty in the STF estimate propagates into the cross-correlation 370 that generate our input data for tomography.Previous experience has led us to believe t model plays a large role in the uncertainty of ∆T i .The probabilistic approach presente to quantify this influence by calculating ∆T i,j for each ensemble member j.From all station, the ensemble mean ∆T i and its standard deviation σ i can then be used as inp tomographic inversion.Thus we obtain a new and robust observable: Bayesian travelti 375 with full uncertainty information.
Figure 6 shows the standard deviation σ i of P-wave ∆T i at all stations.Comparison to-noise ratios of fig.6 shows no overall correlation, except for South American sta higher noise level is correlated with a somewhat larger uncertainty on ∆T i .By contr stations all have good SNR, but uncertainties in the travel times are large nonetheless, b 380 uncertainty happens to propagate into the estimates of ∆T i more severely in this ge gion.This information would not have been available in a deterministic source invers a somewhat larger uncertainty on T i .By contrast, European stations all have good SNR, but uncertainties in the travel times are large nonetheless, because source uncertainty happens to propagate into the estimates of T i more severely in this geographical region.This information would not have been available in a deterministic source inversion and could strongly affect the results of seismic tomography.

Performance of the empirical orthogonal basis for STF parameterisation
We choose to parameterise the source time function in terms of empirical orthogonal functions (eofs), which by design is the most efficient parameterisation, if the characteristics of the STFs are well known.We think that they are, having semi-automatically deconvolved thousands of STFs in prior work (Sigloch and Nolet, 2006;Sigloch, 2011) and compared them with other studies (Tanioka and Ruff, 1997;Houston, 2001;Tocheport et al., 2007).The flip side of this tailored basis is that it might quickly turn inefficient when atypical STFs are encountered.From the appearance of the eofs in Fig. 2a, it is for example obvious that STFs longer than 20 s could not be expressed well as a weighted combination of only 10 eofs.Hence the STFs of the strongest earthquakes considered (around M W 7.5) might not be fit quite as well as the bulk of smaller events, which contributed more weight to defining the eof base.For our tomography application, this behaviour is acceptable and even desirable, since the largest Solid Earth, 5, 1055-1069, 2014 www.solid-earth.net/5/1055/2014/events are no more valuable than smaller ones (often quite the opposite, since the point source approximation starts to break down for large events).For a detailed display of a set of potentially problematic STFs see the electronic supplement to this paper.At first glance it might seem unintuitive that the basis functions have oscillatory character and thus negative parts, rather than resembling a set of non-negative basis functions (a set of triangles would be one such set).However, the training collection to which the principal components analysis was applied did consist of predominantly non-negative functions, which by construction are then represented particularly efficiently, even if the eofs may not give this appearance.On top of this, we explicitly encourage non-negativity of the solution via the prior π STF (Eq.18).A rough estimation showed that roughly 90 % of the model space are ruled out by the condition that the source should have a vanishing negative part.
We wanted to know how many basis functions of a more generic basis (e.g., wavelets) would be required in order to approximate the STF collection equally well as with the eofs.A trial with a basis of sinc wavelets showed that 16 basis functions were needed to achieve the same residual misfit as delivered by our optimised basis of only 10 eofs.Since the size of the model space grows exponentially with the number of parameters, avoiding 6 additional parameters makes a big difference in terms of sampling efficiency.

Moment tensor parameterisation
The parameterisation of the moment tensor is a technically non-trivial point.We discuss the pros and cons of possible alternatives to our chosen solution: -Parameterisation in terms of strike φ f , slip λ and dip δ is problematic for sampling.Strike and dip describe the orientation of the fault plane; an equivalent description would be the unit normal vector n on the fault.
All possible normal vectors form a unit sphere.In order to sample uniformly on this unit sphere, samples have to be drawn from a uniform volumetric density (Tarantola, 2005, 6.1).Since the neighbourhood algorithm (and most other sampling algorithms) implicitly assume Cartesian coordinates in the model space, the prior density has to be multiplied by the Jacobian of the transformation into the actual coordinate system, in our case 1/ sin δ.To our knowledge, this consideration is neglected in most model space studies, but it would be more severe in ensemble sampling than in gradientbased optimisation.
-A different issue with strike-dip parameterisation is the following: the Euclidean distances applied to {φ f , λ, δ} by the NA and similar, Cartesian-based algorithms are in fact a rather poor measure of the similarity of two double-couple sources.A more suitable measure of misfit is the Kagan angle (Kagan, 1991), which is the smallest angle required to rotate the principal axes of one double couple into the corresponding principal axes of the other, or the Tape measure of source similarity (Tape and Tape, 2012).
This is an issue in model optimisation with the first stage of the neighbourhood algorithm (Kennett et al., 2000;Sambridge and Kennett, 2001;Vallée et al., 2011).Wathelet (2008) has introduced complex boundaries to the NA, but unfortunately no periodic ones.
-An alternative would be to sample {M xx , M yy , M zz , M xy , M yz , M xz } independently, but this is inefficient because the range of physically sensible parameters spans several orders of magnitude.
-Finally, one might choose not to sample the moment tensor at all.Instead, one might only from the {S i , d} model space, followed by direct, linear inversion of the six moment tensor elements corresponding to each sample.This would speed up the sampling considerably since the dimensionality of the model space would be reduced from 16 to 10. Moment tensor inversion is a linear problem (Eq.2), and hence we would not lose much information about uncertainties.In a potential downside, moment tensor inversion can be unstable in presence of noise or bad stations, but from our experience with supervised, linear inversions, this is typically not a severe problem in practice.Therefore we are considering this pragmatic approach of reduced dimensionality for production runs.

Neighbourhood algorithm
The neighbourhood algorithm avoids some of the pitfalls of other sampling algorithms.Compared to the popular Metropolis-Hastings algorithm, we see several advantages for our problem: -The MH is difficult to implement for multivariate distributions.This is especially true when the parameters are different physical quantities and follow different distributions as is the case in our study.
-As the MH is a random-walk algorithm, the step width is a very delicate parameter.It affects the convergence rate and also the correlation of models, which has to be taken into account when estimating probability density functions from the ensemble.This is a bigger problem than for the Gibbs sampler, which the NA is based on.
-The MH is rather bad at crossing valleys of low probability in multimodal probability distributions.We are expecting such, especially for the source depth.These problems are less severe for a Gibbs sampler, on which the second stage of the NA is based.The first stage of the NA could be replaced by a completely separate mapping algorithm, like genetic algorithms or simulated annealing.Like the first stage of the NA, they only explore the model space for a best-fitting solution.Their results might be used as input for the second stage of the NA.Compared to those, the NA has the advantage of using only two tuning parameters, which control (a) how many new models are generated in each step and (b) in how many different cells these models are generated.As in every optimisation algorithm, they control the tradeoff between exploration of the model space and exploitation of the region around the best models.
There is no hard-and-fast rule for choosing values for these tuning parameters.Since we do not want to optimise for only one "best" solution, we tend towards an explorative strategy and try to map large parts of the model space.Compared to other source inversion schemes, we are explicitly interested in local minima in the misfit landscape.Local minima are often seen as nuisance, especially in the rather aggressive iterative optimisation frameworks, but in our view they contain valuable information.What may appear as a local minimum to the specific data set that we are using for inversion might turn out to be the preferred solution of another source inversion method (e.g., surface waves, GPS or InSAR).
However, an ensemble that does not resolve the best-fitting model is equally useless.The posterior of all models gets normalised after all forward simulations have been done (see Eq. 10).If one peak (the best solution) is missing, the normalisation constant k will be too small, and therefore P (m|d) will be too high for all other models, meaning that the credibility bounds will be too large.It is possible that other sampling schemes, such as parallel tempering, might find better compromises between exploration and exploitation, which could be a topic of further study.

Comparison with other source inversion schemes
Table 4 shows a list of other point source inversion algorithms proposed and applied over the past 15 years.Most widely used is probably the Global Centroid Moment Tensor (CMT) catalogue (Dziewoński et al., 1981;Ekström et al., 2012), which is mostly based on intermediate-period (> 40 s) waveforms to determine a centroid moment tensor solution.Its results are less applicable to short-period body wave studies, since waveforms in the latter are dominated by the hypocentre, which may differ significantly from the centroid.Another classical catalogue is the ISC bulletin (Bondár and Storchak, 2011), which goes back as far as 1960.The ISC catalogue focuses on estimating event times and locations, neither of which are the topic of this study.The ISC recently adopted a global search scheme based on the first stage of the NA, similar to Sambridge and Kennett (2001), followed by an attempt to refine the result by linearised inversion, including inter-station covariances.Garcia et al. (2013) and Tocheport et al. (2007) use simulated annealing to infer depth and moment tensor.A STF is estimated from the P waveforms.By neglecting all crustal contributions and reducing the forward simulation to mantle attenuation, this approach is very efficient.
Similarly, Kolář (2000) used a combination of simulated annealing and bootstrapping to estimate uncertainties of the moment tensor, depth and a source time function.The study was limited to two earthquakes.Kennett et al. (2000) used the first stage of the NA to optimise for hypocentre depth, moment tensor, and the duration of a trapezoidal STF, using essentially the same kind of data as the present study, and an advanced reflectivity code for forward modelling.However, no uncertainties were estimated.
Deb ¸ski ( 2008) is one of the only two studies, to our knowledge, obtained source time functions and their uncertainties by Bayesian inference.He studied magnitude 3 events in a copper mine in Poland.By using the empirical Green's functions (EGF) method, it was not necessary to do an explicit forward simulation.The study was limited to inverting for the STF, which he parameterised sample-wise.This was possible since the forward problem was computationally very inexpensive to solve.
The second sampling study is Wéber (2006), which used an octree importance sampling algorithm to infer probability density functions for depth and moment tensor rate function.The resulting ensemble was decomposed into focal mechanisms and source time functions, a non-trivial and nonunique problem (Wéber, 2009).With this algorithm, a catalogue of Hungarian seismicity was produced until 2010, but apparently this promising work was not extended to a global context.
The most recent global source catalogue is the SCARDEC method by Vallée et al. (2011).It uses the first stage of the neighbourhood algorithm to optimise the parameters source depth, strike, dip and rake.For each model and each station, a relative source time function (RSTF) is calculated.The misfit is comprised of a waveform misfit and the differences between the RSTF at different stations.Uncertainties of the parameters are estimated by the variation of the misfit along different parameters.The STF catalogue has been used to infer the stress drop of a large set of earthquakes (Vallée, 2013).
The PRISM algorithm as presented here is the first to enable Bayesian inference of seismic source parameters on a global scale and in a flexible framework.It allows for sampling of the source time function by a set of optimised, wavelet-like basis functions.By producing a whole ensemble of solutions, arbitrary parameters, like the uncertainty of travel time misfits, can be estimated from the ensemble afterwards, at little additional cost.We showed that routine Bayesian inference of source parameters from teleseismic body waves is possible and provides valuable insights.From clearly stated a priori assumptions, followed by data assimilation, we obtain rigorous uncertainty estimates of the model parameters.The resulting ensemble of a posteriori plausible solutions permits estimating the propagation of uncertainties from the source inversion to other observables of practical interest to us, such as travel time anomalies for seismic tomography.

Solid
The Supplement related to this article is available online at doi:10.5194/se-5-1055-2014-supplement.

Fig. 1 .
Fig. 1.Source time function solutions for a M W 5.7 earthquake in Virginia, USA (201 joint inversion for STF and moment tensor M , using the iterative linearised optimisatio

Figure 1 .
Figure1.Source time function solutions for a M W 5.7 earthquake in Virginia, USA, (2011/08/23) obtained from joint inversion for STF and moment tensor M, using the iterative linearised optimisation algorithm ofSigloch and Nolet (2006).Trial source depths ranged from 2 km to 17 km, in increments of 1 km, and each deconvolution was based on the same 86 broadband, teleseismic P waveforms.Note the strong changes in STF and moment tensor as a function of depth.Top left shows the moment tensor solution from the NEIC catalogue for comparison.For every candidate solution, the percentage of "non-negative" energy is given, a proxy for how oscillatory (and thus inherently non-physical) the solution is.The third number gives the average cross-correlation coefficient between observed and predicted waveforms achieved by each solution.At depths between 2 and 7 km, the STF is pulse-like, simple, and non-negative, and waveform cross-correlation attains its maximum, signalling the most likely depth range for this event.The present study offers an approach to quantify these qualitative tradeoffs and judgements.

Fig. 3 .Figure 3 .
Fig.3.Principle of the Neighbourhood Algorithm, demonstrated for a two-dimensional toy problem (t derlying distributions are fictional and chosen for demonstration purposes).Top: In the pre-mapping only the prior distribution is evaluated, resulting in a map of starting models that cluster in regions of hig probability (marked by lighter shades of red).Middle: Next, the NA loads this map, evaluates the po probability for every sample, and refines the map only in the best fitting Voronoi cells.Lighter shades o correspond to a higher posterior probability.Bottom: In the sampling or appraisal stage, the value of the rior is interpolated to the whole Voronoi cell.The Gibbs sampler uses this map to produce an ensemble ensemble can be used to calculate integrals over the model space, like the mean or mode of selected param 10 Figure3.Principle of the neighbourhood algorithm, demonstrated for a two-dimensional toy problem (the underlying distributions are fictional and chosen for demonstration purposes).Top: in the premapping stage, only the prior distribution is evaluated, resulting in a map of starting models that cluster in regions of high prior probability (marked by lighter shades of red).Middle: next, the NA loads this map, evaluates the posterior probability for every sample, and refines the map only in the best-fitting Voronoi cells.Lighter shades of blue correspond to a higher posterior probability.Bottom: in the sampling or appraisal stage, the value of the posterior is interpolated to the whole Voronoi cell.The Gibbs sampler uses this map to produce an ensemble.This ensemble can be used to calculate integrals over the model space, like the mean or mode of selected parameters.

Figure 4 .
Figure 4. Waveform data and source estimates for the 2011/08/23 Virginia earthquake (M W 5.7).Top row: distribution of 41 and 17 teleseismic broadband stations that recorded P and S waveforms, respectively.Station colour corresponds to the signal-to-noise ratio in the relevant waveform window.Middle row: synthetic broadband waveforms (red), compared to the data for the best-fitting model.Black waveforms are P seismograms; blue waveforms are SH seismograms.The time windows are 51.2 s long and start 5 s before the theoretical phase arrival time.The amplitudes of all P and SH waveforms have been normalised.Bottom left: posterior marginal distribution of estimated source depth.Bottom right: posterior marginal distribution of the source time function.Probability densities are marked by colour and are highest in the areas shaded red.

Fig. 5 .
Fig. 5. Bayesian beach ball: Probabilistic display of focal mechanism solutions for th quake.

Figure 5 .
Figure 5. Bayesian beach ball: probabilistic display of focal mechanism solutions for the 2011 Virginia earthquake.

Figure 6 .
Figure 6.Standard deviations σ i P wave travel times T i , as calculated from the ensemble of solutions.The travel time estimates are by-products of using waveform cross-correlation as the measure for goodness of fit, and they represent our main input data for tomographic inversions.The unit on the colour scale is seconds.

2014/ 2 Method 2.1 Parameterisation of the source time function
Source time function (STF) is a synonym for the moment rate ṁ(t) of a point source, denoting a time series that describes the rupture evolution of the earthquake.It is related to u(t), the vertical or transverse component of the displacement seismogram observed at location r r by convolution with the Green function: Solid Earth, 5, 1055-1069, 2014 www.solid-earth.net/5/1055/

Table 1 .
Sampling of the prior probability distribution: range of STF weights a i that are permitted in the first stage of the neighbourhood algorithm.

Table 2 .
Crustal model assumed for the source region of the 2011 Virginia earthquake (CRUST2.0).

Table 3 .
Credible intervals for source parameters of the Virginia earthquake.The moment tensor components M kl need to be multiplied by 10 16 Nm.

Table 3 .
Credible intervals for source parameters of the Virginia earthquake.The mom M kl need to be multiplied by 10 16 Nm.