Revisiting the statistical analysis of pyroclast density and porosity data

Explosive volcanic eruptions are commonly characterized based on a thorough analysis of the generated deposits. Amongst other characteristics in physical volcanology, density and porosity of juvenile clasts are some of the most frequently used to constrain eruptive dynamics. In this study, we evaluate the sensitivity of density and porosity data to statistical methods and introduce a weighting parameter to correct issues raised by the use of frequency analysis. Results of textural investigation can be biased by clast selection. Using statistical tools as presented here, the meaningfulness of a conclusion can be checked for any data set easily. This is necessary to define whether or not a sample has met the requirements for statistical relevance, i.e. whether a data set is large enough to allow for reproducible results. Graphical statistics are used to describe density and porosity distributions, similar to those used for grain-size analysis. This approach helps with the interpretation of volcanic deposits. To illustrate this methodology, we chose two large data sets: (1) directed blast deposits of the 3640–3510 BC eruption of Chachimbiro volcano (Ecuador) and (2) block-and-ash-flow deposits of the 1990–1995 eruption of Unzen volcano (Japan). We propose the incorporation of this analysis into future investigations to check the objectivity of results achieved by different working groups and guarantee the meaningfulness of the interpretation.


Introduction
Pyroclast density and porosity are commonly used to reconstruct eruptive dynamics and feed numerical models.The pyroclast density ρ is defined as The mass of a pyroclast m is easily measured using a precision balance.The measurement of its volume V is much more challenging, as pyroclasts have irregular shapes.According to the Archimedes' principle, V can be calculated using the volume of water displaced by the pyroclast V w that can be directly measured or calculated using the following equation: where the water density ρ w depends on the ambient temperature and m w corresponds to the mass of water displaced by the pyroclast.
If the density DRE (dense rock equivalent, ρ DRE ) is known, either assumed using the rock composition or measured in laboratory (i.e.rock powder density using water or helium pycnometry), it can be used along with the pyroclast density to calculate the pyroclast porosity (ϕ): It is important to note that measuring the density and the porosity of irregularly shaped pyroclasts is not straightforward.In particular, the parameter m w is difficult to constrain accurately due to water infiltration in the pyroclast.
Published by Copernicus Publications on behalf of the European Geosciences Union.

B. Bernard et al.: Revisiting the statistical analysis of pyroclast density and porosity data
The impact on the measurement increases for samples with high porosity and permeability.In any case, the properties of the pore network, such as the pore tortuosity, have to be taken into account because they affect the m w .Over the last decades, several methods have been developed to minimize the effect of intruding water (Houghton and Wilson, 1989;Schiffman and Mayfield, 1998;Polacci et al., 2003;Kueppers et al., 2005).Instead of trying to prevent imbibition, a very recent field study by Farquharson et al. (2015) attempts to correct for it, a method that may become useful in remote/difficult areas where it is a challenge to carry a vacuum pump and car battery.It is worth indicating that there are many different techniques to obtain density and porosity such as water saturation, pycnometry (water or helium), photogrammetry, calliper techniques, and X-ray tomography (Hanes, 1962;Manger, 1966;Giachetti et al., 2011).The increasing use of regularly shaped samples (cores) allows for an easy way to derive average density but provides partial information on the bulk density and porosity of the starting pyroclasts due to 3-D effects such as heterogeneous vesicle size and density distribution.The purpose of this paper is not to compare the different methods used to obtain the density/porosity data but to discuss how they should be treated statistically.
Another important aspect of density/porosity analysis is that pyroclastic deposits commonly present a large range of density values, so sample sets must comprise a large number of clasts.Additionally, the results must be checked for a low amount of bias due to preferential sampling during fieldwork.Then the density and porosity results are generally treated statistically using frequency analysis including average and distribution histograms.These analyses are often used to interpret volcanic structures or explosivity (Kueppers et al., 2005(Kueppers et al., , 2009;;Belousov et al., 2007;Shea et al., 2010;Mueller et al., 2011;Farquharson et al., 2015).The main issue in this approach is that density and porosity are considered thermodynamically as intensive properties that are not additive unlike extensive properties such as mass or volume (White, 2012).As a consequence, if it cannot be added, it should not be possible to average (sum divided by number of measurement) intensive properties.Pyroclast density is size dependent even for samples with a homogeneous bubble distribution (increase in density for particles smaller than the average bubble size, e.g., Eychenne and Le Pennec, 2012).This effect can be even stronger for heterogeneous pyroclastic material that commonly shows bubble gradients.Therefore, the average density ρ a can be estimated as the total mass of the pyroclasts m t divided by their total volume V t : The non-additive property of density and porosity also limits the use of frequency histograms.For statistical analysis on the density/porosity distribution, the measurements must be weighted adequately to be physically meaningful.
The purpose of this paper is to present a simple method to obtain weighted statistics in order to analyse density and porosity data.We also propose a stability analysis that allows the quantification of the quality of the sampling and the relevance of the results.In order to standardize the description of grain-size distribution of sediments, Inman (1952) proposed a set of graphical parameters based on statistical analysis.The new parameters such as graphical standard deviation and graphical skewness allowed quantifying descriptive terms such as poor or good sorting.Few years later Folk and Ward (1957) proposed revised parameters that better describe natural material, in particular polymodal distributions.They also introduced the kurtosis that helps to describe the shape of the mode.These parameters have been used ever since to characterize and distinguish volcanic deposits (Walker, 1971).We propose to adapt those equations to describe density and porosity distribution.This methodology is incorporated in an open-source R script (http://www.r-project.org/).R is a high-functioning freeware with excellent statistical capacities that provide an optimal platform for such an analysis.In order to encourage the use of this analysis, we also provide a similar MatLab numeric code.An Excel spreadsheet is also supplied, but only with basic formulae as most of the protocol cannot be translated to a spreadsheet format.Finally we illustrate and discuss this method using two large data sets.The symbols used in this paper are summarized in Table A1 in the Appendix.

Density and porosity data sets
We chose two large data sets from different pyroclastic deposits in order to assess the validity of our approach.The Chachimbiro data set (Bernard et al., 2014) is made of 32 sample sets from different outcrops of the 3640-3510 BC directed blast from Chachimbiro volcano, Ecuador (Supplement A).Each sample set contains between 15 and 103 clasts of the 16-32 mm fraction measured using the methodology of Houghton and Wilson (1989).The Unzen data set (Kueppers et al., 2005) is made of 31 sample sets from blockand-ash-flow deposits from the 1990-1995 eruption of Unzen volcano, Japan (Supplement B).Each sample set contains 24-33 large pyroclasts (> 64 mm) measured according to the methodology presented in Kueppers et al. (2005).

Weighting measurements
In order to perform a thorough statistical analysis of density and porosity data, each clast measurement in a sample set with a number of measurements n must be weighted.Based on Eq. ( 1), the density/porosity data can be weighted either by the volume or by the mass of the pyroclast as soon as the weighting parameter, here called the representativeness R, is defined as follows: (5) Here we chose to present the weighting by volume but the same resolution can be used to weight by mass.Equation (1) can be reformulated as follows: Then Eq. ( 6) can be inserted in Eq. ( 4): Using Eqs. ( 5) and (7): the representativeness by volume of any pyroclast is defined as the volumetric portion of the pyroclast in the whole sample:

Abundance histograms and cumulative plots
Abundance histograms and cumulative plots are typical graphical representations of density and porosity data (Fig. 1).1a and c).Several studies have used mixed histograms, with the main axis for density and a secondary axis for porosity (Houghton and Wilson, 1989;Formenti and Druitt, 2003;Belousov et al., 2007;Shea et al., 2010;Komorowski et al., 2013).There is no consensus for the histogram representation; nonetheless most studies used bin sizes between 50 and 100 kg m −3 for the density (Cashman and McConnell, 2005;Kueppers et al., 2005;Bernard et al., 2014).In theory, the bin size should be selected depending on the number of measurements and the density or porosity range; nevertheless for the purpose of comparison, we chose a constant bin size (100 kg m −3 and 5 % porosity) that can be changed in the numeric code.Cumulative plots (Fig. 1b and d) are easier to produce and have a unique representation as the data are used directly to produce the plot.The data are sorted by increasing density or porosity and these values are then plotted against the cumulative abundance that is the sum of the representativeness.The density and porosity cumulative plots should have the same shape but rotated 180 • .

Stability analysis
One of the main questions when performing a density and porosity analysis on pyroclastic deposits is the following: how many measurements are required to have a statistically representative sample set?The sample set size, here expressed as the number of measurements n, is primarily dependant on the dispersion of the data.Deposits with a large density range and a large standard deviation require a larger number of measurements.In order to assess the quality of the sampling, we propose a stability analysis based on the comparison between the final density average (including all the measurements) and intermediate density averages (including part of the measurements).To avoid analytical skew, due to intentional or unintentional ordering of the samples during the measurements, the data must be ordered randomly several times.The intermediate average ρ aint is calculated after each measurement and compared with the final average.An absolute error (AE) is calculated using as follows: Each run with random ordering leads to a different AE after a certain number of measurements.We chose to represent the 95th quantile (2σ ) of the AE against the number of measurements (Fig. 2).We found that about 1000 repetitive runs on one sample set are required to achieve identical results.Finally, the slope of the curve is calculated below a 5 % threshold of the absolute error to avoid the large error associated with a very small number of measurements.This slope is a direct indicator of the quality of the sampling with low slopes associated with high quality sampling.The slope of the curve is also calculated below 1 % of AE as an additional quality indicator but it appears less useful in practice.

Graphical statistics
As the frequency analysis is not suitable for density and porosity data, some interesting statistical parameters, such as the standard deviation, are difficult to obtain.Based on the work achieved to characterize better grain-size distribution (Inman, 1952;Folk and Ward, 1957), we propose for the first time a similar approach to calculate the graphical statistics of density and porosity using the cumulative plots (Fig. 1b  and d).The main difference between graphical statistics for grain-size distribution or for density data is not the equations but the data itself.Grain-size data obtained through sieving are partial data as the grain-size distribution inside each size class (1φ, 1 2 φ or 1 4 φ) is unknown.The density data, on the other hand, are continuous through the whole sample set.For informational purpose we present the equations for the density, which are identical to the equations for the porosity.

Inman graphical statistics
Inman (1952) defined three parameters: -the graphical median Md is a proxy of the average: where ρ 50 corresponds to the value of ρ at 50 % of cumulative abundance.Same notation is used for the following equations; -the graphical standard deviation σ describe the dispersion of the data set: -the graphical skewness Sk characterize the asymmetry of the data distribution:

Folk and Ward graphical statistics
Folk and Ward (1957) proposed different parameters supposed to be more representative of natural distributions, in particular for bimodal or polymodal distributions.The main difference with Inman's parameters is the inclusion of a 1-σ Solid Earth, 6, 869-879, 2015 parameter for the mean and a 2-σ parameter for standard deviation and skewness.In addition, Folk and Ward (1957) included the kurtosis, a statistical parameter that helps to characterize the shape of the distribution peak: -the graphical mean Mz: -the inclusive standard deviation σ I: -the inclusive skewness SkI: -the graphical kurtosis K: It is important to note that the values of graphical median and mean should be relatively close to the weighted average.Nevertheless, as the weighted average is physically the most accurate value, we propose to use it for graphical representation.Standard deviation, skewness and kurtosis are yet to be used to characterize density and porosity distributions, but they are useful.

R code
An open-access R code has been created to automate the calculations presented above.Additionally it facilitates the automatic creation of abundance histograms, cumulative plots, and stability curves.The input file must be in the format csv (field separated by comma) and structured as follows: 1. first column: pyroclast mass (in kg or g); 2. second column: pyroclast volume (in m 3 or cm 3 ); 3. third column: pyroclast density (in kg m −3 or g cm −3 ); 4. fourth column: pyroclast porosity (in decimal from 0 to 1).
The columns should have a header.All the values must have the decimal point separator for the R code to run properly.The name of the file should correspond to the name of the sample set to avoid confusion when compiling large data sets.The R code is provided in the Supplement (Supplement C) and to run the code only three commands are required in R: 1. set the working directory where the R code and the input file are located: setwd("∼/"); 2. load the code: source("stats.R"); 3. run the code: results<-stats("Input file name.csv").
For large data sets it is possible to create a list of csv files and treat them with a loop: 4. create the list: l<-list.files(path=".",pattern="csv"); 5. run the code for the list: for (i in 1:length(l)) {a<stats(l[i],plot=FALSE)}).The R code generates a text file with the statistical results and the figures in pdf format.Compiling the Chachimbiro (33 sample sets, 1492 clasts) and Unzen (32 sample sets, 922 clasts) data sets with the R code with 1000 runs for the stability analysis of each sample set take respectively 36 and 22 s on a 4 GB RAM computer (∼ 42 clasts s −1 in both cases).A translation of the R code in MatLab format is also provided in the Supplement C as well as a basic spreadsheet including the formulae required to obtain a weighted average.

Frequency versus weighted analysis
The absolute difference between frequency and weighted density/porosity averages for Chachimbiro and Unzen data sets is up to 4 and 2 % (Fig. 3a, Supplement D), respectively; that is close to the analytical error (< 5 %).This difference is not as important as the relative difference between individual sample sets per volcano.To highlight this we chose two sample sets from the Chachimbiro, 021-B and 089-A.These samples have almost the exact same frequency density average (1961 and 1960 kg m −3 ) but distinct weighted density averages (2039 and 1892 kg m −3 ).In contrast, two other sample sets from Chachimbiro (018-C and 095-A) show similar weighted density averages (2246 and 2242 kg m −3 ) but distinct frequency density averages (2284 and 2154 kg m −3 ).Abundance histograms can also be biased by the use of frequency analysis.We observed significant modification of the histogram shape such as fluctuation of the density/porosity modes (Fig. 3b), variation of the mode fraction, or change of the general density/porosity distribution (unimodal or plurimodal).For both of our study cases, the number of measurements and the number of samples per deposit is large enough for the effect of one method compared to the other to be minimum (few percent of deviation), even though laboratory experiments have shown that porosity is one of the main parameters that controls fragmentation during explosive eruptions under the presence of bubbles with gas overpressure (Alidibirov and Dingwell, 1996;Spieler et al., 2004).Therefore a change of only few percent of porosity might induce a large error on the calculation of pre-eruptive conditions such as overpressure, fragmentation depth, permeability and rock strength (Mueller et al., 2005;Heap et al., 2014).It is difficult to assess the effect of the statistical method based on literature as most of the publications only provide the final density and porosity data sets and not the raw data (mass and volume).

Sample size
The stability analysis (see Sect. 2.3) can be used to assess the quality of the sampling and also to estimate the minimum number of measurements required to obtain meaningful results.When comparing the slope of the stability curve below the 5 % threshold and the number of measurements from the Chachimbiro data set, it appears that sample sets with more than 40 clasts have a high stability (Fig. 4, Supplement D).Below 40 measurements there is scattering in the results (from high to low stability) probably associated with differences in the standard deviation.The Unzen data set exhibits a much smaller spread with a high stability for most of the sample sets.This difference indicates that natural heterogeneity of pyroclasts and eruption, transport and deposition dynamics require a deposit-adapted sampling strategy.Houghton and Wilson (1989) propose a minimum of 30 clasts per sample set.Our analysis shows that the minimum number of measured clasts per sample set must be established according to the characteristics of the deposit itself.When more raw data are available on different deposits, stability analysis results could be used to suggest a minimum number of measurements for future investigations.Moreover, the stability analysis might be used to select only high stability, ergo more representative samples for further analysis such as laboratory experimentation or permeability measurements (Fig. 5).

Distinguishing deposits
Graphical statistics for grain-size analysis have been commonly used to identify the nature of volcanic deposits (Walker, 1971).The same might be applied for density analysis.Figure 5 highlights the differences between the Chachimbiro and Unzen data sets.For similar values of density/porosity averages, the Chachimbiro data set shows almost systematically a higher standard deviation than the Unzen data set (Supplement D).The two data sets also display a small degree of overlap when looking at skewness and kurtosis parameters.The Unzen deposits have principally a symmetric porosity distribution (SkG and SkI around 0), while the Chachimbiro deposits have a clear asymmetric distribution (SkG and SkI mostly positive and up to 0.4).The porosity distribution for Unzen deposits is typically mesokurtic www.solid-earth.net/6/869/2015/Solid Earth, 6, 869-879, 2015 (KG ∼ 1) while it is generally highly leptokurtic (KG > 1) for Chachimbiro deposits, mostly associated with a larger tail of data and wider porosity modes.This might be interpreted as an expression of the outgassing processes in both contexts.The dome collapses, associated with Unzen deposits, probably affected the upper part of the lava dome that has been fairly outgassed while the directed blast, associated with Chachimbiro deposit, removed most of the dome in one event, but also magma from the plumbing system with higher volatile content.There is no major difference between the Inman (1952) and the Folk and Ward (1957) parameters for the Unzen data set while the Chachimbiro data set behaves differently.In particular the inclusive skewness (Fig. 5d) allows for a better distinction between the Unzen and Chachimbiro data sets.As indicated by Folk (1966), the Folk and Ward parameters generally represent polymodal distribution better than the Inman parameters do.Consequently, the bimodal distribution of most samples from the Chachimbiro deposit explains why the former better describes them than the latter.It is possible that the distinction made thanks to the graphical parameters is related to the origin of the deposits (directed blast vs. block-and-ash flow) but more data from different deposits are required to support this hypothesis.

Conclusion
This study presents a new methodology to treat density and porosity measurements from pyroclastic deposits.It presents weighting equations that allow a more robust statistical analysis.The evaluation of Chachimbiro and Unzen density/porosity data sets indicates that frequency analysis alone can lead to misinterpretations and that weighted analysis should be used to avoid analytical bias.The stability analysis provides a tool to assess the quality of the sampling while the graphical parameters allow for a better characterization of the deposits than the classical approach using only averages and histograms.The results obtained show that for small numbers of measurements the Chachimbiro sample sets are less stable than the Unzen ones.This can be interpreted as being due to either the sampling method or due to the deposit density/porosity distribution.Finally we propose the use of graphical statistics to represent density/porosity data.
The differences observed between the two data sets indicate that such representations can be useful to distinguish pyroclastic deposits.

Figure 1 .
Figure 1.Abundance histograms (a and c) and cumulative plots (b and d) for pyroclast density and porosity data.Sample CHA-201-A (n = 103) from Chachimbiro directed blast deposit.

Figure 2 .
Figure 2. Stability curves obtained after 1000 runs for two samples from Chachimbiro and Unzen data sets.Note the constant slope below the 5 % threshold.

Figure 3 .
Figure 3.Comparison between frequency and weighted analyses.(a) Weighted vs. frequency density average for Chachimbiro and Unzen data sets, note the large relative differences highlighted by the red arrows (see Sect. 3.1 for explanation); (b) porosity abundance histogram for one sample from the Chachimbiro data set, note the large difference (10 %) of the main porosity mode between the two statistical methods represented by the red arrow.

Figure 4 .
Figure 4. Results of the stability analyses for the Chachimbiro and Unzen data sets.Note that there is a large scattering for Chachimbiro data set below 40 measurements while the Unzen data set has much less dispersed values.

Figure 5 .
Figure 5. Graphical parameters for the Chachimbiro and Unzen data sets.Only high stability (slope < 0.5 %) sample sets are used in this figure.Note that the two data sets show lower superposition with the Folk and Ward parameters than with the Inman parameters, in particular when using the Skewness (d).

Bernard et al.: Revisiting the statistical analysis of pyroclast density and porosity data weighted
graphs.For the abundance histogram, in each interval we sum the representativeness of the measurements instead of counting the number of measurements and dividing it by n.It is important to note that density and porosity histograms can have different shapes due to the selected bin size (Fig.