Clustering has a meaning: optimization of angular similarity to detect 3D geometric anomalies in geological terrains

. The geological potential of sparse subsurface data is not being fully exploited since the available workflows are not 15 specifically designed to detect and interpret 3D geometric anomalies hidden in the data. We develop a new unsupervised machine learning framework to cluster and analyze the spatial distribution of orientations sampled throughout a geological interface. Our method employs Delaunay triangulation and clustering with the squared Euclidean distance to cluster local unit orientations/attitude which results in minimizing the within-cluster cosine distance. We performed the clustering on two representations of the triangles: normal and dip vectors. The classes resulting from clustering were attached to a geometric 20 centre of a triangle (irregular version). We developed also a regular version of spatial clustering which allows to answer whether points from a grid structure can be affected by anomalies. To illustrate the usefulness of the combination between cosine distance as dissimilarity metric and two cartographic versions, we analyzed subsurface data documenting two horizons: 1) the bottom Jurassic surface from the Central European Basin System (CEBS) and 2) an interface between Middle-Jurassic units within the Kraków-Silesian Homocline (KSH) which is a part of the CEBS. The empirical results suggest that clustering 25 normal vectors may result in near collinear cluster centers and boundaries between clusters of similar trend, thus pointing to axis of a potential megacylinder. Clustering dip vectors resulted on the other hand in near co-circular cluster centers, thus pointing to a potential megacone. We also show that the linear arrangements of the anomalies, their topological relationships and internal structure can provide insights regarding the internal structure of the singularity, e.g. whether it may be due to drilling a nonvertical fault plane or due to a wider deformation zone composed of many smaller faults.

3 used to group geometrically similar observations in one color domain. In the field of subsurface structural geology, it is expected that the recognition of subjectivity will bring about a better understanding of the subsurface and can help in reducing uncertainties (Bond, 2015;Curtis, 2012). Having a method to detect and investigate the spatial configurations of 3D geometric 65 outliers (grouped in homogenous subsets according to optimization criterion) and their topology is expected to be a major step in the understanding of the subsurface architecture. This is in particular valid for environments with sparse data, in which the topology of the fault network may be uncertain or prone to the existence of unknown faults (Schneeberger et al., 2017).

Examples of using machine-learning in solid Earth geoscience
In fields related to solid Earth geosciences, machine-learning methods have been applied in earthquake prediction (Seydoux 70 et al., 2020;Johnson et al., 2021;McLellan and Audet, 2020), the classification of rock units in geological mapping based on lithological or geophysical features (Cracknell and Reading, 2014;Kuhn et al., 2018;Xiong and Zuo, 2021;Wang et al., 2020) and the investigation of the topology of fractured networks (Srinivasan et al., 2018;Valera et al., 2018). Unsupervised techniques have also been successfully used for the problem of finding homogenous subsets of observations representing discontinuities (Hammah and Curran, 1999;Zhan et al., 2017a, b) or portions of geological interfaces to determine the average 75 orientation of regional trends (Michalak et al., 2019). The above studies did not however investigate the role of vector representations (normal and dip vectors) on the clustering results. They also did not attempt to use the characterization of Voronoi diagrams to explain the meaning of the boundaries between obtained clusters.

Spatial clustering
Spatial clustering is a generic term for investigating geometric trends throughout a surface of interest (Fisher, 1993;Fisher et 80 al., 1985). In the first step of the procedure, orientations with spatial components are sampled throughout the surface. Then, the observations are grouped into homogenous subsets. The partition can be achieved by assigning a corresponding class resulting from clustering. This class can be recorded as an integer and then represented with a label (a symbol or a color). From a technical viewpoint, the data frame used for clustering is reduced because only coordinates of normal and dip vectors serve as input for clustering. Thus, the spatial distribution of these classes can be examined by re-assigning a spatial component to 85 the labeled observations. The idea of spatial clustering shares some similarities with color-coding portions of interpreted seismic surfaces with respect to their orientation (Di and Gao, 2017). However, the latter approach does not employ optimization techniques (see section 2.1 Subjectivity) to group individual observations into homogenous subsets. Moreover, color coding is usually performed independently for dip and dip direction (Di and Gao, 2017;Roberts, 2001), which does not allow investigation of the relationship between dip and dip direction anomalies. 90 4 3 Methods

Calculating the orientation
Three noncollinear points in three-dimensional space define a plane whose orientation can be calculated using basic linear algebra (Allmendinger, 2020;Groshong, 2006). When triangulation of the points is applied, the quality of the triangles can be 95 measured using a variety of nonequivalent coefficients for their further removal (Collon et al., 2015;Michalak, 2018;Frey and Borouchaki, 1999). In this study, we used a collinearity coefficient defined as the proportion of the longest triangle edge to the length sum of the remaining edges (Michalak, 2018). The resulting coefficient lies within the interval of [0.5, 1], with lower values pointing at equilateral triangles and higher values representing collinear configurations. In the irregular grid (which is the case for the KSH), we filtered the configurations whose collinearity exceeded 0.90. This restriction resulted primarily in 100 the removal of triangles that lie at the edge of the convex hull (e.g., see the nonconvex shape of the convex hull in Fig. 12).

Assumptions underlying the clustering procedure
Selected clustering algorithms allow dissimilarity metrics to be specified to evaluate the similarity (or dissimilarity) between individual observations. In our case, these observations are normal and dip vectors (Fig. 1). For example, if the k-means algorithm is used (James et al., 2013), the squared Euclidean distance acts as the distance metric between p-dimensional 105 observations and ′ (in our case = 3): . (1) We propose to use a well-known fact (Choi et al., 2014;Zhang et al., 2020) that for unit normal vectors, the above squared distance is proportional to cosine distance ("°" denotes the scalar product between two vectors, and || • || is the Euclidean norm): . 5 Thus, the optimization problem solved in the k-means algorithm for unit vectors can be conceptualized in two ways (using the 115 fact the ( ) takes values on the interval [−1,1]): A. Minimizing the within-cluster cosine distance. 1 − cos (∡{ , ′ }) is minimized (equals zero) when cos(∡{ , ′ }) = 1, i.e. when the angle between and ′ is 0°.

Equivalence of k-means clustering and Voronoi cells
Because the clusters resulting from applying the k-means algorithm can be conceptualized as Voronoi cells (Hastie et al., 2009), we propose to explain some of the clustering results using relevant computational geometry theorems about Voronoi diagrams (De Berg et al., 2008). 130 Theorem 1. Let P be a set of n point sites in the plane. If all the sites are collinear then Vor(P) consists of n−1 parallel lines.
Otherwise, Vor(P) is connected and its edges are either segments or half-lines.
Theorem 2. For the Voronoi diagram Vor(P) of a set of points P the following holds: (i) A point q is a vertex of Vor(P) if and only if its largest empty circle CP(q) contains three or more sites on its boundary. 135 (ii) The bisector between sites pi and pj defines an edge of Vor(P) if and only if there is a point q on the bisector such that CP(q) contains both pi and pj on its boundary but no other site.

Irregular and regular trend maps
We propose here two versions of spatial clustering of geological contacts. The first version is based on taking one representative associated with the interiors of Delaunay triangles (their geometric centers). The vertices corresponding to the 140 Delaunay triangles are given indices that are also included in the resulting data frame, which we denote as Table Y. The clustering methods assign the resulting orientation labels to the geometric centers of the Delaunay triangles (Fig. 2).
The second version tries to evenly represent the geometric trends throughout the deformed surface (Fig. 2). It allows to answer whether a specific 2D point (a geographic location) lying in the domain of a triangulation is affected by a 3D geometric anomaly related to a triangulation model . In this approach, a structured grid defined by points linked with corresponding 145 clustering labels is generated. From a technical viewpoint, it is first necessary to generate the orientations of the Delaunay triangles. Then, we generate a regular point network within a region of interest. Next, we use the locate function offered by the Computational Geometry Algorithms Library (CGAL) library to link the points from the regular network with the corresponding triangles. In its simplest version, this function takes the point query as its argument, and the possible return values are as follows (Yvinec, 2021): 150 • If the point (from the regular network) query lies inside the convex hull of the points, a face that contains the query in its interior or on its boundary is returned.
• If the point query lies outside the convex hull of the triangulation but in the affine hull, it is a face (∞, p, q) such that the query lies to the left of the oriented line (the rest of the triangulation lies to the right of this line).
In the code, we enable only the inclusion of the first group of triplets into the resulting data frame, which we denote here as 155  Table X. This is because not all triangles (e.g., those of small size) may be linked with any points from the user-defined 7 regular grid, and this can especially be the case if the grid is sparse. To observe the differences, we therefore recommend using a SQL-related Right Outer Join (ROJ) (or Left Outer Join with replaced arguments) method rather than Inner Join for 160 merging. This choice allows us to calculate the proportion of the area that is not covered with any orientation label (the coverage presented in Fig. 12). ROJ applied to Tables X and Y returns all rows from the right Table (Y) and any rows with matching keys from the left Table (X). In this study, the keys over which the tables are merged are the indices of the points that build the corresponding Delaunay triangles. If no points can be found for a given triangle, NA values are assigned to the px and py coordinates in the merged Table Z. 165 summary of the regularization method: information about clustering labels of triangles must be attached to points from the regular grid. This transfer of information is possible via CGAL query functions which allow to identify triangles that have 8 points in their interiors (the points are arguments of the query functions). Please note that executing the query functions and clustering are done in separate environments, therefore two datasets (Table X and Y) need to be merged using unique elements model Maystrenko et al., , 2012. LTZ is according to (Medhus et al., 2012). The sedimentary cover of the CEBS can be subdivided into two clearly distinguished structural levels: (1) pre-Permian, (2) Permian and (3) Meso-Cenozoic structural units (Doornenbal et al., 2009;Evans et al., 2003;Maystrenko et al., , 2012185 Ziegler, 1990b). The pre-Permian sedimentary level mainly includes Devonian and Carboniferous sedimentary rocks with Silurian, Ordovician and Cambrian rocks along the north-eastern margin of the CEBS (Doornenbal et al., 2009;Evans et al., 2003;Fossen et al., 2017;Lassen and Thybo, 2012;Usaityte, 2000;Wiest et al., 2020).
During the Late Cretaceous-Early Cenozoic, some parts of the CEBS underwent basin inversion with the strongest compressive deformations and uplift along the Elbe Fault System, the Sorgenfrei-Tornquist and the Teisseyre-Tornquist zones (Hansen et al., 2000;Krzywiec et al., 2021;Mazur et al., 2005;Mogensen and Jensen, 1994;Otto, 2003;Scheck-Wenderoth and 210 Lamarche, 2005;Voigt et al., 2008;Ziegler, 1990a). During the Cenozoic, broad subsidence occurred within the central part of the North Sea, where more than 3 km of sediments were deposited (Evans et al., 2003;Ziegler, 1990b). The investigated Jurassic horizon represents the base of the Jurassic in places where the Jurassic sediments are present (Maystrenko et al., , 2012. Within the rest of the model area, this horizon corresponds to the top of pre-Jurassic sediments or to the top of the crystalline basement. 215

Case study 2: The Kraków Silesian Homocline (KSH)
The KSH in Poland is interpreted as an eastern continuation of a greater geological unit called the Fore-Sudetic Homocline Mesosoic Polish Basin (e.g. Dadlez et al., 1995;Słonka and Krzywiec, 2019). The axial part of the Permian-Mesosoic Polish Basin is called the Mid-Polish Trough and strikes parallel to the TTZ (see Mazur et al., 2021 for the latest interpretation of TTZ). The Mid-Polish Trough was a zone of strong subsidence during the Late Permian and Mesozoic (Kutek and Głazek, 1972). The inversion and uplift of the Polish Basin resulted in the formation of the Mid-Polish Anticlinorium, with two synclinoria symmetrically distributed along the NW-SE trending anticlinorium (Fig. 5, Kutek and Głazek, 1972;Narkiewicz, 230 2020).
The Permo-Mesozoic deposits of the KSH represent predominantly clastic and carbonate series with common hiatuses and lie unconformably on the denuded and morphologically diversified Paleozoic or Precambrian basement (Buła et al., 2015). It is generally assumed that the strata along with accompanying geological contacts dip gently toward the NE, a feature that has been present since the late Cimmerian phase (Górecka, 1993;Krokowski, 1984). The layers were ultimately tilted to the NE 235 (Figs. 5,6), in the direction of the Szczecin-Łódź-Miechów Synclinorium axis, during the inversion of the Mid-Polish Trough in the Maastrichtian and Paleocene (Górecka, 1993;Kutek and Głazek, 1972). Deviations from the preferred direction of the dip to the NE can be observed in the southeastern part of the KSH, where layers may dip to the S (Krokowski, 1984). The latter effect is believed to be caused by the thrusting Carpathians and the corresponding bending of deposits north of the thrusting loads in the Miocene (Jarosiński et al., 2009;Krokowski, 1984). Additionally, locally greater angular dip angles are 240 expected to be found near faults due to fault-related bending of strata (Bednarek et al., 1992;Matyszkiewicz and Krajewski, 1996;Krokowski, 1984). It is important to note that some of the observed or hypothesized faults have received kinematic interpretations. For example, the én echelon arrangement of SW-NE trending faults (Fig. 5) is believed to be formed by the underlying dextral movement of a strike-slip fault (Krokowski, 1984). This hypothesized movement, trending SE-NW, is believed to be related to the parallel Kraków-Lubliniec Fault, which separates blocks of crystalline basement (Buła et al., 245 2015).
For this study, we selected the geological interface that separates older Kościeliska sandstones (Early Bajocian-Late Bathonian) (Kopik, 1998) from the younger Częstochowa Ore-Bearing Clay Formation (Late Bajocian-Late Bathonian) in the area of Nowa Wieś town. The Częstochowa Ore-Bearing Clay Formation (known also as the ore-bearing clays) contains several horizons of siderite concretions, exploited in the past as the iron ore. There is a hiatus between the sediments, evidenced by 250 the lack of a Strenoceras subfurcatum ammonite zone (Zakrzewski, 1976). part of the Kraków-Silesian Homocline (modified after Bardziński et al., 1986) and the location of the study area.

Case study 1 (CEBS) -materials
We used the lithosphere-scale 3D structural model of the CEBS, constructed by Scheck-Wenderoth (Maystrenko et al., 2013, 2012; of the input data is given in Maystrenko et al. (Maystrenko et al., 2012, 2020. Here, we will only mention the largest 275 data sets used during the model construction. The main data source for the 3D model was the North Sea Digital Atlas which covers the entire North Sea (PGS, 2003). The digital version of the Geological Atlas of the Netherlands has been used for the Netherlands (NITG, 2004). To cover the North German Basin, the Geotectonic Atlas of northwest Germany (Baldschuhn et al., 2001), the 3D structural model of the Glueckstadt Graben (Maystrenko et al., 2006) and the 3D model of the NE German Basin (Scheck et al., 2003) have been taken. The 3D crustal-scale model of the Polish Basin 280 2005) has been used for Poland.

Case study 2 (KSH) -materials
We used 810 borehole records (Unpublished, The borehole data set…) that were handed over to the University of Silesia in Katowice by the "Geological Company of Częstochowa" (Częstochowskie Przedsiębiorstwo Geologiczne). The digitized version of these records contains Cartesian coordinates for the studied interface. The uncertainty of the borehole paths was not 285 provided, and the precision of the coordinates was 1 cm. We used Pulkovo 1942(58)/Poland zone V (EPSG: 2175) as the coordinate reference system. Compared to a previous study built upon this dataset (Michalak et al. 2019), we filtered the data to minimize the number of "noisy" boreholes that are most likely related to measurement errors. The assumed errors were manifested as unusual pointwise distributed depressions or elevations of the studied surface.

Determining the optimum number of clusters
An important clustering issue is the selection of the number of clusters. There are many competitive heuristics suggesting the optimal number of clusters (Rousseeuw, 1987;Hastie et al., 2009). Since in this study we only use the k-means algorithm, we followed the idea of the elbow method. It requires to run the k-means algorithm for different k (e.g. from 1 to 10) and to compute the total within-sum of squares for each clustering (denoted as tot_withinss in Fig.7). To locate the optimal number 295 of clusters, one looks for a kink in the sum of squares curve (Hastie et al., 2009). The experimental results usually suggest the optimum number of clusters to be 2 (Fig. 7B, 7C), 3 (Fig. 7A, 7D) or 4 (Fig. 7B). tot_withinss on the y axis denotes the total within-cluster sum of squares: given a clustering partition, for every cluster the sum of within-squares is calculated (s1, s2, …sk), and finally tot.withinss= s1 + s2 +…+ sk. In (Hastie et al., 2009) tot_withinss is 16 denoted as W(C), where C is a classification function. For example C(7)=3 means that the third cluster was assigned to the 305 seventh observation.

Case study 1 (CEBS) -results
The linear, anomalous zones are obtained along the NW-SE trending lithosphere-scale fault zones, such as the Tornquist Zone (the Sorgenfrei-Tornquist and the Teisseyre-Tornquist Zones) in the north-east and the Elbe Fault System in the south-west (Figs. 4, 8-9). This is due to the fact that these fault zones controlled the structure of the crystalline basement and the 310 configuration of the sedimentary cover within the CEBS that is reflected by the spatial clustering in Figs. 9A and 9B. In particular, the structural development of the Permo-Mesozoic Polish Basin is strongly coupled with the Teisseyre-Tornquist Zone, which can be considered as the preexisting lithospheric weak zone beneath this basin (Mazur et al., 2021). The Teisseyre-Tornquist Zone is characterized by two wide zones of low angle dip direction (Fig. 4C, 9B), reflecting the NW-SE strike of the Polish Basin (Fig. 5). In the case of the Sorgenfrei-Tornquist Zone, the north-eastern margin of the Norwegian-Danish 315 Basin was significantly affected by the shape of this fault zone during both the Permo-Mesozoic subsidence and the Late Cretaceous uplift (Erlström et al., 1997). A similar situation is along the Elbe Fault System which is also represented by a zone of weakness at the south-western margin of the North German Basin (Scheck et al., 2002). Another area with a set of the linear anomalous zones is located within the North Sea (Figs. 4, and 9B), where the Moray Firth Basin, the Central and Viking Grabens are located (Fig. 3). There, these linear anomalous zones are mainly caused by the major boundary faults of the graben 320 structures with the mentioned sedimentary basins. The sedimentary cover is faulted and folded along the boundary faults and, therefore, the geometric anomalies follow the configuration of the faults. Graben and outline the NE-SW trending, long and wide salt walls located there. Actually, the large size of the anomalies and high dip angles correlate well with the high intensity of salt movements, which were the strongest within the Glueckstad Graben Trusheim, 1960;Warsitzka et al., 2019). The high dip angle of the small-scale anomalies is obtained within the northern part of the Central Graben, where the salt diapirs pierce and strongly deform the sedimentary layers in the vicinity of salt structures (Davison et al., 2000;Karlo et al., 2014;Rank-Friend and Elders, 2004). The salt 330 movements were also intensive within the Polish Basin (Krzywiec, 2004(Krzywiec, , 2012. However, the salt-induced deformations of the sedimentary cover of the Polish Basin are not clearly reflected by spatial clustering analysis compared to the Glueckstadt Graben or Central Graben. This is mostly related to a relatively low resolution of the data which have been used to construct the 3D structural model of Poland . In contrast, the input data for the Glueckstadt and Central grabens were characterized by higher resolution (Baldschuhn et al., 2001;PGS, 2003), allowing authors to include 335 more details of the basin structures in the grabens.
18 Figure 8. Using k-means (k=3) clustering to normal (a) and dip (b) vectors for the investigated CEBS Jurassic horizon (irregular version). We used 236380 observations for clustering but the visualization is based on a random sample (10 000 340 observations). It can be observed that the normal representation (a) generated two sets of clusters with the less represented (both less than 3%) magenta and green clusters dipping at moderate angles to SW and NE, respectively. Clustering dip vectors (b) resulted in partitions that represent three dip direction domains (NE, W and SSE) with a common vertex near the stereonet origin and cluster centers having similar dip angles. Dip and dip direction of cluster centers are given in Tab. 1. The grid spacing is 10 degrees for dip angles and dip directions. We applied a polar equal-angle stereographic projection. The left 345 stereonet presents the projection of points from the unit lower hemisphere (tips of unit dip vectors) onto the horizontal plane.
The stereonet on the right presents the projection of points from the unit upper hemisphere (tips of unit normal vectors) onto the horizontal plane.

Case study 2 (KSH) -results
Both the normal (Figs.10A, 11A) and dip vector representations (Figs.10B, 11B) reveal similar spatial configurations of 360 geometric anomalies, i.e., observations dissimilar to the subhorizontal dip to the NE. A visible difference between normal and dip vector representation can be attributed to the spatial integrity of W-E trending anomalies (dipping to S) which are not very well preserved in the normal vector representation with three clusters. This effect can be explained by the fact that in the normal representation with three clusters the cluster centers are more or less collinear. This suggest (Theorem 1) that in the normal vector representation boundaries between clusters are more or less parallel. Indeed, in our results the boundaries 365 between clusters seem to have a similar NE-SW trend (Fig. 11A). The implication of this result is that observations dipping to S may be assigned to more than one cluster (magenta and green) when normal vectors are used for clustering.
For the two representations, two distinctly trending sets can be observed: NE-SW and NNE-SSW (locally N-S). The presence of an én echelon arrangement of the NE-SW set is in line with the orientation of faults in the Częstochowa region (Fig. 5B - Bardziński et al., 1986) and with the model of extensional faulting in the northern part of the KSH due to SE-NW-oriented 370 21 dextral strike-slip movement (Krokowski, 1984). However, our results do not support the unimodal distribution of fault strikes (only NE-SW), as proposed by (Bardziński et al., 1986, Fig. 5B).
We see that some of the arrangements are composed of more than one observation in the direction perpendicular to their trend (e.g., the SW-NE trending anomaly in the NW part of the area in Figs.10B). This necessitates discussion about the origin of these forms given that the difference in elevation between a hanging wall and footwall can be consumed by a single triangle 375 (e.g., with two points lying on the hanging wall and the third on the footwall) (Michalak et al., 2021).
The above effect could be explained by several competitive hypotheses. For example, the fault plane could have been drilled, thus broadening the zone of triangles genetically related to the fault (Michalak et al., 2021). Assuming the tectonic origin of the related structures, it can be hypothesized that fault drags on the hanging wall contribute to subsidiary elevation differences that must be consumed by nearby triangles. It could also be argued that an unusual lowering of the contact surface is due to a 380 deformation zone composed of many smaller faults. Another hypothesis could be that the related feature is not a fault but rather a sedimentary slope, which would explain the gradual lowering of the contact surface.
From a topological (Thiele et al., 2016) perspective, some of the NE-SW trending arrangements are paired in that their NWand SE-dipping counterparts are adjacent (e.g., the form composed of blue and magenta SW-NE-trending belts in the S part of the area in Fig. 12C, D). Depending on their relative position, they can be interpreted either as paleovalleys, grabens and 385 synclines (negative forms) or peaks, horsts and anticlines (positive forms). They can also be interpreted in terms of antithetic shear with hanging walls dipping against the main fault, which is often the case for listric faults (Harding and Tuminas, 1989;Fossen, 2006) or reverse drag/rollover anticlines (Fossen, 2006).
We also produced results using the regular version of spatial clustering with different coverage rates (Figs. 12A-D), i.e.
proportions of triangles linked with points comprising the regular grid. These results show that lower coverage may result in 390 the omission of small triangles (Fig. 12A, 12B) which can be misleading for analysis of connectivity between different representatives of the observed anomalies (Fig. 12C, 12D). The regular version allows to better observe the orientation of neighbors of triangles which can be more difficult if the irregular version is applied.
22 Figure 10. Using k-means (k=2) clustering to normal (a) and dip (b) vectors for the investigated KSH Jurassic horizon. This 395 version is irregular: orientation labels are assigned to geometric centers of Delaunay triangles. Both clustering and visualizations are based on 1502 observations. It can be observed that the normal representation (a) generated two sets of clusters with the less represented (about 4.7 %) magenta cluster dipping at moderate angles to NW. Clustering dip vectors (b) resulted in partitions that represent two dip direction domains with NW and ENE centers having similar dip angles. Dip and dip direction of cluster centers are given in Tab. 2. The grid spacing is 10 degrees for dip angles and dip directions. For 400 explanations regarding the projections, see the caption to Fig. 8. 24 25 Figure 11. Using k-means (k=3) clustering to normal (a) and dip (b) vectors for the investigated KSH Jurassic horizon. This version is irregular: orientation labels are assigned to geometric centers of Delaunay triangles. Both clustering and visualizations are based on 1502 observations. It can be observed that the normal representation (a) generated three sets of 405 clusters with the less represented (about 10.7 %) magenta and blue (about 4.5%) clusters dipping to NW and SE, respectively.
The boundaries between clusters have a similar NE-SW trend. Clustering dip vectors (b) resulted in partitions that represent three dip direction domains with NW, NE and ESE centers having similar dip angles. Dip and dip direction of cluster centers are given in Tab. 2. The grid spacing is 10 degrees for dip angles and dip directions. For explanations regarding the projections, see the caption to Fig. 8. 410  collinear) Delaunay triangles linked with the points from the regular grid to all filtered Delaunay triangles. Note that exceptions from the convex shape of the polygon or blank spaces in the interior are due to removed collinear configurations. We recommend to minimize the value of spacing to exhibit potential connectivity patterns. The grid spacing is 10 degrees for dip 420 angles and dip directions. We applied a polar equal-angle stereographic projection. The stereonet presents the projection of points from the unit lower hemisphere (tips of unit dip vectors) onto the horizontal plane.

Method's capabilities
The method's promise to identify geometric anomalies lies in the fact that squared Euclidean distance inherent to the k-means 425 algorithm (Hastie et al., 2009) is equivalent to cosine distance if processed vectors have unit length (Eq. 2). Thus, the resulting clusters have geometric meaning because they represent groups of observations that have small within-cosine distance, a fact 27 often used in the field of text analysis (Choi et al., 2014;Zhang et al., 2020;Hornik et al., 2012). In structural studies, cosine distance as dissimilarity metric was used for detecting fracture sets in outcrops (Zhan et al., 2017a) based on observations with substantial dip. In this study, we analyzed subsurface geological terrains with sub-horizontal or moderate dip and we believe 430 that the method holds promise for providing insights into the subsurface architecture of similar class-imbalanced data, thus preventing the creation of oversimplified models (Caumon et al., 2009).
This study adds knowledge regarding the possibility of using computational geometry theorems to explain the meaning of resulting clusters. As a case in point, we note that clustering results have a repeatable pattern in that the normal representation produced almost collinear cluster centers (Fig. 13A) and the dip representation has always (except k=2 when there is no vertex 435 and the theorem is not applicable) a vertex common for all clusters near the origin of the stereonet (Fig. 13B). These results can be rewritten using the computational geometry theorems. From Theorem 1 it follows that collinear cluster centers imply parallel boundaries between clusters. Indeed, in our results, the approximate boundaries between clusters have a similar trend (Fig. 13A). This suggests that in our results, the boundaries between clusters may be distributed along a potential cylinder axis with cluster centers lying in the same plane which is perpendicular to the direction of a potential cylinder axis. Moreover, from 440 Theorem 2 it follows, that a point q is a vertex in Voronoi diagrams if and only if its largest empty circle CP(q) contains three or more sites on its boundary. Thus, in Fig. 13B, the cluster centers must have a very similar distance to the origin which implies that the cluster centers lie on a common circle and thus have a very similar dip angle (compare dip angles for dip representations in Tabs 1,2). A full explanation of the above effects lies beyond the scope of this research and needs further studies. 445

Regularization
The first version is irregular, which means that we assign an orientation label to the geometric center of a triangle. This arbitrary decision makes the resulting map biased. The second version reduces the arbitrariness by creating a regular point set that is linked with corresponding triangles and their orientation labels. This regularization may serve to solve topological problems, e.g., whether or not the hypothesized faults intersect. However, there are two caveats to the regularization step: 460 1) only the irregular version can provide additional insights into the observed features (e.g., explanatory hypotheses about drilling the fault plane or having a listric fault) 2) The regular version is still sensitive to the initial spatial heterogeneity of the data density (the spatial configuration of the boreholes).
If the analysis of connectivity between anomalies is of interest, our recommendation is to minimize the spacing of the points 465 in the grid so as to maximize the proportion of linked triangles. This should be helpful to better analyse connectivity between observed anomalies. However, a sparse grid could be potentially more useful in the generalization schemes related to upscaling frameworks (Carmichael and Ailleres, 2016).

Limitations
Assuming the spatial homogeneity of the subhorizontal dip to the NE (ideally a flat plane dipping to the NE), the distances 470 between three points taken to construct the plane do not influence its orientation (it is the same plane). However, if this homogenous surface is faulted, then triangles genetically related to the faults have orientations different than that of the underlying faulted surface (Michalak et al., 2021). The directional within-dissimilarity of the triangles genetically related to faults (thus anomalies) may sometimes be high, with unintuitive dip directions of triangles opposite the fault dip direction (Michalak et al., 2021). In addition, the dip angle of these triangles is affected by the density of the borehole network in the 475 vicinity of the fault, with boreholes located closer to the fault resulting in relatively greater dip angles. This interplay between the initial technical conditions (dip angle of triangles genetically related to faults depends on density of boreholes) and tectonics may limit the epistemological value of the resulting interpretation based on dip-angle-informed clustering (e.g. Fig. 11A). We note that dip vectors are not uniquely defined for horizontal observations (the dip direction cannot be specified), so we recommend removing horizontal observations prior to conducting clustering. A similar problem and proposed solution applies 480 to normal vectors of vertical observations, for which two possible dip directions can be given.

Expert-guided partitions
If the faults strike perpendicular to the preferred direction, then setting a threshold to dip angle may be helpful to reduce the interpretational impact of the initial technical conditions (the density of the borehole network). For example, if faults strike perpendicular to the preferred fault direction (NE) and have their hanging wall to the NE, then the related triangles have dip 485 angle values greater than that of the regional trend, irrespective of the spatial heterogeneity of boreholes. Therefore, we suggest that sometimes the assistance of expert knowledge is needed to answer more specific questions. Another argument in support of the expert-guided partition is that although the combination of dip vector representation and the squared Euclidean distance metric can help identify the directional domains, in some cases, the underlying assumption may not be realistic because genetically related observations may dip to opposite directions. Merging the two groups of observations with a dip direction 490 difference of 180 is, however, unlikely if the combination of dip vector representation and the squared Euclidean distance (cosine distance for unit vectors) is applied (Figs.10B-11B). As a case in point, consider a fault striking perpendicular to the preferred dip direction (NE) with the hanging wall lying to the SW. In this case, triangles genetically related to such a fault may dip to the SW when the points from the hanging wall and footwall are relatively close to each other (and/or with a high value of fault throw). However, they may also dip to the NE at smaller angles, which may be the case if the distance between 495 the points from the hanging wall and footwall is relatively high (and/or with a small value of fault throw), thus only flattening the general effect of the dip to the NE.
We note that the space of the geometric hypotheses created by the cartographic results may be high and thus interpretationally challenging. We note that in the case of a lack of other geological knowledge or data, the method is capable only of indicating the strike of the hypothesized structures. This is because the dip direction associated with the triangles related to these faults 500 may also be attributed to reverse faults that have dip directions opposite those of the related triangles (Michalak et al., 2021).

Conclusions
As Bond (Bond, 2015) argues, for much structural geology, it is fair to say that "it's all about geometry". The infinite threedimensional space encountered in structural geology observations points to the need for generalization of geometric information to increase the capabilities of recognizing related structures and their topology (Kania and Szczęch, 2020;Thiele 505 et al., 2016). We believe that the workflow holds promise for identifying the relationships between the effects of the forces that shaped the region and those that caused subareas to deviate from the regional plan (compare Davis, 2002). More detailed conclusions are highlighted below: • Compared to simply color-coding surfaces with respect to either dip angle or dip direction, our method allows investigation of the relationship between dip and dip direction anomalies. Moreover, in simple visualizations of dip 510 angle and dip direction, the boundaries between colors in available default color palettes are established without considering an optimization criterion which may result in the lack of spatial integrity of the existing structures (Fig.   6C).
• In our approach, observations are separated according to an optimization criterion. Our method is capable of detecting geometric anomalies because applying squared Euclidean distance to unit vectors results in minimizing within-cluster 515 cosine distance (Eq. 2). Obviously, the geometric meaning of the proposed optimization procedure can be completely lost if the processed vectors do not have unit length. Thus, we do not recommend scaling vectors according to the size of the related triangles.
• In case of many sub-horizontal observations (which is true for many terrains), we propose two different conceptualization about the optimization procedure for normal and dip representations. For normal vectors 520 representing sub-horizontal terrains, it is better to conceptualize the optimization as minimizing the within-cluster cosine distance. For dip vectors representing sub-horizontal terrains, it is better to conceptualize the optimization task as maximizing the between-cluster cosine distance.
• The correspondence between Voronoi tessellation (Hastie et al., 2009) and clusters resulting from applying the kmeans algorithm as well as computational geometry theorems allow to further explain the meaning of the clusters. 525 Empirical results show that the combination of cosine distance with normal and dip vector representation holds promise for identifying axes of potential megacylinders and slope of potential megacones, respectively (Fig. 13).
These results should not however be extrapolated as a general rule for other study areas.
• The selection of triangulation as a source of collecting data for spatial clustering allows the internal structure of anomalies to be revealed. We created additional yet potentially competitive hypotheses about the nature of the 530 observed anomalies, i.e., whether the internal structure of the singularity may be due to drilling a nonvertical fault plane or due to a wider deformation zone composed of many smaller faults.

Data availability
Datasets for this research (input and processed data) are available in these intext data citation references: (Michalak, 2021a) [available on request]

Author contribution
MM devised the project, wrote the manuscript, performed the computations and discussed the results. LT discussed the results 545 and created several hypotheses regarding the internal structure of the revealed anomalies. FW participated in the study conceptualization (formulating the idea of regularization) and discussed the results (creating hypotheses regarding the observed anomalies). JŻ drafted the regional geological setting and discussed the results regarding the internal structure of the observed anomalies. KG participated in drafting the regional geological setting. MK advocated for the regularization and discussed the limitations regarding the interpretational value of dip-angle-based clustering. YPM described the setting and data from the 550 Central European Basin System and discussed the results. PL participated in drafting the regional geological setting.