ObspyDMT: a Python toolbox for retrieving and processing large seismological data sets

. We present obspyDMT, a free, open-source software toolbox for the query, retrieval, processing and management of seismological data sets, including very large, heterogeneous and/or dynamically growing ones. ObspyDMT sim-pliﬁes and speeds up user interaction with data centers, in more versatile ways than existing tools. The user is shielded from the complexities of interacting with different data centers and data exchange protocols and is provided with powerful diagnostic and plotting tools to check the retrieved data and metadata. While primarily a productivity tool for research seismologists and observatories, easy-to-use syntax and plotting functionality also make obspyDMT an effective teaching aid. Written in the Python programming language, it can be used as a stand-alone command-line tool (requiring no knowledge of Python) or can be integrated as a module with other Python codes. It facilitates data archiving, pre-processing, instrument correction and quality control – routine but nontrivial tasks that can consume much user time. We describe obspyDMT’s functionality, design and technical implementation, accompanied by an overview of its use cases. As an example of a typical problem encountered in seismogram preprocessing, we show how to check for inconsistencies in response ﬁles of two example stations. We also demonstrate the fully automated request, remote computation and retrieval of synthetic seismograms from the Synthet-ics Engine (Syngine) web service of the Data Management Center (DMC) at the Incorporated Research Institutions for Seismology (IRIS).


Introduction
Seismology is a data-rich science, and since the advent of global digital networks in the 1990s, the growth of seismological waveform data holdings at international data centers has constantly accelerated.The data avalanche is a blessing, but also poses challenges to the scientist who needs to find and process these waveforms.Which data are available at the various international data centers?How can subsets of interest be selected, downloaded, organized, preprocessed, instrument-corrected and quality-controlled in a manageable amount of user time?Quality control and instrument corrections are nontrivial tasks, requiring tools that provide adequate diagnostics to verify data integrity.Almost every datadriven workflow in seismology begins with these considerations.As a project progresses, local data holdings often need to be updated, repaired, or extended, including the troubleshooting of earlier failed requests, adding waveforms made available since initial retrieval, adding (meta-)data from other data centers and downloading corrected metadata files.Surgical tasks of this kind can easily require more human supervision than the initial retrieval.
For a sense of data volumes, consider the example of Fig. 1, which arose in our work on global waveform tomography (Hosseini et al., 2014;Hosseini, 2016).Using the obspyDMT software, we queried the Incorporated Research Institutions for Seismology (IRIS) Data Management Center (DMC) about hour-long, broadband waveform segments containing earthquakes exceeding a magnitude of 5. Figure 1a plots the data center's response: since 1990, IRIS' event catalog lists 1000-3000 such events per year, visual- ized in obspyDMT's automatically generated map of Fig. 1b.The number of archived broadband channels has grown to almost 5200 in 2016, and we are offered more than 10 8 waveforms, corresponding more than 20 terabytes of data (and very long download times).Most applications would call for the selection of desirable subset of data before launching an actual request.
Besides large volumes, the hallmark of seismological data is heterogeneity.A culture of data sharing from permanent networks and temporary experiments means that waveforms get archived at many different data centers around the world in different waveform and metadata formats and documented and quality-controlled to varying degrees.Archives receive continuous inflows of data from telemetered stations, but also batchwise contributions from temporary experiments.Many experiments make metadata available immediately but restrict access to actual waveforms for several years.No general mechanism exists for broadcasting updates about data center holdings, which instead need to be actively and repeatedly queried by interested users.Data access mechanisms tend to be specific to each center.Downloading timecontinuous or very long seismograms may be less supported than downloading short segments around earthquake occurrences.
obspyDMT is free, open-source community software that strives to address these access challenges in a more comprehensive, integrated and time-saving manner than existing software, which includes WILBER, WebDC, BREQ_FAST, NetDC, EMERALD (West and Fouch, 2012), IGeoS (Morozov and Pavlis, 2011a, b), SOD (Owens et al., 2004) and ObsPyLoad (Scheingraber et al., 2013).It is an easy-to-use command-line tool for the query, retrieval and management of seismograms.The user is shielded from the complexities of interacting with different data centers and provided with powerful diagnostic tools to check the retrieved data and metadata and to execute most routine preprocessing tasks, including instrument corrections.ObspyDMT is written in the Python programming language and runs on Linux, Mac OS and Windows platforms.
Section 2 gives a high-level overview of obspyDMT's functionality in comparison to existing seismogram retrieval and management tools.Section 3 is a concise but nearcomplete tour that aims to turn the reader into a productive obspyDMT user very quickly while also listing all usage options.Section 4 discusses implementation and performance of features that set obspyDMT apart from existing tools, specifically its communication with data centers, its robustness and its diagnostics for instrument corrections.
All graphics in this paper were generated by obspyDMT.The caption of each figure gives the generating command(s) that handled the data and produced the plot.
2 Overview of software functionality obspyDMT is a stand-alone tool for data retrieval and management that is not associated with any one seismological data center, data exchange protocol, or data format.In a style similar to Unix shells, it issues a single, one-line command obspyDMT which produces a default behavior and can be customized with many different options flags.There are no required options, and the omission of an option flag will trigger default behavior.This makes obspyDMT robust to run and easy to learn.The possibilities for customization are extensive, as will be discussed in Sect.3. To give an idea, the command obspyDMT --datapath iris_events_dir --min_date 1990-01-01 --max_date 2017-01-01 --min_mag 5.0 --event_info --plot_seismicity downloaded a global seismicity catalog from the IRIS DMC, saved the metadata in a predefined directory structure and generated Fig. 1 as a diagnostic display of the result.Invoking obspyDMT without any flags would have requested from the IRIS event catalog metadata for all events since 1970 that exceeded a magnitude of 3.0.
obspyDMT is part of the ObsPy ecosystem (Beyreuther et al., 2010;Megies et al., 2011;Krischer et al., 2015), an open-source community project that develops Python software for seismological observatories under the GNU Lesser General Public License, hosted by the Ludwig-Maximilians-Universität Munich.ObspyDMT uses many of ObsPy's utility functions, as well as functions from Python's numpy, scipy and matplotlib libraries (Hunter, 2007), combining them into a more specialized piece of software.While no knowledge of Python is required to use obspyDMT, a software developer may seamlessly integrate it with other Python code.Python also makes it easy to wrap source codes written in other programming languages.For example, ObsPy wraps evalresp, IRIS' maturely developed software for instrument response corrections.ObspyDMT's functionality can be summarized as follows.
-Query of station metadata: by absolute time or relative to earthquake occurrences; by geographic area (rectangles or circles); by channel or instrument type; wildcarding (*) is supported; simultaneous queries of different data centers.
-Query of earthquake source metadata: from different catalog providers (currently from NEIC, GCMT (Global Centroid Moment Tensor), IRIS DMC, NCEDC, USGS, INGV and ISC); event origin information or full-moment tensors; by time window, region, event magnitude and/or event depth.
-Diagnostic plots to visualize metadata; plots are generated simply by appending an option flag to the datahandling command.
-Retrieval of actual waveform data (seismograms) according to the results of metadata queries.Support for different data exchange protocols (International Federation of Digital Seismograph Networks (FDSN) web services, ArcLink).
-Retrieval of time-continuous series of arbitrary length; generation of diagnostic log files.
-Parallelized retrieval of waveform data from a data center for increased speed.Simultaneous retrieval from different data centers.
-Update mode: identical or modified queries can be relaunched; only new, modified, or previously failed data will be retrieved from the data center(s).
-Tolerant of retrieval errors and missing data (includes diagnostic logs).
-Automatic organization of data, metadata and log files into standardized directory trees.(At present no tie to a database system.) -Processing of retrieved data sets using default or userdefined instructions.ObsPy, SAC (George Helffrich and Bastow, 2013) or any other processing tool can be used to customize the processing unit on the waveform level.Supports processing immediately upon waveform retrieval or later, batch-type processing.Support for parallel processing.
-Application of instrument responses.Support for various instrument formats (e.g., StationXML and dataless SEED).Diagnostic plots of analog and digital "filter stages".Option of parallelized instrument correction, taking advantage of multi-core architectures now common even on desktop processors.
-Automated retrieval of synthetic seismograms from IRIS' data services products (Hutko et al., 2017) for comparison to real data.
Various community software packages exist for achieving these tasks, but to our knowledge no other freely available package achieves them all.Table 1 compares the features of popular seismological community software to those of ob-spyDMT.We consider only tools that include functionality for data retrieval.
All data centers offer such tools, but each is limited to retrieving data from that specific center.For example, both the IRIS DMC in the US and ORFEUS Data Center (ODC) in Europe implement the web-form-based WILBER service for retrieving event-based waveforms, as well as the email-based BREQ_FAST service for time-continuous waveforms.If a user requires data from both centers, they need to be contacted separately.If event-based as well as continuous data   1 that provides access to several data centers (in a single command) and to both types of waveform data (in two separate commands).The demand for continuous time series, often in large quantities, has surged with the rapid rise in cross-correlation methods based on ambient noise (Shapiro and Campillo, 2004).ObspyDMT provides more convenient access than the email-based tools BREQ_FAST or NetDC.
obspyDMT is also the only tool to offer an "update" mode for waveforms, response files and/or metadata information: relaunching a previous request will identify and retrieve only data that could not be retrieved earlier.Like obspyDMT, the SOD, IGeoS and EMERALD tools are stand-alone software that runs on the user's computer rather than a data center server.All four communicate with data centers via the relatively new web services interfaces defined by the FDSN.Queries are formulated as URL strings (uniform resource locators) that point to physical data resources over the internet.We refer to this access method as "direct".Compared to older access methods, it can save much human intervention time by freeing the user from the need to click through web pages (WILBER, WebDC) or manage emails (BREQ_FAST, NetDC).SOD, IGeoS and EMERALD retrieve event-based waveforms only, i.e., queries are based on earthquake occurrences.
The stand-alone tools obspyDMT and EMERALD additionally manage the data download and archiving to a local computer, thus relieving users of additional tedious and timeconsuming steps.Both include certain plotting options (more extensively in obspyDMT).
obspyDMT also offers full instrument correction based on RESP or StationXML station metadata, combined with diagnostic plots of transfer functions for individual filter stages.SOD is the only other tool to offer instrument correction, but this includes gain correction only, and it offers no diagnostic plots.
obspyDMT is the only tool to provide an automated update functionality for a user's existing, local data holdings.

Guided tour of use cases
The purpose of this section is to turn the reader into a proficient user of obspyDMT in the short space of a few pages.We demonstrate the most common use cases for the query, selection, retrieval and management of seismograms, metadata and synthetic waveform.We list obspyDMT's full set of options in Table 2, which should be consulted as a crossreference during the various stops of this guided tour.
obspyDMT is a command-line tool that consists of a single command obspyDMT usually followed by option flags to modify the default behavior.Table 2 lists all available flag options, with explanations.

Querying earthquake metadata
First, we request event information from one of several supported seismicity catalogs, without downloading any waveforms yet.
--plot_seismicity triggers the generation of the global seismicity map plot of Fig. 2. --event_info switches off the retrieval of actual seismograms so that only metadata are downloaded to a local directory named neic_event_dir/ (argument of --data_path).This directory is created if necessary, and it is populated with the following subdirectory and files: Appended to the earlier command, this generates the map inset of Fig. 2b.Note the rendering of colored beach balls (deepest seismicity in the foreground).The global map of Fig. 2 also plots beach balls rather than simple black dots, but they do not become apparent at this zoom level.

Query of station metadata
Let's say we plan to investigate earthquakes exceeding a magnitude of 6.0 that occurred in this Indonesian rectangle at depths above 100 km.We want to know which seismometers in the Global Seismic Network (GSN) were operational to record them from 1 February to 1 December 2014.We issue the following query: obspyDMT --datapath event_based_dir --min_date 2014-02-01 --max_date 2014-12-01 --min_mag 6.0 --max_depth 100 --event_rect 80/135/-15/35 --event_catalog NEIC_USGS --net _GSN --cha BHZ --meta_data The NEIC event catalog returns 16 matching earthquakes, metadata for which are stored in 16 separate subdirectories of a local directory called event_based_dir.Each of the 16 event subdirectories holds a subdirectory called availability.txt to which metadata were written describing the GSN seismometers that were operational during the event.(Refer to Appendix A and Fig. A1 for a graphic depicting the full directory structure created by obspyDMT.)Only station metadata are requested, as specified by the mode flag --meta_data.We want StationXML files for (all) stations in the GSN network (--net _GSN), but only for the broadband, high-gain, vertical components of these stations, as specified by channel flag --cha BHZ.A subset of stations could be specified by the --sta flag, which supports wildcarding *, like many obspyDMT options.Since the option is absent here, it defaults to --sta * , i.e., all stations in the _GSN network.(See Table 2 for defaults for all options.)The underscore in --net _GSN marks this as a virtual network, whereas the two regular networks IU and II would be queried by --net "IU,II".

Requesting and retrieving waveform data in event-based mode
Next, we retrieve the actual BHZ seismograms from the GSN network that were recorded during the 16 Indonesian earthquakes identified in Sect.3.2.In our earlier obspyDMT command, only a few option flags need to be changed:

--tour
Run a quick tour.

--check
Check all basic dependencies and their installed versions on the local machine and exit.

--version
Show the obspyDMT version and exit.

--reset
If the datapath is found, delete it before running obspyDMT.

--continuous
Continuous time series request mode.

--local
Local mode for processing/plotting (no data retrieval).

--print_event_catalogs
Print-supported earthquake catalogs that can be passed as arguments to --event_catalog.

False --force_response
Retrieve response file(s), force override of any preexisting response files in local datapath directory.
In event_based mode, the reference time is the earthquake origin time by default but can be modified by --cut_time_phase.
In continuous mode, the reference time(s) is (are) specified by --interval option, and --preset prepends the specified lead to each interval (default: 0). "300" --offset <in sec> Time interval in seconds to include to the retrieved time series after the time reference.
In event_based mode, the reference time is the earthquake origin time by default but can be modified by --cut_time_phase.
In continuous mode, the reference time(s) are specified by --interval option, and --offset appends the specified offset to each interval (default: 1800). "3600"

--cut_time_phase
In event_based mode, use the first-arriving phase as reference time (i.e., P, Pdiff or PKIKP, determined automatically).Overrides the use of origin time as default reference time.

--resample_method <lanczos/decimate>
Resampling method: "decimate" or "lanczos".Both methods use sharp lowpass filters before resampling in order to avoid aliasing.If the desired sampling rate is 5 times lower than the original one, resampling will be done in several stages (default: "lanczos").
"  --event_catalog LOCAL searches for an existing event catalog on the user's local machine, in the EVENTS-INFO subdirectory of --datapath <PATH>.This is usually a previously retrieved catalog.
--read_catalog <PATH> Read in an existing local event catalog and proceed.Currently supported catalog metadata formats: "CSV", "QUAKEML", "NDK", "ZMAP".Process retrieved data based on processing instructions in the selected processing unit (default: "process_unit"). "process_unit_sac" --force_process Forces running of the processing unit on the local/retrieved data, overwriting any previously processed data in local datapath directory.

--plot_focal
Plot beachballs instead of dots for event locations.

--plot_ray
Plot the ray coverage for all station-event pairs found in the specified directory (--datapath).

--create_kml
Create a KML file for event/station/ray.KML format is readable by Google Earth.

--create_event_vtk
Create a VTK file for event(s).VTK format is readable by Paraview.

--plot_seismicity
Create a seismicity map and some basic statistics on the results.
If not specified, the starting date of the last channel in the stationXML will be used. "2010-01-01" --plotxml_output <DIS/VEL/ACC> Type of transfer function to plot: DIS/VEL/ACC (default: VEL)."DIS"

--plotxml_allstages
Plot all filter stages specified in response file.

--plotxml_paz
Plot only poles and zeros (PAZs) of the response file, i.e., the analog stage.

--plotxml_plotstage12
Plot only stages 1 and 2 of full response file.

--plotxml_start_stage <stage>
First stage in response file to be considered for plotting the transfer function (default: 1). "1"

--plotxml_end_stage <stage>
Final stage in response file to be considered for plotting the transfer function, (default: last stage given in response file or the 100th stage, whichever number is smaller).
"0.001" --plotxml_map_compare Plot all stations for which instrument responses have been compared (PAZ against full response).

--plotxml_percentage <percent>
Percentage of the phase transfer function's frequency range to be used for checking the difference between methods."100" will compare transfer functions across their entire spectral range, i.e., from min_freq (set by --plotxml_min_freq) to Nyquist frequency; "80" compares from min_freq to 0.8 times Nyquist frequency (default: 80).
"100" Others --email <email address> Send an email to the specified address after completing the job (default: false).obspyDMT --datapath neic_event_dir --min_date 1990-01-01 --max_date 2017-01-01 --min_mag 5.0 --event_catalog NEIC_USGS --event_info --plot_seismicity Global seismicity map of archived earthquakes in NEIC catalog of a magnitude of more than 5.0 that occurred between 1990 and 2016.One command queried the NEIC catalog, stored and organized the retrieved information and generated the seismicity map.(No actual waveform data were queried in this example.)The results of some basic statistics (magnitude and depth histograms) are also generated and plotted automatically (a).Note the rendering of colored beach balls in the map inset (deepest seismicity in the foreground).The global map also contains beach balls rather than just simple black dots, but they do not become apparent at this zoom level.
--data_source specifies explicitly that the IRIS DMC should be contacted, although this would also be the default if the flag were omitted.If the user is unsure, it is best to specify --data_source all, which prompts obspy-DMT to contact all 20 supported data centers listed in Table 3 and probably more in the future.(The list can be inspected by invoking obspyDMT --print_data_sources.) --preset 300 and --offset 3600 specify the retrieval of waveform time windows of 300 s before to 3600 s after the reference time.Since we are downloading in event-based mode, i.e., centered around earthquake occurrences, the reference time defaults to the event origin time.This could be changed to the time of P -wave arrival by invoking --cut_time_phase (see Table 3. List of international data centers that can be currently accessed via FDSN and ArcLink interfaces of obspyDMT.This list is growing as more and more data centers can be accessed directly (as opposed to FTP or email-based methods).obspyDMT --print_data_sources lists all available data centers, and --print_event_catalogs lists all available event catalogs.
ArcLink Many European data centers lute start time.ObspyDMT knows that it is downloading in event-based mode because this is its default mode; adding the flag --event_based would have made this explicit.
(--meta_data mode was introduced in Sect.3.2; the alternative modes of --continuous and --local will be demonstrated shortly.)Issuing this single-line command is the only requirement on user time; everything else is done automatically.Specifically, obspyDMT will do the following: 1. Request event information from the NEIC event catalog --event_catalog NEIC_USGS.
2. In the --datapath event_based_dir, create a subdirectory EVENTS-INFO/ containing a local catalog of metadata for the 16 matching events.Also in --datapath, create 16 event subdirectories, each containing a subdirectory tree (info/, resp/, raw/, processed/) as in Appendix A, Fig. A1.
3. Retrieve station metadata for all GSN stations for the 16 events in StationXML format from the IRIS data center and save these to subdirectories resp/.
4. Retrieve BHZ waveforms of 3900 s duration from all matching GSN stations in miniSEED format and save to subdirectories raw/.
5. Run default preprocessing operations on the waveforms, consisting of removing means and trends, tapering, fil-tering, and deconvolving the instrument response (all customizable).The processed seismograms are save to subdirectories processed/.
6. Save additional log files on query success to subdirectories info/.
Note how user time remains limited to issuing a single command no matter how many earthquakes, stations, or waveforms are being requested.Our tests required no human intervention even for very large requests that took weeks to download and encountered various time-outs or missing data issues at the data centers (cf.Sect.4.2).

Update of existing waveform data sets
In the course of working with a waveform data set, it often becomes necessary to update it.This could mean requesting the same data again (because part of the earlier request failed for some reason) or expanding the number of earthquakes, stations or seismograms.ObspyDMT aims to be smart about these various cases and not to retrieve duplicates unless the users explicitly wants it to.We demonstrate typical use cases.They have in common that the local --datapath directory must remain identical to that of any earlier request.
If the user wants to update only certain events, then --min_date, --max_date, --min_mag, --max_mag and/or --event_rect can be adjusted (see Table 2 for other options).Similarly, if the new date-time window is not contained within the old one, then additional events might fit the criteria and their waveforms would be added in new event directories.
If all 16 preexisting event directories are to be updated, an alternative to the above command is to remove all event criteria because obspyDMT will then default to the local, preexisting event catalog in EVENTS-INFO/ for earthquake metadata.
obspyDMT --datapath event_based_dir --net _GSN --cha BHZ --preset 300 --offset 3600 --instrument_correction --data_source IRIS If the user decides they need seismograms for all BHE channels (in addition to BHZ), the update command would be obspyDMT --datapath event_based_dir --net _GSN --cha BHE --preset 300 --offset 3600 --instrument_correction --data_source IRIS Augmenting the existing 16 events with seismograms from additional data centers is also an update operation because the waveform holdings of data centers often overlap to some extent.Again obspyDMT will automatically compare metadata in order to avoid downloading duplicates.To update the data set with all vertical broadband channels of the GFZ and ORFEUS data centers, we would request obspyDMT --datapath event_based_dir --cha BHZ --preset 300 --offset 3600 --instrument_correction --data_source "GFZ,ORFEUS" --datapath event_based_dir is identical to what we defined in the previous command line that specifies the name of the top directory.

Retrieval of waveform data in time-continuous mode (--continuous)
In contrast to the examples thus far, some usage cases require waveforms that are not relative to or centered on specific earthquake occurrences.We refer to this usage mode as "time continuous" (--continuous).For example, studies that cross-correlate ambient noise often require long time series from many stations, often divided into segments of shorter duration (i.e., 1 day).ObspyDMT makes the handling of continuous time series easy, even if the data sets are voluminous.
obspyDMT --continuous --datapath yv_continuous_dir --min_date 2012-12-15 --max_date 2013-01-15 --net YV --sta "RR0 * ,RR1 * ,RR2 * " --cha BHZ --sampling_rate 10 --data_source RESIF --user your_username --pass your_password This command queries the French RESIF data center for time series from 15 December 2012 to 15 January 2013 recorded by the temporary ocean-bottom seismometer network of the RHUM-RUM (Réunion Hotspot and Upper Mantle -Réunions Unterer Mantel) experiment (network code YV) (Barruol and Sigloch, 2013;Stähler et al., 2016).The wildcard "*" is used to specify multiple station names.Since the data are embargoed until the end of 2017, a username and password needed to be passed to the data center (--user, --pass).Here we were interested in noise levels on the ocean floor during the passage of tropical storm Dumile and therefore requested waveforms for the storm period, highlighted by the yellow box in Fig. 3.The storm was clearly recorded by elevated noise levels, whose variable onset times track the storm's diachronous passage across the 1500 km × 1500 km wide network (Davy et al., 2014).
Long time series often need to be downsampled for ease of storage and handling, in this case to 10 Hz from originally 50 Hz (--sampling_rate 10).ObspyDMT uses ObsPy functionality for resampling to any rate; if the frequency ratio is large, antialiasing and downsampling are automatically done in multiple stages.

Speeding up data retrieval by parallelization
obspyDMT uses ObsPy clients to retrieve metadata and actual waveforms from the data centers.Every request consists of three basic steps: (1) connect and send the data request to the data center; (2) download the data; (3) disconnect.By default, obspyDMT executes these steps for every metadata or waveform request separately, e.g., 3 × 1000 steps if 1000 waveforms are requested.For large requests, this can become a serious bottleneck.To increase the efficiency in such cases, a functionality for parallelized data retrieval can be enabled as follows: --req_parallel --req_np 4 The first flag changes the data retrieval mode from serial (default) to parallelized, and the second flag specifies the number of parallel requests.
The parallelization in obspyDMT is implemented on two levels: data center and waveforms.As an example of the former, if waveform data from both ORFEUS and IRIS are requested, obspyDMT sends parallel requests to these data centers.
The other parallelization is at waveform level: if several waveforms are requested from one data center, they are retrieved by --req_np parallel processes.(A good choice for np is the number of CPUs on the retrieving computer, i.e., 4 to 16 for many current laptops or desktops.)The number of requested waveforms or metadata files will be divided into the number of specified processes.Each process then sends and retrieves its set of requests serially, but all processes organize their data into the same --datapath directory.
Further speeding up can be achieved by specifying a bulk request (--bulk flag).Instead of requesting individual items, this will send a list of items (time series or metadata) to the data center, which reduces the number of (dis-)connections.We have, however, noticed occasional instabilities (for very large requests, fewer waveforms are retrieved than in serial mode); hence, serial is set as the conservative default.

Plotting tools
obspyDMT offers various plotting tools for visualizing data sets.Figure 2 demonstrates the plotting of seismic sources (beach balls) on a map, via the --plot_seismicity option.
Figure 4 demonstrates a map plot of ray paths between sources and receivers for the Indonesian example data set of Sect.3.1 to 3.4 in Google Earth:

Processing and instrument correction
obspyDMT can process the waveforms directly after retrieving the data, or it can process an existing data set in a separate step (local mode).By default, obspyDMT follows processing instructions described in the process_unit.pyfile located in the /path/to/my/obspyDMT/obspyDMT directory.This scripting file can be freely edited by the user and may include calls to external waveform processing programs such as ObsPy or SAC.This vastly expands the possibilities for waveform processing and lets users easily adapt and integrate functionality from earlier, non-obspyDMT workflows.Instructions in this file are written at the waveform level, and obspyDMT applies them to all waveforms in the entire data set (in serial or in parallel mode).The default file included in the current distribution, /path/to/my/obspyDMT/obspyDMT/ process_unit.py,can perform routine processing steps such as resampling, data format conversion and instrument correction.These steps can be accessed via dedicated option flags, each of which results in the execution of only the appropriate part of processing script process_unit.py(see --pre_process option flag).Hence, a user requiring only these routine operations need not create or modify a processing script file.The operations include 1. resampling time series, for example, downsampling for ease of storage and handling (refer to Sect.3.5 and --sampling_rate option flag) 2. converting the format of retrieved waveforms to SAC and filling in some headers by the simple inclusion of the --waveform_format sac option flag 3. instrument correction which includes removing means and trends, tapering, prefiltering (customizable by --pre_filt option flag) and deconvolving the instrument response to displacement, velocity or acceleration (all customizable).
The user can also modify the process_unit.pyor write a new script with new processing instructions.Currently, these files need to be located in the /path/to/my/obspyDMT/obspyDMT directory and can be accessed via --pre_process my_proc_unit option flag, replacing my_proc_unit with the name of the Python script.The instructions are written at the waveform level, and obspyDMT automatically applies them to all archived waveforms.The main advantage of this design choice is its flexibility.The user can customize the processing instructions using available tools in ObsPy; moreover, other processing tools can be used or combined to write these instructions.As an example, the following command line calls a processing instruction process_unit_sac.py;this file is located in /path/to/my/obspyDMT/obspyDMT: obspyDMT --datapath event_based_dir --local --force_process --pre_process process_unit_sac Here, SAC (instead of ObsPy) is used to remove the mean, apply a Hanning window, compute the FFT (fast Fourier transform), plot the amplitude spectrum of each waveform on a log-log plot and save the images as PDF files in the processed directory.
The option flags --min_azi, --max_azi, --min_epi and --max_epi specify minimum azimuth, maximum azimuth, minimum distance and maximum distance for station search, respectively.The synthetic waveforms are stored in the syngine_prem_a_2s directory, the contents of which can be plotted by obspyDMT plotting tools (refer to Fig. 5).
obspyDMT stores station information in one ASCII file per event and in the SAC headers (if this waveform format is selected).It automatically updates metadata information and log files of a local data archive if stations are added/removed.Event information is written in QuakeML and ASCII formats.Although basic source and receiver information can be retrieved from most data centers, moment tensor solutions are available only in certain seismicity catalogs, among them the NEIC and GCMT catalogs, which are both supported by obspyDMT (refer to moment tensor retrieval as demonstrated by Fig. 2).
In summary, obspyDMT retrieves, organizes and stores all meta-information required to compute synthetic seismograms using arbitrary forward-modeling tools.Users only need to provide scripts that connect this metadata input to their desired computational engine (other than Syngine), for example, AxiSEM (Nissen-Meyer et al., 2014) or Instaseis (van Driel et al., 2015).
www.solid-earth.net/8/1047/2017/Solid Earth, 8, 1047-1070, 2017 Here we discuss implementation and performance issues, specifically obspyDMT's communication with data centers, its robustness in the case of large and heterogeneous requests, and the usefulness of the instrument correction diagnostics.All three features set obspyDMT apart from existing tools.

Communication with data centers
obspyDMT can retrieve data from a multitude of international data centers (Table 3; a list that is growing).The user is shielded from having to know communication specifics for each data center.Under the hood, the software implements ObsPy clients for two different kinds of data exchange protocols: FDSN web services and ArcLink.
In 2013, the FDSN defined common web service interfaces (http://www.fdsn.org/webservices/),allowing data request tools to work with any of the growing number of FDSN data centers that implement these interfaces (http://www.fdsn.org/webservices/datacenters/).These centers currently include the IRIS DMC, BGR, EMSC, ETH, GEONET, GFZ, INGV, IPGP, ISC, KOERI, LMU, NCEDC, NIEP, NOA, ODC, ORFEUS, RESIF, SCEDC, USGS and USP.Three service interfaces are specified by the FDSN and supported by ObsPy: fdsnws-station for accessing station metadata in StationXML format, fdsnws-dataselect for accessing time series in miniSEED format, and fdsnws-event for accessing earthquake parameters in QuakeML format.ObspyDMT offers conversion to other formats, e.g., SAC for waveforms --waveform_format sac.Requests are sent via the HTTP internet protocol for individual requests and via HTTP-POST for lists of requests, so that data can be requested from any web browser by generating URLs.
ArcLink is an older data request protocol that arose in Europe in order to virtually consolidate distributed seismological data holdings across various European countries.It is a distributed request protocol developed by the German WebDC initiative of GEOFON and BGR (Bundesanstalt für Geowissenschaften und Rohstoffe) as a continuation of the NetDC concept originally developed by the IRIS DMC.ArcLink communicates via TCP/IP rather than via supervision-intensive email or FTP requests required by other access mechanisms at the time.It accesses waveform data in miniSEED or SEED format and associated metainformation as dataless SEED files.At the time we developed ObsPyLoad, a pre-cursor of obspyDMT (Scheingraber et al., 2013), only a few data centers were implementing FDSN web services.Hence, ArcLink clients greatly expanded the reach of ObsPyLoad, to include most European data centers.ObsPyLoad contacts the ORFEUS DMC via ArcLink, which in turn "forwards" ArcLink requests to other data centers across Europe.This ArcLink functionality is retained in obspyDMT, but if a data center implements both interfaces, then obspyDMT accesses it via web services (default), which now includes the European data centers.It seems likely that web services will completely supersede ArcLink.

Robustness of data retrieval
In our research we have used obspyDMT extensively, in order to retrieve several voluminous, event-based data sets for global-scale tomography, from different combinations of data centers.We have also requested large volumes of timecontinuous data ("ambient noise") for cross-correlation studies.In all cases, we observed obspyDMT to work stably, i.e., requiring no user intervention despite the fact that many individual waveform requests encounter errors from the data centers, for various reasons.ObspyDMT caught all exceptions and continued undeterred.
In a demanding test that expanded the scope of the example of Sect.3.3, we retrieved all BHZ channels from all supported data sources, in event-based mode, requesting earthquakes exceeding a magnitude of 6.0 that occurred during 2 years.The idea was to test the most challenging request mode, which includes station and event metadata, and to communicate with all data centers, including some that implemented web services very recently.
This finding is consistent with the performance of obspy-DMT's predecessor ObsPyLoad (Scheingraber et al., 2013).With an event-based request similar to the one above to all data centers available at the time (in 2012 this was IRIS and the European centers via ORFEUS/ArcLink), we retrieved 162 GB of waveform data, consisting of 690 503 miniSEED files for three components (BHZ, BHE and BHN) for 154 events.The retrieval took 45 days because the job slowed down considerably after the first 73 GB (but continued at the old speed after relaunching, i.e., requesting the remaining 89 GB through update mode).The fraction of successfully retrieved waveforms varied strongly between data centers and ranged from 99.8 to 34.8 % (availabilities were verified by spot checks in manual retrieval attempts).The exact reasons for the slowdown remained unclear, but aside from the decision to relaunch, no user intervention was required at either download stage.
For the current test in 2017, no such slowdown was observed, and the retrieval of a comparable data volume (145 GB) took only a 1 / 20 of the time (2.5 days), despite being routed to many more data centers.We conclude that obspyDMT works robustly with all supported data centers, even for large and heterogeneous data and metadata requests.

Instrument correction
If station metadata could be routinely trusted, correcting for instrument responses would amount to a simple series of deconvolutions of a number of impulse responses (analog and digital filter stages from raw waveforms).Unfortunately, it is not uncommon for filter information in station metadata files to be erroneous.Some of the resulting artifacts in the displacement or velocity seismograms are large enough to potentially cause serious geoscientific misinterpretation, such as pronounced travel time delays under an isolated island station where in reality there are none.
Problems with the contents of StationXML or SEED/RESP files may or may not be straightforward to identify, as discussed below.A full visual representation of filter impulse responses can greatly facilitate trouble shooting.ObspyDMT implements several plotting options for this purpose, as demonstrated in  An instrument response typically consists of a first, analog stage (a.k.a."poles and zeros", or PAZ stage), which describes the transfer function of the sensor, and several digital stages, which describe the A/D conversion, antialiasing and downsampling inside the data logger.The PAZ stage is rarely problematic, whereas specifications of the digital stages are error-prone.Our discussion of neuralgic points and their possible diagnosis follows the PhD thesis of Groos (2010).
Coefficients of asymmetric FIR filters are sometimes given in reverse order from that expected by the SEED convention, which can cause erroneous time delays of up to 1 s in the "corrected" waveforms.This issue may not be easy to detect as it requires knowledge of the correct order of filter coefficients, e.g., by comparing it to a trusted StationXML file describing the same data logger in a different location.
A typical, unproblematic response resembles Fig. 6, with PAZ and full response coinciding everywhere except near the Nyquist frequency.By contrast, a plot like Fig. 7   a potential problem.The very different phase responses of PAZ-only versus full response indicate that the digital stages introduce a significant delay (and possibly distortion) of the corrected time series.The user can then question whether this behavior is expected from the data logger.ObspyDMT automatically creates diagnostic reports for stations where PAZ and full response differ significantly.Figure 8 further zooms in on the issue, by indicating that among the digital stages, only Stage 5 has a non-zero phases response, identifying it as the questionable one.If the user decides that the digital stage specifications are suspect, they can choose to apply PAZ-only correction rather than full response -this should give a decent result, except for frequencies very close to Nyquist.Alternatively, if the user is working with lowfrequency data only (below 0.01 Hz), they can conclude that no problem would ever arise because even Stage 5 is almost 0 in that spectral range.
Another recurring problem concerns delay time values specified for the FIR filter stages.According to the SEED manual, corrected filter delay times have to be positive; and yet, negative or 0 values are sometimes encountered in retrieved metadata files which can result in erroneous time shifts of 1 to 2 s in corrected waveforms.This problem is easily spotted, but 7 years after the report by (Groos, 2010), we still encounter such response files delivered by data centers.
obspyDMT also checks for inconsistencies in the "estimated delay" and the "correction applied" of the digital filter stages.In modern data loggers, these two values are usually similar because delay times are removed from the waveforms internally.However, discrepancies have been observed, such as negative or 0 values for the corrected delay time.In the example of Fig. 7, the estimated delay is reported as 0.63 s, and the applied correction is 0.0 s.ObspyDMT collects this information and automatically generates one diagnostic report for the results of all consistency checks.

Conclusions
We presented obspyDMT, a new software for the query, retrieval, processing and management of large seismological data sets.Its functionality, design and technical implementation were described and compared with existing seismological data retrieval and management tools.Through examples we demonstrated its main functionalities, such as a query of station and earthquake source metadata (full-moment tensor and event origin), the retrieval of event-based or timecontinuous waveform data from various data centers in one command line, an update mode, a customizable processing unit, and the automatic organization of (meta-)data and log files into standardized directory trees.The user is provided with powerful diagnostic and plotting tools to check the retrieved data and metadata.For large seismological data sets, data retrieval and processing can be parallelized on multicore architectures by the simple inclusion of an option flag.Using obspyDMT's diagnostic plots of analog and digital filter stages, we checked the spectra (amplitude and phase) of instrument response files.Synthetic seismograms matching an example data set were retrieved from IRIS Syngine.
In all these use cases, issuing a single-line command is the only requirement for the user, everything else is done automatically.
Refer to Appendix C for instructions on how to download and install obspyDMT.

Figure 1 .
Figure 1.obspyDMT --datapath iris_events_dir --min_date 1990-01-01 --max_date 2017-01-01 --min_mag 5.0 --event_info --plot_seismicity Rapid growth of seismological waveform data holdings at international data centers since 1990.Using the obspyDMT command above, we queried the IRIS DMC for hour-long, vertical, broadband (BHZ and HHZ) waveform segments containing earthquakes exceeding a magnitude of 5.0.(a) The data center's response.Red line shows cumulative sum of available event-based waveforms for this request; year y=1990 num_events(y) × num_channels(y) .Number of events and seismograms in each year are shown by dotted and solid blue lines, respectively.(b) Global seismicity map of earthquakes in panel (a) colored by depth.Red: 0-70 km; green: 70-300 km; blue: ≥ 300 km.The generation of this map is triggered by the --plot_seismicity flag.Upon startup of the plotting module, the user can select the map style, "Shadedrelief" in this example.

Figure 4 .
Figure 4. obspyDMT --datapath event_based_dir --local --plot_ev --plot_focal --plot_sta --plot_ray --create_kml Plot of the contents of the --datapath event_based_dir that contains the Indonesian example data set generated in Sects.3.1 to 3.4.--local specifies that the existing, local waveform holdings should be plotted, rather than contacting the data centers anew.Sixteen earthquake locations are plotted as beach balls; stations featuring BHZ channels are indicated by yellow markers.Waveforms were retrieved from three data centers (IRIS, ORFEUS, GFZ).
Figure 6.obspyDMT --plot_stationxml --plotxml_paz --plotxml_min_freq 0.0001 --datapath /path/to/STXML.IC.XAN.00.BHZ Transfer function spectra (amplitude and phase) of a Streckeisen STS-1VBB w/E300 station (IC.XAN) in China.Blue lines show the transfer function components computed for all filter stages in a StationXML file; red lines are for the analog part.The two functions match very well in all frequencies except for the amplitude spectra close to the Nyquist frequency (dashed line).
Figure 7. obspyDMT --plot_stationxml --plotxml_paz --plotxml_min_freq 0.0001 --datapath /path/to/STXML.GT.LBTB.00.BHZ Transfer function spectra (amplitude and phase) of a Geotech KS-54000 borehole seismometer (GT.LBTB) in Botswana.Blue lines show transfer function components computed for all filter stages in the StationXML file; red lines are for the analog part.A large discrepancy exists between the phase spectra of the two transfer functions.The deviation emerges at frequencies around 10 −2 Hz and increases up to the Nyquist frequency.Fig.8shows that this difference is caused by one of the digital stages in the instrument response.

Figure 8 .
Figure 8. obspyDMT --datapath /path/to/STXML.GT.LBTB.00.BHZ --plot_stationxml --plotxml_min_freq 0.0001 --plotxml_allstages Transfer function spectra (amplitude and phase) of each stage in the StationXML file of a Geotech KS-54000 borehole seismometer (GT.LBTB) in Botswana.In the phase response, two stages (1 and 5) have non-zero values.Both stages contribute to the phase spectrum of the complete instrument response ("full-resp") of Fig. 7.However, the effects of Stage 5 on amplitude and phase spectra are not considered in PAZ (analog).

Table 1 .
Comparison of seismological data retrieval and management tools.Abbreviations: E -event-based; C -continuous time series; Uupdate mode.ObspyDMT is the only tool to provide access to both FDSN and ArcLink (in a single command), to retrieve both event-based and time-continuous waveform data, and to offer an "update" mode for waveforms, response files and/or metadata information.Few other tools provide for the management of data download and archiving, instrument correction, or diagnostics plots.EIDA: European Integrated Data Archive.

Table 2 .
Complete list of option flags to customize the default behavior of the obspyDMT command.

Table 2 .
Complete list of option flags to customize the default behavior of the obspyDMT command.

Table 2 .
Complete list of option flags to customize the default behavior of the obspyDMT command.