SAM Data Models
The SAM Data Model (and Database) contains the following entities.
For a complete pictorial representation of the model,
see the Full E/R Diagram
(PostScript, GIF, PDF)
or examine various views of the model.
Runs, Events, and the Data File Catalog
- A Run identifies a set of data acquired sequentially with a single set of
calibration and detector constants. A Run is identified by an increasing
run number and described by details such as its begin/end time,
description, and run conditions. A Run is also identified by a distinct
Run Type, such as "monte carlo", "physics data taking", "detector", "calibration", etc. Run numbers will be unique per run type, which allows for the possibility that the Monte Carlo data imported from an external collaboration may use the same run numbering schema as the D0 Detector.
- Each Run produces many physics Events, which are described by an unique
event number and their defining trigger bit settings and event filters.
The raw data for each event is stored in the raw data tier with one file
containing multiple events. Subsequent processing on the events results
in the creation of additional files for the other data tiers: fully
reconstructed (EDU250), summary reconstructed (EDU50/EDU150), and highly
compact summary physics data (EDU5/Thumbnail).
- The numerous Data Files created for a run each contain information regarding
many events. Data Files are created by the On-Line system as Events are captured at the detectors. Data Files are also created by Off-Line data analysis, as physicists find interesting event phenomenon that they want to keep in a file to ease subsequent references. Data Files are identified by their filename and described by
attributes such as an event range, trigger information, sequence in the
run, create time, data tier (edu type), format information, file status. The Data File Status denotes if the file is available (in a valid tape or disk location), being imported (in the process of being written to tape), deleted, lost, or unavailable.
- The Event Catalog provides a means to find related events and files.
Currently this catalog is described as a direct event-to-file mapping, but
to reduce the database storage required, alternatives to storing simple
minded key entries for each event to file map are being considered.
For example, the start and end event ids for a file might define the map.
The exact implementation of this mapping technique will be determined as
the details regarding the size and access times of the database become known.
Event Catalog Types are also being considered to classify the types of event catalogs.
- Each file will exist in one or more Data Storage Locations. A file may reside on a
tape volume, or it may also be cached on station disk, or buffered on other disk managed by SAM.
While the tape volume
may be considered the "source" of the file, during it's lifetime in SAM,
a file may exist in multiple locations. The Data File Locations track which Data Files are on the various volumes and disks. The full path to each file is recorded, along with the following for the various storage location types. For files on tape, the full Tape Location is recorded, along with the sequence number on the volume, the volume tape label and the data format. For files on disk, the Data Disks (Station Disks or Other Disks) records the node and disk mount point, and the Data Storage Location (Station Disk Location or Other Disk Location) records the full path to the file. Station Disks also include the related station.
- For each data file on tape, the Volume is recorded in SAM. For each volume, the tape label, data format, volume status, volume location, and Volume Type are recorded. Volume types include shelf, tape, ait, etc.
- Each file is classified by a Data Tier, which defines the type of data
content in the file. The Data Tier describes the level of consolidation
that the file represents: raw data,
fully reconstructed (EDU250), summary reconstructed (EDU50/EDU150), highly
compact summary physics data (EDU5/Thumbnail), or unknown/other. More
data tiers can easily be added as needed.
- Physics analysis is performed over physical streams of events.
These Physical Data Streams
are distinguished by a particular set of trigger and filter bit setting
combinations which are depicted as Trigger Streams.
It is expected that there will be about 32 data streams of interest. The list of these streams is found in the Trigger List.
Each Physical Data Stream identifies the files that comprise the stream,
whereas a Logical Data Stream
provides a means of grouping Physical Data Streams.
- A Dataset consists of a set of data useful from a physics perspective.
It may be a set of data constructed by an analysis process, or it may be
a set of data used by an analysis process.
Physical Datasets identify
a particular set of files of interest in the Data File Physical Dataset entity.
Logical Event Sets identify
a particular set of events of interest.
- The entities of Person and
Working Group are used for tracking user
activity. The Person Working Group combination depicts the fact that each Person may belong to one or more Working Groups.
Physics analysis is performed by Persons working on behalf of a particular Working Group, or rather by a Person Working Group combination.
Defining and Running Analysis Projects
To start a project, first the Project Definition is created. The Project Definition includes the detailed criteria which are used to select the files/events of interest for the physics analysis at hand. When creating a Project Definition, the user is informed of the summary of Data Files that are selected based on their Project Definition, complete with number of files, average size of files, etc.
- After reviewing the Project Definition and deciding it is a worthwhile set of criteria to keep for later use, or maybe even to start a running project now, the user then saves this as a Project Snapshot. When saving a Project Snapshot, the complete list of Data Files which are identified by the Project Definition at that point in time are recorded as the Project Files.
- Starting a running Analysis Project
requires a Person working in a particular Working Group to identify which Project Snapshot
they would like to start the Analysis Project for and on which Station. When running, the Analysis Project will retrieve the Project Files and serve them to all Consumers registered with the Project. For running projects, the initiating person and working group are recorded, along with the station, node and operating system process id of the running project. All projects will reflect their current Analysis Project Status, e.g. running, complete.
Stations and File Caching
The Station is responsible for controlling and monitoring
the currently running projects, including recognizing local disk cache, copying
files from the tape vaults to local disk, optimizing concurrent requests per its
known set of resources, etc.
The station manages the file cache, by recording the date and time that all Cached Files were placed into cached and removed from cache. The history of all Cached File Status changes is maintained, allowing the station to monitor its own performance. The station also tracks all usages of cached files by analysis projects as Cached File Project Usages, recording the start and end time of each usage. Additionally, the station allows groups to lock files in cache for later use. Tracking these Cached File Locks allows frequently used files to be kept in cache, minimizing robot arm activity.
The station performs further optimization and policy controls by following the rules specified in the Cache Policy Types and the Station Group Rules. These rules are maintained for each group by users identified as Station Group Admins. The list of allowed station group admins is maintained by those users identified as Station Admins. These station admins must first be defined by the SAM administrative team for each station. Contact email@example.com to request the addition of a station admin. These requests will be reviewed with your D0 collaboration before adding any station admins.
Consuming and Producing Files
- The Person Working Group combinations that participate in active analysis projects by consuming files that it delivers become known as
As the project runs, additional Consumers may join the project and use the
files it returns. Consumers who have "fallen off" the project, either because
they left or their processes died, may also re-join the project if needed.
For each Consumer, the Process, or more specifically, the
Analysis Process information is recorded,
including the Node name, operating system type (Oper Sys Type), the Hardware Type on which the process is run and its current Process Status. The Application Family is also recorded for each process, which indicates the family, name and version of program run by the process.
As Data Files are consumed, the consumer and analysis process that is accessing each
file is tracked via a Consumed File list.
As a part of this listing,
the Consumed File Status of that User's file access is recorded, to indicate if
they are reading the file, done with the file, etc.
The list of Missed Events tracks events that are missed during file consumption. Events are missed if files are unavailable, or other technical glitches occur.
- As well as acting as consumers of files, analysis processes may actually be Producers of new data files.
As data files are produced, the processes that produce them are recorded in the
relationship between data files and processes.
If new Data Files are produced by the splitting of a data file into multiple files,
or the merging of multiple data files into one new file, the
File Lineage is recorded. This File Lineage
provides the ability to know exactly which files were merged and split into the
Import Processes are another type of process that will produce data files. Import Processes load data created from external sources into the SAM system. For example, Monte Carlo data created at other institutions is loaded into SAM by Import Processes. The Import Process will denote the Person and Working Group submitting the external data, the Run to which it relates, the Application Family which was run to create this external data, and the actual Data Files being loaded by this Import Process.
Data File Parameters record character large data objects (CLOBS) for the Monte Carlo import data. The entire contents of the parameter files are stored in the SAM database.
Analysis Projects may be run for specific Pick Events. The current data model depicts Pick Events related to Analysis Projects. Additional relationships from Pick Events to the actual Events and other entities will be needed as we work out the details on the implementation of Pick Events.
The details on Luminosity are not included in the overall SAM model. These details are available in other documents. They will be incorporated in the SAM model as needed.