Schema Documentation

This document describes the tables that make up the Hive schema. Tables are grouped into categories, and the purpose of each table is explained.
You can toggle the display of individual columns using [Show/Hide columns] buttons.

Pipeline structure

hive_meta
pipeline_wide_parameters
analysis_base
analysis_stats
dataflow_rule
analysis_ctrl_rule

Job-related

job
job_file
accu
analysis_data

execution tables

worker
role

Logging and monitoring

worker_resource_usage
log_message
analysis_stats_monitor

Pipeline structure

hive_meta

Show columns [Back to top]

This table keeps several important hive-specific pipeline-wide key-value pairs such as hive_sql_schema_version, hive_use_triggers and hive_pipeline_name.

Column	Type	Default value	Description	Index
meta_key	VARCHAR(255)	-	the KEY of KEY-VALUE pairs (primary key)
meta_value	TEXT	NULL	the VALUE of KEY-VALUE pairs

pipeline_wide_parameters

Show columns [Back to top]

This table contains a simple hash between pipeline_wide_parameter names and their values. The same data used to live in 'meta' table until both the schema and the API were finally separated from Ensembl Core.

Column	Type	Default value	Description	Index
param_name	VARCHAR(255)	-	the KEY of KEY-VALUE pairs (primary key)	key: value_idx
param_value	TEXT	NULL	the VALUE of KEY-VALUE pairs

analysis_base

Show columns [Back to top]

Each Analysis is a node of the pipeline diagram. It acts both as a "class" to which Jobs belong (and inherit from it certain properties) and as a "container" for them (Jobs of an Analysis can be blocking all Jobs of another Analysis).

Column	Type	Default value	Description	Index
analysis_id	INTEGER	-	a unique ID that is also a foreign key to most of the other tables
logic_name	VARCHAR(255)	-	the name of the Analysis object	unique key: logic_name_idx
module	VARCHAR(255)	NULL	the Perl module name that runs this Analysis
parameters	TEXT	NULL	a stingified hash of parameters common to all jobs of the Analysis
resource_class_id	INTEGER	-	link to the resource_class table
failed_job_tolerance	INTEGER	0	% of tolerated failed Jobs
max_retry_count	INTEGER	3	how many times a job of this Analysis will be retried (unless there is no point)
can_be_empty	SMALLINT	0	if TRUE, this Analysis will not be blocking if/while it doesn't have any jobs
priority	SMALLINT	0	an Analysis with higher priority will be more likely chosen on Worker's specialization
meadow_type	VARCHAR(255)	NULL	if defined, forces this Analysis to be run only on the given Meadow
analysis_capacity	INTEGER	NULL	if defined, limits the number of Workers of this particular Analysis that are allowed to run in parallel

analysis_stats

Show columns [Back to top]

Parallel table to analysis_base which provides high level statistics on the state of an analysis and it's jobs. Used to provide a fast overview, and to provide final approval of 'DONE' which is used by the blocking rules to determine when to unblock other analyses. Also provides

Column	Type	Default value	Description	Index
analysis_id	INTEGER	-	foreign-keyed to the corresponding analysis_base entry	primary key
batch_size	INTEGER	1	how many jobs are claimed in one claiming operation before Worker starts executing them
hive_capacity	INTEGER	NULL	a reciprocal limiter on the number of Workers running at the same time (dependent on Workers of other Analyses)
status	ENUM('BLOCKED', 'LOADING', 'SYNCHING', 'EMPTY', 'READY', 'WORKING', 'ALL_CLAIMED', 'DONE', 'FAILED')	'EMPTY'	cached state of the Analysis
total_job_count	INTEGER	0	total number of Jobs of this Analysis
semaphored_job_count	INTEGER	0	number of Jobs of this Analysis that are in SEMAPHORED state
ready_job_count	INTEGER	0	number of Jobs of this Analysis that are in READY state
done_job_count	INTEGER	0	number of Jobs of this Analysis that are in DONE state
failed_job_count	INTEGER	0	number of Jobs of this Analysis that are in FAILED state
num_running_workers	INTEGER	0	number of running Workers of this Analysis
num_required_workers	INTEGER	0	extra number of Workers of this Analysis needed to execute all READY jobs
behaviour	ENUM('STATIC', 'DYNAMIC')	'STATIC'	whether hive_capacity is set or is dynamically calculated based on timers
input_capacity	INTEGER	4	used to compute hive_capacity in DYNAMIC mode
output_capacity	INTEGER	4	used to compute hive_capacity in DYNAMIC mode
avg_msec_per_job	INTEGER	NULL	weighted average used to compute DYNAMIC hive_capacity
avg_input_msec_per_job	INTEGER	NULL	weighted average used to compute DYNAMIC hive_capacity
avg_run_msec_per_job	INTEGER	NULL	weighted average used to compute DYNAMIC hive_capacity
avg_output_msec_per_job	INTEGER	NULL	weighted average used to compute DYNAMIC hive_capacity
last_update	TIMESTAMP	NULL	when this entry was last updated
sync_lock	SMALLINT	0	a binary lock flag to prevent simultaneous updates

dataflow_rule

Show columns [Back to top]

Extension of simple_rule design except that goal(to) is now in extended URL format e.g. mysql://ensadmin:@ecs2:3361/compara_hive_test?analysis.logic_name='blast_NCBI34' (full network address of an analysis). The only requirement is that there are rows in the job, analysis, dataflow_rule, and worker tables so that the following join works on the same database WHERE analysis.analysis_id = dataflow_rule.from_analysis_id AND analysis.analysis_id = job.analysis_id AND analysis.analysis_id = worker.analysis_id These are the rules used to create entries in the job table where the input_id (control data) is passed from one analysis to the next to define work. The analysis table will be extended so that it can specify different read and write databases, with the default being the database the analysis is on

Column	Type	Default value	Description	Index
dataflow_rule_id	INTEGER	-	internal ID
from_analysis_id	INTEGER	-	foreign key to analysis table analysis_id	unique: key
branch_code	INTEGER	1	branch_code of the fan	unique: key
funnel_dataflow_rule_id	INTEGER	NULL	dataflow_rule_id of the semaphored funnel (is NULL by default, which means dataflow is not semaphored)	unique: key
to_analysis_url	VARCHAR(255)	''	foreign key to net distributed analysis logic_name reference	unique: key
input_id_template	TEXT	NULL	a template for generating a new input_id (not necessarily a hashref) in this dataflow; if undefined is kept original	unique: key

analysis_ctrl_rule

Show columns [Back to top]

These rules define a higher level of control. These rules are used to turn whole anlysis nodes on/off (READY/BLOCKED). If any of the condition_analyses are not 'DONE' the ctrled_analysis is set to BLOCKED. When all conditions become 'DONE' then ctrled_analysis is set to READY The workers switch the analysis.status to 'WORKING' and 'DONE'. But any moment if a condition goes false, the analysis is reset to BLOCKED.

Column	Type	Default value	Description	Index
condition_analysis_url	VARCHAR(255)	''	foreign key to net distributed analysis reference	unique: key
ctrled_analysis_id	INTEGER	-	foreign key to analysis table analysis_id	unique: key

Resources

resource_class

Show columns [Back to top]

Maps between resource_class numeric IDs and unique names.

Column	Type	Default value	Description	Index
resource_class_id	INTEGER	-	unique ID of the ResourceClass
name	VARCHAR(255)	-	unique name of the ResourceClass	unique: key

resource_description

Show columns [Back to top]

Maps (ResourceClass, MeadowType) pair to Meadow-specific resource lines.

Column	Type	Default value	Description	Index
resource_class_id	INTEGER	-	foreign-keyed to the ResourceClass entry	primary key
meadow_type	VARCHAR(255)	-	if the Worker is about to be executed on the given Meadow...	primary key
submission_cmd_args	VARCHAR(255)	''	... these are the resource arguments (queue, memory,...) to give to the submission command
worker_cmd_args	VARCHAR(255)	''	... and these are the arguments that are given to the worker command being submitted

Job-related

job

Show columns [Back to top]

The job is the heart of this system. It is the kiosk or blackboard where workers find things to do and then post work for other works to do. These jobs are created prior to work being done, are claimed by workers, are updated as the work is done, with a final update on completion.

Column	Type	Default value	Description	Index
job_id	INTEGER	-	autoincrement id
prev_job_id	INTEGER	NULL	previous job which created this one
analysis_id	INTEGER	-	the analysis_id needed to accomplish this job.	unique key: input_id_stacks_analysis key: analysis_status_retry
input_id	CHAR(255)	-	input data passed into Analysis:RunnableDB to control the work	unique key: input_id_stacks_analysis
param_id_stack	CHAR(64)	''	a CSV of job_ids whose input_ids contribute to the stack of local variables for the job	unique key: input_id_stacks_analysis
accu_id_stack	CHAR(64)	''	a CSV of job_ids whose accu's contribute to the stack of local variables for the job	unique key: input_id_stacks_analysis
role_id	INTEGER	NULL	links to the Role that claimed this job (NULL means it has never been claimed)	key: role_status
status	ENUM('SEMAPHORED','READY','CLAIMED','COMPILATION','PRE_CLEANUP','FETCH_INPUT','RUN','WRITE_OUTPUT','POST_CLEANUP','DONE','FAILED','PASSED_ON')	'READY'	state the job is in	key: analysis_status_retry key: role_status
retry_count	INTEGER	0	number times job had to be reset when worker failed to run it	key: analysis_status_retry
completed	TIMESTAMP	NULL	when the job was completed
runtime_msec	INTEGER	NULL	how long did it take to execute the job (or until the moment it failed)
query_count	INTEGER	NULL	how many SQL queries were run during this job
semaphore_count	INTEGER	0	if this count is >0, the job is conditionally blocked (until this count drops to 0 or below). Default=0 means "nothing is blocking me by default".
semaphored_job_id	INTEGER	NULL	the job_id of job S that is waiting for this job to decrease S's semaphore_count. Default=NULL means "I'm not blocking anything by default".

job_file

Show columns [Back to top]

For testing/debugging purposes both STDOUT and STDERR streams of each Job can be redirected into a separate log file. This table holds filesystem paths to one or both of those files. There is max one entry per job_id and retry.

Column	Type	Default value	Description	Index
job_id	INTEGER	-	foreign key	primary key
retry	INTEGER	-	copy of retry_count of job as it was run	primary key
role_id	INTEGER	-	links to the Role that claimed this job	key: role
stdout_file	VARCHAR(255)	NULL	path to the job's STDOUT log
stderr_file	VARCHAR(255)	NULL	path to the job's STDERR log

accu

Show columns [Back to top]

Accumulator for funneled dataflow.

Column	Type	Default value	Description	Index
sending_job_id	INTEGER	NULL	semaphoring job in the "box"	key: accu_sending_idx
receiving_job_id	INTEGER	-	semaphored job outside the "box"	key: accu_receiving_idx
struct_name	VARCHAR(255)	-	name of the structured parameter
key_signature	VARCHAR(255)	-	locates the part of the structured parameter
value	TEXT	NULL	value of the part

analysis_data

Show columns [Back to top]

A generic blob-storage hash. Currently the only legitimate use of this table is "overflow" of job.input_ids: when they grow longer than 254 characters the real data is stored in analysis_data instead, and the input_id contains the corresponding analysis_data_id.

Column	Type	Default value	Description	Index
analysis_data_id	INTEGER	-	primary id
data	TEXT	NULL	text blob which holds the data	key: data(100)

execution tables

worker

Show columns [Back to top]

Entries of this table correspond to Worker objects of the API. Workers are created by inserting into this table so that there is only one instance of a Worker object in the database. As Workers live and do work, they update this table, and when they die they update again.

Column	Type	Default value	Description	Index
worker_id	INTEGER	-	unique ID of the Worker
meadow_type	VARCHAR(255)	-	type of the Meadow it is running on	key: meadow_process
meadow_name	VARCHAR(255)	-	name of the Meadow it is running on (for 'LOCAL' type is the same as host)	key: meadow_process
host	VARCHAR(255)	-	execution host name
process_id	VARCHAR(255)	-	identifies the Worker process on the Meadow (for 'LOCAL' is the OS PID)	key: meadow_process
resource_class_id	INTEGER	NULL	links to Worker's resource class
work_done	INTEGER	0	how many jobs the Worker has completed successfully
status	ENUM('SPECIALIZATION','COMPILATION','READY','PRE_CLEANUP','FETCH_INPUT','RUN','WRITE_OUTPUT','POST_CLEANUP','DEAD')	'READY'	current status of the Worker
born	TIMESTAMP	CURRENT_TIMESTAMP	when the Worker process was started
last_check_in	TIMESTAMP	NULL	when the Worker last checked into the database
died	TIMESTAMP	NULL	if defined, when the Worker died (or its premature death was first detected by GC)
cause_of_death	ENUM('NO_ROLE', 'NO_WORK', 'JOB_LIMIT', 'HIVE_OVERLOAD', 'LIFESPAN', 'CONTAMINATED', 'RELOCATED', 'KILLED_BY_USER', 'MEMLIMIT', 'RUNLIMIT', 'SEE_MSG', 'UNKNOWN')	NULL	if defined, why did the Worker exit (or why it was killed)
log_dir	VARCHAR(255)	NULL	if defined, a filesystem directory where this Worker's output is logged

role

Show columns [Back to top]

Entries of this table correspond to Role objects of the API. When a Worker specializes, it acquires a Role, which is a temporary link between the Worker and a resource-compatible Analysis.

Column	Type	Default value	Description	Index
role_id	INTEGER	-	unique ID of the Role
worker_id	INTEGER	-	the specialized Worker	key: worker
analysis_id	INTEGER	-	the Analysis into which the Worker specialized	key: analysis
when_started	TIMESTAMP	CURRENT_TIMESTAMP	when this Role started
when_finished	TIMESTAMP	NULL	when this Role finished. NULL may either indicate it is still running or was killed by an external force.
attempted_jobs	INTEGER	0	counter of the number of attempts
done_jobs	INTEGER	0	counter of the number of successful attempts

Logging and monitoring

worker_resource_usage

Show columns [Back to top]

A table with post-mortem resource usage statistics of a Worker.

Column	Type	Default value	Description	Index
worker_id	INTEGER	-	links to the worker table	primary key
exit_status	VARCHAR(255)	NULL	meadow-dependent, in case of LSF it's usually 'done' (normal) or 'exit' (abnormal)
mem_megs	FLOAT	NULL	how much memory the Worker process used
swap_megs	FLOAT	NULL	how much swap the Worker process used
pending_sec	FLOAT	NULL	time spent by the process in the queue before it became a Worker
cpu_sec	FLOAT	NULL	cpu time used by the Worker process
lifespan_sec	FLOAT	NULL	walltime used by the Worker process
exception_status	VARCHAR(255)	NULL	meadow-specific flags, in case of LSF it can be 'underrun', 'overrun' or 'idle'

log_message

Show columns [Back to top]

When a Job or a job-less Worker (job_id=NULL) throws a "die" message for any reason, the message is recorded in this table. It may or may not indicate that the job was unsuccessful via is_error flag. Also $self->warning("...") messages are recorded with is_error=0.

Column	Type	Default value	Description	Index
log_message_id	INTEGER	-	an autoincremented primary id of the message
job_id	INTEGER	NULL	the id of the job that threw the message (or NULL if it was outside of a message)	key: job_id
role_id	INTEGER	NULL	the 'current' role
worker_id	INTEGER	NULL	the 'current' worker	key: worker_id
time	TIMESTAMP	CURRENT_TIMESTAMP	when the message was thrown
retry	INTEGER	NULL	retry_count of the job when the message was thrown (or NULL if no job)
status	ENUM('UNKNOWN','SPECIALIZATION','COMPILATION','CLAIMED','READY','PRE_CLEANUP','FETCH_INPUT','RUN','WRITE_OUTPUT','POST_CLEANUP','PASSED_ON')	'UNKNOWN'	of the job or worker when the message was thrown
msg	TEXT	NULL	string that contains the message
is_error	SMALLINT	NULL	binary flag

analysis_stats_monitor

Show columns [Back to top]

A regular timestamped snapshot of the analysis_stats table.

Ensembl Hive Schema Documentation

List of the tables:

Pipeline structure

Resources

Job-related

execution tables

Logging and monitoring

Pipeline structure

Resources

Job-related

execution tables

Logging and monitoring