Instruction File Persistence Refactor

Document status	DRAFT
Document owner	@Joshua Lamos-Sweeney + @Luke Cook
Designer	@Luke Cook
Tech lead	@Joshua Lamos-Sweeney
Technical writers	TBD
QA	TBD

Objective

Development of an instruction-free process layer for Digest and Extract layer of GDM wherein there is no file-system direct access for instruction file ‘passing’. By doing this, we can reduce the reliance on the file system, as well as develop better process handling (e.g. scheduling) systems without a requirement for a file-system-based cron solution.

This will also allow for a more direct calling structure, reducing wait-times on short jobs, and allow for more precise knowledge of the state of each job. (See sub-enhancements for future tasks in that space).

Concept

The Process layer (for now we’ll look at the Digest, as it and the Extract are functionally similar), currently passes the ‘job’ to be completed as a file, placed after all the ancillary files and folders are created and in place. This ‘job file' moves through the system as a crude status mechanism. Now that we have more complex job status tracking, and less reliance on manually kicking off jobs internally, this system gives very little benefit to the user, while complicating the file system structure, and being in itself a cause of confusion.

This also hurts any chances of dynamic instances, which would rely on the same filesystem, and makes it difficult to do monitoring and inter-process communication.

The proposed system will allow a ‘user’ program, such as the LoaderUI, to make a call to the digest instance on a specific receiving port, which will accept a serialized instruction (currently our instruction file format, serialized), and directly kick off a job this way.

Implementation Decisions

Currently, the DigestListener receives a job object through the reception port and places the job into a four thread queue, limiting the running jobs to four concurrent jobs, FIFO. This limits the chances of system overutilization while remaining simplistic, and is isolated in DigestListener so a more comprehensive scheduler can be easily ‘slotted in’. As is this is more optimal than the ‘slow ramping’ cron job ‘pseudo-scheduler’, which can allow for many jobs should all the jobs be ‘long’.

Success metrics

Goal	Metric

Goal	Metric
Integrates into existing GDM UIs	Does everything still work?
Does not fail under high load scenarios	Does it actually limit system load?

Genomic Data Manager

Instruction File Persistence Refactor

Objective

Concept

Implementation Decisions

Success metrics

Assumptions