Job Persistence and Instruction Refactor

Questions to be answered:

How does this correlate to the new architecture?
- Will the calling service manage load balancing / retries
- How will the calling service trigger the load? (scp/ssh, web calls, tcp, etc…)
How do instructions relate to jobs? (many jobs to one instruction?, one to one?)

Rationale: Storing instructions in the file system as static JSON objects moves state information away from the central repository of information about a job, confuses users, and does not play well with dynamic instances or state queries.

Furthermore, the file system is an unencapsulated stateful location, with stateful permissions and strong uniqueness constraints. All of this put together creates a volatile environment with strong imposed restrictions on the logic surrounding instructions.

Instructions must be uniquely identified by a name (the file name), the running process must have permission to move the files, and the running process does not have complete ownership over its entities. All in all, having instructions on the file system imposes business logic, and creates complexity that is unneeded.

Overall Design

Goals

→ Remove all references to instruction files in the file system from the digester.

→ Create an opening for outside components to call the digester directly (e.g. via a webservice call)

Steps

Phase 1: Proof Of Concept of Direct Instruction Injection

At this phase, we look into the difficulty of modifying the webservice layer in such a way as it immediately kicks off a process thread, much the same way BERT currently directly processes jobs. This will include passing the job data directly instead of through a JSON file.

Phase 2: Scheduler/Load Balancer

(**** This phase may be marked as obsolete in new architecture ****)
Once the Phase one prototype is complete, the difficulties of adding a scheduler into the mix can be researched pretty easily. During this phase, we hope to identify a scheduler/load balancer that can work well with spot instances and other cloud improvements.

Phase 3: Instruction database table and interface

At this point, we can investigate efficient storage of the instruction data structures as well as live updates from the scheduled/balanced jobs onto this structure, as well as how this structure relates to the upcoming job table redesign.

Post-Phase 3: Finish

→ Testing/fixing issues found in prototype.

→ Integration with main codebase.

Process Flow

User targets the ‘createInstruction’ end point in the web services, the same as it is done now
The serialized instruction is redirected to a digester service
(Optional) a record of the instruction is placed in the database, perhaps coupled with the job
The digester service resides on the compute node, and spins up digesters in a thread pool

→ Using the createInstruction end point means no changes to the loaderUI will be needed

→ The instruction in the database is for retrospection purposes

→ This means if an instruction fails, the instruction can be inspected

→ Database schema can be a very simple file store. That is the whole table will be

‘key → instructionJson’

Digester Service

The digester services acts as a mediator between the submitted jobs, and the digesting instances. It lives on the compute node, and accepts Digest instructions. It self regulates scheduling, with an abstracted regulator, allowing for configuration between systems on strategies.

When the digester service starts up, it will request all uncompleted jobs from the web service. It will execute those that are deemed ‘unsettled'. Being unsettled is intentionally vaguely stated as there are many situations where a digest job may be unfinished, but may or may not want to be executed again. The base of this requirement is all jobs with the ‘submitted’ tag should be executed. This allows the digester to hold onto instructions, without executing them and without the worry on critical failure that these jobs will never be executed.

It should be noted that, although this is design specific to the digester, there is not much in the way of implementing the same system for the extractor. In fact, this will start to make a lot of sense as we see we want to balance the entire compute node’s activities.

API

The Core of the digester API:

POST: /job
  Body: {job: Job to be executed,
         config: {Digester Configuration}}

Further additions to the api can be discussed, but this is all that is needed to make it functional.