Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Rationale: Storing instructions in the file system as static JSONB objects moves state information away from the central repository of information about a job, confuses users, and does not play well with dynamic instances or state queries.

Furthermore, the file system is an unencapsulated stateful location, with stateful permissions and strong uniqueness constraints. All this put together creates a volatile environment with strong imposed restrictions on the logic surrounding instructions.

What this all means is this: Instructions must be uniquely identified by a name (the file name), the running process must have permission to move the files, and there must be some guarantee that the files won’t move while the process is running (else a failure may arise when it looks for them to finish things up). All in all, having instructions on the file system imposes business logic, and creates complexity that is unneeded.

Overall Design

Goals

→ Decouple webservices and filesystem by having web-service talk with a scheduler

→ Scheduler and database aware of ‘job’ being requested allows for more dynamic processing

→ Dynamic processing and postgres storage allow more real-time status updates, facilitating updates to the Job Status redesign

Steps

Phase 1: Proof Of Concept of Direct Instruction Injection

At this phase, we look into the difficulty of modifying the webservice layer in such a way as it immediately kicks off a process thread, much the same way BERT currently directly processes jobs. This will include passing the job data directly instead of through a JSON file.

Phase 2: Scheduler/Load Balancer

Once the Phase one prototype is complete, the difficulties of adding a scheduler into the mix can be researched pretty easily. During this phase, we hope to identify a scheduler/load balancer that can work well with spot instances and other cloud improvements.

Phase 3: Instruction database table and interface

At this point, we can investigate efficient storage of the instruction data structures as well as live updates from the scheduled/balanced jobs onto this structure, as well as how this structure relates to the upcoming job table redesign.

Post-Phase 3: Finish

→ Interfacing with new Job Status redesign.

→ Testing/fixing issues found in prototype.

→ Integration with main codebase.

Process Flow

  • User targets the ‘createInstruction’ end point in the web services, the same as it is done now

  • The serialized instruction is redirected to a digester service

  • (Optional) a record of the instruction is placed in the database

  • The digester service resides on the compute node, and spins up digesters in a thread pool

  • The digester has callbacks built into it, updating the calling machine with changes in its status

→ Using the createInstruction end point means no changes to the loaderUI will be needed

→ The instruction in the database is for retrospection purposes

→ This means if an instruction fails, the instruction can be inspected

→ Database schema can be a very simple file store. That is the whole table will be

‘key → instructionJson’

Callback API (Prototoype)

The Digester expect a target to an api for calling the webservice on change to status.

Here’s the API:

POST .../job
  Body: {Job JSON}
Creates a Job described the the given body JSON

PATCH .../job/{job_id}
  Body: {Job Prototype JSON}
Updates the job with the given prototype

This way, whenever the pertinent information for a job changes, the relevant data can be updated in the host service

Digester Service

The digester services acts as a mediator between the submitted jobs, and the digesting instances. It lives on the compute node, and accepts Digest instructions. It self regulates scheduling, with an abstracted regulator, allowing for configuration between systems on strategies.

When the digester service starts up, it will request all uncompleted jobs from the web service. It will execute those that are deemed ‘unsettled'. Being unsettled is intentionally vaguely stated as there are many situations where a digest job may be unfinished, but may or may not want to be executed again. The base of this requirement is all jobs with the ‘submitted’ tag should be executed. This allows the digester to hold onto instructions, without executing them and without the worry on critical failure that these jobs will never be executed.

It should be noted that, although this is design specific to the digester, there is not much in the way of implementing the same system for the extractor. In fact, this will start to make a lot of sense as we see we want to balance the entire compute node’s activities.

API

The Core of the digester API:

POST: /job
  Body: {job: Job to be executed,
         config: {Digester Configuration}}

Further additions to the api can be discussed, but this is all that is needed to make it functional.

  • No labels