Cloud data flow

From Array Suite Wiki
Revision as of 20:52, 27 July 2022 by Joseph (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Flow of data for cloud-based analyses

OmicSoft Server on the Cloud integration allows users to seamlessly analyze NGS data using on-demand Amazon Cloud resources.

Basic workflow

Nearly all computationally-intensive analyses, especially those processing large NGS files (NGS QC/alignment/summarization, variant calling, etc), can be run on the cloud. Smaller summarization and analysis jobs will be directly carried-out on the OmicSoft Server machine.

The general "rule" is that if both input and output folders are cloud-based (i.e. mapped S3 bucket folders), then the analysis will be performed on a cloud EC2 virtual machine in the customer AWS environment.

  1. Job submission
    1. User selects input data from a cloud folder in a mapped S3 bucket, such as /CloudFolder/CCLE
      1. If #1 is true, the output folder should also be set to a cloud folder, such as /CloudFolder/TestOutput
  2. Job Launching
    1. OmicSoft Server will transfer necessary reference files (genomes, gene models, etc) from OmicSoft Server's OmicsoftDirectory to the OmicsoftCloudDirectory
      1. If the reference files are not available, OmicSoft Server will first retrieve the reference data from, and then sync to customer cloud bucket.
    2. When the analysis is submitted, OmicSoftServer will launch one EC2 instance per sample; alignment related jobs use OAlignInstanceType; other jobs use OSummaryInstanceType.
      1. Cloud instances are launched with OmicSoft software pre-installed, which will be updated to the latest version for analysis
      2. OmicSoft includes several pre-build AMIs to launch instances; custom instances can be built if needed.
      3. If admin set MaxInstanceCount=20, at most 20 EC2 machines will be started. If there are more than 20 samples, extra samples will be queued.
    3. Input files in S3 are copied to EC2 machines where EBS storage are attached (EBS size is calculated based on input file size)
  3. Job Completion
    1. OmicSoft Server will monitor job progression by SQS
  4. When a job is finished, all results are uploaded to S3 output folder
    1. Large data (Filtered FASTQ files, BAM files, VCF files) will remain on S3, not downloaded to OmicSoft Server, but can be streamed on-demand in Array Studio
    2. Small files (NgsData: links to S3 BAM files; OmicData: expression with design/annotation; Table reports) are copied to OmicSoft Server, summarized, and saved in the OmicSoft Server server machine
    3. When a job is finished, the machine will wait 30min to run analysis on new samples in the queue
    4. EC2 machines are terminated when no jobs in queue and it is idle > 30min. No EC2 machines are running when all samples are finished.