Cloud data flow
From Array Suite Wiki
Flow of data for cloud-based analyses
OmicSoft Server on the Cloud integration allows users to seamlessly analyze NGS data using on-demand Amazon Cloud resources.
Basic workflow
Nearly all computationally-intensive analyses, especially those processing large NGS files (NGS QC/alignment/summarization, variant calling, etc), can be run on the cloud. Smaller summarization and analysis jobs will be directly carried-out on the OmicSoft Server machine.
The general "rule" is that if both input and output folders are cloud-based (i.e. mapped S3 bucket folders), then the analysis will be performed on a cloud EC2 virtual machine in the customer AWS environment.
- Job submission
- User selects input data from a cloud folder in a mapped S3 bucket, such as /CloudFolder/CCLE
- If #1 is true, the output folder should also be set to a cloud folder, such as /CloudFolder/TestOutput
- User selects input data from a cloud folder in a mapped S3 bucket, such as /CloudFolder/CCLE
- Job Launching
- OmicSoft Server will transfer necessary reference files (genomes, gene models, etc) from OmicSoft Server's OmicsoftDirectory to the OmicsoftCloudDirectory
- If the reference files are not available, OmicSoft Server will first retrieve the reference data from omicsoft.com, and then sync to customer cloud bucket.
- When the analysis is submitted, OmicSoftServer will launch one EC2 instance per sample; alignment related jobs use OAlignInstanceType; other jobs use OSummaryInstanceType.
- Cloud instances are launched with OmicSoft software pre-installed, which will be updated to the latest version for analysis
- OmicSoft includes several pre-build AMIs to launch instances; custom instances can be built if needed.
- If admin set MaxInstanceCount=20, at most 20 EC2 machines will be started. If there are more than 20 samples, extra samples will be queued.
- Input files in S3 are copied to EC2 machines where EBS storage are attached (EBS size is calculated based on input file size)
- OmicSoft Server will transfer necessary reference files (genomes, gene models, etc) from OmicSoft Server's OmicsoftDirectory to the OmicsoftCloudDirectory
- Job Completion
- OmicSoft Server will monitor job progression by SQS
- When a job is finished, all results are uploaded to S3 output folder
- Large data (Filtered FASTQ files, BAM files, VCF files) will remain on S3, not downloaded to OmicSoft Server, but can be streamed on-demand in Array Studio
- Small files (NgsData: links to S3 BAM files; OmicData: expression with design/annotation; Table reports) are copied to OmicSoft Server, summarized, and saved in the OmicSoft Server server machine
- When a job is finished, the machine will wait 30min to run analysis on new samples in the queue
- EC2 machines are terminated when no jobs in queue and it is idle > 30min. No EC2 machines are running when all samples are finished.