create_instance.sh will create a new Caper server instance on your AWS EC2 region and configure the instance for Cromwell with PostgreSQL database.
- Sign up for an AWS account.
- Make sure that your account has full permission on two services (S3 and EC2).
- Configure your AWS CLI. Enter key, secret (password) and region (IMPORTANT) obtained from your account's IAM.
$ aws configure- Click on this to create a new AWS VPC. Make sure that the region on top right corner of the console page matches with your region of interest. Click on
Nextand thenNextagain. Agree toCapabililties. Click onCreate stack. - Choose available zones in
Availability Zones. For example, if your region isus-west-2, then you will seeus-west-2a,us-west-2bandus-west-2c.
- Click on this to create a new AWS Batch. Make sure that the region on top right corner of the console page matches with your region of interest. Click on
Next. - There are several required parameters to be specified on this page
S3 Bucket name: S3 bucket name to store your pipeline outputs. This is not a full path for the output directory. It's just bucket's name without the scheme prefixs3://. Make sure that this bucket doesn't exist. If it exists then delete it or try with a different non-existing bucket name.VPC ID: Choose the VPCGenomicsVPCthat you just created.VPC Subnet IDs: Choose all private subnets created with the above VPC.Max vCPUs for Default Queue: Maximum total number of CPUs for the spot instance queue. It's 4000 by default, which is huge already. But if you use more CPUs than this limit then your jobs will be stuck atRUNNABLEstatus.Max vCPUs for Priority Queue: Maximum total number of CPUs for the on-demand instance queue. It's 4000 by default, which is huge already. But if you use more CPUs than this limit then your jobs will be stuck atRUNNABLEstatus.
- Click on
Nextand thenNextagain. Agree toCapabililties. Click onCreate stack. - Go to your AWS Batch and click on
Job queuesin the left sidebar. You will see two Job Queues (priority-*anddefault-*). There has been some issues with the default one which is based on spot instances. Spot instances are interrupted quite often and Cromwell doesn't seem to handle it properly. We recommend to usepriority-*queue even though it costs a bit more than spot instances. Click on the chosen job queue and get ARN of it. This ARN will be used later to create Caper server instance.
Run without parameters to see detailed help.
$ bash create_instance.shTry with the positional arguments only first and see if it works.
$ bash create_instance.sh [INSTANCE_NAME] [AWS_REGION] [PUBLIC_SUBNET_ID] [AWS_BATCH_ARN] [KEY_PAIR_NAME] [AWS_OUT_DIR]AWS_REGION: Your AWS region. e.g.us-west-2. Make sure that it matches withregionin your AWS credentials file$HOME/.aws/credentials.PUBLIC_SUBNET_ID: Click onServiceson AWS Console and ChooseVPC. Click onSubnetson the left sidebar and findPublic subnet 1under your VPC created from the above instruction.AWS_BATCH_ARN: ARN of the AWS Batch created from the above instruction. Double-quote the whole ARN since it includes:.KEY_PAIR_NAME: Click onServiceson AWS Console and ChooseEC2. ChooseKey Pairson the left sidebar and create a new key pair (in.pemformat). Take note of the key name and keep the.pemkey file on a secure directory where you want to SSH to the instance from. You will need it later when you SSH to the instancec.AWS_OUT_DIR: Full output directory path starting with the bucket name you used in the above instruction. This directory should start withs3://. e.g.s3://caper-server-out-bucket/out.
Go to the AWS Console and Click on Services on AWS Console and Choose EC2. Click on Instances on the left sidebar and find the created instance. Click on the instance.
Click on Security and find Security groups. Click on the security group. Add an inbound rule. Choose type SSH and define CIDR for your IP range. Setting it to 0.0.0.0/0 will open the VPC to the world.
IMPORTANT: It is a default security group for the VPC so use it at your own risk. It's recommended to calculate CIDR for your computer/company and use it here.
Go back to Instances on the console and find the server instance. Get the command line to SSH to it. Make sure that you have the .pem key file on your local computer.
Connect to the instance and wait until caper -v works. Allow 20-30 minutes for Caper installation.
$ caper -vAuthenticate yourself for AWS services.
$ sudo su
$ aws configure
# enter your AWS credential and region (IMPORTANT)Run Caper server.
# cd to caper's main directory
$ sudo su
$ cd /opt/caper
$ screen -dmS caper_server bash -c "caper server > caper_server.log 2>&1"On the instance, attach to the existing screen caper_server, stop it with Ctrl + C.
$ sudo su # log-in as root
$ screen -r caper_server # attach to the screen
# in the screen, press Ctrl + C to send SIGINT to CaperOn the instance, make a new screen caper_server.
$ sudo su
$ cd /opt/caper
$ screen -dmS caper_server bash -c "caper server > caper_server.log 2>&1"For the first log-in, authenticate yourself to get permission to read/write on the output S3 bucket. This is to localize any external URIs (defined in an input JSON) on the output S3 bucket's directory with suffix .caper_tmp/. Make sure that you have full permission on the output S3 bucket.
$ aws configure
# enter your AWS credential and correct region (IMPORTANT)Check if caper list works without any network errors.
$ caper listSubmit a workflow.
$ caper submit [WDL] -i input.json ...Caper will localize big data files on a S3 bucket directory --aws-loc-dir (or aws-loc-dir in the Caper conf file), which defaults to [AWS_OUT_DIR]/.caper_tmp/ if not defined. e.g. your FASTQs and reference genome data defined in an input JSON.
VERY IMPORTANT!
Caper localizes input files on output S3 bucket path + ./caper_tmp if they are given as non-S3 URIs (e.g. gs://example/ok.txt, http://hello,com/a.txt, /any/absolute/path.txt). However if S3 URIs are given in an input JSON then Caper will not localize them and will directly pass them to Cromwell. However, Cromwell is very picky about region and permission.
First of all PLEASE DO NOT USE ANY EXTERNAL S3 FILES OUT OF YOUR REGION. Call-caching will not work for those external files. For example, if your Caper server resides on us-west-2 and you want to use a Broad reference file s3://broad-references/hg38/v0/Homo_sapiens_assembly38.dict. All broad data are on us-east-1 so call-caching will never work.
Another example is ENCODE portal's file. This FASTQ file has a public S3 URI in metadata, which is s3://encode-public/2017/01/27/92e9bb3b-bc49-43f4-81d9-f51fbc5bb8d5/ENCFF641SFZ.fastq.gz. All ENCODE portal's data are on us-west-2. Call-caching will not work other regions. It's recommended to directly use the URL of this file https://www.encodeproject.org/files/ENCFF641SFZ/@@download/ENCFF641SFZ.fastq.gz in input JSON.
DO NOT USE S3 FILES ON A PRIVATE BUCKET. Job instances will not have access to those private files even though the server instance has one (with your credentials configured with aws configure). For example, ENCODE portal's unreleased files are on a private bucket s3://encode-priavte. Jobs will always fail if you use these private files.
If S3 files in an input JSON are public in the same region then check if you have s3:GetObjectAcl permission on the file.
$ aws s3api get-object-acl --bucket encode-public --key 2017/01/27/92e9bb3b-bc49-43f4-81d9-f51fbc5bb8d5/ENCFF641SFZ.fastq.gz
{
"Owner": {
"DisplayName": "encode-data",
"ID": "50fe8c9d2e5e9d4db8f4fd5ff68ec949de9d4ca39756c311840523f208e7591d"
},
"Grants": [
{
"Grantee": {
"DisplayName": "encode-aws",
"ID": "a0dd0872acae5121b64b11c694371e606e28ab2e746e180ec64a2f85709eb0cd",
"Type": "CanonicalUser"
},
"Permission": "FULL_CONTROL"
},
{
"Grantee": {
"Type": "Group",
"URI": "http://acs.amazonaws.com/groups/global/AllUsers"
},
"Permission": "READ"
}
]
}If you get 403 Permission denied then call-caching will not work.
To avoid all permission/region problems, please use non-S3 URIs/URLs.
https://docs.opendata.aws/genomics-workflows/orchestration/cromwell/cromwell-overview.html
See [this] for troubleshooting.