Skip to content

Latest commit

 

History

History
245 lines (223 loc) · 16.5 KB

File metadata and controls

245 lines (223 loc) · 16.5 KB

Usage

Data Uploader

fq_upload.py -1 <Read 1 FASTQ file> -2 <Read 2 FASTQ file> -o <Output HDFS Folder>
Argument:
	-h: Usage
	-1: Input Read 1 FASTQ file
	-2: Input Read 2 FASTQ file
	-o: Output tsv list
Usage:
	./fq_upload.py -1 ~/NA19240/SRR7782669_same_r1.fq.gz -2 ~/NA19240/SRR7782669_same_r2.fq.gz -o /NA19240/chunk.fq

Parallel Data Transformation by Adam

/usr/local/spark/bin/spark-submit \
  --master spark://server:7077 \
  --class org.bdgenomics.adam.cli.ADAMMain \
  /seqslab/adam-assembly-1.0.0-qual.jar transformAlignments
  
Argument "INPUT" is required
 INPUT                                                           : The ADAM, BAM or SAM file to apply the transforms to
 OUTPUT                                                          : Location to write the transformed data in ADAM/Parquet format
 -add_md_tags VAL                                                : Add MD Tags to reads based on the FASTA (or equivalent) file passed to this option.
 -aligned_read_predicate                                         : Only load aligned reads. Only works for Parquet files. Exclusive of region
                                                                   predicate.
 -allow_one_mismatch_for_each N                                  : trim poly G allow one mismatch for each
 -atgx_transform                                                 : Enable Atgenomix alignment transformation.
 -barcode_len N                                                  : barcode length
 -barcode_whitelist VAL                                          : barcode whitelist
 -bin_quality_scores VAL                                         : Rewrites quality scores of reads into bins from a string of bin descriptions, e.g.
                                                                   0,20,10;20,40,30.
 -cache                                                          : Cache data to avoid recomputing between stages.
 -coalesce N                                                     : Set the number of partitions written to the ADAM output directory
 -collapse_dup_reads                                             : collect reads with same sequences
 -compare_req N                                                  : trim poly G compare req
 -concat VAL                                                     : Concatenate this file with <INPUT> and write the result to <OUTPUT>
 -defer_merging                                                  : Defers merging single file output
 -disable_fast_concat                                            : Disables the parallel file concatenation engine.
 -disable_pg                                                     : Disable writing a new @PG line.
 -disable_sv_dup                                                 : Disable duplication of sv calling reads, soft-clip or discordantly.
 -filter_lc_reads                                                : filter low complexity reads
 -filter_lq_reads                                                : filter out reads containing max_lq_base of base with quality < min_quality
 -filter_n                                                       : filter reads with uncalled base count >= max_N_count, defaulted as 2
 -force_load_bam                                                 : Forces TransformAlignments to load from BAM/SAM.
 -force_load_fastq                                               : Forces TransformAlignments to load from unpaired FASTQ.
 -force_load_ifastq                                              : Forces TransformAlignments to load from interleaved FASTQ.
 -force_load_parquet                                             : Forces TransformAlignments to load from Parquet.
 -force_shuffle_coalesce                                         : Even if the repartitioned RDD has fewer partitions, force a shuffle.
 -h (-help, --help, -?)                                          : Print help
 -known_indels VAL                                               : VCF file including locations of known INDELs. If none is provided, default
                                                                   consensus model will be used.
 -known_snps VAL                                                 : Sites-only VCF giving location of known SNPs
 -lc_kmer N                                                      : Low complexity kmer
 -lc_threshold_len_factor N                                      : Low complexity: threshold = length / lc_threshold_len_factor / lc_kmer
 -limit_projection                                               : Only project necessary fields. Only works for Parquet files.
 -log_odds_threshold N                                           : The log-odds threshold for accepting a realignment. Default value is 5.0.
 -mark_duplicate_reads                                           : Mark duplicate reads
 -max_N_count N                                                  : upper limit for uncalled base count
 -max_consensus_number N                                         : The maximum number of consensus to try realigning a target region to. Default
                                                                   value is 30.
 -max_indel_size N                                               : The maximum length of an INDEL to realign to. Default value is 500.
 -max_lq_base N                                                  : max acceptable low quality base for -filter_lq_reads, defaulted as 10
 -max_mismatch N                                                 : trim poly G max mismatch
 -max_reads_per_target N                                         : The maximum number of reads attached to a target considered for realignment.
                                                                   Default is 20000.
 -max_target_size N                                              : The maximum length of a target region to attempt realigning. Default length is
                                                                   3000.
 -md_tag_fragment_size N                                         : When adding MD tags to reads, load the reference in fragments of this size.
 -md_tag_overwrite                                               : When adding MD tags to reads, overwrite existing incorrect tags.
 -min_acceptable_quality N                                       : Minimum acceptable quality for recalibrating a base in a read. Default is 5.
 -min_length N                                                   : read min length
 -min_quality N                                                  : threshold for low quality base, defaulted as '?', representing Q30 for Illumina
                                                                   1.8+ Phred+33
 -n_mer_len N                                                    : N-mer length
 -paired_fastq VAL                                               : When converting two (paired) FASTQ files to ADAM, pass the path to the second file
                                                                   here.
 -parquet_block_size N                                           : Parquet block size (default = 128mb)
 -parquet_compression_codec [UNCOMPRESSED | SNAPPY | GZIP | LZO] : Parquet compression codec
 -parquet_disable_dictionary                                     : Disable dictionary encoding
 -parquet_logging_level VAL                                      : Parquet logging level (default = severe)
 -parquet_page_size N                                            : Parquet page size (default = 1mb)
 -print_metrics                                                  : Print metrics to the log on completion
 -quality_encode                                                 : encode quality with depth
 -rand_assign_n                                                  : randomly assign N to one of the nucleotides A, C, G, and T
 -realign_indels                                                 : Locally realign indels present in reads.
 -recalibrate_base_qualities                                     : Recalibrate the base quality scores (ILLUMINA only)
 -record_group VAL                                               : Set converted FASTQs' record-group names to this value; if empty-string is passed,
                                                                   use the basename of the input file, minus the extension.
 -reference VAL                                                  : Path to a reference file to use for indel realignment.
 -region_predicate VAL                                           : Only load a specific range of regions. Mutually exclusive with aligned read
                                                                   predicate.
 -repartition N                                                  : Set the number of partitions to map data to
 -single                                                         : Saves OUTPUT as single file
 -sort_fastq_output                                              : Sets whether to sort the FASTQ output, if saving as FASTQ. False by default.
                                                                   Ignored if not saving as FASTQ.
 -sort_lexicographically                                         : Sort the reads lexicographically by contig name, instead of by index.
 -sort_reads                                                     : Sort the reads by referenceId and read position
 -storage_level VAL                                              : Set the storage level to use for caching.
 -stringency VAL                                                 : Stringency level for various checks; can be SILENT, LENIENT, or STRICT. Defaults
                                                                   to LENIENT
 -tag_partition_num N                                            : maximum number of partitions supported by -tag_reads option
 -tag_partition_range N                                          : tag number for a partition
 -tag_reads                                                      : tag read name with serial numbers for coding pair-end information
 -ten_x                                                          : transform 10x format
 -trim_adapter                                                   : trim adapter
 -trim_both                                                      : trim both
 -trim_head                                                      : trim head
 -trim_one                                                       : trim one bp in head and tail
 -trim_poly_g                                                    : trim poly G
 -trim_tail                                                      : trim tail
 -unclip_reads                                                   : If true, unclips reads during realignment.  

ConnectedReads

Main program

/usr/local/spark/bin/spark-submit \
  --master spark://server:7077 \  
  --class com.atgenomix.connectedreads.cli.GraphSeqMain \  
  /seqslab/connectedreads-1.0.0.jar

Usage: connectedreads-submit [<spark-args> --] <annot-args>

Choose one of the following commands:

PREPROCESSING OPERATIONS
             overlap : String Graph Generation
             correct : conduct error correction on input reads

ASSEMBLY OPERATIONS
            assemble : assemble Illumina whole reads via read overlap, read pair, barcode index

Parallel Error Correction

/usr/local/spark/bin/spark-submit \  
  --master spark://server:7077 \  
  --class com.atgenomix.connectedreads.cli.GraphSeqMain \  
  /seqslab/connectedreads-1.0.0.jar correct
             
Option "-pl_batch" is required
 INPUT                      : Input path (generated by Adam transform)
 OUTPUT                     : Output path
 -assign_N                  : whether to randeomly replace N base in reads with A/C/G/T base
 -h (-help, --help, -?)     : Print help
 -keep_err                  : whether to keep error report in tsv format
 -max_N_count N             : upper limit for uncalled base count
 -max_correction_ratio N    : maximal error over correction target base ratio for error identification at a certain internal node [default=0.5]
 -max_err_read N            : maximal reads having errors at a certain internal node [default=1]
 -max_read_length N         : Maximal read length [default = 152]
 -mim_err_depth N           : minimal depth of suffix tree where error will be reported [default=40]
 -min_read_support N        : minimal reads support for error identification at a certain internal node [default=3]
 -mlcp N                    : Minimal longest common prefix [default = 45]
 -output_fastq              : whether to dump all reads to fastq file
 -packing_size N            : The number of reads will be packed together [default = 100]
 -pl_batch N                : Prefix length for number of batches [default=2]
 -pl_partition N            : Prefix length for number of partitions [default=6]
 -print_metrics             : Print metrics to the log on completion
 -profiling                 : Enable performance profiling and output to $OUTPUT/STATS
 -raw_err_cutoff N          : minimum threshold of raw error reports
 -raw_err_group_len N       : range where raw error reports should be grouped as a single error event
 -ref_samples_path STRING[] : provide reference sample path with space as deliminator, e.g. path1 path2 ... pathN
 -seperate_err              : separate error [default=true]
 -stats                     : Enable to output statistics of String Graph to $OUTPUT/STATS
 -total_ploidy N            : total ploidy of input fastq file and reference files

Parallel String Graph Construction

/usr/local/spark/bin/spark-submit \  
  --master spark://server:7077 \  
  --class com.atgenomix.connectedreads.cli.GraphSeqMain \  
  /seqslab/connectedreads-1.0.0.jar overlap

Option "-pl_batch" is required
 INPUT                  : Input path (generated by Adam transform)
 OUTPUT                 : Output path
 -cache                 : Cache the reads in memory to speedup data processing
 -checkpoint_path VAL   : Checkpoint path
 -h (-help, --help, -?) : Print help
 -max_edges N           : Maximal number of edges per read [default = Integer.MAX_VALUE]
 -max_read_length N     : Maximal read length [default = 152]
 -mlcp N                : Minimal longest common prefix [default = 85]
 -numbering             : Assign an unique number for each read automatically
 -output_fastq          : dump fastq from vertices
 -output_format VAL     : output vertices and edges in format of ASQG | PARQUET
 -packing_size N        : The number of reads will be packed together [default = 100]
 -pl_batch N            : Prefix length for number of batches [default=1]
 -pl_partition N        : Prefix length for number of partitions [default=7]
 -print_metrics         : Print metrics to the log on completion
 -profiling             : Enable performance profiling and output to $OUTPUT/STATS
 -rmdup                 : Remove duplication of reads
 -stats                 : Enable to output statistics of String Graph to $OUTPUT/STATS

Parallel Haplotype-sensitive Assembly (HSA)

user@server# /usr/local/spark/bin/spark-submit \  
  --master spark://server:7077 \  
  --class com.atgenomix.connectedreads.cli.GraphSeqMain \  
  /seqslab/connectedreads-1.0.0.jar assemble

Argument "VERTEX_INPUT" is required
 VERTEX_INPUT           : Vertex input path (generated by overlap pipeline)
 EDGE_INPUT             : Edge input path (generated by overlap pipeline)
 OUTPUT                 : Output path
 CHECKPOINT_PATH        : Checkpoint path
 -big_steps N           : Big steps [default=400]
 -contig_upper_bound N  : Contig upper bound
 -degree_profiling      : Degree profiling
 -denoise_len N         : Denoise by contig length
 -ee N                  : Bloom filter expected elements [default=40]
 -fpp N                 : Bloom filter false positive rate [default=0.001]
 -h (-help, --help, -?) : Print help
 -input_format VAL      : input format of vertices/edges [ASQG | PARQUET]
 -intersection N        : Bloom filter intersection
 -mod N                 : Modulo for choosing vertices with label B [default=3]
 -one_to_one_profiling  : One-to-one profiling
 -overlap N             : Overlap length
 -partition N           : Pairing partition [default=2100]
 -ploidy N              : Ploidy of the sample, 2 for human
 -print_metrics         : Print metrics to the log on completion
 -small_steps N         : Small steps [default=5]

Data Downloader

ConnectedReads leverage ADAM to export the haplotype-sensitive contigs to local disk.

/usr/local/spark/bin/spark-submit \
  --master spark://server:7077 \
  --class org.bdgenomics.adam.cli.ADAMMain \
  /seqslab/adam-assembly-1.0.0-qual.jar transformAlignments
  -force_load_parquet ${input assembly parquet folder}
  ${output FASTQ folder}

hadoop fs -text ${output FASTQ folder}/*.snappy | gzip -1 - > ${local FASTQ GZIP file}