Created
February 12, 2019 11:14
-
-
Save mikisvaz/a65cb99bbf0d7a61f0d41f4e0ca02540 to your computer and use it in GitHub Desktop.
Gue code to tie NGS data in project to HTS workflow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| extension :bam | |
| dep HTS, :BAM_rescore, | |
| :fastq1 => :placeholder, :fastq2 => :placeholder, :reference => :placeholder, | |
| :sample_name => :placeholder, | |
| :platform_unit => :placeholder, | |
| :read_group_name => :placeholder, | |
| :sequencing_center => "CNAG", | |
| :platform => 'Illuimna', | |
| :library_name => 'LN', | |
| :interval_list => Bellmunt.interval_list do |jobname,options| | |
| sample_name = jobname | |
| options[:sample_name] = sample_name.gsub('_', '.') | |
| options[:reference] = Bellmunt.reference | |
| sample_fastqs = [] | |
| Bellmunt.path.glob("*.fastq.gz").each do |file| | |
| basename = File.basename(file) | |
| sample_fastqs << file if basename.split("_")[2] == sample_name | |
| end | |
| case sample_fastqs.length | |
| when 2 | |
| machine, lane, sample = File.basename(sample_fastqs.first).split("_") | |
| options[:read_group_name] = [machine, lane] * "." | |
| options[:platform_unit] = [machine, lane, sample] * "." | |
| options[:fastq1] = sample_fastqs.sort.first | |
| options[:fastq2] = sample_fastqs.sort.last | |
| {:inputs => options, :jobname => jobname} | |
| when 4 | |
| sample_runs = {} | |
| sample_fastqs.each do |file| | |
| machine, lane, sample = File.basename(file).split("_") | |
| sample_runs[[machine,lane]*"."] ||= [] | |
| sample_runs[[machine,lane]*"."] << file | |
| end | |
| jobs = [] | |
| num = 1 | |
| sample_runs.each do |run_code,files| | |
| run_options = {} | |
| run_options[:read_group_name] = run_code + "." + sample_name | |
| run_options[:platform_unit] = run_code | |
| run_options[:fastq1] = files.sort.first | |
| run_options[:fastq2] = files.sort.last | |
| jobs << {:task => :BAM, :inputs => options.merge(run_options), :jobname => jobname + "_multiplex_" + num.to_s} | |
| num += 1 | |
| end | |
| jobs | |
| else | |
| raise "Number of fastq is not 2 or 4: #{Misc.fingerprint sample_fastqs}" | |
| end | |
| end | |
| dep HTS, :BAM_multiplex, :compute => :ignore, :reference => Bellmunt.reference, :bam_files => :placeholder do |jobname,options,dependencies| | |
| if dependencies.flatten.length > 1 | |
| {:jobname => jobname, :inputs => options.merge(:bam_files => dependencies.flatten.collect{|dep| dep.path})} | |
| else | |
| [] | |
| end | |
| end | |
| dep_task :BAM, HTS, :BAM_rescore do |jobname,options, dependencies| | |
| if (mutiplex = dependencies.flatten.select{|dep| dep.task_name == :BAM_multiplex}.first) | |
| {:inputs => options.merge("HTS#BAM_duplicates" => mutiplex), :jobname => jobname + '_multiplexed'} | |
| else | |
| [] | |
| end | |
| end |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This code block finds takes the sample name from the jobname given to the task. This is a nice way to organize results, they are all in folders named after the type of information the files hold, with the sample name as file name. The sample name is used to find the FASTQ files; elsewhere is defined the path where these files are located and is what changes from project to project. Also some other parameters can be changed, such as read groups.
This part of the code does one part of the plumbing we discussed in the previous comment. If only two fastq files are found they they are just set as the corresponding parameters of the job, along with other parameters, the actual dependency is return as a Hash which defines the inputs of the new task and its jobname. The actual task to be run is assumed to be what was defined in the
depfunctionBAM_rescore. The exceptional part is when 4 fastqs are found. In this case they are separated into two sets of functions that incorporated into not one but two dependency hashes. In fact the code block can return not one but an array of dependencies. Furthermore these dependencies do not even have to be of the same type that was declared, in this case they are of type:BAM, this is because we will inject this dependencies into the dependency graph.This part of the code injects a dependency of type multiplex, it is an additional step that is only used when there are two
:BAMdependencies instead of just one:BAM_rescore. If there is only one:BAM_rescorethen the block of code returns no dependencies and nothing is done. If there are two:BAMfiles then a new dependency that takes these files and returns a merged version of these files with duplicates marked.This is the last part of the plumbing. Here, if we find that we have a
:BAM_multiplexdependency and are thus in the exceptional case, we return a:BAM_rescorewith an altered dependency tree. This alteration consists of assigning the:BAM_duplications, which is normally the step where duplicates are marked, with the:BAM_multiplexdependency we just setup. This:BAM_rescorewe have setup is return as an additional dependency. Remember that in the exceptional case we didn't actually return a:BAM_rescoredependency, but two:BAMdependencies. Botton line is that::BAM_rescoredependency is setup and return and nothing else happens or:BAM, a:BAM_multiplex, and a:BAM_rescoredependencies are return with some custom plumbing.