class: center, middle # Bioinformatics Shell Scripting, Writing Pipelines, and Parallelizing Tasks ## Buffalo Chapter 12 --- ## Introduction: -- * Pipelines involve running a sequence of commands on one or many files -- * Well-written pipelines are both _**robust**_ and _**reproducible**_ -- * Crafting a robust pipeline can be particularly challenging because it involves several commands applied to different files -- This chapter introduces the following topics related to pipelines: -- 1. Bash shell scripts -- 2. File processing with `find` and `xargs` -- 3. Running parallel pipelines --- ## Basic Bash Scripting: -- * Bash, in addition to being the shell through which we send commands to a computer's kernel, is also a full-fledged scripting language -- * While Bash is more limited as a general purpose language then something like Python, it's great for "duct taping" commands together into a pipeline -- * So what is a "Bash script"? -- * A series of commands organized into a rerunnable script with built in checks for file availability and errors -- * Bash scripts always have the extension _.sh_ and you can create them in a simple text editor --- ## The components of your Bash header: -- A typical Bash header will look something like this: ``` #!/bin/bash set -e set -u set -o pipefail ``` -- `#!/bin/bash`: known as the _shebang_ and indicates the path to the interpreter used to execute the script -- `set -e`: prevents the shell script from proceeding if one of its commands fails -- `set -u`: aborts a script if an unset variable is encountered -- `set -o pipefail`: if your script contains pipes, the `set -e` command will only result in script termination if the last command in the pipe results in an error...this indicates any error in a pipe should cause exit --- ## Running a Bash script: -- A Bash script can be run by: -- * Using the `bash` program directly: ``` bash script.sh ``` -- * Calling the script as a program: ``` ./script.sh ``` -- Calling the script as a program allows flexibility across interpreters (other shells), but executable permissions may first need to be given to the script to run: ``` $ chmod u+x script.sh ``` --- ## Passing command-line arguments to your Bash script: -- Create the following Bash script and save it as `args.sh`: ``` #!/bin/bash set -e set -u set -o pipefail echo "script name: $0" echo "first arg: $1" echo "second arg: $2" echo "third arg: $3" ``` -- Now try running it as: ``` $ bash args.sh arg1 arg2 arg3 ``` -- How is Bash passing the command line arguments to the script? --- This can be utilized for creating informative error messages: -- Modify your `args.sh` script as follows: ``` #!/bin/bash set -e set -u set -o pipefail if [ "$#" -lt 3 ] # are there less than 3 arguments? then echo "error: too few arguments, you provided $#, 3 required" echo "usage: args.sh arg1 arg2 arg3" exit 1 fi echo "script name: $0" echo "first arg: $1" echo "second arg: $2" echo "third arg: $3" ``` -- Try running with three command-line arguments and fewer than three --- Now let's create a Bash script that includes an `else` as part of our "if Statement"... -- Save the following script as `genefind.sh` ``` #!/bin/bash set -e set -u set -o pipefail if grep "$1" ${2}.txt > /dev/null then echo "found $1 in ${2}.txt" else echo "did not find $1 in ${2}.txt" fi ``` -- Take a look at the file `genes.txt` in the Week_12 folder and see if you can determine how to use this script -- Thinking back to your UNIX exercises, can you think of how to modify this to search two files for the same pattern? --- Here you go: ``` #!/bin/bash set -e set -u set -o pipefail if grep "$1" ${2}.txt > /dev/null && grep "$1" ${3}.txt > /dev/null then echo "found $1 in ${2}.txt and ${3}.txt" else echo "did not find $1 in ${2}.txt and ${3}.txt" fi ``` --- ## Processing Files with Bash Using "for Loops" and Globbing: -- We'll often want to run a Bash script across a number of files (_e.g.,_ files with different chromosomes' data) -- In order to complete this iterative process we'll need to: 1. **Select** which files to apply the commands to 2. **Loop** over the data and apply the commands 3. **Keep track** of the names of any output files created -- Files can be selected by either using information contained within another "guide" file or selecting files within a directory based on some criteria. -- Let's create some temporary files to see how this works: ``` $ mkdir -p zmays-snps/{data/seqs,stats} $ touch zmays-snps/data/seqs/zmays{A,B,C}_R{1,2}.fastq ``` --- ``` #!/bin/bash set -e set -u set -o pipefail # specify the input samples file, where the third # column is the path to each sample FASTQ file sample_info=samples.txt # create a Bash array from the third column of $sample_info sample_files=($(cut -f 3 "$sample_info")) for fastq_file in ${sample_files[@]} do # strip .fastq from each FASTQ file, and add suffix # "-stats.txt" to create an output filename for each FASTQ file results_file="$(basename $fastq_file .fastq)-stats.txt" # run fastq_stat on a file, writing results to the filename we've # indicated above fastq_stat $fastq_file > zmays-snps/stats/$results_file done ``` -- This example takes a single input file, operates on the file, and produces a single output file -- What if we need to operate across multiple files to produce a single output? --- Here's an example: ``` #!/bin/bash set -e set -u set -o pipefail for fastq_file in zmays-snps/data/seqs/*.fastq do # strip .fastq from each FASTQ file, and add suffix # "-stats.txt" to create an output filename for each FASTQ file results_file="$(basename $fastq_file .fastq)-stats.txt" # run fastq_stat on a file, writing results to the filename we've # indicated above fastq_stat $fastq_file > zmays-snps/stats/$results_file done ``` --- ## Processing/Matching Files with `find` and `xargs`: -- `find` can be used in a Bash script to match particular files and then `xargs` can be used to pass these files on to particular commands -- `find` is different than `ls` in that it can match files recursively: -- ``` $ find zmays-snps ``` -- The depth to which `find` searches within your directory can be controlled with the `-maxdepth` option: -- ``` $ find zmays-snps -maxdepth 1 ``` -- We can match particular patterns in files using the `-name` option of `find`: -- ``` $ find zmays-snps -name "zmaysB*fastq" ``` -- Other than searching recursively, can you see differences in the results of `find` vs. `ls`? --- We can also control what we match with `find` using the `type` option: ``` $ find zmays-snps -name "zmaysB*fastq" -type f ``` -- How are these options of `find` being logically interpreted? -- We can modify this: -- ``` $ find zmays-snps -name "zmaysB*fastq" -or -type f ``` -- ``` $ touch zmays-snps/data/seqs/zmaysB_R1-temp.fastq ``` -- Negation can also be useful for matching files: ``` $ find zmays-snps -type f $ find zmays-snps -type f -not -name "zmaysC*fastq" $ find zmays-snps -type f -not -name "zmaysC*fastq" -not -name "*-temp*" ``` --- Once we've matched a file, we can pass it along to a program using `xargs` -- we can match them using `find`: ``` find zmays-snps -name "*-temp.fastq" ``` -- But can you pipe this directly to `rm`? -- Here's where `xargs` comes in handy: ``` find zmays-snps -name "*-temp.fastq" | xargs rm ``` -- And finally with `xargs` you can launch multiple processes in parallel: -- ``` find zmays-snps -name "fastq" -type f | xargs -P 6 fastq_stat ```