script_specific.sh is an example for extracting sequences between two anchors.
The output is a three-column tab-delimited table in the format
| anchor1 | insert | anchor2 |
Only lines that match the pattern are included.
min_insert is the minimum number of characters needed to be a match.
the output file name outfile is defined by removing .fastq.gz from the input, then appending ${gene}.inserts.
zcat opens the gzipped file and pipes it into awk, where (NR-2)%4==0 prints only the lines with the sequence. Specifically, only print if the line number (NR, or number record) minus 2 equals 0 or a multiple of 4. In other words, every 4th line starting at line 2.
grep -E prints lines matching the regular expression pattern we are searching for. -o tells it to only print the matching region, excluding what comes before or after the match.
You could stop at this point and you'd get a bunch of lines that match your pattern, including the anchors on both ends.
sed separates the anchors from the insert with tabs.
See output_example.txt for example (first 20 lines) of output.
See script_generalized.sh for how we could turn this into something usable for all parameters. This
generalized bash script takes command-line arguments and saves them as variables that were previously hard-coded.
You could then run this script the following way:
bash script_generalized.sh his2 TGGTTCAG GCATCTGGCA 51_1_30_L001_R1_001.fastq.gz 10