Skip to content

Instantly share code, notes, and snippets.

@cory-weller
Last active September 8, 2023 12:55
Show Gist options
  • Select an option

  • Save cory-weller/dea6285a15b77414645afaae384ff564 to your computer and use it in GitHub Desktop.

Select an option

Save cory-weller/dea6285a15b77414645afaae384ff564 to your computer and use it in GitHub Desktop.
sim_anchors

Overview

script_specific.sh is an example for extracting sequences between two anchors.

The output is a three-column tab-delimited table in the format

| anchor1 | insert | anchor2 |

Only lines that match the pattern are included.

min_insert is the minimum number of characters needed to be a match.

the output file name outfile is defined by removing .fastq.gz from the input, then appending ${gene}.inserts.

zcat opens the gzipped file and pipes it into awk, where (NR-2)%4==0 prints only the lines with the sequence. Specifically, only print if the line number (NR, or number record) minus 2 equals 0 or a multiple of 4. In other words, every 4th line starting at line 2.

grep -E prints lines matching the regular expression pattern we are searching for. -o tells it to only print the matching region, excluding what comes before or after the match.

You could stop at this point and you'd get a bunch of lines that match your pattern, including the anchors on both ends.

sed separates the anchors from the insert with tabs.

See output_example.txt for example (first 20 lines) of output.

Generalization

See script_generalized.sh for how we could turn this into something usable for all parameters. This generalized bash script takes command-line arguments and saves them as variables that were previously hard-coded.

You could then run this script the following way:

bash script_generalized.sh his2 TGGTTCAG GCATCTGGCA 51_1_30_L001_R1_001.fastq.gz 10
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
TGGTTCAG CTTGAGCTTCAGTTTCTTCAGCTGGTCTGTAGTCCTTGACAGATGGAGCAAGAATTGGTTCTTCTTCTTTTGGTTCAATGATGGTGACA GCATCTGGCA
#!/usr/bin/env bash
gene=${1}
anchor1=${2}
anchor2=${3}
fastq=${4}
min_insert=${5}
outfile="${fastq%.fastq.gz}.${gene}.inserts"
zcat ${fastq} | \
awk '(NR-2)%4==0' | \
grep -E -o "$anchor1[ACTG]{$min_insert,}$anchor2" | \
sed "s/\(${anchor1}\)\([ACTG]\{$min_insert,\}\)\(${anchor2}\)/\1\t\2\t\3/g" | \
> ${outfile}
#!/usr/bin/env bash
cd /data/SBGE/simone/9-22-21_rnaseq
gene='geneA'
anchor1='TGGTTCAG'
anchor2='GCATCTGGCA'
fastq='51_1_30_L001_R1_001.fastq.gz'
min_insert='10'
outfile="${fastq%.fastq.gz}.${gene}.inserts"
zcat ${fastq} | \
awk '(NR-2)%4==0' | \
grep -E -o "$anchor1[ACTG]{$min_insert,}$anchor2" | \
sed "s/\(${anchor1}\)\([ACTG]\{$min_insert,\}\)\(${anchor2}\)/\1\t\2\t\3/g" | \
> ${outfile}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment