docker pull ghcr.io/rocker-org/verse:4.3.3
docker run --rm -ti ghcr.io/rocker-org/verse:4.3.3 bash
quarto create project book abook
cd abook
## I turned off PDF rendering in _quarto.yml because lots of deps
quarto render
| *bcftools filter | |
| *Filter variants per region (in this example, print out only variants mapped to chr1 and chr2) | |
| qbcftools filter -r1,2 ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.hg38.vcf.gz | |
| *printing out info for only 2 samples: | |
| bcftools view -s NA20818,NA20819 filename.vcf.gz | |
| *printing stats only for variants passing the filter: | |
| bcftools view -f PASS filename.vcf.gz |
UPDATE: I have baked the ideas in this file inside a Python CLI tool called pyds-cli. Please find it here: https://github.com/ericmjl/pyds-cli
Having done a number of data projects over the years, and having seen a number of them up on GitHub, I've come to see that there's a wide range in terms of how "readable" a project is. I'd like to share some practices that I have come to adopt in my projects, which I hope will bring some organization to your projects.
Disclaimer: I'm hoping nobody takes this to be "the definitive guide" to organizing a data project; rather, I hope you, the reader, find useful tips that you can adapt to your own projects.
Disclaimer 2: What I’m writing below is primarily geared towards Python language users. Some ideas may be transferable to other languages; others may not be so. Please feel free to remix whatever you see here!
Code is clean if it can be understood easily – by everyone on the team. Clean code can be read and enhanced by a developer other than its original author. With understandability comes readability, changeability, extensibility and maintainability.
- Follow standard conventions.
- Keep it simple stupid. Simpler is always better. Reduce complexity as much as possible.
- Boy scout rule. Leave the campground cleaner than you found it.
- Always find root cause. Always look for the root cause of a problem.
| # Create a simple data frame for testing | |
| df <- data.frame(POSIXtime = seq(as.POSIXct('2013-08-02 12:00'), | |
| as.POSIXct('2013-08-06 05:00'), len = 45), | |
| x = seq(45)) | |
| # The Subset Examples | |
| # | |
| # All data on 2013-08-06 | |
| sub.1 <- subset(df, format(POSIXtime,'%d')=='06') |