wojtyniakAQ/notebook-ed17bd09-0916-40a1-8c1a-33bea019f2a1-paper-20260319-214012.ipynb

## notebook-ed17bd09-0916-40a1-8c1a-33bea019f2a1-paper-20260319-214012.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Developmental Regulation of Cell Type-Specific Transcription in Drosophila Spermatogenesis\n",
    "\n",
    "**Paper:** Lu et al. (2020) \"Developmental regulation of cell type-specific transcription by novel promoter-proximal sequence elements\"\n",
    "\n",
    "**Published in:** Genes & Development 34:663-677\n",
    "\n",
    "---\n",
    "\n",
    "## Overview\n",
    "\n",
    "This notebook provides an educational walkthrough of the computational workflows used to study transcriptional changes during the transition from proliferating spermatogonia to differentiating spermatocytes in *Drosophila melanogaster*.\n",
    "\n",
    "### Key Findings\n",
    "\n",
    "- **>3,000 genes** (~30% of testis transcriptome) use new promoters when germ cells become spermatocytes\n",
    "- Three classes of differentially regulated genes identified:\n",
    "  1. **Down-regulated genes** (1,155): Lower expression in spermatocytes\n",
    "  2. **Off-to-on genes** (1,841): Newly expressed in spermatocytes\n",
    "  3. **Alternative promoter genes** (1,230): Express from new promoters in spermatocytes\n",
    "\n",
    "- Spermatocyte-specific promoters lack canonical core promoter motifs (TATA, DPE) but contain:\n",
    "  - **tMAC-ChIP motif** (~60 bp upstream of TSS)\n",
    "  - **Achi/Vis motif** (TGTCA)\n",
    "  - **ACA motif** (at +26/+28/+30 positions)\n",
    "  - **CNAAATT motif** (translational control element)\n",
    "\n",
    "- The **tMAC complex** is required for opening chromatin at spermatocyte-specific promoters\n",
    "\n",
    "### Educational Note\n",
    "\n",
    "**This notebook uses small synthetic datasets for demonstration purposes.** It illustrates the analytical approaches without attempting full-scale replication (which would require extensive computational resources). To apply these methods to real data, researchers would need access to:\n",
    "- High-performance computing clusters\n",
    "- Large memory systems (>32GB RAM)\n",
    "- GPU resources for intensive analyses\n",
    "- Full genome reference files and annotations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup: Install Dependencies\n",
    "\n",
    "We'll install the necessary Python packages for RNA-seq analysis, statistical testing, and bioinformatics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[37m⠋\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m\r",
      "\u001b[2K\u001b[37m⠋\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mnumpy==2.4.2                                                                  \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mpandas==3.0.1                                                                 \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mnumpy==2.4.2                                                                  \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mmatplotlib==3.10.8                                                            \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mseaborn==0.13.2                                                               \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mscipy==1.17.1                                                                 \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mscikit-learn==1.8.0                                                           \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mbiopython==1.86                                                               \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mlogomaker==0.8.7                                                              \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mpython-dateutil==2.9.0.post0                                                  \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mpython-dateutil==2.9.0.post0                                                  \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mcontourpy==1.3.3                                                              \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mcycler==0.12.1                                                                \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mfonttools==4.62.1                                                             \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mkiwisolver==1.5.0                                                             \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mpackaging==26.0                                                               \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mpillow==12.1.1                                                                \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mpyparsing==3.3.2                                                              \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mjoblib==1.5.3                                                                 \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mthreadpoolctl==3.6.0                                                          \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2msix==1.17.0                                                                   \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2m                                                                              \u001b[0m\r",
      "\u001b[2K\u001b[2mResolved \u001b[1m19 packages\u001b[0m \u001b[2min 271ms\u001b[0m\u001b[0m\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[37m⠋\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/0)                                                   \r",
      "\u001b[2K\u001b[37m⠋\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)                                                  \r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)                                                  \r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/18.20 KiB           \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/119.90 KiB          \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB        \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB        \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB        \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 30.88 KiB/119.90 KiB        \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 46.88 KiB/119.90 KiB        \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 46.88 KiB/119.90 KiB        \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 46.88 KiB/119.90 KiB        \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 62.88 KiB/119.90 KiB        \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 78.88 KiB/119.90 KiB        \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 94.88 KiB/119.90 KiB        \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)                                                  "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mseaborn             \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 30.92 KiB/288.00 KiB\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 94.88 KiB/301.83 KiB\r\n",
      "\u001b[2mcontourpy           \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 14.91 KiB/354.35 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 14.88 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.91 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 14.88 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.88 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.91 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 14.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 142.91 KiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 16.00 KiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 14.91 KiB/33.57 MiB         \u001b[12A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[12A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mseaborn             \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 190.92 KiB/288.00 KiB\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 190.88 KiB/301.83 KiB\r\n",
      "\u001b[2mcontourpy           \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 334.69 KiB/354.35 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 510.88 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 334.91 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 126.77 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 190.77 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 238.91 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 286.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 606.91 KiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 510.04 KiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 174.91 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[12A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[12A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mseaborn             \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 239.41 KiB/288.00 KiB\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 222.88 KiB/301.83 KiB\r\n",
      "\u001b[2mcontourpy           \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 350.91 KiB/354.35 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 1022.88 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 638.91 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 584.56 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 206.77 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 670.91 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 366.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 1.01 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 1022.04 KiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 222.91 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[12A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[12A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mseaborn             \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 255.41 KiB/288.00 KiB\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 254.88 KiB/301.83 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 1022.88 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 670.91 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 584.56 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 228.42 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.01 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 382.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.07 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 1022.04 KiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 270.80 KiB/33.57 MiB        \u001b[11A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[11A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/15)\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 270.88 KiB/301.83 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 1.31 MiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 718.91 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.10 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 286.77 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.25 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 728.56 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.39 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.29 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 350.91 KiB/33.57 MiB        \u001b[10A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[10A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (5/15)\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 270.88 KiB/301.83 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 1.33 MiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 718.91 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.10 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 318.77 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.25 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 760.56 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.39 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.29 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 350.91 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[10A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[10A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (5/15)\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 286.88 KiB/301.83 KiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 742.13 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 1.59 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.56 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.38 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.13 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.64 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.60 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 382.91 KiB/33.57 MiB        \u001b[9A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[9A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (5/15)\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 286.88 KiB/301.83 KiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 742.13 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 1.67 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 1.59 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.43 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.21 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.75 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.69 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 392.56 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[9A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[9A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (5/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 1.02 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 2.08 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 2.06 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.10 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.42 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 2.14 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 2.12 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 456.23 KiB/33.57 MiB        \u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (5/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 1.06 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 2.22 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 2.21 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.18 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.50 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 2.29 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 2.25 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 504.56 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (5/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 1.25 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 2.51 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 2.52 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 2.40 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.83 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.60 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 2.57 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 762.19 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (5/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 1.44 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 2.91 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 2.86 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 2.92 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.22 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 2.91 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 2.89 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 1.57 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 1.55 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 3.33 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 3.35 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 3.21 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 2.34 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 3.39 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 3.39 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 2.43 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------------\u001b[30m\u001b[2m--------------\u001b[0m\u001b[0m 1.69 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 3.56 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 3.85 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 3.90 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 2.80 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 3.81 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 3.92 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 3.09 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 1.81 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 3.66 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 4.31 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 4.32 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 3.05 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 4.35 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 4.36 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 3.85 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m--------------------\u001b[30m\u001b[2m----------\u001b[0m\u001b[0m 2.08 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 3.97 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 4.80 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 4.75 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 3.28 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 4.81 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 4.81 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.10 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 2.42 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 4.72 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 5.39 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 5.24 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 3.64 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 5.31 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 5.28 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.13 MiB/33.57 MiB          \u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 2.44 MiB/3.09 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 5.41 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 5.32 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 3.65 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 5.39 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 5.38 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.13 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 2.69 MiB/3.09 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 5.98 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 5.94 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 4.31 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 5.96 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 5.84 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.18 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 2.88 MiB/3.09 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 6.53 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 6.52 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 4.95 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 6.16 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 6.41 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 5.51 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 2.98 MiB/3.09 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 7.02 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 5.12 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 6.40 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m----------------\u001b[30m\u001b[2m--------------\u001b[0m\u001b[0m 7.00 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 6.18 MiB/33.57 MiB          \u001b[6A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[6A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 7.06 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 5.30 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 6.49 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 7.17 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 6.30 MiB/33.57 MiB          \u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/15)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 7.12 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--------------------\u001b[30m\u001b[2m----------\u001b[0m\u001b[0m 5.70 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 6.58 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 7.31 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 6.33 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (10/15)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 7.59 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 6.70 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--------------------\u001b[30m\u001b[2m----------\u001b[0m\u001b[0m 7.22 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 8.08 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 7.51 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (10/15)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 7.95 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 7.55 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 7.60 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 8.91 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 8.80 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (10/15)\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 8.31 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 7.74 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 10.06 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 9.73 MiB/33.57 MiB          \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (10/15)\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 8.34 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 7.76 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 10.43 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 10.25 MiB/33.57 MiB         \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (10/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 7.91 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 11.20 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 10.64 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (10/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 8.16 MiB/10.37 MiB\r\n",
      "\u001b[2mlogomaker           \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 11.89 MiB/12.58 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 11.27 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 8.48 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 12.31 MiB/33.57 MiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 8.52 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 12.65 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 8.76 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 14.45 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 9.05 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 16.31 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 9.43 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m----------------\u001b[30m\u001b[2m--------------\u001b[0m\u001b[0m 18.98 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (13/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 9.86 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 21.74 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (13/15)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 10.25 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 24.33 MiB/33.57 MiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (13/15)\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 24.33 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (13/15)\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 27.35 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (13/15)\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 30.27 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠇\u001b[0m \u001b[2mPreparing packages...\u001b[0m (14/15)\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 32.44 MiB/33.57 MiB         \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠇\u001b[0m \u001b[2mPreparing packages...\u001b[0m (14/15)                                                 \r",
      "\u001b[2K\u001b[2mPrepared \u001b[1m15 packages\u001b[0m \u001b[2min 1.48s\u001b[0m\u001b[0m\r\n",
      "░░░░░░░░░░░░░░░░░░░░ [0/0] \u001b[2mInstalling wheels...                                 \u001b[0m\r",
      "\u001b[2K░░░░░░░░░░░░░░░░░░░░ [0/15] \u001b[2mInstalling wheels...                                \u001b[0m\r",
      "\u001b[2K░░░░░░░░░░░░░░░░░░░░ [0/15] \u001b[2mthreadpoolctl==3.6.0                                \u001b[0m\r",
      "\u001b[2K█░░░░░░░░░░░░░░░░░░░ [1/15] \u001b[2mthreadpoolctl==3.6.0                                \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K█░░░░░░░░░░░░░░░░░░░ [1/15] \u001b[2mcycler==0.12.1                                      \u001b[0m\r",
      "\u001b[2K██░░░░░░░░░░░░░░░░░░ [2/15] \u001b[2mcycler==0.12.1                                      \u001b[0m\r",
      "\u001b[2K██░░░░░░░░░░░░░░░░░░ [2/15] \u001b[2mpyparsing==3.3.2                                    \u001b[0m\r",
      "\u001b[2K████░░░░░░░░░░░░░░░░ [3/15] \u001b[2mpyparsing==3.3.2                                    \u001b[0m\r",
      "\u001b[2K████░░░░░░░░░░░░░░░░ [3/15] \u001b[2mcontourpy==1.3.3                                    \u001b[0m\r",
      "\u001b[2K█████░░░░░░░░░░░░░░░ [4/15] \u001b[2mcontourpy==1.3.3                                    \u001b[0m\r",
      "\u001b[2K█████░░░░░░░░░░░░░░░ [4/15] \u001b[2mseaborn==0.13.2                                     \u001b[0m\r",
      "\u001b[2K██████░░░░░░░░░░░░░░ [5/15] \u001b[2mseaborn==0.13.2                                     \u001b[0m\r",
      "\u001b[2K██████░░░░░░░░░░░░░░ [5/15] \u001b[2mkiwisolver==1.5.0                                   \u001b[0m\r",
      "\u001b[2K████████░░░░░░░░░░░░ [6/15] \u001b[2mkiwisolver==1.5.0                                   \u001b[0m\r",
      "\u001b[2K████████░░░░░░░░░░░░ [6/15] \u001b[2mjoblib==1.5.3                                       \u001b[0m\r",
      "\u001b[2K█████████░░░░░░░░░░░ [7/15] \u001b[2mjoblib==1.5.3                                       \u001b[0m\r",
      "\u001b[2K█████████░░░░░░░░░░░ [7/15] \u001b[2mfonttools==4.62.1                                   \u001b[0m\r",
      "\u001b[2K██████████░░░░░░░░░░ [8/15] \u001b[2mfonttools==4.62.1                                   \u001b[0m\r",
      "\u001b[2K██████████░░░░░░░░░░ [8/15] \u001b[2mpillow==12.1.1                                      \u001b[0m\r",
      "\u001b[2K████████████░░░░░░░░ [9/15] \u001b[2mpillow==12.1.1                                      \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K█████████████████░░░ [13/15] \u001b[2mscipy==1.17.1                                      \u001b[0m\r",
      "\u001b[2K\u001b[2mInstalled \u001b[1m15 packages\u001b[0m \u001b[2min 68ms\u001b[0m\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mbiopython\u001b[0m\u001b[2m==1.86\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mcontourpy\u001b[0m\u001b[2m==1.3.3\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mcycler\u001b[0m\u001b[2m==0.12.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mfonttools\u001b[0m\u001b[2m==4.62.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mjoblib\u001b[0m\u001b[2m==1.5.3\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mkiwisolver\u001b[0m\u001b[2m==1.5.0\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mlogomaker\u001b[0m\u001b[2m==0.8.7\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mmatplotlib\u001b[0m\u001b[2m==3.10.8\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mpandas\u001b[0m\u001b[2m==3.0.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mpillow\u001b[0m\u001b[2m==12.1.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mpyparsing\u001b[0m\u001b[2m==3.3.2\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mscikit-learn\u001b[0m\u001b[2m==1.8.0\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mscipy\u001b[0m\u001b[2m==1.17.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mseaborn\u001b[0m\u001b[2m==0.13.2\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mthreadpoolctl\u001b[0m\u001b[2m==3.6.0\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "# Install all required packages in a single command\n",
    "!uv pip install numpy pandas matplotlib seaborn scipy scikit-learn biopython logomaker"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Libraries imported successfully!\n",
      "NumPy version: 2.4.2\n",
      "Pandas version: 3.0.1\n"
     ]
    }
   ],
   "source": [
    "# Import libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from scipy import stats\n",
    "from scipy.stats import nbinom, norm\n",
    "from sklearn.decomposition import PCA\n",
    "from Bio import motifs\n",
    "from Bio.Seq import Seq\n",
    "import warnings\n",
    "import re\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set random seed for reproducibility\n",
    "np.random.seed(42)\n",
    "\n",
    "# Configure plotting\n",
    "plt.style.use('seaborn-v0_8-darkgrid')\n",
    "sns.set_palette(\"husl\")\n",
    "\n",
    "print(\"Libraries imported successfully!\")\n",
    "print(f\"NumPy version: {np.__version__}\")\n",
    "print(f\"Pandas version: {pd.__version__}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Part 1: RNA-seq Differential Expression Analysis\n",
    "\n",
    "## Background\n",
    "\n",
    "The paper used RNA-seq to identify genes differentially expressed during the spermatogonia-to-spermatocyte transition. The experimental design:\n",
    "\n",
    "- **bam−/−**: Testes from *bam* mutants (enriched for spermatogonia)\n",
    "- **48hrPHS**: 48 hours post-heat shock (early spermatocytes)\n",
    "- **72hrPHS**: 72 hours post-heat shock (mature spermatocytes)\n",
    "\n",
    "### Methodology\n",
    "\n",
    "1. RNA-seq library preparation with Ribo-Zero rRNA removal\n",
    "2. Illumina NextSeq 500 sequencing (~34-40M reads per sample)\n",
    "3. Read trimming with TrimGalore\n",
    "4. Alignment with STAR to *D. melanogaster* dm6 genome\n",
    "5. Read counting with STAR using Ensembl BDGP6.84 annotation\n",
    "6. **Differential expression with DESeq2**\n",
    "\n",
    "## Generate Synthetic RNA-seq Count Data\n",
    "\n",
    "We'll create a small synthetic dataset representing ~1,000 genes with realistic count distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generating synthetic RNA-seq count data...\n",
      "\n",
      "Generated count matrix: (1000, 6)\n",
      "Genes: 1000, Samples: 6\n",
      "\n",
      "First few rows:\n",
      "           bam_rep1  bam_rep2  48hrPHS_rep1  48hrPHS_rep2  72hrPHS_rep1  \\\n",
      "Gene_0000     16856     23728          3651          3269          3924   \n",
      "Gene_0001     12182     17988          4107          5426          5196   \n",
      "Gene_0002     24548     27257          4832         14915          6149   \n",
      "Gene_0003     16482     13968          3772          4324          5613   \n",
      "Gene_0004     10382     12153          8436          3966          6323   \n",
      "\n",
      "           72hrPHS_rep2  \n",
      "Gene_0000          2522  \n",
      "Gene_0001          3099  \n",
      "Gene_0002          9337  \n",
      "Gene_0003          7698  \n",
      "Gene_0004          7329  \n",
      "\n",
      "Count distribution summary:\n",
      "           bam_rep1      bam_rep2   48hrPHS_rep1   48hrPHS_rep2  \\\n",
      "count   1000.000000   1000.000000    1000.000000    1000.000000   \n",
      "mean    8695.827000   8889.343000   12571.219000   12003.741000   \n",
      "std     8883.850925   9210.899318   15843.153057   15082.317511   \n",
      "min        6.000000      4.000000      63.000000     129.000000   \n",
      "25%      837.250000    861.250000    3505.000000    3510.500000   \n",
      "50%     7002.000000   6950.000000    7125.000000    6808.000000   \n",
      "75%    12664.250000  13335.250000   14120.000000   13851.750000   \n",
      "max    57753.000000  62787.000000  114366.000000  125827.000000   \n",
      "\n",
      "        72hrPHS_rep1   72hrPHS_rep2  \n",
      "count    1000.000000    1000.000000  \n",
      "mean    12588.889000   12516.246000  \n",
      "std     16177.509766   15447.839459  \n",
      "min       107.000000      94.000000  \n",
      "25%      3489.250000    3624.000000  \n",
      "50%      7197.000000    7211.000000  \n",
      "75%     13874.750000   14398.750000  \n",
      "max    124881.000000  111634.000000  \n"
     ]
    }
   ],
   "source": [
    "def generate_synthetic_rnaseq_data(n_genes=1000, n_replicates=2):\n",
    "    \"\"\"\n",
    "    Generate synthetic RNA-seq count data mimicking the paper's experimental design.\n",
    "    \n",
    "    Returns:\n",
    "    - counts_df: DataFrame with raw counts\n",
    "    - true_labels: True gene categories for validation\n",
    "    \"\"\"\n",
    "    \n",
    "    # Gene names\n",
    "    genes = [f\"Gene_{i:04d}\" for i in range(n_genes)]\n",
    "    \n",
    "    # Initialize count matrix\n",
    "    # Columns: bam_rep1, bam_rep2, 48hr_rep1, 48hr_rep2, 72hr_rep1, 72hr_rep2\n",
    "    sample_names = [\n",
    "        'bam_rep1', 'bam_rep2',\n",
    "        '48hrPHS_rep1', '48hrPHS_rep2',\n",
    "        '72hrPHS_rep1', '72hrPHS_rep2'\n",
    "    ]\n",
    "    \n",
    "    counts = np.zeros((n_genes, len(sample_names)), dtype=int)\n",
    "    true_labels = []\n",
    "    \n",
    "    # Define gene categories based on paper's criteria\n",
    "    n_down = int(0.15 * n_genes)  # ~15% down-regulated\n",
    "    n_off_to_on = int(0.25 * n_genes)  # ~25% off-to-on\n",
    "    n_alt_promoter = int(0.20 * n_genes)  # ~20% alternative promoter\n",
    "    n_unchanged = n_genes - n_down - n_off_to_on - n_alt_promoter\n",
    "    \n",
    "    gene_idx = 0\n",
    "    \n",
    "    # 1. Down-regulated genes (high in bam, low in 48hr/72hr)\n",
    "    for i in range(n_down):\n",
    "        base_count_bam = np.random.randint(500, 3000)\n",
    "        base_count_sperm = base_count_bam // np.random.randint(2, 8)  # >=2-fold decrease\n",
    "        \n",
    "        # Add biological variation\n",
    "        counts[gene_idx, 0:2] = nbinom.rvs(5, 0.5/(0.5+base_count_bam), size=2) + base_count_bam\n",
    "        counts[gene_idx, 2:4] = nbinom.rvs(5, 0.5/(0.5+base_count_sperm), size=2) + base_count_sperm\n",
    "        counts[gene_idx, 4:6] = nbinom.rvs(5, 0.5/(0.5+base_count_sperm), size=2) + base_count_sperm\n",
    "        \n",
    "        true_labels.append('down-regulated')\n",
    "        gene_idx += 1\n",
    "    \n",
    "    # 2. Off-to-on genes (very low in bam, high in 48hr/72hr)\n",
    "    for i in range(n_off_to_on):\n",
    "        base_count_bam = np.random.randint(1, 50)  # Nearly zero\n",
    "        base_count_sperm = base_count_bam * np.random.randint(10, 30)  # >8-fold increase\n",
    "        \n",
    "        counts[gene_idx, 0:2] = nbinom.rvs(5, 0.5/(0.5+base_count_bam), size=2) + base_count_bam\n",
    "        counts[gene_idx, 2:4] = nbinom.rvs(5, 0.5/(0.5+base_count_sperm), size=2) + base_count_sperm\n",
    "        counts[gene_idx, 4:6] = nbinom.rvs(5, 0.5/(0.5+base_count_sperm), size=2) + base_count_sperm\n",
    "        \n",
    "        true_labels.append('off-to-on')\n",
    "        gene_idx += 1\n",
    "    \n",
    "    # 3. Alternative promoter genes (expressed in both, moderate change)\n",
    "    for i in range(n_alt_promoter):\n",
    "        base_count = np.random.randint(300, 2000)\n",
    "        fold_change = np.random.uniform(1.5, 4)  # Moderate change\n",
    "        \n",
    "        counts[gene_idx, 0:2] = nbinom.rvs(5, 0.5/(0.5+base_count), size=2) + base_count\n",
    "        counts[gene_idx, 2:4] = nbinom.rvs(5, 0.5/(0.5+int(base_count*fold_change)), size=2) + int(base_count*fold_change)\n",
    "        counts[gene_idx, 4:6] = nbinom.rvs(5, 0.5/(0.5+int(base_count*fold_change)), size=2) + int(base_count*fold_change)\n",
    "        \n",
    "        true_labels.append('alternative-promoter')\n",
    "        gene_idx += 1\n",
    "    \n",
    "    # 4. Unchanged genes\n",
    "    for i in range(n_unchanged):\n",
    "        base_count = np.random.randint(100, 1500)\n",
    "        \n",
    "        counts[gene_idx, 0:2] = nbinom.rvs(5, 0.5/(0.5+base_count), size=2) + base_count\n",
    "        counts[gene_idx, 2:4] = nbinom.rvs(5, 0.5/(0.5+base_count), size=2) + base_count\n",
    "        counts[gene_idx, 4:6] = nbinom.rvs(5, 0.5/(0.5+base_count), size=2) + base_count\n",
    "        \n",
    "        true_labels.append('unchanged')\n",
    "        gene_idx += 1\n",
    "    \n",
    "    # Create DataFrame\n",
    "    counts_df = pd.DataFrame(counts, columns=sample_names, index=genes)\n",
    "    \n",
    "    return counts_df, true_labels\n",
    "\n",
    "# Generate data\n",
    "print(\"Generating synthetic RNA-seq count data...\")\n",
    "counts_df, true_labels = generate_synthetic_rnaseq_data(n_genes=1000)\n",
    "\n",
    "print(f\"\\nGenerated count matrix: {counts_df.shape}\")\n",
    "print(f\"Genes: {counts_df.shape[0]}, Samples: {counts_df.shape[1]}\")\n",
    "print(\"\\nFirst few rows:\")\n",
    "print(counts_df.head())\n",
    "\n",
    "# Summary statistics\n",
    "print(\"\\nCount distribution summary:\")\n",
    "print(counts_df.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Quality Control\n",
    "\n",
    "Before differential expression analysis, we perform quality control similar to standard RNA-seq workflows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize library sizes and distributions\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Library sizes (total counts per sample)\n",
    "lib_sizes = counts_df.sum(axis=0)\n",
    "axes[0].bar(range(len(lib_sizes)), lib_sizes, color=['#e74c3c', '#e74c3c', '#3498db', '#3498db', '#2ecc71', '#2ecc71'])\n",
    "axes[0].set_xlabel('Sample', fontsize=12)\n",
    "axes[0].set_ylabel('Total Read Count', fontsize=12)\n",
    "axes[0].set_title('Library Sizes Across Samples', fontsize=14, fontweight='bold')\n",
    "axes[0].set_xticks(range(len(lib_sizes)))\n",
    "axes[0].set_xticklabels(counts_df.columns, rotation=45, ha='right')\n",
    "axes[0].grid(axis='y', alpha=0.3)\n",
    "\n",
    "# Distribution of log2 counts\n",
    "log_counts = np.log2(counts_df + 1)  # Add pseudocount to avoid log(0)\n",
    "for i, col in enumerate(counts_df.columns):\n",
    "    axes[1].hist(log_counts[col], bins=50, alpha=0.5, label=col)\n",
    "axes[1].set_xlabel('log2(count + 1)', fontsize=12)\n",
    "axes[1].set_ylabel('Frequency', fontsize=12)\n",
    "axes[1].set_title('Distribution of Gene Expression Levels', fontsize=14, fontweight='bold')\n",
    "axes[1].legend(fontsize=8, loc='upper right')\n",
    "axes[1].grid(alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Library sizes (total counts):\")\n",
    "for sample, size in lib_sizes.items():\n",
    "    print(f\"  {sample}: {size:,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DESeq2-Style Normalization and Differential Expression\n",
    "\n",
    "The paper used DESeq2 for differential expression analysis. DESeq2:\n",
    "1. Normalizes for library size using median-of-ratios method\n",
    "2. Models counts with negative binomial distribution\n",
    "3. Estimates dispersion (variance)\n",
    "4. Tests for differential expression with Wald test\n",
    "\n",
    "We'll implement a simplified version of this workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Performing DESeq2-style normalization...\n",
      "\n",
      "Size factors:\n",
      "  bam_rep1: 0.735\n",
      "  bam_rep2: 0.748\n",
      "  48hrPHS_rep1: 1.256\n",
      "  48hrPHS_rep2: 1.176\n",
      "  72hrPHS_rep1: 1.245\n",
      "  72hrPHS_rep2: 1.254\n",
      "\n",
      "Performing differential expression analysis (72hrPHS vs bam)...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Differential expression results: 1000 genes analyzed\n",
      "\n",
      "Top 10 most significantly DE genes:\n",
      "         gene  log2FoldChange    pvalue      padj\n",
      "4   Gene_0004       -1.475157  0.009351  0.080616\n",
      "8   Gene_0008       -3.189430  0.003257  0.080616\n",
      "12  Gene_0012       -2.910046  0.003779  0.080616\n",
      "14  Gene_0014       -3.917046  0.009292  0.080616\n",
      "18  Gene_0018       -2.729305  0.002105  0.080616\n",
      "28  Gene_0028       -2.866352  0.009386  0.080616\n",
      "29  Gene_0029       -2.568444  0.009616  0.080616\n",
      "32  Gene_0032       -4.048315  0.003466  0.080616\n",
      "34  Gene_0034       -2.410950  0.002588  0.080616\n",
      "44  Gene_0044       -3.602962  0.000699  0.080616\n"
     ]
    }
   ],
   "source": [
    "def deseq2_normalize(counts_df):\n",
    "    \"\"\"\n",
    "    DESeq2-style normalization using median-of-ratios method.\n",
    "    \"\"\"\n",
    "    # Calculate geometric mean of each gene across samples\n",
    "    geo_means = np.exp(np.mean(np.log(counts_df + 1), axis=1))\n",
    "    \n",
    "    # Calculate size factors\n",
    "    size_factors = []\n",
    "    for col in counts_df.columns:\n",
    "        ratios = counts_df[col] / geo_means\n",
    "        ratios = ratios[np.isfinite(ratios) & (ratios > 0)]\n",
    "        size_factors.append(np.median(ratios))\n",
    "    \n",
    "    size_factors = np.array(size_factors)\n",
    "    \n",
    "    # Normalize counts\n",
    "    normalized = counts_df / size_factors\n",
    "    \n",
    "    return normalized, size_factors\n",
    "\n",
    "def calculate_fold_change(counts_df, condition1_cols, condition2_cols):\n",
    "    \"\"\"\n",
    "    Calculate log2 fold change between two conditions.\n",
    "    \"\"\"\n",
    "    # Add pseudocount to avoid division by zero\n",
    "    mean1 = counts_df[condition1_cols].mean(axis=1) + 1\n",
    "    mean2 = counts_df[condition2_cols].mean(axis=1) + 1\n",
    "    \n",
    "    log2fc = np.log2(mean2 / mean1)\n",
    "    \n",
    "    return log2fc\n",
    "\n",
    "def simple_differential_expression(counts_df, condition1_cols, condition2_cols):\n",
    "    \"\"\"\n",
    "    Simplified differential expression analysis.\n",
    "    Uses t-test on log-transformed counts (as a proxy for DESeq2's more sophisticated approach).\n",
    "    \"\"\"\n",
    "    results = []\n",
    "    \n",
    "    for gene in counts_df.index:\n",
    "        counts1 = counts_df.loc[gene, condition1_cols].values\n",
    "        counts2 = counts_df.loc[gene, condition2_cols].values\n",
    "        \n",
    "        # Log transform with pseudocount\n",
    "        log_counts1 = np.log2(counts1 + 1)\n",
    "        log_counts2 = np.log2(counts2 + 1)\n",
    "        \n",
    "        # T-test\n",
    "        t_stat, p_value = stats.ttest_ind(log_counts2, log_counts1)\n",
    "        \n",
    "        # Calculate fold change\n",
    "        mean1 = counts1.mean() + 1\n",
    "        mean2 = counts2.mean() + 1\n",
    "        log2fc = np.log2(mean2 / mean1)\n",
    "        \n",
    "        results.append({\n",
    "            'gene': gene,\n",
    "            'baseMean1': mean1,\n",
    "            'baseMean2': mean2,\n",
    "            'log2FoldChange': log2fc,\n",
    "            'pvalue': p_value,\n",
    "            'stat': t_stat\n",
    "        })\n",
    "    \n",
    "    results_df = pd.DataFrame(results)\n",
    "    \n",
    "    # Benjamini-Hochberg FDR correction\n",
    "    from scipy.stats import false_discovery_control\n",
    "    results_df['padj'] = false_discovery_control(results_df['pvalue'])\n",
    "    \n",
    "    return results_df\n",
    "\n",
    "# Normalize counts\n",
    "print(\"Performing DESeq2-style normalization...\")\n",
    "normalized_counts, size_factors = deseq2_normalize(counts_df)\n",
    "\n",
    "print(\"\\nSize factors:\")\n",
    "for sample, sf in zip(counts_df.columns, size_factors):\n",
    "    print(f\"  {sample}: {sf:.3f}\")\n",
    "\n",
    "# Perform differential expression: 72hrPHS vs bam\n",
    "print(\"\\nPerforming differential expression analysis (72hrPHS vs bam)...\")\n",
    "de_results = simple_differential_expression(\n",
    "    normalized_counts,\n",
    "    condition1_cols=['bam_rep1', 'bam_rep2'],\n",
    "    condition2_cols=['72hrPHS_rep1', '72hrPHS_rep2']\n",
    ")\n",
    "\n",
    "print(f\"\\nDifferential expression results: {de_results.shape[0]} genes analyzed\")\n",
    "print(\"\\nTop 10 most significantly DE genes:\")\n",
    "print(de_results.nsmallest(10, 'padj')[['gene', 'log2FoldChange', 'pvalue', 'padj']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gene Classification Based on Paper's Criteria\n",
    "\n",
    "Following the paper's methodology:\n",
    "\n",
    "1. **Down-regulated genes**: Expressed ≥2-fold less in spermatocytes\n",
    "2. **Off-to-on genes**: Low/negligible expression in bam−/−, >8-fold increase by 48hrPHS or >16-fold by 72hrPHS\n",
    "3. **Alternative promoter genes**: Expressed in both conditions (would need CAGE data to confirm different TSS)\n",
    "4. **Unchanged genes**: No significant change"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def classify_genes(de_results, counts_df, padj_cutoff=0.05):\n",
    "    \"\"\"\n",
    "    Classify genes based on paper's criteria.\n",
    "    \"\"\"\n",
    "    classifications = []\n",
    "    \n",
    "    for idx, row in de_results.iterrows():\n",
    "        gene = row['gene']\n",
    "        log2fc = row['log2FoldChange']\n",
    "        padj = row['padj']\n",
    "        \n",
    "        # Get mean expression in each condition\n",
    "        mean_bam = counts_df.loc[gene, ['bam_rep1', 'bam_rep2']].mean()\n",
    "        mean_72hr = counts_df.loc[gene, ['72hrPHS_rep1', '72hrPHS_rep2']].mean()\n",
    "        \n",
    "        # Classify\n",
    "        if padj > padj_cutoff:\n",
    "            category = 'unchanged'\n",
    "        elif log2fc < -1:  # 2-fold decrease\n",
    "            category = 'down-regulated'\n",
    "        elif mean_bam < 50 and log2fc > 3:  # Low in bam, >8-fold increase\n",
    "            category = 'off-to-on'\n",
    "        elif log2fc > 0.5 and mean_bam > 100:  # Expressed in both, moderate increase\n",
    "            category = 'alternative-promoter'\n",
    "        else:\n",
    "            category = 'unchanged'\n",
    "        \n",
    "        classifications.append(category)\n",
    "    \n",
    "    de_results['classification'] = classifications\n",
    "    \n",
    "    return de_results\n",
    "\n",
    "# Classify genes\n",
    "de_results = classify_genes(de_results, counts_df)\n",
    "\n",
    "# Count classifications\n",
    "classification_counts = de_results['classification'].value_counts()\n",
    "\n",
    "print(\"\\nGene Classification Summary:\")\n",
    "print(\"=\"*50)\n",
    "for category, count in classification_counts.items():\n",
    "    percentage = (count / len(de_results)) * 100\n",
    "    print(f\"{category:25s}: {count:4d} ({percentage:5.1f}%)\")\n",
    "\n",
    "print(\"\\nPaper's findings for comparison:\")\n",
    "print(\"  Down-regulated           : 1,155 (11.6%)\")\n",
    "print(\"  Off-to-on                : 1,841 (18.5%)\")\n",
    "print(\"  Alternative promoter     : 1,230 (12.4%)\")\n",
    "print(\"  Total genes analyzed     : 9,371\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization: MA Plot and Volcano Plot\n",
    "\n",
    "Standard visualizations for differential expression analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(16, 6))\n",
    "\n",
    "# MA Plot (log2FC vs mean expression)\n",
    "de_results['baseMean'] = (de_results['baseMean1'] + de_results['baseMean2']) / 2\n",
    "\n",
    "color_map = {\n",
    "    'down-regulated': '#e74c3c',\n",
    "    'off-to-on': '#3498db',\n",
    "    'alternative-promoter': '#9b59b6',\n",
    "    'unchanged': '#95a5a6'\n",
    "}\n",
    "\n",
    "for category in ['unchanged', 'down-regulated', 'off-to-on', 'alternative-promoter']:\n",
    "    subset = de_results[de_results['classification'] == category]\n",
    "    axes[0].scatter(\n",
    "        np.log10(subset['baseMean']),\n",
    "        subset['log2FoldChange'],\n",
    "        c=color_map[category],\n",
    "        s=10,\n",
    "        alpha=0.6,\n",
    "        label=category\n",
    "    )\n",
    "\n",
    "axes[0].axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)\n",
    "axes[0].axhline(y=1, color='gray', linestyle=':', linewidth=1, alpha=0.5)\n",
    "axes[0].axhline(y=-1, color='gray', linestyle=':', linewidth=1, alpha=0.5)\n",
    "axes[0].set_xlabel('log10(Mean Expression)', fontsize=12)\n",
    "axes[0].set_ylabel('log2(Fold Change)', fontsize=12)\n",
    "axes[0].set_title('MA Plot: 72hrPHS vs bam', fontsize=14, fontweight='bold')\n",
    "axes[0].legend(loc='upper left', fontsize=9)\n",
    "axes[0].grid(alpha=0.3)\n",
    "\n",
    "# Volcano Plot (log2FC vs -log10(padj))\n",
    "for category in ['unchanged', 'down-regulated', 'off-to-on', 'alternative-promoter']:\n",
    "    subset = de_results[de_results['classification'] == category]\n",
    "    axes[1].scatter(\n",
    "        subset['log2FoldChange'],\n",
    "        -np.log10(subset['padj'] + 1e-300),  # Add small value to avoid log(0)\n",
    "        c=color_map[category],\n",
    "        s=10,\n",
    "        alpha=0.6,\n",
    "        label=category\n",
    "    )\n",
    "\n",
    "axes[1].axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.5)\n",
    "axes[1].axhline(y=-np.log10(0.05), color='gray', linestyle=':', linewidth=1, alpha=0.5)\n",
    "axes[1].set_xlabel('log2(Fold Change)', fontsize=12)\n",
    "axes[1].set_ylabel('-log10(Adjusted P-value)', fontsize=12)\n",
    "axes[1].set_title('Volcano Plot: 72hrPHS vs bam', fontsize=14, fontweight='bold')\n",
    "axes[1].legend(loc='upper right', fontsize=9)\n",
    "axes[1].grid(alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PCA Analysis\n",
    "\n",
    "Principal Component Analysis to visualize sample relationships."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform PCA on log-transformed normalized counts\n",
    "log_norm_counts = np.log2(normalized_counts + 1)\n",
    "\n",
    "# Filter for genes with sufficient expression (similar to DESeq2's rlog transformation)\n",
    "expressed_genes = (counts_df.sum(axis=1) > 50)\n",
    "log_norm_filtered = log_norm_counts[expressed_genes]\n",
    "\n",
    "pca = PCA(n_components=2)\n",
    "pca_result = pca.fit_transform(log_norm_filtered.T)\n",
    "\n",
    "# Create PCA plot\n",
    "fig, ax = plt.subplots(figsize=(10, 7))\n",
    "\n",
    "conditions = ['bam', 'bam', '48hrPHS', '48hrPHS', '72hrPHS', '72hrPHS']\n",
    "colors_pca = {'bam': '#e74c3c', '48hrPHS': '#3498db', '72hrPHS': '#2ecc71'}\n",
    "markers = ['o', 's', 'o', 's', 'o', 's']\n",
    "\n",
    "for i, (sample, condition, marker) in enumerate(zip(counts_df.columns, conditions, markers)):\n",
    "    ax.scatter(\n",
    "        pca_result[i, 0],\n",
    "        pca_result[i, 1],\n",
    "        c=colors_pca[condition],\n",
    "        marker=marker,\n",
    "        s=200,\n",
    "        edgecolors='black',\n",
    "        linewidth=1.5,\n",
    "        alpha=0.8,\n",
    "        label=sample\n",
    "    )\n",
    "\n",
    "ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% variance)', fontsize=12)\n",
    "ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% variance)', fontsize=12)\n",
    "ax.set_title('PCA: Sample Relationships', fontsize=14, fontweight='bold')\n",
    "ax.legend(loc='best', fontsize=9)\n",
    "ax.grid(alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"\\nPCA explained variance:\")\n",
    "print(f\"  PC1: {pca.explained_variance_ratio_[0]*100:.2f}%\")\n",
    "print(f\"  PC2: {pca.explained_variance_ratio_[1]*100:.2f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Part 2: Promoter Sequence Analysis and Motif Discovery\n",
    "\n",
    "## Background\n",
    "\n",
    "The paper identified several key motifs in spermatocyte-specific promoters:\n",
    "\n",
    "1. **tMAC-ChIP motif**: Found ~60 bp upstream of TSS\n",
    "2. **Achi/Vis motif**: TGTCA (TALE-class homeodomain binding site)\n",
    "3. **Initiator (Inr) motif**: TCA at position +1 (TSS)\n",
    "4. **ACA motif**: At positions +26, +28, or +30\n",
    "5. **CNAAATT motif**: Between +29 and +60 (translational control element)\n",
    "\n",
    "### Methodology\n",
    "\n",
    "- Motif discovery with MEME-ChIP\n",
    "- Positional enrichment with CENTRIMO\n",
    "- Analysis of 300 bp regions centered on CAGE clusters\n",
    "\n",
    "## Generate Synthetic Promoter Sequences\n",
    "\n",
    "We'll create synthetic promoter sequences with realistic base compositions and embedded motifs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_promoter_sequence(length=300, gc_content=0.45, motifs_to_embed=None):\n",
    "    \"\"\"\n",
    "    Generate a random DNA sequence with specified GC content and embedded motifs.\n",
    "    \n",
    "    motifs_to_embed: list of (motif_seq, position) tuples\n",
    "    \"\"\"\n",
    "    # Generate random sequence with specified GC content\n",
    "    bases = ['A', 'T', 'G', 'C']\n",
    "    probs = [(1-gc_content)/2, (1-gc_content)/2, gc_content/2, gc_content/2]\n",
    "    \n",
    "    seq = ''.join(np.random.choice(bases, size=length, p=probs))\n",
    "    seq_list = list(seq)\n",
    "    \n",
    "    # Embed motifs at specified positions\n",
    "    if motifs_to_embed:\n",
    "        for motif, pos in motifs_to_embed:\n",
    "            if 0 <= pos < length - len(motif):\n",
    "                for i, base in enumerate(motif):\n",
    "                    seq_list[pos + i] = base\n",
    "    \n",
    "    return ''.join(seq_list)\n",
    "\n",
    "def generate_promoter_sequences_by_type(n_sequences_per_type=50):\n",
    "    \"\"\"\n",
    "    Generate promoter sequences for different gene types.\n",
    "    \"\"\"\n",
    "    sequences = {}\n",
    "    \n",
    "    # 1. Off-to-on gene promoters (spermatocyte-specific)\n",
    "    # These should contain tMAC-ChIP, Achi/Vis, Inr, ACA, and CNAAATT motifs\n",
    "    off_to_on_seqs = []\n",
    "    for i in range(n_sequences_per_type):\n",
    "        # TSS at position 150 (middle of 300bp sequence)\n",
    "        tss_pos = 150\n",
    "        \n",
    "        motifs = [\n",
    "            ('TAACGTA', tss_pos - 60),  # tMAC-ChIP motif ~60bp upstream\n",
    "            ('TGTCA', tss_pos - 30),     # Achi/Vis motif\n",
    "            ('TCA', tss_pos),            # Initiator at TSS\n",
    "            ('ACA', tss_pos + np.random.choice([26, 28, 30])),  # ACA motif\n",
    "            ('CAAAATT', tss_pos + np.random.randint(29, 50))    # CNAAATT motif\n",
    "        ]\n",
    "        \n",
    "        seq = generate_promoter_sequence(length=300, gc_content=0.42, motifs_to_embed=motifs)\n",
    "        off_to_on_seqs.append(seq)\n",
    "    \n",
    "    sequences['off-to-on'] = off_to_on_seqs\n",
    "    \n",
    "    # 2. Down-regulated gene promoters (canonical promoters)\n",
    "    # These contain TATA box and/or DPE\n",
    "    down_reg_seqs = []\n",
    "    for i in range(n_sequences_per_type):\n",
    "        tss_pos = 150\n",
    "        \n",
    "        motifs = [\n",
    "            ('TATAAA', tss_pos - 30),    # TATA box\n",
    "            ('CAKTY', tss_pos),          # Inr (simplified)\n",
    "            ('AGAC', tss_pos + 28)       # DPE (simplified)\n",
    "        ]\n",
    "        \n",
    "        seq = generate_promoter_sequence(length=300, gc_content=0.52, motifs_to_embed=motifs)\n",
    "        down_reg_seqs.append(seq)\n",
    "    \n",
    "    sequences['down-regulated'] = down_reg_seqs\n",
    "    \n",
    "    # 3. Unchanged gene promoters (random)\n",
    "    unchanged_seqs = []\n",
    "    for i in range(n_sequences_per_type):\n",
    "        seq = generate_promoter_sequence(length=300, gc_content=0.48)\n",
    "        unchanged_seqs.append(seq)\n",
    "    \n",
    "    sequences['unchanged'] = unchanged_seqs\n",
    "    \n",
    "    return sequences\n",
    "\n",
    "# Generate promoter sequences\n",
    "print(\"Generating synthetic promoter sequences...\")\n",
    "promoter_seqs = generate_promoter_sequences_by_type(n_sequences_per_type=50)\n",
    "\n",
    "print(\"\\nGenerated promoter sequences:\")\n",
    "for seq_type, seqs in promoter_seqs.items():\n",
    "    print(f\"  {seq_type:20s}: {len(seqs)} sequences (300 bp each)\")\n",
    "\n",
    "print(\"\\nExample off-to-on promoter sequence (first 150 bp):\")\n",
    "print(promoter_seqs['off-to-on'][0][:150])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Motif Scanning and Enrichment Analysis\n",
    "\n",
    "We'll scan the promoter sequences for the key motifs identified in the paper."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_motif_occurrences(sequences, motif, allow_mismatches=0):\n",
    "    \"\"\"\n",
    "    Count occurrences of a motif in a list of sequences.\n",
    "    Returns list of positions for each sequence.\n",
    "    \"\"\"\n",
    "    all_positions = []\n",
    "    \n",
    "    for seq in sequences:\n",
    "        positions = []\n",
    "        \n",
    "        # Simple exact match (can be extended to allow mismatches)\n",
    "        for i in range(len(seq) - len(motif) + 1):\n",
    "            if seq[i:i+len(motif)] == motif:\n",
    "                positions.append(i)\n",
    "        \n",
    "        all_positions.append(positions)\n",
    "    \n",
    "    return all_positions\n",
    "\n",
    "def calculate_motif_enrichment(sequences, motif, window_start, window_end):\n",
    "    \"\"\"\n",
    "    Calculate enrichment of motif in a specific window.\n",
    "    \"\"\"\n",
    "    count_in_window = 0\n",
    "    total_count = 0\n",
    "    \n",
    "    for seq in sequences:\n",
    "        # Count in window\n",
    "        window_seq = seq[window_start:window_end]\n",
    "        count_in_window += window_seq.count(motif)\n",
    "        \n",
    "        # Count total\n",
    "        total_count += seq.count(motif)\n",
    "    \n",
    "    window_length = window_end - window_start\n",
    "    total_length = len(sequences[0])\n",
    "    \n",
    "    expected = total_count * (window_length / total_length)\n",
    "    enrichment = count_in_window / expected if expected > 0 else 0\n",
    "    \n",
    "    return enrichment, count_in_window\n",
    "\n",
    "# Define motifs from the paper\n",
    "motifs_to_analyze = {\n",
    "    'tMAC-ChIP': 'TAACGTA',\n",
    "    'Achi/Vis': 'TGTCA',\n",
    "    'Inr': 'TCA',\n",
    "    'ACA': 'ACA',\n",
    "    'CNAAATT': 'CAAAATT',\n",
    "    'TATA': 'TATAAA'\n",
    "}\n",
    "\n",
    "# Analyze motif enrichment in different gene types\n",
    "print(\"\\nMotif Enrichment Analysis\")\n",
    "print(\"=\"*80)\n",
    "print(f\"{'Motif':<15} {'Off-to-on':>12} {'Down-reg':>12} {'Unchanged':>12}\")\n",
    "print(\"-\"*80)\n",
    "\n",
    "for motif_name, motif_seq in motifs_to_analyze.items():\n",
    "    enrichments = {}\n",
    "    \n",
    "    for seq_type in ['off-to-on', 'down-regulated', 'unchanged']:\n",
    "        sequences = promoter_seqs[seq_type]\n",
    "        positions = count_motif_occurrences(sequences, motif_seq)\n",
    "        \n",
    "        # Count sequences containing the motif\n",
    "        n_with_motif = sum(1 for p in positions if len(p) > 0)\n",
    "        fraction = n_with_motif / len(sequences)\n",
    "        enrichments[seq_type] = fraction\n",
    "    \n",
    "    print(f\"{motif_name:<15} {enrichments['off-to-on']:>11.1%} {enrichments['down-regulated']:>11.1%} {enrichments['unchanged']:>11.1%}\")\n",
    "\n",
    "print(\"\\nExpected results based on paper:\")\n",
    "print(\"  - tMAC-ChIP, Achi/Vis, ACA, CNAAATT: enriched in off-to-on promoters\")\n",
    "print(\"  - TATA: enriched in down-regulated promoters\")\n",
    "print(\"  - Inr: present in both (canonical core promoter element)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Positional Analysis of Motifs\n",
    "\n",
    "The paper found that motifs are enriched at specific positions relative to the TSS. We'll analyze motif positions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def analyze_motif_positions(sequences, motif, tss_position=150):\n",
    "    \"\"\"\n",
    "    Analyze positions of motif occurrences relative to TSS.\n",
    "    \"\"\"\n",
    "    positions_relative_to_tss = []\n",
    "    \n",
    "    for seq in sequences:\n",
    "        for i in range(len(seq) - len(motif) + 1):\n",
    "            if seq[i:i+len(motif)] == motif:\n",
    "                rel_pos = i - tss_position\n",
    "                positions_relative_to_tss.append(rel_pos)\n",
    "    \n",
    "    return positions_relative_to_tss\n",
    "\n",
    "# Analyze positions for key motifs in off-to-on promoters\n",
    "fig, axes = plt.subplots(2, 3, figsize=(16, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "motifs_for_position = [\n",
    "    ('tMAC-ChIP', 'TAACGTA'),\n",
    "    ('Achi/Vis', 'TGTCA'),\n",
    "    ('Inr', 'TCA'),\n",
    "    ('ACA', 'ACA'),\n",
    "    ('CNAAATT', 'CAAAATT'),\n",
    "    ('TATA', 'TATAAA')\n",
    "]\n",
    "\n",
    "for idx, (motif_name, motif_seq) in enumerate(motifs_for_position):\n",
    "    positions = analyze_motif_positions(promoter_seqs['off-to-on'], motif_seq)\n",
    "    \n",
    "    if len(positions) > 0:\n",
    "        axes[idx].hist(positions, bins=30, color='#3498db', edgecolor='black', alpha=0.7)\n",
    "        axes[idx].axvline(x=0, color='red', linestyle='--', linewidth=2, label='TSS')\n",
    "        axes[idx].set_xlabel('Position relative to TSS (bp)', fontsize=10)\n",
    "        axes[idx].set_ylabel('Frequency', fontsize=10)\n",
    "        axes[idx].set_title(f'{motif_name} motif positions', fontsize=12, fontweight='bold')\n",
    "        axes[idx].legend()\n",
    "        axes[idx].grid(alpha=0.3)\n",
    "    else:\n",
    "        axes[idx].text(0.5, 0.5, 'No motifs found', ha='center', va='center', transform=axes[idx].transAxes)\n",
    "        axes[idx].set_title(f'{motif_name} motif positions', fontsize=12, fontweight='bold')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\nExpected positions based on paper:\")\n",
    "print(\"  - tMAC-ChIP: ~60 bp upstream of TSS\")\n",
    "print(\"  - Achi/Vis: Between -50 and -5 bp\")\n",
    "print(\"  - Inr: At TSS (position 0)\")\n",
    "print(\"  - ACA: At +26, +28, or +30\")\n",
    "print(\"  - CNAAATT: Between +29 and +60\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sequence Logo Visualization\n",
    "\n",
    "Create sequence logos for the identified motifs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import logomaker\n",
    "    \n",
    "    def create_sequence_logo(sequences, motif_length=10, position=150):\n",
    "        \"\"\"\n",
    "        Create a sequence logo from aligned sequences.\n",
    "        \"\"\"\n",
    "        # Extract motif regions\n",
    "        motif_regions = []\n",
    "        for seq in sequences:\n",
    "            if len(seq) >= position + motif_length:\n",
    "                motif_regions.append(seq[position:position+motif_length])\n",
    "        \n",
    "        # Create position frequency matrix\n",
    "        counts = {'A': [], 'C': [], 'G': [], 'T': []}\n",
    "        \n",
    "        for pos in range(motif_length):\n",
    "            pos_counts = {'A': 0, 'C': 0, 'G': 0, 'T': 0}\n",
    "            for seq in motif_regions:\n",
    "                if pos < len(seq):\n",
    "                    base = seq[pos]\n",
    "                    if base in pos_counts:\n",
    "                        pos_counts[base] += 1\n",
    "            \n",
    "            total = sum(pos_counts.values())\n",
    "            for base in ['A', 'C', 'G', 'T']:\n",
    "                counts[base].append(pos_counts[base] / total if total > 0 else 0)\n",
    "        \n",
    "        # Create DataFrame for logomaker\n",
    "        df = pd.DataFrame(counts)\n",
    "        \n",
    "        return df\n",
    "    \n",
    "    # Create logos for TSS region in off-to-on promoters\n",
    "    fig, axes = plt.subplots(1, 2, figsize=(14, 4))\n",
    "    \n",
    "    # TSS region (position 0)\n",
    "    logo_data_tss = create_sequence_logo(promoter_seqs['off-to-on'], motif_length=10, position=145)\n",
    "    logo_tss = logomaker.Logo(logo_data_tss, ax=axes[0])\n",
    "    axes[0].set_ylabel('Frequency', fontsize=12)\n",
    "    axes[0].set_xlabel('Position relative to TSS', fontsize=12)\n",
    "    axes[0].set_title('Sequence Logo: TSS Region (Off-to-on genes)', fontsize=12, fontweight='bold')\n",
    "    \n",
    "    # Downstream region (ACA/CNAAATT region)\n",
    "    logo_data_down = create_sequence_logo(promoter_seqs['off-to-on'], motif_length=15, position=175)\n",
    "    logo_down = logomaker.Logo(logo_data_down, ax=axes[1])\n",
    "    axes[1].set_ylabel('Frequency', fontsize=12)\n",
    "    axes[1].set_xlabel('Position relative to TSS', fontsize=12)\n",
    "    axes[1].set_title('Sequence Logo: Downstream Region (+25 to +40)', fontsize=12, fontweight='bold')\n",
    "    \n",
    "    plt.tight_layout()\n",
    "    plt.show()\n",
    "    \n",
    "except ImportError:\n",
    "    print(\"logomaker not available, skipping sequence logo visualization\")\n",
    "    print(\"\\nMotif consensus sequences from the paper:\")\n",
    "    print(\"  - tMAC-ChIP motif: Complex pattern enriched in ChIP-seq peaks\")\n",
    "    print(\"  - Achi/Vis motif: TGTCA\")\n",
    "    print(\"  - Inr motif: TCA (or TCAKTY)\")\n",
    "    print(\"  - ACA motif: ACA\")\n",
    "    print(\"  - CNAAATT motif: C[ACGT]AAATT (translational control element)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Part 3: Chromatin Accessibility Analysis (ATAC-seq)\n",
    "\n",
    "## Background\n",
    "\n",
    "The paper used ATAC-seq to measure chromatin accessibility during differentiation. Key findings:\n",
    "\n",
    "- **Off-to-on promoters are closed in bam−/−** (spermatogonia)\n",
    "- **Promoters open by 48hrPHS** (early spermatocytes)\n",
    "- **Fully open by 72hrPHS** (mature spermatocytes)\n",
    "- **Opening requires tMAC function** (absent in *aly* and *topi* mutants)\n",
    "\n",
    "We'll simulate ATAC-seq-like accessibility scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_atac_accessibility_scores(de_results, counts_df):\n",
    "    \"\"\"\n",
    "    Generate simulated ATAC-seq accessibility scores.\n",
    "    Accessibility should correlate with gene expression and classification.\n",
    "    \"\"\"\n",
    "    accessibility_data = []\n",
    "    \n",
    "    for idx, row in de_results.iterrows():\n",
    "        gene = row['gene']\n",
    "        classification = row['classification']\n",
    "        \n",
    "        # Get expression levels\n",
    "        expr_bam = counts_df.loc[gene, ['bam_rep1', 'bam_rep2']].mean()\n",
    "        expr_72hr = counts_df.loc[gene, ['72hrPHS_rep1', '72hrPHS_rep2']].mean()\n",
    "        \n",
    "        # Simulate accessibility based on classification\n",
    "        if classification == 'off-to-on':\n",
    "            # Closed in bam, open in spermatocytes\n",
    "            acc_bam = np.random.uniform(0, 20)\n",
    "            acc_48hr = np.random.uniform(40, 70)\n",
    "            acc_72hr = np.random.uniform(60, 100)\n",
    "        elif classification == 'down-regulated':\n",
    "            # Open in bam, less open in spermatocytes\n",
    "            acc_bam = np.random.uniform(60, 100)\n",
    "            acc_48hr = np.random.uniform(30, 60)\n",
    "            acc_72hr = np.random.uniform(20, 50)\n",
    "        elif classification == 'alternative-promoter':\n",
    "            # Both open, but different patterns\n",
    "            acc_bam = np.random.uniform(40, 70)\n",
    "            acc_48hr = np.random.uniform(50, 80)\n",
    "            acc_72hr = np.random.uniform(60, 90)\n",
    "        else:  # unchanged\n",
    "            # Similar accessibility\n",
    "            base_acc = np.random.uniform(30, 70)\n",
    "            acc_bam = base_acc + np.random.uniform(-10, 10)\n",
    "            acc_48hr = base_acc + np.random.uniform(-10, 10)\n",
    "            acc_72hr = base_acc + np.random.uniform(-10, 10)\n",
    "        \n",
    "        accessibility_data.append({\n",
    "            'gene': gene,\n",
    "            'classification': classification,\n",
    "            'acc_bam': max(0, acc_bam),\n",
    "            'acc_48hr': max(0, acc_48hr),\n",
    "            'acc_72hr': max(0, acc_72hr)\n",
    "        })\n",
    "    \n",
    "    atac_df = pd.DataFrame(accessibility_data)\n",
    "    return atac_df\n",
    "\n",
    "# Generate ATAC-seq data\n",
    "print(\"Generating simulated ATAC-seq accessibility data...\")\n",
    "atac_df = generate_atac_accessibility_scores(de_results, counts_df)\n",
    "\n",
    "print(\"\\nATAC-seq data summary:\")\n",
    "print(atac_df.groupby('classification')[['acc_bam', 'acc_48hr', 'acc_72hr']].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize Chromatin Accessibility Changes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(2, 2, figsize=(14, 12))\n",
    "\n",
    "# 1. Heatmap of accessibility for off-to-on genes\n",
    "off_to_on_atac = atac_df[atac_df['classification'] == 'off-to-on'].copy()\n",
    "off_to_on_atac = off_to_on_atac.sort_values('acc_72hr', ascending=False).head(50)\n",
    "heatmap_data = off_to_on_atac[['acc_bam', 'acc_48hr', 'acc_72hr']].T\n",
    "\n",
    "im = axes[0, 0].imshow(heatmap_data, aspect='auto', cmap='YlOrRd', interpolation='nearest')\n",
    "axes[0, 0].set_yticks([0, 1, 2])\n",
    "axes[0, 0].set_yticklabels(['bam', '48hrPHS', '72hrPHS'])\n",
    "axes[0, 0].set_xlabel('Genes (Off-to-on)', fontsize=10)\n",
    "axes[0, 0].set_ylabel('Condition', fontsize=10)\n",
    "axes[0, 0].set_title('Chromatin Accessibility: Off-to-on Genes', fontsize=12, fontweight='bold')\n",
    "plt.colorbar(im, ax=axes[0, 0], label='ATAC-seq signal')\n",
    "\n",
    "# 2. Line plots showing accessibility changes\n",
    "for classification in ['off-to-on', 'down-regulated', 'alternative-promoter', 'unchanged']:\n",
    "    subset = atac_df[atac_df['classification'] == classification]\n",
    "    means = [subset['acc_bam'].mean(), subset['acc_48hr'].mean(), subset['acc_72hr'].mean()]\n",
    "    stds = [subset['acc_bam'].std(), subset['acc_48hr'].std(), subset['acc_72hr'].std()]\n",
    "    \n",
    "    axes[0, 1].plot([0, 1, 2], means, marker='o', linewidth=2, markersize=8, label=classification)\n",
    "    axes[0, 1].fill_between(\n",
    "        [0, 1, 2],\n",
    "        np.array(means) - np.array(stds),\n",
    "        np.array(means) + np.array(stds),\n",
    "        alpha=0.2\n",
    "    )\n",
    "\n",
    "axes[0, 1].set_xticks([0, 1, 2])\n",
    "axes[0, 1].set_xticklabels(['bam', '48hrPHS', '72hrPHS'])\n",
    "axes[0, 1].set_ylabel('Mean ATAC-seq Signal', fontsize=10)\n",
    "axes[0, 1].set_xlabel('Condition', fontsize=10)\n",
    "axes[0, 1].set_title('Accessibility Changes During Differentiation', fontsize=12, fontweight='bold')\n",
    "axes[0, 1].legend(loc='best', fontsize=8)\n",
    "axes[0, 1].grid(alpha=0.3)\n",
    "\n",
    "# 3. Scatter plot: Accessibility change vs Expression change\n",
    "merged_data = de_results.merge(atac_df, on=['gene', 'classification'])\n",
    "merged_data['acc_change'] = merged_data['acc_72hr'] - merged_data['acc_bam']\n",
    "\n",
    "color_map = {\n",
    "    'down-regulated': '#e74c3c',\n",
    "    'off-to-on': '#3498db',\n",
    "    'alternative-promoter': '#9b59b6',\n",
    "    'unchanged': '#95a5a6'\n",
    "}\n",
    "\n",
    "for classification in color_map.keys():\n",
    "    subset = merged_data[merged_data['classification'] == classification]\n",
    "    axes[1, 0].scatter(\n",
    "        subset['log2FoldChange'],\n",
    "        subset['acc_change'],\n",
    "        c=color_map[classification],\n",
    "        s=20,\n",
    "        alpha=0.6,\n",
    "        label=classification\n",
    "    )\n",
    "\n",
    "axes[1, 0].axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)\n",
    "axes[1, 0].axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.5)\n",
    "axes[1, 0].set_xlabel('log2(Fold Change) - Expression', fontsize=10)\n",
    "axes[1, 0].set_ylabel('Accessibility Change (72hr - bam)', fontsize=10)\n",
    "axes[1, 0].set_title('Accessibility vs Expression Changes', fontsize=12, fontweight='bold')\n",
    "axes[1, 0].legend(loc='best', fontsize=8)\n",
    "axes[1, 0].grid(alpha=0.3)\n",
    "\n",
    "# 4. Box plots of accessibility by gene type\n",
    "data_for_box = []\n",
    "labels_for_box = []\n",
    "colors_for_box = []\n",
    "\n",
    "for classification in ['off-to-on', 'down-regulated', 'alternative-promoter', 'unchanged']:\n",
    "    subset = atac_df[atac_df['classification'] == classification]\n",
    "    data_for_box.append(subset['acc_72hr'].values)\n",
    "    labels_for_box.append(classification)\n",
    "    colors_for_box.append(color_map[classification])\n",
    "\n",
    "bp = axes[1, 1].boxplot(data_for_box, labels=labels_for_box, patch_artist=True, showfliers=False)\n",
    "for patch, color in zip(bp['boxes'], colors_for_box):\n",
    "    patch.set_facecolor(color)\n",
    "    patch.set_alpha(0.6)\n",
    "\n",
    "axes[1, 1].set_ylabel('ATAC-seq Signal (72hrPHS)', fontsize=10)\n",
    "axes[1, 1].set_xlabel('Gene Classification', fontsize=10)\n",
    "axes[1, 1].set_title('Promoter Accessibility in Spermatocytes', fontsize=12, fontweight='bold')\n",
    "axes[1, 1].tick_params(axis='x', rotation=45)\n",
    "axes[1, 1].grid(axis='y', alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\nKey observation from the paper:\")\n",
    "print(\"  - Off-to-on promoters transition from closed to open state\")\n",
    "print(\"  - This opening requires tMAC complex function\")\n",
    "print(\"  - tMAC binds locally at these promoters to create ~100bp accessible regions\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "# Part 4: Integration and Summary\n",
    "\n",
    "## Summary of Key Findings\n",
    "\n",
    "This notebook demonstrated the computational workflows used to study cell type-specific transcription during *Drosophila* spermatogenesis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create final summary visualization\n",
    "fig = plt.figure(figsize=(16, 10))\n",
    "gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)\n",
    "\n",
    "# 1. Gene classification pie chart\n",
    "ax1 = fig.add_subplot(gs[0, 0])\n",
    "classification_counts = de_results['classification'].value_counts()\n",
    "colors_pie = [color_map.get(cat, '#95a5a6') for cat in classification_counts.index]\n",
    "ax1.pie(classification_counts.values, labels=classification_counts.index, autopct='%1.1f%%',\n",
    "        colors=colors_pie, startangle=90)\n",
    "ax1.set_title('Gene Classification\\n(Our Dataset)', fontsize=12, fontweight='bold')\n",
    "\n",
    "# 2. Expression patterns\n",
    "ax2 = fig.add_subplot(gs[0, 1:])\n",
    "example_genes = {}\n",
    "for classification in ['down-regulated', 'off-to-on', 'alternative-promoter']:\n",
    "    gene = de_results[de_results['classification'] == classification].iloc[0]['gene']\n",
    "    example_genes[classification] = gene\n",
    "    \n",
    "    expr_profile = counts_df.loc[gene, :].values\n",
    "    ax2.plot([0, 0, 1, 1, 2, 2], expr_profile, marker='o', linewidth=2, \n",
    "             label=f\"{classification}\\n({gene})\", color=color_map[classification])\n",
    "\n",
    "ax2.set_xticks([0, 1, 2])\n",
    "ax2.set_xticklabels(['bam\\n(spermatogonia)', '48hrPHS\\n(early)', '72hrPHS\\n(mature)'])\n",
    "ax2.set_ylabel('Normalized Expression', fontsize=10)\n",
    "ax2.set_title('Expression Patterns During Differentiation', fontsize=12, fontweight='bold')\n",
    "ax2.legend(loc='best', fontsize=8)\n",
    "ax2.grid(alpha=0.3)\n",
    "\n",
    "# 3. Motif enrichment summary\n",
    "ax3 = fig.add_subplot(gs[1, :])\n",
    "motif_enrichment_summary = []\n",
    "for motif_name, motif_seq in motifs_to_analyze.items():\n",
    "    for seq_type in ['off-to-on', 'down-regulated']:\n",
    "        sequences = promoter_seqs[seq_type]\n",
    "        positions = count_motif_occurrences(sequences, motif_seq)\n",
    "        fraction = sum(1 for p in positions if len(p) > 0) / len(sequences)\n",
    "        motif_enrichment_summary.append({\n",
    "            'Motif': motif_name,\n",
    "            'Gene Type': seq_type,\n",
    "            'Fraction': fraction\n",
    "        })\n",
    "\n",
    "enrich_df = pd.DataFrame(motif_enrichment_summary)\n",
    "pivot_df = enrich_df.pivot(index='Motif', columns='Gene Type', values='Fraction')\n",
    "\n",
    "x = np.arange(len(pivot_df.index))\n",
    "width = 0.35\n",
    "\n",
    "ax3.bar(x - width/2, pivot_df['off-to-on'], width, label='Off-to-on', color='#3498db', alpha=0.8)\n",
    "ax3.bar(x + width/2, pivot_df['down-regulated'], width, label='Down-regulated', color='#e74c3c', alpha=0.8)\n",
    "\n",
    "ax3.set_ylabel('Fraction of Promoters with Motif', fontsize=10)\n",
    "ax3.set_title('Motif Enrichment in Different Promoter Types', fontsize=12, fontweight='bold')\n",
    "ax3.set_xticks(x)\n",
    "ax3.set_xticklabels(pivot_df.index, rotation=45, ha='right')\n",
    "ax3.legend()\n",
    "ax3.grid(axis='y', alpha=0.3)\n",
    "\n",
    "# 4. Accessibility dynamics\n",
    "ax4 = fig.add_subplot(gs[2, :])\n",
    "conditions = ['bam', '48hrPHS', '72hrPHS']\n",
    "x_pos = np.arange(len(conditions))\n",
    "width = 0.2\n",
    "\n",
    "for i, classification in enumerate(['off-to-on', 'down-regulated', 'alternative-promoter', 'unchanged']):\n",
    "    subset = atac_df[atac_df['classification'] == classification]\n",
    "    means = [subset['acc_bam'].mean(), subset['acc_48hr'].mean(), subset['acc_72hr'].mean()]\n",
    "    \n",
    "    ax4.bar(x_pos + i*width, means, width, label=classification, \n",
    "            color=color_map[classification], alpha=0.8)\n",
    "\n",
    "ax4.set_ylabel('Mean ATAC-seq Signal', fontsize=10)\n",
    "ax4.set_xlabel('Developmental Stage', fontsize=10)\n",
    "ax4.set_title('Chromatin Accessibility Dynamics', fontsize=12, fontweight='bold')\n",
    "ax4.set_xticks(x_pos + width * 1.5)\n",
    "ax4.set_xticklabels(conditions)\n",
    "ax4.legend(loc='upper left', fontsize=8)\n",
    "ax4.grid(axis='y', alpha=0.3)\n",
    "\n",
    "plt.suptitle('Summary: Transcriptional Regulation During Spermatogenesis', \n",
    "             fontsize=16, fontweight='bold', y=0.995)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Paper's Main Conclusions\n",
    "\n",
    "### 1. Massive Transcriptional Reprogramming\n",
    "\n",
    "- **>3,000 genes** (~30% of testis transcriptome) expressed from new promoters in spermatocytes\n",
    "- Three major categories identified:\n",
    "  - 1,155 down-regulated genes\n",
    "  - 1,841 off-to-on genes  \n",
    "  - 1,230 alternative promoter genes\n",
    "\n",
    "### 2. Novel Promoter Architecture\n",
    "\n",
    "- Spermatocyte-specific promoters **lack canonical core promoter motifs** (TATA, DPE)\n",
    "- Instead contain **novel promoter-proximal elements**:\n",
    "  - **tMAC-ChIP motif** (~60 bp upstream)\n",
    "  - **Achi/Vis binding site** (TGTCA)\n",
    "  - **Initiator (Inr)** at TSS\n",
    "  - **ACA motif** at +26/+28/+30\n",
    "  - **CNAAATT motif** at +29 to +60\n",
    "\n",
    "### 3. tMAC Complex is Master Regulator\n",
    "\n",
    "- **tMAC binding required** for promoter opening\n",
    "- Acts **locally** to create ~100 bp nucleosome-free regions\n",
    "- Loss of tMAC components (Aly, Topi) prevents:\n",
    "  - Chromatin opening\n",
    "  - Gene activation\n",
    "  - Spermatocyte differentiation\n",
    "\n",
    "### 4. Motif Cooperation Determines Expression\n",
    "\n",
    "- **Narrow, highly-expressed promoters** have most/all motifs at optimal positions\n",
    "- **92% probability** of narrow-high expression if all 5 motifs present\n",
    "- Motifs work **additively** to enhance expression\n",
    "\n",
    "### 5. Cell Type-Specific Mechanism\n",
    "\n",
    "- Demonstrates how **promoter-proximal elements** can drive robust cell type-specific programs\n",
    "- Alternative to traditional enhancer-based regulation\n",
    "- Particularly important for **terminal differentiation genes** expressed in single tissue\n",
    "\n",
    "---\n",
    "\n",
    "## Scaling to Real Data\n",
    "\n",
    "To apply these methods to actual RNA-seq, CAGE, and ATAC-seq data from the paper or similar experiments:\n",
    "\n",
    "### Computational Requirements\n",
    "\n",
    "- **Memory**: 32-64 GB RAM (for genome alignment and counting)\n",
    "- **Storage**: 500+ GB for raw FASTQ files and genome references\n",
    "- **CPU**: Multi-core system (16+ cores recommended)\n",
    "- **GPU**: Optional but helpful for deep learning-based analysis\n",
    "- **Time**: Full analysis may take days to weeks\n",
    "\n",
    "### Software and Tools\n",
    "\n",
    "1. **Read Processing**:\n",
    "   - TrimGalore/Cutadapt for adapter trimming\n",
    "   - FastQC for quality control\n",
    "\n",
    "2. **Alignment**:\n",
    "   - STAR aligner (RNA-seq, CAGE)\n",
    "   - BWA (ATAC-seq)\n",
    "   - Reference genome: *D. melanogaster* dm6\n",
    "\n",
    "3. **Quantification**:\n",
    "   - STAR (gene counting)\n",
    "   - CAGEr (TSS mapping from CAGE)\n",
    "   - NucleoATAC (nucleosome positioning)\n",
    "\n",
    "4. **Statistical Analysis**:\n",
    "   - DESeq2 (differential expression)\n",
    "   - edgeR (alternative)\n",
    "\n",
    "5. **Motif Analysis**:\n",
    "   - MEME-ChIP (de novo motif discovery)\n",
    "   - CENTRIMO (positional enrichment)\n",
    "   - FIMO (motif scanning)\n",
    "\n",
    "6. **Visualization**:\n",
    "   - DeepTools (genomic data visualization)\n",
    "   - IGV (genome browser)\n",
    "   - Custom R/Python scripts\n",
    "\n",
    "### Data Resources\n",
    "\n",
    "- **GEO Accession**: GSE145975\n",
    "- **Genome**: UCSC dm6 or Ensembl BDGP6\n",
    "- **Annotation**: Ensembl BDGP6.84\n",
    "\n",
    "---\n",
    "\n",
    "## Educational Value\n",
    "\n",
    "This notebook demonstrates:\n",
    "\n",
    "1. ✅ RNA-seq differential expression workflow\n",
    "2. ✅ Gene classification based on expression patterns  \n",
    "3. ✅ Promoter sequence analysis and motif discovery\n",
    "4. ✅ Chromatin accessibility analysis\n",
    "5. ✅ Integration of multiple genomics datasets\n",
    "6. ✅ Statistical analysis and visualization\n",
    "\n",
    "**Key Learning Points**:\n",
    "\n",
    "- How to analyze RNA-seq data for differential expression\n",
    "- Methods for identifying regulatory sequence motifs\n",
    "- Integration of transcriptomics and epigenomics data\n",
    "- Understanding cell type-specific transcriptional programs\n",
    "- Visualization techniques for genomics data\n",
    "\n",
    "---\n",
    "\n",
    "## References\n",
    "\n",
    "**Original Paper:**\n",
    "Lu, D., Sin, H.-S., Lu, C., & Fuller, M. T. (2020). Developmental regulation of cell type-specific transcription by novel promoter-proximal sequence elements. *Genes & Development*, 34(9-10), 663-677. doi:10.1101/gad.335331.119\n",
    "\n",
    "**Key Methods Papers:**\n",
    "- DESeq2: Love et al. (2014) *Genome Biology* 15:550\n",
    "- CAGE: Shiraki et al. (2003) *PNAS* 100:15776-15781  \n",
    "- ATAC-seq: Buenrostro et al. (2013) *Nature Methods* 10:1213-1218\n",
    "- MEME: Bailey & Elkan (1994) *ISMB* 2:28-36\n",
    "\n",
    "---\n",
    "\n",
    "**End of Notebook**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Starting notebook execution test...\n",
      "============================================================\n"
     ]
    }
   ],
   "source": [
    "# Test: Execute all code cells in notebook sequentially\n",
    "import time\n",
    "start_time = time.time()\n",
    "\n",
    "# Track cell execution\n",
    "cells_executed = []\n",
    "errors = []\n",
    "\n",
    "print(\"Starting notebook execution test...\")\n",
    "print(\"=\"*60)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
No results found