wojtyniakAQ/notebook-66c35349-e804-47ee-a7f2-ffc64a3707b4-paper-20260320-171131.ipynb

## notebook-66c35349-e804-47ee-a7f2-ffc64a3707b4-paper-20260320-171131.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CZ ID: Cloud-Based Long Read Metagenomic Analysis\n",
    "\n",
    "## Educational Overview Notebook\n",
    "\n",
    "**Paper:** Simmonds et al. (2024) - \"CZ ID: a cloud-based, no-code platform enabling advanced long read metagenomic analysis\"\n",
    "\n",
    "**Purpose:** This notebook provides an educational walkthrough of the computational workflows described in the paper, demonstrating the CZ ID mNGS Nanopore pipeline and its validation.\n",
    "\n",
    "**⚠️ Important Note:** This notebook uses small-scale toy data to demonstrate workflows within resource constraints (4GB RAM, ~5-10 min runtime). For production use, these methods would be run on cloud infrastructure with full NCBI databases and real sequencing data.\n",
    "\n",
    "---\n",
    "\n",
    "## Overview of Workflows\n",
    "\n",
    "This paper describes:\n",
    "\n",
    "1. **CZ ID mNGS Nanopore Pipeline** - Main metagenomic analysis workflow for Oxford Nanopore long reads\n",
    "2. **Kraken2 Benchmark** - Validation against standard tool using mock microbial community\n",
    "3. **Divergent Virus Detection** - Testing sensitivity for viruses at varying sequence divergence\n",
    "4. **Clinical Sample Validation** - Detection limits for known virus (HCoV OC43) at varying MOI\n",
    "5. **Mosquito Virome Analysis** - Known and novel virus detection in non-human hosts\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup and Dependencies\n",
    "\n",
    "Install all required packages for the workflow demonstrations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[37m⠋\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m\r",
      "\u001b[2K\u001b[37m⠋\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mResolving dependencies...                                                     \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mbiopython==1.86                                                               \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mnumpy==2.4.2                                                                  \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mpandas==3.0.1                                                                 \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mnumpy==2.4.2                                                                  \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mmatplotlib==3.10.8                                                            \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mseaborn==0.13.2                                                               \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mscipy==1.17.1                                                                 \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mscikit-learn==1.8.0                                                           \u001b[0m\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mpython-dateutil==2.9.0.post0                                                  \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mcontourpy==1.3.3                                                              \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mcontourpy==1.3.3                                                              \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mcycler==0.12.1                                                                \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mfonttools==4.62.1                                                             \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mkiwisolver==1.5.0                                                             \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mpackaging==26.0                                                               \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mpillow==12.1.1                                                                \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mpyparsing==3.3.2                                                              \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mjoblib==1.5.3                                                                 \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2mthreadpoolctl==3.6.0                                                          \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2msix==1.17.0                                                                   \u001b[0m\r",
      "\u001b[2K\u001b[37m⠹\u001b[0m \u001b[2m                                                                              \u001b[0m\r",
      "\u001b[2K\u001b[2mResolved \u001b[1m18 packages\u001b[0m \u001b[2min 238ms\u001b[0m\u001b[0m\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[37m⠋\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/0)                                                   \r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/0)                                                   \r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)                                                  \r",
      "\u001b[2K\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/18.20 KiB           \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/119.90 KiB          \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB        \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB        \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB        \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB        \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 14.88 KiB/119.90 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/10.37 MiB           \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 14.91 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 30.88 KiB/119.90 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/10.37 MiB           \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 30.88 KiB/119.90 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/10.37 MiB           \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 46.88 KiB/119.90 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/10.37 MiB           \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mthreadpoolctl       \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 18.20 KiB/18.20 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 46.88 KiB/119.90 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/10.37 MiB           \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcycler              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 8.13 KiB/8.13 KiB\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 46.88 KiB/119.90 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/10.37 MiB           \u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mpyparsing           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 46.88 KiB/119.90 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/10.37 MiB           \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m     0 B/301.83 KiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 62.60 KiB/10.37 MiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mseaborn             \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 46.92 KiB/288.00 KiB\r\n",
      "\u001b[2mjoblib              \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 108.83 KiB/301.83 KiB\r\n",
      "\u001b[2mcontourpy           \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 62.91 KiB/354.35 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 32.00 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 32.00 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.88 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.88 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.91 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 161.34 KiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 30.80 KiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[11A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[11A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mseaborn             \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 270.92 KiB/288.00 KiB\r\n",
      "\u001b[2mcontourpy           \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 302.91 KiB/354.35 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 256.00 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 284.02 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 270.88 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 270.88 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 258.27 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 270.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 398.60 KiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 254.91 KiB/33.57 MiB        \u001b[10A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[10A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mseaborn             \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 288.00 KiB/288.00 KiB\r\n",
      "\u001b[2mcontourpy           \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 334.91 KiB/354.35 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 272.00 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 316.02 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 274.13 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 302.88 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 270.91 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 286.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 430.60 KiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 270.91 KiB/33.57 MiB        \u001b[10A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[10A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mcontourpy           \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 334.91 KiB/354.35 KiB\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 272.00 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 316.02 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 290.13 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 302.88 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 285.71 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 302.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 430.60 KiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 286.91 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[9A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[9A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 400.00 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 398.03 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 392.56 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 408.56 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 392.56 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 382.90 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 574.49 KiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 334.91 KiB/33.57 MiB        \u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠙\u001b[0m \u001b[2mPreparing packages...\u001b[0m (0/14)\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 763.00 KiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 718.03 KiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 776.56 KiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 792.56 KiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 776.56 KiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 824.56 KiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 944.98 KiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 712.56 KiB/33.57 MiB        "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 1.00 MiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 1.04 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.00 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.07 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.00 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.07 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.23 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 1.00 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 1.00 MiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 1.04 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.00 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.07 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.00 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.07 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.25 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 1.00 MiB/33.57 MiB          \u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 1.00 MiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 1.04 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.00 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.07 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.00 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.07 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 1.26 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m\u001b[30m\u001b[2m------------------------------\u001b[0m\u001b[0m 1.00 MiB/33.57 MiB          \u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 1.29 MiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 1.42 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 1.28 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.47 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.28 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.47 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.62 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 1.29 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠹\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mkiwisolver          \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 1.37 MiB/1.41 MiB\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 1.50 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 1.39 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.56 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.37 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.55 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----\u001b[30m\u001b[2m--------------------------\u001b[0m\u001b[0m 1.68 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 1.37 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[8A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[8A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 1.53 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 1.41 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 1.60 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.40 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.60 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.73 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 1.40 MiB/33.57 MiB          \u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------------\u001b[30m\u001b[2m--------------\u001b[0m\u001b[0m 1.70 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 1.59 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 1.73 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.56 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.76 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 1.90 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 1.55 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 1.89 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 1.81 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 1.92 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.79 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 1.98 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 2.12 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 1.74 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠸\u001b[0m \u001b[2mPreparing packages...\u001b[0m (6/14)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m--------------------\u001b[30m\u001b[2m----------\u001b[0m\u001b[0m 2.11 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 1.97 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 2.11 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 1.93 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.12 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 2.29 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 1.93 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/14)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 2.31 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 2.16 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 2.32 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.16 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 2.37 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.50 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-\u001b[30m\u001b[2m-----------------------------\u001b[0m\u001b[0m 2.11 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/14)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 2.55 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 2.43 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 2.56 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 2.40 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 2.62 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------\u001b[30m\u001b[2m-----------------------\u001b[0m\u001b[0m 2.70 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 2.36 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/14)\r\n",
      "\u001b[2mbiopython           \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 2.91 MiB/3.09 MiB\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 2.75 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------------\u001b[30m\u001b[2m------------------\u001b[0m\u001b[0m 2.86 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 2.72 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 2.93 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 3.03 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 2.63 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[7A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[7A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/14)\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 3.00 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 3.00 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 2.97 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 3.10 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 3.31 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 2.87 MiB/33.57 MiB          \u001b[6A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[6A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/14)\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 3.12 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 3.16 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 3.05 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 3.29 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 3.48 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--\u001b[30m\u001b[2m----------------------------\u001b[0m\u001b[0m 3.00 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[6A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[6A\u001b[37m⠼\u001b[0m \u001b[2mPreparing packages...\u001b[0m (7/14)\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 3.67 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 3.81 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 3.71 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--------------\u001b[30m\u001b[2m----------------\u001b[0m\u001b[0m 3.97 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------\u001b[30m\u001b[2m-------------------\u001b[0m\u001b[0m 4.10 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 3.66 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[6A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[6A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (8/14)\r\n",
      "\u001b[2mfonttools           \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 4.42 MiB/4.73 MiB\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 4.46 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 4.31 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m----------------\u001b[30m\u001b[2m--------------\u001b[0m\u001b[0m 4.54 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 4.72 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.06 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[6A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[6A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (8/14)\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 4.99 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 4.87 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 4.96 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 5.24 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.09 MiB/33.57 MiB          \u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (8/14)\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 5.20 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 5.14 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 5.08 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------\u001b[30m\u001b[2m---------------\u001b[0m\u001b[0m 5.50 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.11 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (8/14)\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 6.05 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 6.00 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 6.09 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------------\u001b[30m\u001b[2m-------------\u001b[0m\u001b[0m 6.17 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---\u001b[30m\u001b[2m---------------------------\u001b[0m\u001b[0m 4.29 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠴\u001b[0m \u001b[2mPreparing packages...\u001b[0m (8/14)\r\n",
      "\u001b[2mpillow              \u001b[0m \u001b[32m------------------------------\u001b[30m\u001b[2m\u001b[0m\u001b[0m 6.71 MiB/6.71 MiB\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 6.92 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 6.65 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 6.43 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 5.94 MiB/33.57 MiB          \u001b[5A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[5A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (9/14)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 7.05 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 6.68 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------------\u001b[30m\u001b[2m------------\u001b[0m\u001b[0m 6.53 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-----\u001b[30m\u001b[2m-------------------------\u001b[0m\u001b[0m 6.17 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (9/14)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 7.58 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 7.63 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--------------------\u001b[30m\u001b[2m----------\u001b[0m\u001b[0m 7.16 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m------\u001b[30m\u001b[2m------------------------\u001b[0m\u001b[0m 6.94 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (9/14)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 7.97 MiB/8.31 MiB\r\n",
      "\u001b[2mscikit-learn        \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 8.41 MiB/8.49 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------------\u001b[30m\u001b[2m---------\u001b[0m\u001b[0m 7.58 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 9.48 MiB/33.57 MiB          \u001b[4A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[4A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (9/14)\r\n",
      "\u001b[2mmatplotlib          \u001b[0m \u001b[32m-----------------------------\u001b[30m\u001b[2m-\u001b[0m\u001b[0m 8.14 MiB/8.31 MiB\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 7.70 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--------\u001b[30m\u001b[2m----------------------\u001b[0m\u001b[0m 9.92 MiB/33.57 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[3A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[3A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (9/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 7.73 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m---------\u001b[30m\u001b[2m---------------------\u001b[0m\u001b[0m 10.33 MiB/33.57 MiB         \u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠦\u001b[0m \u001b[2mPreparing packages...\u001b[0m (9/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-----------------------\u001b[30m\u001b[2m-------\u001b[0m\u001b[0m 8.09 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m----------\u001b[30m\u001b[2m--------------------\u001b[0m\u001b[0m 11.63 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m------------------------\u001b[30m\u001b[2m------\u001b[0m\u001b[0m 8.48 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-------------\u001b[30m\u001b[2m-----------------\u001b[0m\u001b[0m 14.67 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m-------------------------\u001b[30m\u001b[2m-----\u001b[0m\u001b[0m 8.72 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m----------------\u001b[30m\u001b[2m--------------\u001b[0m\u001b[0m 18.16 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 9.09 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m-------------------\u001b[30m\u001b[2m-----------\u001b[0m\u001b[0m 21.83 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠧\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m---------------------------\u001b[30m\u001b[2m---\u001b[0m\u001b[0m 9.61 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m----------------------\u001b[30m\u001b[2m--------\u001b[0m\u001b[0m 25.00 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠇\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 9.68 MiB/10.37 MiB\r\n",
      "\u001b[2mscipy               \u001b[0m \u001b[32m--------------------------\u001b[30m\u001b[2m----\u001b[0m\u001b[0m 29.63 MiB/33.57 MiB         "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[2A\u001b[37m⠇\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 9.88 MiB/10.37 MiB          \u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠇\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)\r\n",
      "\u001b[2mpandas              \u001b[0m \u001b[32m----------------------------\u001b[30m\u001b[2m--\u001b[0m\u001b[0m 9.90 MiB/10.37 MiB          "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1A\r",
      "\u001b[2K\u001b[1B\r",
      "\u001b[2K\u001b[1A\u001b[37m⠇\u001b[0m \u001b[2mPreparing packages...\u001b[0m (12/14)                                                 \r",
      "\u001b[2K\u001b[2mPrepared \u001b[1m14 packages\u001b[0m \u001b[2min 1.47s\u001b[0m\u001b[0m\r\n",
      "░░░░░░░░░░░░░░░░░░░░ [0/0] \u001b[2mInstalling wheels...                                 \u001b[0m\r",
      "\u001b[2K░░░░░░░░░░░░░░░░░░░░ [0/14] \u001b[2mInstalling wheels...                                \u001b[0m\r",
      "\u001b[2K░░░░░░░░░░░░░░░░░░░░ [0/14] \u001b[2mthreadpoolctl==3.6.0                                \u001b[0m\r",
      "\u001b[2K█░░░░░░░░░░░░░░░░░░░ [1/14] \u001b[2mthreadpoolctl==3.6.0                                \u001b[0m\r",
      "\u001b[2K█░░░░░░░░░░░░░░░░░░░ [1/14] \u001b[2mcontourpy==1.3.3                                    \u001b[0m\r",
      "\u001b[2K██░░░░░░░░░░░░░░░░░░ [2/14] \u001b[2mcontourpy==1.3.3                                    \u001b[0m\r",
      "\u001b[2K██░░░░░░░░░░░░░░░░░░ [2/14] \u001b[2mcycler==0.12.1                                      \u001b[0m\r",
      "\u001b[2K████░░░░░░░░░░░░░░░░ [3/14] \u001b[2mcycler==0.12.1                                      \u001b[0m\r",
      "\u001b[2K████░░░░░░░░░░░░░░░░ [3/14] \u001b[2mkiwisolver==1.5.0                                   \u001b[0m\r",
      "\u001b[2K█████░░░░░░░░░░░░░░░ [4/14] \u001b[2mkiwisolver==1.5.0                                   \u001b[0m\r",
      "\u001b[2K█████░░░░░░░░░░░░░░░ [4/14] \u001b[2mpyparsing==3.3.2                                    \u001b[0m\r",
      "\u001b[2K███████░░░░░░░░░░░░░ [5/14] \u001b[2mpyparsing==3.3.2                                    \u001b[0m\r",
      "\u001b[2K███████░░░░░░░░░░░░░ [5/14] \u001b[2mjoblib==1.5.3                                       \u001b[0m\r",
      "\u001b[2K████████░░░░░░░░░░░░ [6/14] \u001b[2mjoblib==1.5.3                                       \u001b[0m\r",
      "\u001b[2K████████░░░░░░░░░░░░ [6/14] \u001b[2mseaborn==0.13.2                                     \u001b[0m\r",
      "\u001b[2K██████████░░░░░░░░░░ [7/14] \u001b[2mseaborn==0.13.2                                     \u001b[0m\r",
      "\u001b[2K██████████░░░░░░░░░░ [7/14] \u001b[2mpillow==12.1.1                                      \u001b[0m\r",
      "\u001b[2K███████████░░░░░░░░░ [8/14] \u001b[2mpillow==12.1.1                                      \u001b[0m\r",
      "\u001b[2K███████████░░░░░░░░░ [8/14] \u001b[2mfonttools==4.62.1                                   \u001b[0m\r",
      "\u001b[2K████████████░░░░░░░░ [9/14] \u001b[2mfonttools==4.62.1                                   \u001b[0m"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\r",
      "\u001b[2K█████████████████░░░ [12/14] \u001b[2mscipy==1.17.1                                      \u001b[0m\r",
      "\u001b[2K\u001b[2mInstalled \u001b[1m14 packages\u001b[0m \u001b[2min 63ms\u001b[0m\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mbiopython\u001b[0m\u001b[2m==1.86\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mcontourpy\u001b[0m\u001b[2m==1.3.3\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mcycler\u001b[0m\u001b[2m==0.12.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mfonttools\u001b[0m\u001b[2m==4.62.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mjoblib\u001b[0m\u001b[2m==1.5.3\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mkiwisolver\u001b[0m\u001b[2m==1.5.0\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mmatplotlib\u001b[0m\u001b[2m==3.10.8\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mpandas\u001b[0m\u001b[2m==3.0.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mpillow\u001b[0m\u001b[2m==12.1.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mpyparsing\u001b[0m\u001b[2m==3.3.2\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mscikit-learn\u001b[0m\u001b[2m==1.8.0\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mscipy\u001b[0m\u001b[2m==1.17.1\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mseaborn\u001b[0m\u001b[2m==0.13.2\u001b[0m\r\n",
      " \u001b[32m+\u001b[39m \u001b[1mthreadpoolctl\u001b[0m\u001b[2m==3.6.0\u001b[0m\r\n"
     ]
    }
   ],
   "source": [
    "# Install all dependencies in one command\n",
    "!uv pip install biopython numpy pandas matplotlib seaborn scipy scikit-learn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from Bio import SeqIO\n",
    "from Bio.Seq import Seq\n",
    "from Bio.SeqRecord import SeqRecord\n",
    "from scipy import stats\n",
    "from sklearn.metrics import precision_recall_curve, auc\n",
    "import random\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set random seeds for reproducibility\n",
    "np.random.seed(42)\n",
    "random.seed(42)\n",
    "\n",
    "# Configure plotting\n",
    "sns.set_style(\"whitegrid\")\n",
    "plt.rcParams['figure.figsize'] = (10, 6)\n",
    "plt.rcParams['figure.dpi'] = 100\n",
    "\n",
    "print(\"✓ All libraries imported successfully\")\n",
    "print(\"✓ Random seeds set for reproducibility\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 2. Workflow 1: CZ ID mNGS Nanopore Pipeline\n",
    "\n",
    "The main CZ ID pipeline consists of four major steps:\n",
    "\n",
    "1. **QC and Host Filtering** - Remove low-quality and host reads\n",
    "2. **De Novo Assembly** - Assemble long reads using metaFlye\n",
    "3. **Database Alignment** - Align to NCBI NT (minimap2) and NR (DIAMOND)\n",
    "4. **Taxon Reporting** - Aggregate and report identified taxa\n",
    "\n",
    "### 2.1 Generate Synthetic Nanopore-like Reads\n",
    "\n",
    "We'll create small synthetic datasets to demonstrate the workflow."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_nanopore_read(length, error_rate=0.05, gc_content=0.5):\n",
    "    \"\"\"\n",
    "    Generate a synthetic Nanopore read with error characteristics.\n",
    "    \n",
    "    Parameters:\n",
    "    -----------\n",
    "    length : int\n",
    "        Length of the read\n",
    "    error_rate : float\n",
    "        Probability of sequencing error\n",
    "    gc_content : float\n",
    "        Target GC content\n",
    "    \n",
    "    Returns:\n",
    "    --------\n",
    "    str : DNA sequence\n",
    "    \"\"\"\n",
    "    bases = ['A', 'T', 'G', 'C']\n",
    "    gc_bases = ['G', 'C']\n",
    "    at_bases = ['A', 'T']\n",
    "    \n",
    "    # Generate sequence with target GC content\n",
    "    seq = []\n",
    "    for _ in range(length):\n",
    "        if np.random.random() < gc_content:\n",
    "            seq.append(np.random.choice(gc_bases))\n",
    "        else:\n",
    "            seq.append(np.random.choice(at_bases))\n",
    "    \n",
    "    # Add errors (substitutions for simplicity)\n",
    "    seq_with_errors = []\n",
    "    for base in seq:\n",
    "        if np.random.random() < error_rate:\n",
    "            # Introduce error\n",
    "            other_bases = [b for b in bases if b != base]\n",
    "            seq_with_errors.append(np.random.choice(other_bases))\n",
    "        else:\n",
    "            seq_with_errors.append(base)\n",
    "    \n",
    "    return ''.join(seq_with_errors)\n",
    "\n",
    "def generate_quality_scores(length, mean_quality=15):\n",
    "    \"\"\"\n",
    "    Generate Phred quality scores for a read.\n",
    "    Nanopore typical range is 9-20 for modern models.\n",
    "    \"\"\"\n",
    "    return np.random.normal(mean_quality, 2, length).clip(9, 40)\n",
    "\n",
    "# Generate synthetic dataset\n",
    "print(\"Generating synthetic Nanopore reads...\")\n",
    "print(\"(In production: upload basecalled FASTQ files from actual Nanopore sequencing)\\n\")\n",
    "\n",
    "# Create 100 synthetic reads representing a small metagenomic sample\n",
    "synthetic_reads = []\n",
    "read_lengths = np.random.lognormal(mean=7.5, sigma=0.5, size=100).astype(int).clip(500, 10000)\n",
    "\n",
    "for i, length in enumerate(read_lengths):\n",
    "    read_id = f\"read_{i:04d}\"\n",
    "    sequence = generate_nanopore_read(length)\n",
    "    qualities = generate_quality_scores(length)\n",
    "    \n",
    "    # Convert quality scores to Phred+33 ASCII\n",
    "    qual_string = ''.join([chr(int(q) + 33) for q in qualities])\n",
    "    \n",
    "    record = SeqRecord(\n",
    "        Seq(sequence),\n",
    "        id=read_id,\n",
    "        description=f\"length={length} synthetic_nanopore_read\",\n",
    "        letter_annotations={\"phred_quality\": [int(q) for q in qualities]}\n",
    "    )\n",
    "    synthetic_reads.append(record)\n",
    "\n",
    "print(f\"✓ Generated {len(synthetic_reads)} synthetic reads\")\n",
    "print(f\"  - Read lengths: {read_lengths.min()}-{read_lengths.max()} bp\")\n",
    "print(f\"  - Mean length: {read_lengths.mean():.0f} bp\")\n",
    "print(f\"  - Total bases: {read_lengths.sum():,} bp\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Step 1: Quality Control and Host Filtering\n",
    "\n",
    "**CZ ID uses:**\n",
    "- **fastp** for QC filtering\n",
    "- **minimap2** for host genome alignment\n",
    "\n",
    "**QC criteria:**\n",
    "- Mean Phred score ≥ 9\n",
    "- Complexity ≥ 30%\n",
    "- Length ≥ 100 bp\n",
    "\n",
    "**Host filtering:**\n",
    "- Remove reads mapping to specified host genome\n",
    "- Additionally remove all Homo sapiens reads (contamination control)\n",
    "- Subsample to 1 million reads max"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def calculate_complexity(sequence):\n",
    "    \"\"\"\n",
    "    Calculate sequence complexity (fraction of unique k-mers).\n",
    "    Low complexity sequences (e.g., AAAAAAA) have low diversity.\n",
    "    \"\"\"\n",
    "    if len(sequence) < 3:\n",
    "        return 0.0\n",
    "    \n",
    "    k = 3  # Use 3-mers\n",
    "    kmers = set()\n",
    "    total_kmers = len(sequence) - k + 1\n",
    "    \n",
    "    for i in range(total_kmers):\n",
    "        kmers.add(sequence[i:i+k])\n",
    "    \n",
    "    return len(kmers) / total_kmers\n",
    "\n",
    "def qc_filter_reads(reads, min_quality=9, min_complexity=0.2, min_length=100):\n",
    "    \"\"\"\n",
    "    Filter reads based on quality, complexity, and length.\n",
    "    Simulates fastp filtering.\n",
    "    \"\"\"\n",
    "    filtered_reads = []\n",
    "    \n",
    "    for read in reads:\n",
    "        # Check length\n",
    "        if len(read.seq) < min_length:\n",
    "            continue\n",
    "        \n",
    "        # Check mean quality\n",
    "        mean_qual = np.mean(read.letter_annotations[\"phred_quality\"])\n",
    "        if mean_qual < min_quality:\n",
    "            continue\n",
    "        \n",
    "        # Check complexity\n",
    "        complexity = calculate_complexity(str(read.seq))\n",
    "        if complexity < min_complexity:\n",
    "            continue\n",
    "        \n",
    "        filtered_reads.append(read)\n",
    "    \n",
    "    return filtered_reads\n",
    "\n",
    "def simulate_host_filtering(reads, host_fraction=0.7):\n",
    "    \"\"\"\n",
    "    Simulate host read removal.\n",
    "    In production: uses minimap2 to align reads to host reference genome.\n",
    "    \n",
    "    Parameters:\n",
    "    -----------\n",
    "    reads : list\n",
    "        Input reads\n",
    "    host_fraction : float\n",
    "        Fraction of reads that are host-derived (to be removed)\n",
    "    \"\"\"\n",
    "    # Randomly designate reads as host or non-host\n",
    "    non_host_reads = []\n",
    "    for read in reads:\n",
    "        if np.random.random() > host_fraction:\n",
    "            non_host_reads.append(read)\n",
    "    \n",
    "    return non_host_reads\n",
    "\n",
    "# Apply QC filtering\n",
    "print(\"Step 1: Quality Control and Host Filtering\\n\")\n",
    "print(\"Applying QC filters (fastp):\")\n",
    "qc_filtered_reads = qc_filter_reads(synthetic_reads)\n",
    "print(f\"  - Input reads: {len(synthetic_reads)}\")\n",
    "print(f\"  - After QC: {len(qc_filtered_reads)}\")\n",
    "if len(qc_filtered_reads) > 0:\n",
    "    print(f\"  - Removed: {len(synthetic_reads) - len(qc_filtered_reads)} ({(1 - len(qc_filtered_reads)/len(synthetic_reads))*100:.1f}%)\\n\")\n",
    "else:\n",
    "    print(f\"  - Removed: {len(synthetic_reads)} (100.0%)\\n\")\n",
    "\n",
    "# Apply host filtering\n",
    "print(\"Applying host filtering (minimap2 alignment to host genome):\")\n",
    "non_host_reads = simulate_host_filtering(qc_filtered_reads, host_fraction=0.7)\n",
    "print(f\"  - After QC: {len(qc_filtered_reads)}\")\n",
    "print(f\"  - After host removal: {len(non_host_reads)}\")\n",
    "if len(qc_filtered_reads) > 0:\n",
    "    print(f\"  - Host reads removed: {len(qc_filtered_reads) - len(non_host_reads)} ({(1 - len(non_host_reads)/len(qc_filtered_reads))*100:.1f}%)\\n\")\n",
    "else:\n",
    "    print(f\"  - Host reads removed: 0 (N/A)\\n\")\n",
    "\n",
    "# Subsampling (if > 1M reads)\n",
    "print(\"Subsampling non-host reads (max 1 million):\")\n",
    "if len(non_host_reads) > 1_000_000:\n",
    "    non_host_reads = random.sample(non_host_reads, 1_000_000)\n",
    "    print(f\"  - Subsampled to: {len(non_host_reads):,} reads\")\n",
    "else:\n",
    "    print(f\"  - No subsampling needed: {len(non_host_reads)} reads\")\n",
    "\n",
    "print(f\"\\n✓ QC and host filtering complete\")\n",
    "print(f\"  Final read count for assembly: {len(non_host_reads)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Step 2: De Novo Assembly with metaFlye\n",
    "\n",
    "**CZ ID uses metaFlye** (Kolmogorov et al., 2020) for long read metagenomic assembly.\n",
    "\n",
    "**Assembly parameters:**\n",
    "- Only reads > 1000 bp are assembled\n",
    "- `--nano-hq` flag for super accuracy basecalled reads\n",
    "- `--nano-raw` flag with one polishing round for other basecalling models\n",
    "\n",
    "**Note:** metaFlye requires significant computational resources. Below we simulate the assembly process with toy data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def simulate_assembly(reads, min_length=1000):\n",
    "    \"\"\"\n",
    "    Simulate de novo assembly process.\n",
    "    \n",
    "    In production:\n",
    "    - Uses metaFlye assembler on cloud infrastructure\n",
    "    - Processes all reads > 1000 bp\n",
    "    - Outputs contigs with associated coverage statistics\n",
    "    \n",
    "    For demonstration:\n",
    "    - Groups similar-length reads to simulate contigs\n",
    "    - Simulates coverage by counting reads per \"contig\"\n",
    "    \"\"\"\n",
    "    # Filter reads by length\n",
    "    long_reads = [r for r in reads if len(r.seq) >= min_length]\n",
    "    short_reads = [r for r in reads if len(r.seq) < min_length]\n",
    "    \n",
    "    print(f\"Assembly input:\")\n",
    "    print(f\"  - Reads ≥ {min_length} bp (assembled): {len(long_reads)}\")\n",
    "    print(f\"  - Reads < {min_length} bp (not assembled): {len(short_reads)}\\n\")\n",
    "    \n",
    "    # Simulate contig formation\n",
    "    # In reality, metaFlye uses overlap graphs and repeat resolution\n",
    "    contigs = []\n",
    "    num_contigs = max(1, len(long_reads) // 3)  # Simulate contig collapse\n",
    "    \n",
    "    for i in range(num_contigs):\n",
    "        # Create synthetic contig\n",
    "        contig_length = np.random.randint(2000, 8000)\n",
    "        contig_seq = generate_nanopore_read(contig_length, error_rate=0.01)  # Lower error after consensus\n",
    "        coverage = np.random.randint(5, 50)  # Simulated coverage depth\n",
    "        \n",
    "        contig = {\n",
    "            'id': f'contig_{i:03d}',\n",
    "            'sequence': contig_seq,\n",
    "            'length': contig_length,\n",
    "            'coverage': coverage,\n",
    "            'num_reads': coverage  # Simplified\n",
    "        }\n",
    "        contigs.append(contig)\n",
    "    \n",
    "    return contigs, long_reads, short_reads\n",
    "\n",
    "# Perform assembly\n",
    "print(\"Step 2: De Novo Assembly (metaFlye)\\n\")\n",
    "print(\"[In production: metaFlye runs on AWS with parameters:\")\n",
    "print(\" metaFlye --nano-hq --meta --threads 16 ...]\\n\")\n",
    "\n",
    "contigs, assembled_reads, non_contig_reads = simulate_assembly(non_host_reads)\n",
    "\n",
    "print(f\"Assembly results:\")\n",
    "print(f\"  - Contigs assembled: {len(contigs)}\")\n",
    "print(f\"  - Total contig length: {sum(c['length'] for c in contigs):,} bp\")\n",
    "print(f\"  - Mean contig length: {np.mean([c['length'] for c in contigs]):.0f} bp\")\n",
    "print(f\"  - Mean coverage: {np.mean([c['coverage'] for c in contigs]):.1f}X\\n\")\n",
    "\n",
    "# Show example contigs\n",
    "print(\"Example contigs:\")\n",
    "contig_df = pd.DataFrame([\n",
    "    {'Contig ID': c['id'], 'Length (bp)': c['length'], \n",
    "     'Coverage': f\"{c['coverage']}X\", 'Reads': c['num_reads']}\n",
    "    for c in contigs[:5]\n",
    "])\n",
    "print(contig_df.to_string(index=False))\n",
    "\n",
    "print(f\"\\n✓ Assembly complete\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Step 3: Alignment to NCBI Databases\n",
    "\n",
    "**CZ ID aligns sequences to:**\n",
    "1. **NCBI NT (Nucleotide)** - using minimap2\n",
    "   - Contigs aligned to NT\n",
    "   - Non-contig reads aligned to NT\n",
    "2. **NCBI NR (Non-Redundant Protein)** - using DIAMOND\n",
    "   - Only contigs aligned (non-contig reads too error-prone for protein translation)\n",
    "\n",
    "**Post-alignment filtering:**\n",
    "- Remove Deuterostomia hits (for Deuterostome hosts)\n",
    "- Remove artificial sequences (NCBI:txid81077)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def simulate_ncbi_alignment(sequences, database='NT'):\n",
    "    \"\"\"\n",
    "    Simulate alignment to NCBI databases.\n",
    "    \n",
    "    In production:\n",
    "    - NT alignment: minimap2 against full NCBI nucleotide database\n",
    "    - NR alignment: DIAMOND against NCBI non-redundant protein database\n",
    "    - Returns accessions, taxIDs, percent identity, alignment length, E-value\n",
    "    \n",
    "    For demonstration:\n",
    "    - Simulates taxonomic assignments\n",
    "    - Generates realistic alignment metrics\n",
    "    \"\"\"\n",
    "    # Simulated microbial taxa (similar to ZymoBIOMICS mock community)\n",
    "    taxa = [\n",
    "        {'name': 'Pseudomonas aeruginosa', 'taxid': 287, 'category': 'Bacteria'},\n",
    "        {'name': 'Escherichia coli', 'taxid': 562, 'category': 'Bacteria'},\n",
    "        {'name': 'Salmonella enterica', 'taxid': 28901, 'category': 'Bacteria'},\n",
    "        {'name': 'Listeria monocytogenes', 'taxid': 1639, 'category': 'Bacteria'},\n",
    "        {'name': 'Staphylococcus aureus', 'taxid': 1280, 'category': 'Bacteria'},\n",
    "        {'name': 'Saccharomyces cerevisiae', 'taxid': 4932, 'category': 'Eukaryota'},\n",
    "        {'name': 'Human betaherpesvirus 5', 'taxid': 10359, 'category': 'Viruses'},\n",
    "        {'name': 'Dengue virus', 'taxid': 12637, 'category': 'Viruses'},\n",
    "    ]\n",
    "    \n",
    "    alignments = []\n",
    "    \n",
    "    for seq_info in sequences:\n",
    "        if isinstance(seq_info, dict):\n",
    "            seq_id = seq_info['id']\n",
    "            seq_length = seq_info['length']\n",
    "        else:\n",
    "            seq_id = seq_info.id\n",
    "            seq_length = len(seq_info.seq)\n",
    "        \n",
    "        # Randomly assign taxon\n",
    "        taxon = random.choice(taxa)\n",
    "        \n",
    "        # Simulate alignment metrics\n",
    "        # NT typically has higher identity than NR (protein more conserved)\n",
    "        if database == 'NT':\n",
    "            percent_identity = np.random.uniform(85, 99)\n",
    "            alignment_length = int(seq_length * np.random.uniform(0.8, 1.0))\n",
    "        else:  # NR\n",
    "            percent_identity = np.random.uniform(60, 95)\n",
    "            alignment_length = int(seq_length * np.random.uniform(0.6, 0.9) / 3)  # amino acids\n",
    "        \n",
    "        e_value = 10 ** np.random.uniform(-50, -10)\n",
    "        \n",
    "        alignment = {\n",
    "            'query_id': seq_id,\n",
    "            'query_length': seq_length,\n",
    "            'taxon_name': taxon['name'],\n",
    "            'taxid': taxon['taxid'],\n",
    "            'category': taxon['category'],\n",
    "            'percent_identity': percent_identity,\n",
    "            'alignment_length': alignment_length,\n",
    "            'e_value': e_value,\n",
    "            'database': database\n",
    "        }\n",
    "        alignments.append(alignment)\n",
    "    \n",
    "    return alignments\n",
    "\n",
    "# Align contigs to NT\n",
    "print(\"Step 3: Alignment to NCBI Databases\\n\")\n",
    "print(\"[In production: minimap2 and DIAMOND run against full NCBI NT/NR databases]\\n\")\n",
    "\n",
    "print(\"Aligning contigs to NCBI NT (minimap2)...\")\n",
    "contig_nt_alignments = simulate_ncbi_alignment(contigs, database='NT')\n",
    "print(f\"  - {len(contig_nt_alignments)} contig alignments to NT\\n\")\n",
    "\n",
    "# Align non-contig reads to NT\n",
    "print(\"Aligning non-contig reads to NCBI NT (minimap2)...\")\n",
    "# Use a subset for speed\n",
    "sample_non_contig = non_contig_reads[:10] if len(non_contig_reads) > 10 else non_contig_reads\n",
    "read_nt_alignments = simulate_ncbi_alignment(sample_non_contig, database='NT')\n",
    "print(f\"  - {len(read_nt_alignments)} read alignments to NT\\n\")\n",
    "\n",
    "# Align contigs to NR\n",
    "print(\"Aligning contigs to NCBI NR (DIAMOND)...\")\n",
    "contig_nr_alignments = simulate_ncbi_alignment(contigs, database='NR')\n",
    "print(f\"  - {len(contig_nr_alignments)} contig alignments to NR\\n\")\n",
    "\n",
    "# Combine alignments\n",
    "all_alignments = contig_nt_alignments + read_nt_alignments + contig_nr_alignments\n",
    "\n",
    "# Create DataFrame\n",
    "alignment_df = pd.DataFrame(all_alignments)\n",
    "\n",
    "print(\"Sample alignment results:\")\n",
    "display_df = alignment_df[['query_id', 'taxon_name', 'category', 'percent_identity', \n",
    "                            'alignment_length', 'e_value', 'database']].head(5)\n",
    "print(display_df.to_string(index=False))\n",
    "\n",
    "print(f\"\\n✓ Database alignment complete\")\n",
    "print(f\"  Total alignments: {len(all_alignments)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 Step 4: Taxon Reporting and Aggregation\n",
    "\n",
    "**CZ ID aggregates results to produce:**\n",
    "- **bPM** - Bases per million (normalized abundance)\n",
    "- **Bases (b)** - Total bases aligned to taxon\n",
    "- **Reads (r)** - Total reads aligned to taxon\n",
    "- **Contigs** - Number of contigs aligned to taxon\n",
    "- **% Identity** - Average percent identity\n",
    "- **Length (L)** - Average alignment length\n",
    "- **E-value** - Average E-value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_taxon_report(alignments):\n",
    "    \"\"\"\n",
    "    Aggregate alignments by taxon to create CZ ID-style sample report.\n",
    "    \"\"\"\n",
    "    # Calculate total bases sequenced (for bPM calculation)\n",
    "    total_bases = sum([a['query_length'] for a in alignments])\n",
    "    \n",
    "    # Group by taxon and database\n",
    "    taxon_groups = {}\n",
    "    \n",
    "    for aln in alignments:\n",
    "        key = (aln['taxon_name'], aln['database'])\n",
    "        \n",
    "        if key not in taxon_groups:\n",
    "            taxon_groups[key] = {\n",
    "                'alignments': [],\n",
    "                'taxon_name': aln['taxon_name'],\n",
    "                'taxid': aln['taxid'],\n",
    "                'category': aln['category'],\n",
    "                'database': aln['database']\n",
    "            }\n",
    "        \n",
    "        taxon_groups[key]['alignments'].append(aln)\n",
    "    \n",
    "    # Calculate metrics for each taxon\n",
    "    report_rows = []\n",
    "    \n",
    "    for key, group in taxon_groups.items():\n",
    "        alns = group['alignments']\n",
    "        \n",
    "        total_aligned_bases = sum([a['alignment_length'] for a in alns])\n",
    "        num_reads = len(alns)\n",
    "        num_contigs = len([a for a in alns if 'contig' in a['query_id']])\n",
    "        \n",
    "        # Calculate bPM (bases per million)\n",
    "        bpm = (total_aligned_bases / total_bases) * 1_000_000 if total_bases > 0 else 0\n",
    "        \n",
    "        # Average metrics\n",
    "        avg_identity = np.mean([a['percent_identity'] for a in alns])\n",
    "        avg_length = np.mean([a['alignment_length'] for a in alns])\n",
    "        avg_evalue = np.mean([a['e_value'] for a in alns])\n",
    "        \n",
    "        row = {\n",
    "            'Taxon': group['taxon_name'],\n",
    "            'TaxID': group['taxid'],\n",
    "            'Category': group['category'],\n",
    "            'Database': group['database'],\n",
    "            'bPM': bpm,\n",
    "            'Bases': total_aligned_bases,\n",
    "            'Reads': num_reads,\n",
    "            'Contigs': num_contigs,\n",
    "            '%id': avg_identity,\n",
    "            'L': int(avg_length),\n",
    "            'E-value': avg_evalue\n",
    "        }\n",
    "        report_rows.append(row)\n",
    "    \n",
    "    # Create DataFrame and sort by bPM\n",
    "    report_df = pd.DataFrame(report_rows)\n",
    "    report_df = report_df.sort_values('bPM', ascending=False)\n",
    "    \n",
    "    return report_df\n",
    "\n",
    "# Generate sample report\n",
    "print(\"Step 4: Taxon Reporting\\n\")\n",
    "\n",
    "sample_report = generate_taxon_report(all_alignments)\n",
    "\n",
    "print(\"CZ ID Sample Report (Top Hits):\")\n",
    "print(\"=\" * 100)\n",
    "\n",
    "# Format for display\n",
    "display_report = sample_report.copy()\n",
    "display_report['bPM'] = display_report['bPM'].apply(lambda x: f\"{x:.1f}\")\n",
    "display_report['%id'] = display_report['%id'].apply(lambda x: f\"{x:.1f}%\")\n",
    "display_report['E-value'] = display_report['E-value'].apply(lambda x: f\"{x:.2e}\")\n",
    "\n",
    "print(display_report.to_string(index=False))\n",
    "\n",
    "print(\"\\n✓ Taxon reporting complete\")\n",
    "print(f\"\\n[In production: Interactive report available at czid.org with:\")\n",
    "print(\" - Filterable taxa table\")\n",
    "print(\" - Coverage visualizations\")\n",
    "print(\" - Taxonomic tree view\")\n",
    "print(\" - Downloadable results]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6 Visualization: Taxonomic Composition\n",
    "\n",
    "Visualize the microbial composition detected in the sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot taxonomic composition\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Plot 1: Top taxa by bPM (NT database)\n",
    "nt_report = sample_report[sample_report['Database'] == 'NT'].head(8)\n",
    "axes[0].barh(range(len(nt_report)), nt_report['bPM'])\n",
    "axes[0].set_yticks(range(len(nt_report)))\n",
    "axes[0].set_yticklabels(nt_report['Taxon'])\n",
    "axes[0].set_xlabel('Bases per Million (bPM)')\n",
    "axes[0].set_title('Top Taxa by Abundance\\n(NT Database)')\n",
    "axes[0].invert_yaxis()\n",
    "\n",
    "# Plot 2: Category breakdown\n",
    "category_counts = sample_report.groupby('Category')['Reads'].sum()\n",
    "axes[1].pie(category_counts.values, labels=category_counts.index, autopct='%1.1f%%',\n",
    "            colors=sns.color_palette('Set2', len(category_counts)))\n",
    "axes[1].set_title('Microbial Composition by Category')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('taxonomic_composition.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"✓ Taxonomic composition visualized\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 3. Workflow 2: Kraken2 Benchmark Comparison\n",
    "\n",
    "The paper benchmarked CZ ID against Kraken2 using the **ZymoBIOMICS Microbial Community Standard**, which contains 10 microbial species at known relative abundances.\n",
    "\n",
    "**Benchmarking metrics:**\n",
    "- **AUPR** - Area Under Precision-Recall curve\n",
    "- **L2 distance** - Euclidean distance between predicted and expected abundances\n",
    "\n",
    "**Results:** CZ ID and Kraken2 showed equivalent performance (AUPR = 1.0, L2 = 0.7 for both)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ZymoBIOMICS mock community composition (from paper)\n",
    "# 10 species: 8 bacteria, 2 fungi\n",
    "expected_composition = {\n",
    "    'Bacillus': 0.178,\n",
    "    'Listeria': 0.135,\n",
    "    'Enterococcus': 0.118,\n",
    "    'Staphylococcus': 0.118,\n",
    "    'Limosilactobacillus': 0.106,\n",
    "    'Escherichia': 0.118,\n",
    "    'Salmonella': 0.106,\n",
    "    'Pseudomonas': 0.096,\n",
    "    'Cryptococcus': 0.018,\n",
    "    'Saccharomyces': 0.018\n",
    "}\n",
    "\n",
    "# Simulate CZ ID and Kraken2 results\n",
    "# Both achieve similar performance\n",
    "def simulate_tool_abundance_estimates(expected, noise_level=0.02):\n",
    "    \"\"\"\n",
    "    Simulate tool abundance estimates with small noise.\n",
    "    \"\"\"\n",
    "    estimates = {}\n",
    "    for genus, expected_prop in expected.items():\n",
    "        # Add small random noise\n",
    "        noise = np.random.normal(0, noise_level)\n",
    "        estimated = max(0, expected_prop + noise)\n",
    "        estimates[genus] = estimated\n",
    "    \n",
    "    # Normalize to sum to 1\n",
    "    total = sum(estimates.values())\n",
    "    estimates = {k: v/total for k, v in estimates.items()}\n",
    "    \n",
    "    return estimates\n",
    "\n",
    "np.random.seed(42)\n",
    "czid_estimates = simulate_tool_abundance_estimates(expected_composition, noise_level=0.015)\n",
    "kraken2_estimates = simulate_tool_abundance_estimates(expected_composition, noise_level=0.018)\n",
    "\n",
    "# Create comparison DataFrame\n",
    "comparison_df = pd.DataFrame({\n",
    "    'Genus': list(expected_composition.keys()),\n",
    "    'Expected': list(expected_composition.values()),\n",
    "    'CZ ID NT': [czid_estimates[g] for g in expected_composition.keys()],\n",
    "    'Kraken2': [kraken2_estimates[g] for g in expected_composition.keys()]\n",
    "})\n",
    "\n",
    "print(\"Workflow 2: Kraken2 Benchmark\\n\")\n",
    "print(\"ZymoBIOMICS Mock Community - Relative Abundance Comparison:\")\n",
    "print(\"=\" * 70)\n",
    "\n",
    "display_df = comparison_df.copy()\n",
    "display_df['Expected'] = display_df['Expected'].apply(lambda x: f\"{x:.3f}\")\n",
    "display_df['CZ ID NT'] = display_df['CZ ID NT'].apply(lambda x: f\"{x:.3f}\")\n",
    "display_df['Kraken2'] = display_df['Kraken2'].apply(lambda x: f\"{x:.3f}\")\n",
    "print(display_df.to_string(index=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate benchmarking metrics\n",
    "def calculate_l2_distance(expected, predicted):\n",
    "    \"\"\"Calculate L2 (Euclidean) distance between expected and predicted abundances.\"\"\"\n",
    "    expected_arr = np.array(list(expected.values()))\n",
    "    predicted_arr = np.array([predicted[g] for g in expected.keys()])\n",
    "    return np.sqrt(np.sum((expected_arr - predicted_arr) ** 2))\n",
    "\n",
    "# Calculate L2 distances\n",
    "czid_l2 = calculate_l2_distance(expected_composition, czid_estimates)\n",
    "kraken2_l2 = calculate_l2_distance(expected_composition, kraken2_estimates)\n",
    "\n",
    "print(\"\\nBenchmarking Metrics:\")\n",
    "print(f\"  CZ ID L2 distance: {czid_l2:.3f}\")\n",
    "print(f\"  Kraken2 L2 distance: {kraken2_l2:.3f}\")\n",
    "print(f\"\\n  [Paper results: AUPR = 1.0, L2 = 0.7 for both tools]\")\n",
    "print(f\"\\n✓ Both tools show equivalent performance for known organism detection\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize benchmark comparison\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Plot 1: Abundance comparison\n",
    "x = np.arange(len(comparison_df))\n",
    "width = 0.25\n",
    "\n",
    "axes[0].bar(x - width, comparison_df['Expected'], width, label='Expected', alpha=0.8)\n",
    "axes[0].bar(x, comparison_df['CZ ID NT'], width, label='CZ ID NT', alpha=0.8)\n",
    "axes[0].bar(x + width, comparison_df['Kraken2'], width, label='Kraken2', alpha=0.8)\n",
    "\n",
    "axes[0].set_xlabel('Genus')\n",
    "axes[0].set_ylabel('Relative Abundance (Proportion)')\n",
    "axes[0].set_title('Relative Abundance Estimates\\nZymoBIOMICS Mock Community')\n",
    "axes[0].set_xticks(x)\n",
    "axes[0].set_xticklabels(comparison_df['Genus'], rotation=45, ha='right')\n",
    "axes[0].legend()\n",
    "axes[0].grid(axis='y', alpha=0.3)\n",
    "\n",
    "# Plot 2: Scatter plot - Expected vs Predicted\n",
    "axes[1].scatter(comparison_df['Expected'], comparison_df['CZ ID NT'], \n",
    "               label='CZ ID NT', alpha=0.7, s=100)\n",
    "axes[1].scatter(comparison_df['Expected'], comparison_df['Kraken2'], \n",
    "               label='Kraken2', alpha=0.7, s=100)\n",
    "\n",
    "# Add diagonal line (perfect prediction)\n",
    "max_val = max(comparison_df['Expected'].max(), \n",
    "              comparison_df['CZ ID NT'].max(), \n",
    "              comparison_df['Kraken2'].max())\n",
    "axes[1].plot([0, max_val], [0, max_val], 'k--', alpha=0.3, label='Perfect prediction')\n",
    "\n",
    "axes[1].set_xlabel('Expected Relative Abundance')\n",
    "axes[1].set_ylabel('Predicted Relative Abundance')\n",
    "axes[1].set_title('Expected vs Predicted Abundances')\n",
    "axes[1].legend()\n",
    "axes[1].grid(alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('benchmark_comparison.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"✓ Benchmark comparison visualized\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 4. Workflow 3: Divergent Virus Detection\n",
    "\n",
    "The paper evaluated CZ ID's sensitivity for detecting divergent viruses by:\n",
    "1. Performing **in silico evolution** of 6 virus reference genomes (5-50% divergence)\n",
    "2. Simulating Nanopore reads using **PBSIM2** at 7X coverage\n",
    "3. Running through CZ ID pipeline\n",
    "4. Determining detection thresholds\n",
    "\n",
    "**Key findings:**\n",
    "- **NT database**: Detected viruses up to 10-20% divergence (80-90% similarity)\n",
    "- **NR database**: Detected viruses up to 40-50% divergence\n",
    "- NR extends detection by 20-30% over NT (protein sequences more conserved)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simulate divergent virus detection experiment\n",
    "print(\"Workflow 3: Divergent Virus Detection\\n\")\n",
    "\n",
    "# Virus species tested (from paper)\n",
    "test_viruses = [\n",
    "    {'name': 'Rhinovirus C', 'family': 'Picornaviridae', 'genome_size': 7150},\n",
    "    {'name': 'Hepatitis B Virus', 'family': 'Hepadnaviridae', 'genome_size': 3200},\n",
    "    {'name': 'Human Betaherpesvirus 5', 'family': 'Herpesviridae', 'genome_size': 235000},\n",
    "    {'name': 'Nipah Virus', 'family': 'Paramyxoviridae', 'genome_size': 18200},\n",
    "    {'name': 'SARS CoV 2', 'family': 'Coronaviridae', 'genome_size': 29900},\n",
    "    {'name': 'Lassa Virus', 'family': 'Arenaviridae', 'genome_size': 10300}\n",
    "]\n",
    "\n",
    "# Divergence levels tested\n",
    "divergence_levels = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]\n",
    "\n",
    "# Simulate detection results based on paper findings\n",
    "def simulate_detection_threshold(database='NT'):\n",
    "    \"\"\"\n",
    "    Simulate virus detection at various divergence levels.\n",
    "    \n",
    "    NT: detects up to ~10-20% divergence\n",
    "    NR: detects up to ~40-50% divergence\n",
    "    \"\"\"\n",
    "    results = []\n",
    "    \n",
    "    for virus in test_viruses:\n",
    "        if database == 'NT':\n",
    "            # NT threshold: randomly between 10-20% divergence\n",
    "            threshold = np.random.uniform(10, 20)\n",
    "        else:  # NR\n",
    "            # NR threshold: randomly between 40-50% divergence\n",
    "            threshold = np.random.uniform(40, 50)\n",
    "        \n",
    "        for div in divergence_levels:\n",
    "            detected = div <= threshold\n",
    "            \n",
    "            results.append({\n",
    "                'virus': virus['name'],\n",
    "                'family': virus['family'],\n",
    "                'divergence': div,\n",
    "                'database': database,\n",
    "                'detected': detected\n",
    "            })\n",
    "    \n",
    "    return results\n",
    "\n",
    "# Run simulation for both databases\n",
    "nt_results = simulate_detection_threshold('NT')\n",
    "nr_results = simulate_detection_threshold('NR')\n",
    "\n",
    "all_detection_results = nt_results + nr_results\n",
    "detection_df = pd.DataFrame(all_detection_results)\n",
    "\n",
    "# Calculate maximum detection threshold for each virus\n",
    "max_detection = detection_df[detection_df['detected']].groupby(['virus', 'database'])['divergence'].max().reset_index()\n",
    "max_detection_pivot = max_detection.pivot(index='virus', columns='database', values='divergence')\n",
    "\n",
    "print(\"Maximum Divergence Level Detected (%) for Each Virus:\")\n",
    "print(\"=\" * 60)\n",
    "print(max_detection_pivot.to_string())\n",
    "\n",
    "print(\"\\n[Paper findings: NT detected 10-20% divergence, NR detected 40-50% divergence]\")\n",
    "print(\"✓ NR provides 20-30% extended detection range over NT\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize divergent virus detection\n",
    "fig, ax = plt.subplots(figsize=(12, 6))\n",
    "\n",
    "# Plot maximum divergence detected for each virus\n",
    "virus_names = max_detection_pivot.index\n",
    "x = np.arange(len(virus_names))\n",
    "width = 0.35\n",
    "\n",
    "ax.bar(x - width/2, max_detection_pivot['NT'], width, label='NT (Nucleotide)', alpha=0.8, color='steelblue')\n",
    "ax.bar(x + width/2, max_detection_pivot['NR'], width, label='NR (Protein)', alpha=0.8, color='coral')\n",
    "\n",
    "ax.set_xlabel('Virus Species', fontsize=11)\n",
    "ax.set_ylabel('Maximum Divergence Level Detected (%)', fontsize=11)\n",
    "ax.set_title('Divergent Virus Detection Sensitivity\\nCZ ID mNGS Nanopore Pipeline', fontsize=12)\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels([v.replace(' ', '\\n') for v in virus_names], fontsize=9)\n",
    "ax.legend(fontsize=10)\n",
    "ax.grid(axis='y', alpha=0.3)\n",
    "ax.axhline(y=20, color='steelblue', linestyle='--', alpha=0.5, linewidth=1)\n",
    "ax.axhline(y=45, color='coral', linestyle='--', alpha=0.5, linewidth=1)\n",
    "\n",
    "# Add annotations\n",
    "ax.text(0.02, 0.95, 'NT typical threshold: 10-20%', transform=ax.transAxes, \n",
    "        fontsize=9, verticalalignment='top', color='steelblue')\n",
    "ax.text(0.02, 0.88, 'NR typical threshold: 40-50%', transform=ax.transAxes, \n",
    "        fontsize=9, verticalalignment='top', color='coral')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('divergent_virus_detection.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"✓ Divergent virus detection visualized\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 5. Workflow 4: Clinical Sample Virus Detection\n",
    "\n",
    "The paper validated CZ ID's detection limits using **HeLa cells infected with HCoV OC43** at varying multiplicity of infection (MOI) levels.\n",
    "\n",
    "**MOI levels tested:** 1, 0.1, 0.01, 0.001, 0.0001, 0 (negative control)\n",
    "\n",
    "**Results:** CZ ID detected HCoV OC43 at all MOI levels down to 0.0001, with no detection in negative control."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Workflow 4: Clinical Sample Virus Detection\\n\")\n",
    "\n",
    "# HCoV OC43 detection results from paper (Table 3)\n",
    "moi_results = pd.DataFrame({\n",
    "    'Sample': ['Human-1', 'Human-2', 'Human-3', 'Human-4', 'Human-5', 'Human-6'],\n",
    "    'MOI': [1, 0.1, 0.01, 0.001, 0.0001, 0],\n",
    "    '% HCoV OC43 (of total bp)': [24.89, 1.29, 2.58, 0.31, 0.03, 0.00]\n",
    "})\n",
    "\n",
    "print(\"HCoV OC43 Detection at Varying MOI Levels:\")\n",
    "print(\"=\" * 60)\n",
    "print(moi_results.to_string(index=False))\n",
    "\n",
    "print(\"\\n✓ CZ ID successfully detected virus at all MOI levels (down to 0.0001)\")\n",
    "print(\"✓ No false detection in negative control (MOI = 0)\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize MOI detection\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Plot 1: Bar plot of detection\n",
    "colors = ['green' if moi > 0 else 'red' for moi in moi_results['MOI']]\n",
    "axes[0].bar(moi_results['Sample'], moi_results['% HCoV OC43 (of total bp)'], color=colors, alpha=0.7)\n",
    "axes[0].set_xlabel('Sample')\n",
    "axes[0].set_ylabel('% HCoV OC43 (of total bp)')\n",
    "axes[0].set_title('HCoV OC43 Detection Across MOI Levels')\n",
    "axes[0].tick_params(axis='x', rotation=45)\n",
    "axes[0].grid(axis='y', alpha=0.3)\n",
    "\n",
    "# Add MOI labels\n",
    "for i, (sample, moi, pct) in enumerate(zip(moi_results['Sample'], \n",
    "                                             moi_results['MOI'], \n",
    "                                             moi_results['% HCoV OC43 (of total bp)'])):\n",
    "    label = f'MOI={moi}' if moi > 0 else 'Negative'\n",
    "    axes[0].text(i, pct + 0.5, label, ha='center', fontsize=8)\n",
    "\n",
    "# Plot 2: Log-scale relationship\n",
    "positive_data = moi_results[moi_results['MOI'] > 0]\n",
    "axes[1].scatter(positive_data['MOI'], positive_data['% HCoV OC43 (of total bp)'], s=100, alpha=0.7)\n",
    "axes[1].set_xscale('log')\n",
    "axes[1].set_yscale('log')\n",
    "axes[1].set_xlabel('Multiplicity of Infection (MOI)')\n",
    "axes[1].set_ylabel('% HCoV OC43 (of total bp)')\n",
    "axes[1].set_title('Detection Sensitivity vs MOI (Log Scale)')\n",
    "axes[1].grid(alpha=0.3, which='both')\n",
    "\n",
    "# Add trend line\n",
    "log_moi = np.log10(positive_data['MOI'])\n",
    "log_pct = np.log10(positive_data['% HCoV OC43 (of total bp)'])\n",
    "z = np.polyfit(log_moi, log_pct, 1)\n",
    "p = np.poly1d(z)\n",
    "x_trend = np.logspace(-4, 0, 100)\n",
    "y_trend = 10 ** p(np.log10(x_trend))\n",
    "axes[1].plot(x_trend, y_trend, 'r--', alpha=0.5, label=f'Trend (slope={z[0]:.2f})')\n",
    "axes[1].legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('clinical_detection.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"✓ Clinical sample detection visualized\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 6. Workflow 5: Mosquito Virome Analysis\n",
    "\n",
    "The paper analyzed **5 mosquito samples** previously characterized by Batson et al. (2021) to validate detection of known and novel viruses in non-human hosts.\n",
    "\n",
    "**Methods:**\n",
    "- SISPA library preparation from mosquito RNA\n",
    "- Nanopore sequencing (GridION, R9.4.1 flowcells)\n",
    "- CZ ID analysis with orthogonal BLASTn validation\n",
    "\n",
    "**Results:**\n",
    "- **48 true positive viral hits** across 5 samples\n",
    "- **24 known virus species** (NT %identity ≥ 88%)\n",
    "- **12 novel virus species** (NT %identity ≤ 87%, NR %identity ≤ 74%)\n",
    "- **18 false positives** (mostly mosquito sequences, all with NR alignment < 210 bp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Workflow 5: Mosquito Virome Analysis\\n\")\n",
    "\n",
    "# Results from paper (Table 4)\n",
    "mosquito_results = pd.DataFrame({\n",
    "    'Specimen ID': ['CMS001_017', 'CMS001_018', 'CMS001_028', 'CMS001_044', 'CMS001_050'],\n",
    "    'Known Viruses (hits)': [7, 3, 7, 8, 3],\n",
    "    'Known Viruses (species)': [5, 3, 6, 7, 3],\n",
    "    'Novel Viruses (hits)': [3, 7, 2, 0, 8],\n",
    "    'Novel Viruses (species)': [2, 4, 1, 0, 5],\n",
    "    'False Positives (hits)': [1, 1, 5, 1, 10],\n",
    "    'False Positives (unique)': [1, 1, 5, 1, 5]\n",
    "})\n",
    "\n",
    "print(\"Mosquito Virome Analysis Results:\")\n",
    "print(\"=\" * 90)\n",
    "print(mosquito_results.to_string(index=False))\n",
    "\n",
    "# Summary statistics\n",
    "total_known_hits = mosquito_results['Known Viruses (hits)'].sum()\n",
    "total_known_species = mosquito_results['Known Viruses (species)'].sum()\n",
    "total_novel_hits = mosquito_results['Novel Viruses (hits)'].sum()\n",
    "total_novel_species = mosquito_results['Novel Viruses (species)'].sum()\n",
    "total_fp_hits = mosquito_results['False Positives (hits)'].sum()\n",
    "total_fp_unique = mosquito_results['False Positives (unique)'].sum()\n",
    "\n",
    "print(\"\\nSummary Across All Samples:\")\n",
    "print(f\"  Known viruses: {total_known_hits} hits ({total_known_species} unique species)\")\n",
    "print(f\"  Novel viruses: {total_novel_hits} hits ({total_novel_species} unique species)\")\n",
    "print(f\"  False positives: {total_fp_hits} hits ({total_fp_unique} unique)\")\n",
    "print(f\"  True positives: {total_known_hits + total_novel_hits} ({total_known_species + total_novel_species} unique species)\")\n",
    "print(f\"  Precision: {(total_known_hits + total_novel_hits) / (total_known_hits + total_novel_hits + total_fp_hits):.1%}\")\n",
    "\n",
    "print(\"\\n✓ 83% concordance with previously published Illumina results\")\n",
    "print(\"✓ NT %identity threshold: ≥88% for known, ≤87% for novel viruses\")\n",
    "print(\"✓ NR alignment length >210 bp reduces false positives\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize mosquito virome results\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Plot 1: Stacked bar chart by sample\n",
    "x = np.arange(len(mosquito_results))\n",
    "width = 0.6\n",
    "\n",
    "axes[0].bar(x, mosquito_results['Known Viruses (hits)'], width, \n",
    "           label='Known Viruses', color='steelblue', alpha=0.8)\n",
    "axes[0].bar(x, mosquito_results['Novel Viruses (hits)'], width,\n",
    "           bottom=mosquito_results['Known Viruses (hits)'],\n",
    "           label='Novel Viruses', color='coral', alpha=0.8)\n",
    "axes[0].bar(x, mosquito_results['False Positives (hits)'], width,\n",
    "           bottom=mosquito_results['Known Viruses (hits)'] + mosquito_results['Novel Viruses (hits)'],\n",
    "           label='False Positives', color='lightgray', alpha=0.8)\n",
    "\n",
    "axes[0].set_xlabel('Mosquito Sample')\n",
    "axes[0].set_ylabel('Number of Viral Hits')\n",
    "axes[0].set_title('Viral Detection in Mosquito Samples')\n",
    "axes[0].set_xticks(x)\n",
    "axes[0].set_xticklabels(mosquito_results['Specimen ID'], rotation=45, ha='right')\n",
    "axes[0].legend()\n",
    "axes[0].grid(axis='y', alpha=0.3)\n",
    "\n",
    "# Plot 2: Overall composition\n",
    "categories = ['Known\\nViruses', 'Novel\\nViruses', 'False\\nPositives']\n",
    "counts = [total_known_hits, total_novel_hits, total_fp_hits]\n",
    "colors_pie = ['steelblue', 'coral', 'lightgray']\n",
    "\n",
    "axes[1].pie(counts, labels=categories, autopct='%1.1f%%', colors=colors_pie, \n",
    "           startangle=90, textprops={'fontsize': 10})\n",
    "axes[1].set_title('Overall Viral Hit Distribution\\n(All Samples Combined)')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('mosquito_virome.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"✓ Mosquito virome analysis visualized\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 7. Summary and Key Findings\n",
    "\n",
    "This notebook demonstrated the computational workflows from the CZ ID mNGS Nanopore pipeline paper.\n",
    "\n",
    "### Main Workflow (CZ ID mNGS Nanopore Pipeline)\n",
    "\n",
    "**4-step pipeline:**\n",
    "1. **QC and Host Filtering** (fastp + minimap2)\n",
    "2. **De Novo Assembly** (metaFlye)\n",
    "3. **Database Alignment** (minimap2 for NT, DIAMOND for NR)\n",
    "4. **Taxon Reporting** (aggregation and metrics)\n",
    "\n",
    "### Validation Results\n",
    "\n",
    "1. **Benchmark vs Kraken2**\n",
    "   - Equivalent performance (AUPR = 1.0, L2 = 0.7)\n",
    "   - ZymoBIOMICS mock community\n",
    "\n",
    "2. **Divergent Virus Detection**\n",
    "   - NT: 10-20% divergence threshold\n",
    "   - NR: 40-50% divergence threshold\n",
    "   - NR extends detection by 20-30%\n",
    "\n",
    "3. **Clinical Sensitivity**\n",
    "   - Detected HCoV OC43 down to MOI = 0.0001\n",
    "   - No false positives in negative control\n",
    "\n",
    "4. **Novel Virus Discovery**\n",
    "   - 24 known + 12 novel virus species in mosquitoes\n",
    "   - 83% concordance with Illumina\n",
    "   - Thresholds: known ≥88% NT identity, novel ≤87%\n",
    "\n",
    "### Implications\n",
    "\n",
    "- **Accessible metagenomics**: Cloud-based, no-code platform democratizes long-read mNGS\n",
    "- **Novel pathogen detection**: NR database critical for divergent/novel organism detection\n",
    "- **Versatile applications**: Human clinical, non-human host, environmental samples\n",
    "- **Resource-limited settings**: Free cloud platform ideal for LMICs\n",
    "\n",
    "---\n",
    "\n",
    "## 8. Scaling to Production\n",
    "\n",
    "This notebook used small-scale synthetic data for educational purposes. To scale to production:\n",
    "\n",
    "### Data Requirements\n",
    "- **Input**: Basecalled Nanopore FASTQ files from actual sequencing\n",
    "- **Upload**: Via czid.org web interface or command-line tool\n",
    "- **Metadata**: Host organism, sample type, collection info\n",
    "\n",
    "### Computational Requirements\n",
    "- **Infrastructure**: AWS cloud (automated by CZ ID platform)\n",
    "- **Assembly**: metaFlye with 16+ threads, 64+ GB RAM\n",
    "- **Databases**: Full NCBI NT (~100 GB) and NR (~200 GB)\n",
    "- **Runtime**: ~2-6 hours for typical sample (depends on read count)\n",
    "\n",
    "### Access\n",
    "- **Platform**: https://czid.org (free account required)\n",
    "- **Documentation**: https://chanzuckerberg.zendesk.com/hc/en-us\n",
    "- **Source code**: https://github.com/chanzuckerberg/czid-workflows\n",
    "\n",
    "### Best Practices\n",
    "- Use super accuracy basecalling for best results\n",
    "- Include negative controls\n",
    "- Apply threshold filters (bPM >100, NR bPM >1, L >200, E-value <0.001)\n",
    "- Validate novel findings with orthogonal methods (BLASTn, PCR)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "**Main Paper:**\n",
    "- Simmonds et al. (2024). \"CZ ID: a cloud-based, no-code platform enabling advanced long read metagenomic analysis.\" *bioRxiv*. doi: 10.1101/2024.02.29.579666\n",
    "\n",
    "**Key Tools:**\n",
    "- **fastp**: Chen et al. (2018). Bioinformatics, 34, i884-i890\n",
    "- **minimap2**: Li (2018). Bioinformatics, 34, 3094-3100\n",
    "- **metaFlye**: Kolmogorov et al. (2020). Nat. Methods, 17, 1103-1110\n",
    "- **DIAMOND**: Buchfink et al. (2015). Nat. Methods, 12, 59-60\n",
    "- **Kraken2**: Wood et al. (2019). Genome Biol., 20, 257\n",
    "\n",
    "**Validation Studies:**\n",
    "- Batson et al. (2021). \"Single mosquito metatranscriptomics identifies vectors, emerging pathogens and reservoirs in one assay.\" *eLife*, 10\n",
    "- Nicholls et al. (2019). \"Ultra-deep, long-read nanopore sequencing of mock microbial community standards.\" *Gigascience*, 8\n",
    "\n",
    "---\n",
    "\n",
    "**Notebook completed successfully!** ✓\n",
    "\n",
    "*For questions or issues with CZ ID, visit: https://chanzuckerberg.zendesk.com/hc/en-us*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
No results found