Skip to content

Instantly share code, notes, and snippets.

@ashokbalaraman
Forked from actsasgeek/starter.md
Created September 6, 2020 09:29
Show Gist options
  • Select an option

  • Save ashokbalaraman/cb29f310bf68b7bfeb57621fce651929 to your computer and use it in GitHub Desktop.

Select an option

Save ashokbalaraman/cb29f310bf68b7bfeb57621fce651929 to your computer and use it in GitHub Desktop.
EN685.648 Starter Pack

Infrastructure

  1. You are very strongly encouraged to use a computer upon which you have administrator/superuser privileges. I cannot help you with problems associated with the installation of software and libraries.
  2. You are encouraged to use "Unix"-style operating system (MacOS or Linux flavor) either directly or in a virual environment (Docker or VirtualBox). It's not required but you should be multi-hosted when it comes to OSes and the examples of command line utilities will be in 'Nix. This is not necessary to excel in the class but it is helpful. Many platforms are built on Linux and you should learn to use it.

Setup

  1. Install Anaconda for Python 3.7 for your operating system: Anaconda. We'll be changing this to 3.8 later using the environment file. It's ok if you already have a Python environment you want to use instead (Anaconda will mess with your path and cause problems, for example, if you use brew on MacOS). Just make sure you have all the required libraries -- listed in the environment.yml file--installed using pip.
  2. Create a directory/folder for data science and move into it.
  3. Download environment.yml into your directory (or just copy the Raw content, paste it into a file named environment.yml, and save it).
  4. Execute conda env create -f environment.yml
  5. You now have all the libraries needed for the course (as of now).
  6. Execute conda activate en685648 (whenver working in that environment for any reason, activate it!).
  7. You'll have to install pandasql manually: pip install -U pandasql (make sure you activated the environment!).
  8. Set up Jupyter notebook to use this environment: python -m ipykernel install --user --name en685648 --display-name "Python (en685648)"

For now, the only thing in this directory will be the environment.yml file.

Workflow

Once the class has started, you will be able to download the Jupyter notebooks for each module. In the interim, you may want to get a feel for the enviroment in which you'll be working. Use the following commands:

  1. conda activate en685648 - this will activate the environment. (Use conda env list to see your installed environments).
  2. jupyter notebook - this will start the Jupyter notebook environment with the current directory as the root.
  3. When you create a new Jupyter notebook, you can select "Python (en685648)" as the kernel.

Note - jupyter notebook is eventually be "sunsetted" in favor of jupyter lab. If you want to use jupyter lab, that's fine.

When you're done, you can invoke conda deactivate.

Preparations

The students taking this course come from a variety of backgrounds. While the course itself covers a lot of topics, you will have an easier time of it during the semester if you do a bit of preparation on your own (especially if Python is new to you). These links will get you started but feel free to explore other resources using Google.

  1. Python 3
  2. Jupyter notebooks (YouTube)
    1. Jupyter Notebooks for Beginners (blog)
    2. Advanced Jupyter Notebooks (blog)
  3. Markdown
  4. Pandas
  5. Matplotlib

This course is not primarily a coding course and Data Science is not primarily about running code. Data Science is about analysis and communication. Style, usage, and organization matter. You must be equally adept at using the Markdown and Code cells in the Jupyter notebook. If nothing else, learn to use use Markdown effectively. Additionally, tabulate has been included to help with the creation of tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment