Skip to content

Instantly share code, notes, and snippets.

@Mlawrence95
Mlawrence95 / goodreads_rss_to_yaml.py
Created January 4, 2026 18:42
Fetches GoodReads feed (RSS), converts to a list of dictionaries, and dumps to YAML file. Dedupes subsequent runs if the old path is provided.
import feedparser # via conda install anaconda::feedparser
import yaml
from bs4 import BeautifulSoup
_GOODREADS_RSS_STREAM_URL = "https://www.goodreads.com/review/list_rss/<XXXXXXXXXX>?key=<XXXXXXXXXXXXXX>&shelf=<XXXX>"
# Old yaml lives here. We'll use it to ensure our new dump has unique values.
_EXISTING_YAML_PATH = "docs/_data/books.yml"
_NEW_YAML_PATH = "books.yaml"
@Mlawrence95
Mlawrence95 / read_csv_from_aws_s3_targz.python
Created July 27, 2020 22:54
Given a CSV file that's inside a tar.gz file on AWS S3, read it into a Pandas dataframe without downloading or extracting the entire tar file
# checked against python 3.7.3, pandas 0.24.2, s3fs 0.4.2
import tarfile
import io
import s3fs
import pandas as pd
tar_path = f"s3://my-bucket/debug.tar.gz" # path in s3
metadata_path = "debug/metadata.csv" # path inside of the tar file
@Mlawrence95
Mlawrence95 / md5_decorator.py
Last active May 20, 2020 22:31
A python decorator that adds a column to your pandas dataframe -- the MD5 hash of the specified column
import pandas as pd
from hashlib import md5
def text_to_hash(text):
return md5(text.encode("utf8")).hexdigest()
def add_hash(column_name="document"):
"""
Decorator. Wraps a function that returns a dataframe, must have column_name in columns.
@Mlawrence95
Mlawrence95 / mp3_to_plot.py
Created April 21, 2020 21:30
[python] convert .mp3 file into a .wav, then visualize the sound using a matplotlib plot
import matplotlib.pyplot as plt
import soundfile as sf
from pydub import AudioSegment
# we want to convert source, mp3, into dest, a .wav file
source = "./recordings/test.mp3"
dest = "./recordings/test.wav"
# conversion - check!
@Mlawrence95
Mlawrence95 / get_timestamp.py
Created March 31, 2020 22:18
Use python's time library to print the date as a single string in m/d/y format, GMT. Useful for adding timestamps to filenames
import time
def get_timestamp():
"""
Print the date in m/d/y format, GMT
>>> get_timestamp()
'3_31_2020'
"""
t = time.gmtime()
@Mlawrence95
Mlawrence95 / open_files.py
Created March 26, 2020 18:03
Helpers to open common file types to python data analysis, json and pickle. Great addition to your startup.ipy file in ~/.ipython/profile_default/startup/
import json
import pickle
def openJSON(path):
"""
Safely opens json file at 'path'
"""
with open(path, 'r') as File:
data = json.load(File)
@Mlawrence95
Mlawrence95 / pyplot_set_params.py
Created December 16, 2019 17:24
matplotlib allows you to set plot parameters via a param dict. Here's one such example
import matplotlib.pyplot as plt
params = {'legend.fontsize': 'x-large',
'figure.figsize': (15, 15),
'axes.labelsize': 'x-large',
'axes.titlesize': 'x-large',
'xtick.labelsize': 'x-large',
'ytick.labelsize': 'x-large'}
plt.rcParams.update(params)
@Mlawrence95
Mlawrence95 / make_old_pickles_openable.py
Created December 5, 2019 23:51
Old pickle files can be a pain to work with. This can make SliceTypes and ObjectType exceptions go away in certain circumstances.
import pickle
import dill
dill._dill._reverse_typemap['SliceType'] = slice
dill._dill._reverse_typemap['ObjectType'] = object
@Mlawrence95
Mlawrence95 / clone_private_repo.txt
Created December 5, 2019 23:45
Trying to access a private repo? Use this format to pull it down. (Yes, it asks for your password at the command line. Only do this in low-risk environments)
git clone https://[insert username]:[insert password]@github.com/[insert organisation name]/[insert repo name].git
@Mlawrence95
Mlawrence95 / get_word_counts.py
Last active November 5, 2019 19:13
Takes a document (string) or iterable of documents and returns a Pandas dataframe containing the number of occurrences of each unique word. Note that this is not efficient enough to replace Scikit's CountVectorizer class for a bag of words transformer.
import numpy as np
import pandas as pd
def get_word_counts(document: str) -> pd.DataFrame:
"""
Turns a document into a dataframe of word, counts
Use preprocessing/lowercasing before this step for best results.
If passing many documents, use document = '\n'.join(iterable_of_documents)