Skip to content

Instantly share code, notes, and snippets.

@Cannedfood
Last active September 18, 2025 20:59
Show Gist options
  • Select an option

  • Save Cannedfood/540809cdb09f075724174e49dddd39d9 to your computer and use it in GitHub Desktop.

Select an option

Save Cannedfood/540809cdb09f075724174e49dddd39d9 to your computer and use it in GitHub Desktop.
Script for downloading the google ngram dataset
#!/bin/bash -e
LANGUAGES=(ger eng)
NGRAM_SIZES=(1 2 3)
for LANGUAGE in "${LANGUAGES[@]}"; do
echo "Download the files using wget into $LANGUAGE/"
for SIZE in "${NGRAM_SIZES[@]}"; do
(
echo "$LANGUAGE/$SIZE-gram"
mkdir -p "$LANGUAGE/$SIZE-gram"
cd "$LANGUAGE/$SIZE-gram"
URLS=$(
curl -s "https://storage.googleapis.com/books/ngrams/books/20200217/$LANGUAGE/$LANGUAGE-$SIZE-ngrams_exports.html" |
grep -oP 'href="\K[^"]+\.gz' |
sort -h
)
for URL in $URLS; do
wget --continue -q --show-progress "$URL"
done
)
done
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment