- Go to https://takeout.google.com, making sure to log in (or be logged in already) as the owner of the targeted Gmail account.
- Deselect all, then select only Mail
- Click "All data included", otherwise you will download the entire mailbox!
- In the resulting popup, deselect all the annoying default labels and select only the ones you want, e.g. "Project X"
- Close the pop-up and click "next"
- Choose
- file type for compression (.tgz for Mac and Linux)
- file size upper limit (I used 2G)
- frequency (once)
- destination (email link)
- Click "create export"; if you chose an email link as your destination, you will soon find that in your inbox. Open that email and click the button to download the compressed file to your machine.
- Double-click the .tgz (or .zip) file to get an MBOX file(s) (e.g., project-x.mbox)
Why EML? Simply because it is accepted by Unstructured, which we'll use for the second conversion step
- For this step I paid for MBOX Migrator by Recovery Tools
- the free trial version is limited to 25 emails per MBOX file
- at the time of this writing, the paid version is $29 in the US
- Follow simple instructions to get 1 folder per MBOX file, inside of which you will find several EML files representing individual emails and attachments
- confusingly, Recovery Tools gives this folder the same name as the original MBOX file (complete with .mbox extension), so it's easy to think that the conversion didn't work
- Open a Jupyter or Google Colab notebook
- Upload your EML files into the notebook
- Run the following code blocks (modified as needed or desired) and double-click the downloaded zip file to recover the resulting JSON files. Voilà!!!
!apt-get -qq install poppler-utils tesseract-ocr
%pip install -q --user --upgrade pillow
%pip install -q unstructured[all-docs]
# Process uploaded files by iterating through them and using partition_email from Unstructured each time
from google.colab import files
import os
import json
from unstructured.partition.email import partition_email
from unstructured.staging.base import elements_to_json
# Get a list of files in the current directory, typically /content/ in Colab
# and filter for .eml files based on the file list in the kernel state
for filename in os.listdir('/content/'):
if filename.endswith(".eml"):
file_path = os.path.join('/content/', filename)
print(f"Processing: {file_path}")
try:
# Partition email and attachments (PDF etc.)
elements=partition_email(
filename=file_path,
process_attachments=True,
strategy="hi_res",
)
#Save output to JSON
elements_to_json(elements, filename=f'{filename}.json')
print(f"Saved {filename}.json")
except ValueError as e:
print(f"Error processing {filename}: {e}. Skipping this file.")
# Download all the JSON files you created to your machine
import os
from google.colab import files
import shlex
# Create a list of all .json files in the /content/ directory
json_files = [f for f in os.listdir('/content/') if f.endswith('.json')]
# Create a zip archive containing all JSON files
zip_filename = 'partitioned_emails.zip'
# Properly quote and join the filenames for the shell command
quoted_files_string = shlex.join(json_files)
# Construct and execute the zip command
zip_command = f"zip -j {shlex.quote(zip_filename)} {quoted_files_string}"
get_ipython().system(zip_command)
# Download the zip file
files.download(zip_filename)