nydame/GMAIL2json.md

## GMAIL2json.md

      
    Raw
  

              GMAIL2json.md
            
          
    INSTRUCTIONS FOR PROCESSING EMAILS BY LABEL

Export selected emails from Gmail


Go to https://takeout.google.com, making sure to log in (or be logged in already) as the owner of the targeted Gmail account.
Deselect all, then select only Mail
Click "All data included", otherwise you will download the entire mailbox!
In the resulting popup, deselect all the annoying default labels and select only the ones you want, e.g. "Project X"
Close the pop-up and click "next"
Choose

file type for compression (.tgz for Mac and Linux)
file size upper limit (I used 2G)
frequency (once)
destination (email link)


Click "create export"; if you chose an email link as your destination, you will soon find that in your inbox. Open that email and click the button to download the compressed file to your machine.
Double-click the .tgz (or .zip) file to get an MBOX file(s) (e.g., project-x.mbox)

Convert the MBOX file(s) to EML format

Why EML? Simply because it is accepted by Unstructured, which we'll use for the second conversion step

For this step I paid for MBOX Migrator by Recovery Tools

the free trial version is limited to 25 emails per MBOX file
at the time of this writing, the paid version is $29 in the US


Follow simple instructions to get 1 folder per MBOX file, inside of which you will find several EML files representing individual emails and attachments

confusingly, Recovery Tools gives this folder the same name as the original MBOX file (complete with .mbox extension), so it's easy to think that the conversion didn't work


Convert the EML files to JSON using Unstructured's handy and free Python library


Open a Jupyter or Google Colab notebook
Upload your EML files into the notebook
Run the following code blocks (modified as needed or desired) and double-click the downloaded zip file to recover the resulting JSON files. Voilà!!!

!apt-get -qq install poppler-utils tesseract-ocr
%pip install -q --user --upgrade pillow
%pip install -q unstructured[all-docs]

# Process uploaded files by iterating through them and using partition_email from Unstructured each time
from google.colab import files
import os
import json
from unstructured.partition.email import partition_email
from unstructured.staging.base import elements_to_json

# Get a list of files in the current directory, typically /content/ in Colab
# and filter for .eml files based on the file list in the kernel state
for filename in os.listdir('/content/'):
  if filename.endswith(".eml"):
    file_path = os.path.join('/content/', filename)
    print(f"Processing: {file_path}")

    try:
      # Partition email and attachments (PDF etc.)
      elements=partition_email(
          filename=file_path,
          process_attachments=True,
          strategy="hi_res",
                         )

      #Save output to JSON
      elements_to_json(elements, filename=f'{filename}.json')
      print(f"Saved {filename}.json")
    except ValueError as e:
      print(f"Error processing {filename}: {e}. Skipping this file.")

# Download all the JSON files you created to your machine
import os
from google.colab import files
import shlex

# Create a list of all .json files in the /content/ directory
json_files = [f for f in os.listdir('/content/') if f.endswith('.json')]

# Create a zip archive containing all JSON files
zip_filename = 'partitioned_emails.zip'

# Properly quote and join the filenames for the shell command
quoted_files_string = shlex.join(json_files)

# Construct and execute the zip command
zip_command = f"zip -j {shlex.quote(zip_filename)} {quoted_files_string}"
get_ipython().system(zip_command)

# Download the zip file
files.download(zip_filename)
No results found