I started by looking at this 5 year old but quite useful file on this repo. https://github.com/alexmilowski/emr/tree/master/spark
Last active
September 21, 2019 02:31
-
-
Save ravsau/1129794bfa56655a4d03e079190718b5 to your computer and use it in GitHub Desktop.
Spark-word-count-on-aws-emr
Author
Author
cat output.txt | awk -F\: '{print $1 $2}'| sort -nk 2
sort by second column
Author
Author
Author
Problem can be the snappy compressed files ☝️
Author
![]()
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment


this command will launch the emr cluster. Replace key-name and s3 bucket with your bucket and key name.
EMR launches a cluster that you can view on the ec2 console.
