Skip to content

Instantly share code, notes, and snippets.

@dhruvpathak
Created July 22, 2018 12:08
Show Gist options
  • Select an option

  • Save dhruvpathak/a7b96f469fd404a2351de69a5ff41144 to your computer and use it in GitHub Desktop.

Select an option

Save dhruvpathak/a7b96f469fd404a2351de69a5ff41144 to your computer and use it in GitHub Desktop.
Lemmatization and POS tag correlation in spaCy
# tested on : python 3.7.0, spacy 2.0.12
import urllib
import spacy
from collections import defaultdict
from pprint import pprint
nlp = spacy.load('en')
#fetch a long essay text
text_url = "https://pastebin.com/raw/M7RwNi5q"
input_text = urllib.request.urlopen(text_url).read().decode('utf-8','ignore')
parsed = nlp(input_text)
#a dict to have text and POS tag of a token as key, and its lemmas as list in value
lemma_map = defaultdict(list)
# store all the lemmas against text+POS key of the tokens
for token in parsed:
hash_key = '{0}_{1}'.format(token.text, token.pos_)
lemma_map[hash_key].append(token.lemma_)
#check printed output of words & their lemmas
sorted_items = sorted(lemma_map.items(),key= lambda item:-len(item[1]))
pprint(sorted_items)
#check if for a combination of token's text & its POS tag, are there any
# different lemmas ?
for key,value in lemma_map.items():
if len(set(value)) == 1:
print('lemmas SAME for key:{0},lemma:{1}'.format(key,value[0]))
else:
print('lemmas DIFF for key:{0},lemmas:{1}'.format(key,value))
# observation: for a token, given its role in a sentence, with a
# common POS tag, its lemma form is the same for this data.
@dhruvpathak
Copy link
Author

Result snippets:

('difficult_ADJ',
  ['difficult',
   'difficult',
   'difficult',
   'difficult',
   'difficult',
   'difficult']),
 ('supporting_VERB',
  ['support', 'support', 'support', 'support', 'support', 'support']),
 ('had_VERB', ['have', 'have', 'have', 'have', 'have', 'have']),
 ('case_NOUN', ['case', 'case', 'case', 'case', 'case', 'case']),
 ('person_NOUN', ['person', 'person', 'person', 'person', 'person', 'person']),
lemmas SAME for key:respecting_VERB,lemma:respect
lemmas SAME for key:supporting_VERB,lemma:support
lemmas SAME for key:required_VERB,lemma:require
lemmas SAME for key:generate_VERB,lemma:generate
lemmas SAME for key:ways_NOUN,lemma:way
lemmas SAME for key:working_VERB,lemma:work
lemmas SAME for key:As_ADP,lemma:as
lemmas SAME for key:Sadler_PROPN,lemma:sadler
lemmas SAME for key:1997_NUM,lemma:1997
lemmas SAME for key:puts_VERB,lemma:put
lemmas SAME for key:essence_NOUN,lemma:essence
lemmas SAME for key:said_VERB,lemma:say
lemmas SAME for key:‘_PRON,lemma:‘
lemmas SAME for key:individually_ADV,lemma:individually
lemmas SAME for key:socially_ADV,lemma:socially
lemmas SAME for key:defined_VERB,lemma:define
lemmas SAME for key:creative_ADJ,lemma:creative
lemmas SAME for key:meet_VERB,lemma:meet
lemmas SAME for key:recognised_VERB,lemma:recognise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment