Skip to content

Instantly share code, notes, and snippets.

@gary136
Created November 21, 2019 14:08
Show Gist options
  • Select an option

  • Save gary136/31c89f02ebcb33b01529e625c67ea92c to your computer and use it in GitHub Desktop.

Select an option

Save gary136/31c89f02ebcb33b01529e625c67ea92c to your computer and use it in GitHub Desktop.
vocabulary.py
def vocabulary(sample, start, end):
import string
p = sample.replace('-', ' ').translate(str.maketrans('', '', string.punctuation)).split()
p = [i.lower() for i in p]
p = [(i, df[df.word==i].index[0], list(df[df.word==i]['dic'])[0]) \
if i in np.array(df.word) else (i, 99999, 'N') for i in p]
w = np.array(list(set(p)))[:,0]
r = np.array(list(set(p)))[:,1].astype(int)
d = np.array(list(set(p)))[:,2]
d = pd.DataFrame({'word':w
,'rank':r
,'dic':d})\
.sort_values(by='rank').reset_index().drop('index', axis=1)
return d[(d['rank']>=start) & (d['rank']<end)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment