Last active
February 24, 2022 12:53
-
-
Save Aditya1001001/78137609b2887e4171de332f9a6d96e6 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| for i in range(len(data)): | |
| for key in data[i].keys(): | |
| data[i][key] = data[i][key]["S"] | |
| print(data[:2]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #drop columns | |
| df = df[['topic','title']] | |
| #drop 'NATION' and 'WORLD" labels | |
| data = df[df['topic'] != 'NATION' or df['topic'] != 'WORLD'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| counts = data['topic'].value_counts() | |
| counts.plot(kind='bar', legend=False, grid=True, figsize=(8, 5)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| df = pd.DataFrame(data) | |
| print(df.head()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| for topic in topics: | |
| temp_df = data[data['topic'] == topic][:5000] | |
| df = pd.concat([df, temp_df]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| df['vector'] = df['title'].apply(lambda x: nlp(x).vector) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import numpy as np | |
| lens = data.title.str.len() | |
| lens.hist(bins = np.arange(0,200,5)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import | |
| data = [] | |
| with open('data.json', 'r') as f: | |
| data = f.readlines() | |
| data = [json.loads(item)['Item'] for item in data] | |
| print(data[:2]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import pandas as pd | |
| df = pd.read_json('data.json', lines=True) | |
| print(df) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| import spacy_sentence_bert | |
| # load one of the models listed at https://github.com/MartinoMensio/spacy-sentence-bert/ | |
| nlp = spacy_sentence_bert.load_model('en_stsb_distilbert_base') |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.linear_model import LogisticRegression | |
| logistic_clf = LogisticRegression() | |
| logistic_clf.fit(X_train, y_train) | |
| y_pred = logistic_clf.predict(X_test) | |
| print(accuracy_score(y_test, y_pred)) | |
| ## OUTPUT | |
| ## 0.8254545454545454 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.ensemble import RandomForestClassifier | |
| clf = RandomForestClassifier(max_depth=9, random_state=0) | |
| clf.fit(X_train, y_train) | |
| y_pred = clf.predict(X_test) | |
| print(accuracy_score(y_test, y_pred)) | |
| ## OUTPUT | |
| ## 0.7471717171717172 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| print(data.isnull().sum()) | |
| ## OUTPUT | |
| ## topic 0 | |
| ## title 1 | |
| ## dtype: int64 | |
| data = data.dropna() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.model_selection import train_test_split | |
| X_train, X_test, y_train, y_test = train_test_split(df['vector'].tolist(), df['topic'].tolist(), test_size=0.33, random_state=42) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from sklearn.svm import SVC | |
| clf = SVC(gamma='auto') | |
| clf.fit(X_train, y_train) | |
| y_pred = clf.predict(X_test) | |
| print(accuracy_score(y_test, y_pred)) | |
| ## OUTPUT | |
| ## 0.8392929292929293 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| for headline, topic in zip(headlines, topics): | |
| print(headline) | |
| print(f"True Label: {topic}, Predicted Label: {clf.predict(nlp(headline).vector.reshape(1, -1))[0]} \n") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| headlines = ["Scientists Figured Out How Much Exercise You Need to 'Offset' a Day of Sitting", | |
| "Marlee Matlin On Her Career And 'CODA' — The Actor's Side – Deadline", | |
| "Increasing mental health issues a symptom of Victoria's lockdown", | |
| 'Philippines polio outbreak over: UN', | |
| "Sophie, Countess of Wessex opens up about menopause: ‘It's like somebody's just gone and taken your brain'", | |
| 'Bill Gates tells why he prefers Android mobile over iPhone', | |
| "'A weight has been lifted' Edinburgh pubs and restaurants react to hospitality rules easing", | |
| "Mysterious Signal Flashing From Galaxy's Core Baffle Scientists; Where Is There Source of This Radio Waves?", | |
| "'Tears in their eyes': World erupts over All Blacks' beautiful Maradona tribute", | |
| "'Packed in like sardines'"] | |
| topics = ['SCIENCE', 'ENTERTAINMENT', 'HEALTH', | |
| 'HEALTH', 'ENTERTAINMENT', 'TECHNOLOGY', 'BUSINESS', | |
| 'SCIENCE', 'SPORTS', 'ENTERTAINMENT'] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| print(df.topic.unique()) | |
| ## OUTPUT | |
| ## array(['ENTERTAINMENT', 'BUSINESS', 'NATION', 'SPORTS', 'WORLD', | |
| ## 'TECHNOLOGY', 'HEALTH', 'SCIENCE'], dtype=object) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment