Skip to content

Instantly share code, notes, and snippets.

View AMR-KELEG's full-sized avatar
👨‍🔬

Amr Keleg AMR-KELEG

👨‍🔬
View GitHub Profile
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "AMR-KELEG/Sentence-ALDi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def compute_score(sentence):
"""Returns a normalized divergence 'distance' score from MSA in [0, 1]"""
# Warning -- inputs longer than 512 subtokens are truncated
@AMR-KELEG
AMR-KELEG / estimate_dialect_and_ALDi.py
Created October 1, 2024 14:17
Automatically estimate the ALDi score and dialect of sentences
import re
import torch
import pandas as pd
from tqdm import tqdm
from transformers import AutoModelForSequenceClassification, AutoTokenizer
DIALECTS = [
"Algeria",
"Bahrain",
"Egypt",
@AMR-KELEG
AMR-KELEG / stopwords-removal.ipynb
Created February 26, 2021 14:35
Remove Arabizi stopwords using transliteration
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@AMR-KELEG
AMR-KELEG / marks-merge.ipynb
Last active February 19, 2021 15:07
marks-merge.ipynb
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Unigram Weighting

Precision: 0.84776 +- 0.00871

Recall: 0.83987 +- 0.00888

testing_corpus precision recall
kaz.cleaned_0 0.838899 0.829193
kaz.cleaned_1 0.841818 0.834286
kaz.cleaned_2 0.860011 0.84795