-
-
Save aditya00kumar/011b6ad309de616e15c32b5efcd9f66d to your computer and use it in GitHub Desktop.
| from sklearn.metrics.pairwise import cosine_similarity | |
| def maximal_marginal_relevance(sentence_vector, phrases, embedding_matrix, lambda_constant=0.5, threshold_terms=10): | |
| """ | |
| Return ranked phrases using MMR. Cosine similarity is used as similarity measure. | |
| :param sentence_vector: Query vector | |
| :param phrases: list of candidate phrases | |
| :param embedding_matrix: matrix having index as phrases and values as vector | |
| :param lambda_constant: 0.5 to balance diversity and accuracy. if lambda_constant is high, then higher accuracy. If lambda_constant is low then high diversity. | |
| :param threshold_terms: number of terms to include in result set | |
| :return: Ranked phrases with score | |
| """ | |
| # todo: Use cosine similarity matrix for lookup among phrases instead of making call everytime. | |
| s = [] | |
| r = sorted(phrases, key=lambda x: x[1], reverse=True) | |
| r = [i[0] for i in r] | |
| while len(r) > 0: | |
| score = 0 | |
| phrase_to_add = '' | |
| for i in r: | |
| first_part = cosine_similarity([sentence_vector], [embedding_matrix.loc[i]])[0][0] | |
| second_part = 0 | |
| for j in s: | |
| cos_sim = cosine_similarity([embedding_matrix.loc[i]], [embedding_matrix.loc[j[0]]])[0][0] | |
| if cos_sim > second_part: | |
| second_part = cos_sim | |
| equation_score = lambda_constant*(first_part)-(1-lambda_constant) * second_part | |
| if equation_score > score: | |
| score = equation_score | |
| phrase_to_add = i | |
| if phrase_to_add == '': | |
| phrase_to_add = i | |
| r.remove(phrase_to_add) | |
| s.append((phrase_to_add, score)) | |
| return (s, s[:threshold_terms])[threshold_terms > len(s)] |
Hey @AnubhavCR7, Yes in the paper it is written as equation_score = lambda_constant * ( first_part - (1-lambda_constant) * second_part) but I think there is typo in the equation.
Why?
Because setting any value of λ gives the mix of diversity and accuracy in the result set. The value of λ can be set based on the use-case and your dataset. If you consider the equation given in the paper and set the value of λ =1, then your equation becomes equation_score = first_part but if you set the value of λ =0, then your equation equates to 0, which should not be the case. That is why I have modified the above equation. Hope this answers your question.
I have written a blog post on MMR on medium, here is the link, don't forget to check the comments section :).
Hello @aditya00kumar,
I referred to the original paper of MMR (http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf). According to the formula given in the paper, the code should be :
Please correct me if I am getting it wrong somewhere. Awaiting your response.
Regards.