Created
July 22, 2016 21:18
-
-
Save enewe101/00ed34a082e26e06516bf75cbdfc98f4 to your computer and use it in GitHub Desktop.
Basic reader of parc xml files
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| from bs4 import BeautifulSoup as Soup | |
| class AnnotatedText(object): | |
| def __init__(self, parc_xml): | |
| self.soup = Soup(parc_xml, 'html.parser') | |
| self.words = [] | |
| self.sentences = [] | |
| sentence_tags = self.soup.find_all('sentence') | |
| for sentence_tag in sentence_tags: | |
| sentence = {'words':[]} | |
| self.sentences.append(sentence) | |
| word_tags = sentence_tag.find_all('word') | |
| for word_tag in word_tags: | |
| word = { | |
| 'token': word_tag['text'], | |
| } | |
| attribution = word_tag.find('attribution') | |
| if attribution: | |
| word['attribution'] = { | |
| 'role': attribution.find('attributionrole')['rolevalue'], | |
| 'id': attribution['id'] | |
| } | |
| else: | |
| word['attribution'] = None | |
| self.words.append(word) | |
| sentence['words'].append(word) |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Once you have an annotated text object, it has a property
sentenceswhich is a list of sentences. Each sentence is a dictionary, having awordsproperty, which is a list of all the words in the sentence. Words are dictionaries too, and they have atokenproperty (the original text of the word) and anattributionproperty. Theattributionproperty can beNone, or it could be a dictionary with aroleand anid. The role identifies whether the word belongs to a cue, content, or source span.