Skip to content

Instantly share code, notes, and snippets.

@lucaslouca
Last active November 1, 2018 07:10
Show Gist options
  • Select an option

  • Save lucaslouca/090b5e44d44e8ced9c86510ffade20fc to your computer and use it in GitHub Desktop.

Select an option

Save lucaslouca/090b5e44d44e8ced9c86510ffade20fc to your computer and use it in GitHub Desktop.
Get <body>...</body> content from HTML file and write it to a file
import os
import sys
import re
def strip(path: str):
with open(path, "r") as html_file:
content = html_file.read()
body_start_index = content.find('<body>')
body_end_index = content.find('</body>')
body = content[body_start_index + 6:body_end_index]
# Remove Anchor links as well
anchor = re.compile('<a class="anchor-link" .*?>.*?</a>')
body = re.sub(anchor, '', body)
# Write body to new file
new_file_path, file_extension = os.path.splitext(path)
new_file_path += "_body.txt"
with open(new_file_path, "w") as out:
out.write(body)
def main():
if len(sys.argv) <= 1:
print("Usage: %s file" % sys.argv[0])
return
strip(sys.argv[1])
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment