-
-
Save tarleb/ef395339d4ce8d940cae0c48e5de9e82 to your computer and use it in GitHub Desktop.
| local function sentence_lines (el) | |
| local inlines = el.content | |
| for i = 2, #inlines do | |
| if inlines[i].t == 'Space' and | |
| inlines[i-1].t == 'Str' and | |
| inlines[i-1].text:match '%.$' then | |
| inlines[i] = pandoc.SoftBreak() | |
| end | |
| end | |
| return el | |
| end | |
| return { | |
| {SoftBreak = function () return pandoc.Space() end}, | |
| {Para = sentence_lines}, | |
| {Plain = sentence_lines}, | |
| } |
@bpj Yes of course with the much more aggressive approach to not leave sentences on the table there will be false positives (I know I'll have to tackle some abbreviation issues at some point), but with the original I was getting hundreds of paragraphs in a book that hand 2-10 sentences not split up. I'll definitely be looking into lpeg.utfR because better locale dependent case detection will be important.
@tarleb Yes definitely I had that in mind already, but at the moment I'm going to be rolling it out to a few dozen book projects in two languages over the next few weeks/month and it will be easier to iterate on in conjunction with other normalization stuff I use, but when it gets a little more mature and can move at it's own pace it definitely should land in it's own project.
@alerque Nice! Maybe this could be extracted into a separate project at some point?