As I’ve mentioned in our discussion/chat, this will be difficult and certainly imperfect.
I’ve tried running your sample PDF through the following automatic extractors:
and they both produced the same text, which completely loses the original structure:
1.
1.1.
1.1.1.
1.1.2.
Lorem ipsum dolor sit amet consectetur adipiscing elit.
Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie
lorem. Ut eleifend sagittis porta.
Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,
porttitor eget egestas in, tristique in urna. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. Etiam eleifend tincidunt volutpat. Curabitur eu enim
viverra, condimentum ex in, elementum est. Integer blandit arcu ex, at interdum orci viverra
in.
Now, the real PDF may be composed differently and the extractors may do better (🤞). But trying to move on…
The best could do with that LoremIpsumSpecs.pdf sample was just open it in Acrobat Reader, Edit → More → Select Allthen copy-paste into a text editor to get something like the following:
Specification for Project
1. Lorem ipsum dolor sit amet consectetur adipiscing elit.
1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie
lorem. Ut eleifend sagittis porta.
1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,
...
quis purus. Cras vitae dui fringilla libero posuere varius at et velit.
Specs That Those
Spec 1 High 2.1m
Spec 2 Low 0
Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt
...
3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.
...
which preserves the structure of the section numbers and paragraphs, as well as the table.
Does that resemble the text you’re getting in your R script?
If so, I would avoid trying to write one RegEx to capture a “paragraph”. Instead, try to iterate the text line-by-line and use a little state machine to collect lines for every section number that’s seen.
Here’s what I came up with, in Python:
import re
# Expect that section numbers delimit requirements. Look for a section number to be:
# line-start, followed by some number of a digit and a period, followed by an optional space
# e.g.: '1. ', '1.1.2. ', '1.9.9.9.9.9. '
Sect_no = re.compile(r"^(d.){1,} ?")
sections = []
with open("copy-pasted.txt") as txt_file:
section_lines = [] # intialize empty array
for line in txt_file:
line = line.strip()
if line == "":
continue
if Sect_no.match(line):
if section_lines: # ignore intial "empty section_lines
sections.append(section_lines) # append last set of section lines
section_lines = [] # reset for this new section
section_lines.append(line)
# capture last section
if section_lines:
sections.append(section_lines)
Running that against the copy-pasted text gives me this two-dimensional array of lines, split up by section:
[['Specification for Project'],
['1. Lorem ipsum dolor sit amet consectetur adipiscing elit.'],
['1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie',
'lorem. Ut eleifend sagittis porta.'],
['1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,',
...
'quis purus. Cras vitae dui fringilla libero posuere varius at et velit.',
'Specs That Those',
'Spec 1 High 2.1m',
'Spec 2 Low 0',
'Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt',
...
['3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.',
...
The machine can use some work, like filtering out ‘Specification for Project’; it will also pick up any other lines like headers, footers, or page counts.
From here I’ll extract the section numbers, “reconstitute” the lines into paragraphs, and save it all to a CSV:
import csv
Row = {"Section No.": None, "Section paragraphs": None}
rows = []
for section_lines in sections:
line0 = section_lines[0]
match = Sect_no.match(line0)
if not match: # ignore intial header, or other first line that isn't a section
continue
sect_no = match.group(0).strip()
# intialize paragraphs (likely multiple paras) with first line, minus section number
paragraphs = line0.replace(sect_no, "").strip()
# build up section's paragraphs
# (still don't know what an actual sentence is, or where one para ends and another (or a table) begins)
for line in section_lines[1:]:
paragraphs += "n" + line
# copy Row template and save to list of rows
row = dict(Row)
row["Section No."] = sect_no
row["Section paragraphs"] = paragraphs
rows.append(row)
with open("requirements.csv", "w", newline="") as csv_out:
writer = csv.DictWriter(csv_out, fieldnames=Row)
writer.writeheader()
writer.writerows(rows)
When I run that, my requirements.csv looks something like the following:
+-------------+----------------------------------------------------+
| Section No. | Section paragraphs |
+-------------+----------------------------------------------------+
| 1. | Lorem ipsum dolor sit amet consectetur adipisci... |
+-------------+----------------------------------------------------+
| 1.1. | Pellentesque a sodales arcu, sed feugiat nibh. ... |
| | lorem. Ut eleifend sagittis porta. |
+-------------+----------------------------------------------------+
| 1.1.1. | Integer sit amet consectetur erat. Duis sit ame... |
| | porttitor eget egestas in, tristique in urna. C... |
| | conubia nostra, per inceptos himenaeos. Etiam e... |
| | viverra, condimentum ex in, elementum est. Inte... |
| | in. |
+-------------+----------------------------------------------------+
| 1.1.2. | Interdum et malesuada fames ac ante ipsum primi... |
| | ante consequat scelerisque. Donec non leo lorem... |
| | condimentum. Aenean a tellus augue. Nullam veli... |
| | quis purus. Cras vitae dui fringilla libero pos... |
| | Specs That Those |
| | Spec 1 High 2.1m |
| | Spec 2 Low 0 |
...