r – Reading a specification document (PDF) with paragraphs and tables into a spreadsheet

As I’ve mentioned in our discussion/chat, this will be difficult and certainly imperfect.

I’ve tried running your sample PDF through the following automatic extractors:

and they both produced the same text, which completely loses the original structure:

 1. 
 1.1. 

 1.1.1. 

 1.1.2. 

 Lorem ipsum dolor sit amet consectetur adipiscing elit. 
 Pellentesque   a   sodales   arcu,   sed  feugiat  nibh.  Pellentesque  at  fermentum  odio,  a  molestie 
 lorem. Ut eleifend sagittis porta. 
 Integer   sit   amet   consectetur   erat.   Duis   sit   amet   urna   quam.   Pellentesque   turpis   tortor, 
 porttitor   eget  egestas  in,  tristique  in  urna.  Class  aptent  taciti  sociosqu  ad  litora  torquent  per 
 conubia  nostra,  per  inceptos  himenaeos.  Etiam  eleifend  tincidunt  volutpat.  Curabitur  eu  enim 
 viverra,  condimentum  ex  in,  elementum  est.  Integer  blandit  arcu  ex,  at  interdum  orci  viverra 
 in.

Now, the real PDF may be composed differently and the extractors may do better (🤞). But trying to move on…

The best could do with that LoremIpsumSpecs.pdf sample was just open it in Acrobat Reader, EditMoreSelect Allthen copy-paste into a text editor to get something like the following:

Specification for Project
1. Lorem ipsum dolor sit amet consectetur adipiscing elit.
1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie
lorem. Ut eleifend sagittis porta.
1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,
...
quis purus. Cras vitae dui fringilla libero posuere varius at et velit.
Specs That Those
Spec 1 High 2.1m
Spec 2 Low 0
Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt
...
3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.
...

which preserves the structure of the section numbers and paragraphs, as well as the table.

Does that resemble the text you’re getting in your R script?

If so, I would avoid trying to write one RegEx to capture a “paragraph”. Instead, try to iterate the text line-by-line and use a little state machine to collect lines for every section number that’s seen.

Here’s what I came up with, in Python:

import re

# Expect that section numbers delimit requirements.  Look for a section number to be:
#  line-start, followed by some number of a digit and a period, followed by an optional space
#  e.g.: '1. ', '1.1.2. ', '1.9.9.9.9.9. '
Sect_no = re.compile(r"^(d.){1,} ?")

sections = []
with open("copy-pasted.txt") as txt_file:
    section_lines = []  # intialize empty array

    for line in txt_file:
        line = line.strip()

        if line == "":
            continue

        if Sect_no.match(line):
            if section_lines:  # ignore intial "empty section_lines
                sections.append(section_lines)  # append last set of section lines
            section_lines = []  # reset for this new section

        section_lines.append(line)

# capture last section
if section_lines:
    sections.append(section_lines)

Running that against the copy-pasted text gives me this two-dimensional array of lines, split up by section:

[['Specification for Project'],
 ['1. Lorem ipsum dolor sit amet consectetur adipiscing elit.'],
 ['1.1. Pellentesque a sodales arcu, sed feugiat nibh. Pellentesque at fermentum odio, a molestie',
  'lorem. Ut eleifend sagittis porta.'],
 ['1.1.1. Integer sit amet consectetur erat. Duis sit amet urna quam. Pellentesque turpis tortor,',
 ...
  'quis purus. Cras vitae dui fringilla libero posuere varius at et velit.',
  'Specs That Those',
  'Spec 1 High 2.1m',
  'Spec 2 Low 0',
  'Nunc magna urna, sagittis sit amet interdum quis, finibus non dui. In pharetra risus tincidunt',
 ...
 ['3. Nunc eget maximus dolor. Integer orci purus, ultrices quis fringilla sit amet, blandit non erat.',
 ...

The machine can use some work, like filtering out ‘Specification for Project’; it will also pick up any other lines like headers, footers, or page counts.

From here I’ll extract the section numbers, “reconstitute” the lines into paragraphs, and save it all to a CSV:

import csv

Row = {"Section No.": None, "Section paragraphs": None}

rows = []
for section_lines in sections:

    line0 = section_lines[0]
    match = Sect_no.match(line0)

    if not match:  # ignore intial header, or other first line that isn't a section
        continue

    sect_no = match.group(0).strip()

    # intialize paragraphs (likely multiple paras) with first line, minus section number
    paragraphs = line0.replace(sect_no, "").strip()

    # build up section's paragraphs
    # (still don't know what an actual sentence is, or where one para ends and another (or a table) begins)
    for line in section_lines[1:]:
        paragraphs += "n" + line

    # copy Row template and save to list of rows
    row = dict(Row)
    row["Section No."] = sect_no
    row["Section paragraphs"] = paragraphs
    rows.append(row)

with open("requirements.csv", "w", newline="") as csv_out:
    writer = csv.DictWriter(csv_out, fieldnames=Row)
    writer.writeheader()
    writer.writerows(rows)

When I run that, my requirements.csv looks something like the following:

+-------------+----------------------------------------------------+
| Section No. | Section paragraphs                                 |
+-------------+----------------------------------------------------+
| 1.          | Lorem ipsum dolor sit amet consectetur adipisci... |
+-------------+----------------------------------------------------+
| 1.1.        | Pellentesque a sodales arcu, sed feugiat nibh. ... |
|             | lorem. Ut eleifend sagittis porta.                 |
+-------------+----------------------------------------------------+
| 1.1.1.      | Integer sit amet consectetur erat. Duis sit ame... |
|             | porttitor eget egestas in, tristique in urna. C... |
|             | conubia nostra, per inceptos himenaeos. Etiam e... |
|             | viverra, condimentum ex in, elementum est. Inte... |
|             | in.                                                |
+-------------+----------------------------------------------------+
| 1.1.2.      | Interdum et malesuada fames ac ante ipsum primi... |
|             | ante consequat scelerisque. Donec non leo lorem... |
|             | condimentum. Aenean a tellus augue. Nullam veli... |
|             | quis purus. Cras vitae dui fringilla libero pos... |
|             | Specs That Those                                   |
|             | Spec 1 High 2.1m                                   |
|             | Spec 2 Low 0                                       |
...

Leave a Comment