Mail Archives: geda-user/2015/09/04/05:55:35
Igor2:
[ about tables in pdf's ]
It's true that pdf doesn't have a table structure.
I have some experimetal code to extract tables from pdf's, the is in:
http://turkos.aspodata.se/git/openhw/pdftosym/Experimental/
///
If you use the "-xml" argument to pdftohtml, you get the positions of
the text. What's missing in the output below is text rotation, the
pdf below have vertical text in the headers. It could be useful to
patch pdftohtml to get that info. Also it would be useful to know
the font metrics so you'll know if text elements is separated with a
simple space, i.e. belong to the same text, or more, i.e. possible be
in different columns.
Example:
pdftohtml -f 40 -l 51 -c -xml ~/Net/http/www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_LITERATURE/DATASHEET/CD00237391.pdf a
generates a.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml>
<page number="40" position="absolute" top="0" left="0" height="842" width="595">
<fontspec id="0" size="7" family="Times" color="#000000"/>
<fontspec id="1" size="7" family="Times" color="#000000"/>
<fontspec id="2" size="5" family="Times" color="#000000"/>
<fontspec id="3" size="4" family="Times" color="#000000"/>
<fontspec id="4" size="-1" family="Times" color="#000000"/>
<fontspec id="5" size="3" family="Times" color="#000000"/>
<fontspec id="6" size="2" family="Times" color="#000000"/>
<fontspec id="7" size="5" family="Times" color="#000000"/>
<text top="60" left="67" width="132" height="9" font="0"><b>Pinouts and pin description</b></text>
<text top="60" left="390" width="138" height="9" font="0"><b>STM32F205xx, STM32F207xx</b></text>
...
<text top="708" left="326" width="13" height="7" font="2">BAT</text>
</page>
<page number="41" position="absolute" top="0" left="0" height="842" width="595">
...
<text top="137" left="170" width="0" height="8" font="0"><b>176</b></text>
</page>
</pdf2xml>
///
That is rather simple to parse, so you get one array with fontspecs (to
get the size) and one for the text with page number and position (and
font size).
sort "text" after top and left
find same text in same positions in different pages, that's the page
header and footer, and it's probable not part of the table, so remove
that and the page counter (use some heuristics to find that)
since you have top and height of the text elements, you now can find
text elements that overlaps vertically - thay are your table lines
sometimes you have to merge more lines, e.g. the last col. could be
multiline.
basically use the same proceadure to find limits of the columns.
possible identify sub/superscrips and possible remove them
///
With that procedure I could generate something resempling:
http://turkos.aspodata.se/git/openhw/pdftosym/stm32f100h.tbl
and then
http://turkos.aspodata.se/git/openhw/pdftosym/stm32f100h.pins
which I could use as input to
http://turkos.aspodata.se/git/openhw/pdftosym/symtopin.pl
to generate footprints with.
===================
It has been some time I worked on "pdftosym", maybe we could toss
some ideas.
Regards,
/Karl Hammar
-----------------------------------------------------------------------
Aspö Data
Lilla Aspö 148
S-742 94 Östhammar
Sweden
+46 173 140 57
- Raw text -