Mail Archives: geda-user/2015/09/04/05:55:35

www.delorie.com/archives/browse.cgi

search

Mail Archives: geda-user/2015/09/04/05:55:35

X-Authentication-Warning: delorie.com: mail set sender to geda-user-bounces using -f

X-Recipient: geda-user AT delorie DOT com

X-Mailer: exmh version 2.8.0 04/21/2012 (debian 1:2.8.0~rc1-2) with nmh-1.5

X-Exmh-Isig-CompType: repl

X-Exmh-Isig-Folder: inbox

From: karl AT aspodata DOT se

To: geda-user AT delorie DOT com

Subject: pdf table extraction (was Re: [geda-user] Interesting blog post from a commercial EDA vendor - pdf)

In-reply-to: <alpine.DEB.2.00.1509040545240.6924@igor2priv>

References: <CAOP4iL3YWQ_MH3HNnyDHMGCGeYFBmazwcw7Af_GATQzAUQJ57g AT mail DOT gmail DOT com> <alpine DOT DEB DOT 2 DOT 00 DOT 1509040545240 DOT 6924 AT igor2priv>

Comments: In-reply-to gedau AT igor2 DOT repo DOT hu

message dated "Fri, 04 Sep 2015 06:00:42 +0200."

Mime-Version: 1.0

Message-Id: <20150904095423.31827809DB80@turkos.aspodata.se>

Date: Fri, 4 Sep 2015 11:54:22 +0200 (CEST)

X-Virus-Scanned: ClamAV using ClamSMTP

Reply-To: geda-user AT delorie DOT com

Errors-To: nobody AT delorie DOT com

X-Mailing-List: geda-user AT delorie DOT com

X-Unsubscribes-To: listserv AT delorie DOT com

Igor2:
[ about tables in pdf's ]

It's true that pdf doesn't have a table structure.

I have some experimetal code to extract tables from pdf's, the is in:

  http://turkos.aspodata.se/git/openhw/pdftosym/Experimental/

///

If you use the "-xml" argument to pdftohtml, you get the positions of
the text. What's missing in the output below is text rotation, the
pdf below have vertical text in the headers. It could be useful to
patch pdftohtml to get that info. Also it would be useful to know
the font metrics so you'll know if text elements is separated with a 
simple space, i.e. belong to the same text, or more, i.e. possible be 
in different columns.

Example:

pdftohtml -f 40 -l 51 -c -xml ~/Net/http/www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_LITERATURE/DATASHEET/CD00237391.pdf a

generates a.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="40" position="absolute" top="0" left="0" height="842" width="595">
        <fontspec id="0" size="7" family="Times" color="#000000"/>
        <fontspec id="1" size="7" family="Times" color="#000000"/>
        <fontspec id="2" size="5" family="Times" color="#000000"/>
        <fontspec id="3" size="4" family="Times" color="#000000"/>
        <fontspec id="4" size="-1" family="Times" color="#000000"/>
        <fontspec id="5" size="3" family="Times" color="#000000"/>
        <fontspec id="6" size="2" family="Times" color="#000000"/>
        <fontspec id="7" size="5" family="Times" color="#000000"/>
<text top="60" left="67" width="132" height="9" font="0"><b>Pinouts and pin description</b></text>
<text top="60" left="390" width="138" height="9" font="0"><b>STM32F205xx, STM32F207xx</b></text>
...
<text top="708" left="326" width="13" height="7" font="2">BAT</text>
</page>
<page number="41" position="absolute" top="0" left="0" height="842" width="595">
...
<text top="137" left="170" width="0" height="8" font="0"><b>176</b></text>
</page>
</pdf2xml>

///

That is rather simple to parse, so you get one array with fontspecs (to 
get the size) and one for the text with page number and position (and 
font size).

sort "text" after top and left

find same text in same positions in different pages, that's the page 
header and footer, and it's probable not part of the table, so remove 
that and the page counter (use some heuristics to find that)

since you have top and height of the text elements, you now can find
text elements that overlaps vertically - thay are your table lines

sometimes you have to merge more lines, e.g. the last col. could be 
multiline.

basically use the same proceadure to find limits of the columns.

possible identify sub/superscrips and possible remove them

///

With that procedure I could generate something resempling:

 http://turkos.aspodata.se/git/openhw/pdftosym/stm32f100h.tbl
and then
 http://turkos.aspodata.se/git/openhw/pdftosym/stm32f100h.pins

which I could use as input to

 http://turkos.aspodata.se/git/openhw/pdftosym/symtopin.pl

to generate footprints with.

===================

It has been some time I worked on "pdftosym", maybe we could toss
some ideas.

Regards,
/Karl Hammar

-----------------------------------------------------------------------
AspÃ¶ Data
Lilla AspÃ¶ 148
S-742 94 Ã–sthammar
Sweden
+46 173 140 57

- Raw text -

webmaster	delorie software privacy
Copyright © 2019 by DJ Delorie	Updated Jul 2019

X-Authentication-Warning:	delorie.com: mail set sender to geda-user-bounces using -f
X-Recipient:	geda-user AT delorie DOT com
X-Mailer:	exmh version 2.8.0 04/21/2012 (debian 1:2.8.0~rc1-2) with nmh-1.5
X-Exmh-Isig-CompType:	repl
X-Exmh-Isig-Folder:	inbox
From:	karl AT aspodata DOT se
To:	geda-user AT delorie DOT com
Subject:	pdf table extraction (was Re: [geda-user] Interesting blog post from a commercial EDA vendor - pdf)
In-reply-to:	<alpine.DEB.2.00.1509040545240.6924@igor2priv>
References:	<CAOP4iL3YWQ_MH3HNnyDHMGCGeYFBmazwcw7Af_GATQzAUQJ57g AT mail DOT gmail DOT com> <alpine DOT DEB DOT 2 DOT 00 DOT 1509040545240 DOT 6924 AT igor2priv>
Comments:	In-reply-to gedau AT igor2 DOT repo DOT hu
	message dated "Fri, 04 Sep 2015 06:00:42 +0200."
Mime-Version:	1.0
Message-Id:	<20150904095423.31827809DB80@turkos.aspodata.se>
Date:	Fri, 4 Sep 2015 11:54:22 +0200 (CEST)
X-Virus-Scanned:	ClamAV using ClamSMTP
Reply-To:	geda-user AT delorie DOT com
Errors-To:	nobody AT delorie DOT com
X-Mailing-List:	geda-user AT delorie DOT com
X-Unsubscribes-To:	listserv AT delorie DOT com