X-Authentication-Warning: delorie.com: mail set sender to geda-user-bounces using -f
X-Recipient: geda-user AT delorie DOT com
X-Mailer: exmh version 2.8.0 04/21/2012 (debian 1:2.8.0~rc1-2) with nmh-1.5
X-Exmh-Isig-CompType: repl
X-Exmh-Isig-Folder: inbox
From: karl AT aspodata DOT se
To: geda-user AT delorie DOT com
Subject: Re: [geda-user] pdf -> sym generator
In-reply-to: <14cbf107-7e21-3c6c-2810-850143387449@ecosensory.com>
References: <20170320161202 DOT 965CD8106DC1 AT turkos DOT aspodata DOT se> <14cbf107-7e21-3c6c-2810-850143387449 AT ecosensory DOT com>
Comments: In-reply-to "John Griessen (john AT ecosensory DOT com) [via geda-user AT delorie DOT com]" <geda-user AT delorie DOT com>
   message dated "Mon, 20 Mar 2017 19:12:56 -0600."
Mime-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Message-Id: <20170321103608.0B39A8106DCB@turkos.aspodata.se>
Date: Tue, 21 Mar 2017 11:36:07 +0100 (CET)
X-Virus-Scanned: ClamAV using ClamSMTP
Reply-To: geda-user AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: geda-user AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com
Precedence: bulk

John:
> On 03/20/2017 10:12 AM, karl AT aspodata DOT se wrote:
> > As a proof of concept I have made pdfextr.pl [1]. Witch with [2] as
> > indatafile I can procude [3]:
> >
> > ./pdfextr.pl run=stm32 table=27,31 stm32f105r8.pdf > stm32f105r8.table
> 
> So, the table=27,31 tells it which pages to use to extract the text from.

Yes. I could potetially search for the "list of tables" and some line 
containing "table 5. Pin definition" or the like, and follow the page 
referens to the correct page, and have something that identifies the 
end of the table.

> Sounds like a great start for making a symbol.

:)

> What kind of tables does it work on?

Mind you, it probably just work on the table in the file given above.
But for tables with similar layout, I could identify package names
(LQFP100 etc.) in headers and ajust the logic to that. Also I could 
include logic to identify other header names with some kind of 
dictionary.

The core code basically works on any table, thought the program 
pdftohtml, which provides a dump of the text together with bounding 
boxes, sometimes groups together tokens which belongs to different 
columns, and it doesn't provide with bounding boxes of rotated text.

So I would like to find another program or adjust pdftohtml so the
cell finding process would be easier, currently I have to iclude some
guessing code. Also finding out where the lines goes would be helpful
in assinging straw text to cells.

And, I have the argument run=xxx so you could switch final recognition
and editing code.

>  How do you recognize them from the pdf appearance
> in a pdf reader?

The program finds lines in text, i.e. sequences of text that overlap 
vertically, and then finds columns, i.e. parts of the lines that 
overlap horizontally. Which basically gives me the table cells, then I 
just have add heuristics of how the real world behaves...

Regards,
/Karl Hammar

-----------------------------------------------------------------------
Aspö Data
Lilla Aspö 148
S-742 94 Östhammar
Sweden
+46 173 140 57