X-Authentication-Warning: delorie.com: mail set sender to geda-user-bounces using -f X-Recipient: geda-user AT delorie DOT com X-Mailer: exmh version 2.8.0 04/21/2012 (debian 1:2.8.0~rc1-2) with nmh-1.5 X-Exmh-Isig-CompType: repl X-Exmh-Isig-Folder: inbox From: karl AT aspodata DOT se To: geda-user AT delorie DOT com Subject: Re: [geda-user] pdf -> sym generator In-reply-to: <14cbf107-7e21-3c6c-2810-850143387449@ecosensory.com> References: <20170320161202 DOT 965CD8106DC1 AT turkos DOT aspodata DOT se> <14cbf107-7e21-3c6c-2810-850143387449 AT ecosensory DOT com> Comments: In-reply-to "John Griessen (john AT ecosensory DOT com) [via geda-user AT delorie DOT com]" message dated "Mon, 20 Mar 2017 19:12:56 -0600." Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Message-Id: <20170321103608.0B39A8106DCB@turkos.aspodata.se> Date: Tue, 21 Mar 2017 11:36:07 +0100 (CET) X-Virus-Scanned: ClamAV using ClamSMTP Reply-To: geda-user AT delorie DOT com Errors-To: nobody AT delorie DOT com X-Mailing-List: geda-user AT delorie DOT com X-Unsubscribes-To: listserv AT delorie DOT com Precedence: bulk John: > On 03/20/2017 10:12 AM, karl AT aspodata DOT se wrote: > > As a proof of concept I have made pdfextr.pl [1]. Witch with [2] as > > indatafile I can procude [3]: > > > > ./pdfextr.pl run=stm32 table=27,31 stm32f105r8.pdf > stm32f105r8.table > > So, the table=27,31 tells it which pages to use to extract the text from. Yes. I could potetially search for the "list of tables" and some line containing "table 5. Pin definition" or the like, and follow the page referens to the correct page, and have something that identifies the end of the table. > Sounds like a great start for making a symbol. :) > What kind of tables does it work on? Mind you, it probably just work on the table in the file given above. But for tables with similar layout, I could identify package names (LQFP100 etc.) in headers and ajust the logic to that. Also I could include logic to identify other header names with some kind of dictionary. The core code basically works on any table, thought the program pdftohtml, which provides a dump of the text together with bounding boxes, sometimes groups together tokens which belongs to different columns, and it doesn't provide with bounding boxes of rotated text. So I would like to find another program or adjust pdftohtml so the cell finding process would be easier, currently I have to iclude some guessing code. Also finding out where the lines goes would be helpful in assinging straw text to cells. And, I have the argument run=xxx so you could switch final recognition and editing code. > How do you recognize them from the pdf appearance > in a pdf reader? The program finds lines in text, i.e. sequences of text that overlap vertically, and then finds columns, i.e. parts of the lines that overlap horizontally. Which basically gives me the table cells, then I just have add heuristics of how the real world behaves... Regards, /Karl Hammar ----------------------------------------------------------------------- Aspö Data Lilla Aspö 148 S-742 94 Östhammar Sweden +46 173 140 57