www.delorie.com/archives/browse.cgi | search |
X-Authentication-Warning: | delorie.com: mail set sender to geda-user-bounces using -f |
X-Recipient: | geda-user AT delorie DOT com |
X-Original-DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; |
d=gmail.com; s=20120113; | |
h=mime-version:in-reply-to:references:date:message-id:subject:from:to | |
:content-type; | |
bh=/cdvo8BIhDu4LkRwkdNY9id4Yu8jgoRoDa+yRt7Bw0w=; | |
b=rQX7WF0JzV1bstrYeV3ejdwug0RDj0g+NAfW6b+inLusnCh5EhQ4Bu0/8alhC/nrm6 | |
r9PKEQL7Es3yjJr33wvvI3N6M6lxUm/nqUyIZU9l7JYy7Mzb9tmICNN0044yoqLtQNz1 | |
578BlqzDGlp/BDz2MtJaRUJjuz/3G8YmLc3dtFFN1VIMh6pZc6Ubb0qNXm3sA/B+5lVd | |
cElHfIxI+xF6pdxojPvu5AtLW2yi1v5rH6PPZmoVqKjAVz0+qNRVCQ5/aWVokjZpD5Pu | |
mkSAo/vg7DXKHvy9CI2mz8QrDFxYEwxIUQ9r3SZGjhygl12OAFY0G1Upq4w7LpGchh7q | |
BFTg== | |
MIME-Version: | 1.0 |
X-Received: | by 10.60.65.68 with SMTP id v4mr2758698oes.84.1441369660405; Fri, |
04 Sep 2015 05:27:40 -0700 (PDT) | |
In-Reply-To: | <20150904112133.85560809DB82@turkos.aspodata.se> |
References: | <CAOP4iL3YWQ_MH3HNnyDHMGCGeYFBmazwcw7Af_GATQzAUQJ57g AT mail DOT gmail DOT com> |
<alpine DOT DEB DOT 2 DOT 00 DOT 1509040545240 DOT 6924 AT igor2priv> | |
<20150904095423 DOT 31827809DB80 AT turkos DOT aspodata DOT se> | |
<alpine DOT DEB DOT 2 DOT 00 DOT 1509041305230 DOT 6924 AT igor2priv> | |
<20150904112133 DOT 85560809DB82 AT turkos DOT aspodata DOT se> | |
Date: | Fri, 4 Sep 2015 08:27:40 -0400 |
Message-ID: | <CAOFvGD4rf8e_4DCF8fjS5i3zXebjM_PiR3ebRhdfZPZ5LmrBsw@mail.gmail.com> |
Subject: | Re: [geda-user] Re: pdf table extraction |
From: | "Jason White (whitewaterssoftwareinfo AT gmail DOT com) [via geda-user AT delorie DOT com]" <geda-user AT delorie DOT com> |
To: | geda-user AT delorie DOT com |
Reply-To: | geda-user AT delorie DOT com |
Errors-To: | nobody AT delorie DOT com |
X-Mailing-List: | geda-user AT delorie DOT com |
X-Unsubscribes-To: | listserv AT delorie DOT com |
--001a11c1cb68993c37051eeb0682 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable My absolute favorite for extracting data from tables in PDF datasheets is Tabula (http://tabula.technology/), it has a nice interface. On Fri, Sep 4, 2015 at 7:21 AM, <karl AT aspodata DOT se> wrote: > Igor2: > > On Fri, 4 Sep 2015, karl AT aspodata DOT se wrote: > > > Igor2: > > > [ about tables in pdf's ] > > > > > > It's true that pdf doesn't have a table structure. > > > > > > I have some experimetal code to extract tables from pdf's, the is in: > > > > > > http://turkos.aspodata.se/git/openhw/pdftosym/Experimental/ > > > > Thanx, will check it out. What you wrote suggests your script works > > similar to mine. > > Yes, but I got the impression you used the graphical elements in the > file and that you possible used pdftohtml in "html" mode, which doesn't > give you the text positions. I have been working purely on the textual > part. > > And beware that the code above is a big mess. Perhaps you can have a > look at: > > http://turkos.aspodata.se/computing/pdfextr.pl > > which is a little less unpolished, it extracts things from an invoice > (sorry can't provide you with the input data example). > > Regards, > /Karl Hammar > > ----------------------------------------------------------------------- > Asp=C3=B6 Data > Lilla Asp=C3=B6 148 > S-742 94 =C3=96sthammar > Sweden > +46 173 140 57 > > > --=20 Jason White --001a11c1cb68993c37051eeb0682 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">My absolute favorite for extracting data from tables in PD= F datasheets is Tabula (<a href=3D"http://tabula.technology/">http://tabula= .technology/</a>), it has a nice interface.<br></div><div class=3D"gmail_ex= tra"><br><div class=3D"gmail_quote">On Fri, Sep 4, 2015 at 7:21 AM, <span = dir=3D"ltr"><<a href=3D"mailto:karl AT aspodata DOT se" target=3D"_blank">karl@= aspodata.se</a>></span> wrote:<br><blockquote class=3D"gmail_quote" styl= e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Igor2:<= br> > On Fri, 4 Sep 2015, <a href=3D"mailto:karl AT aspodata DOT se">karl AT aspodata.= se</a> wrote:<br> > > Igor2:<br> > > [ about tables in pdf's ]<br> > ><br> > > It's true that pdf doesn't have a table structure.<br> > ><br> > > I have some experimetal code to extract tables from pdf's, th= e is in:<br> > ><br> > >=C2=A0 <a href=3D"http://turkos.aspodata.se/git/openhw/pdftosym/Ex= perimental/" rel=3D"noreferrer" target=3D"_blank">http://turkos.aspodata.se= /git/openhw/pdftosym/Experimental/</a><br> ><br> > Thanx, will check it out. What you wrote suggests your script works<br= > > similar to mine.<br> <br> Yes, but I got the impression you used the graphical elements in the<br> file and that you possible used pdftohtml in "html" mode, which d= oesn't<br> give you the text positions. I have been working purely on the textual<br> part.<br> <br> And beware that the code above is a big mess. Perhaps you can have a<br> look at:<br> <br> =C2=A0<a href=3D"http://turkos.aspodata.se/computing/pdfextr.pl" rel=3D"nor= eferrer" target=3D"_blank">http://turkos.aspodata.se/computing/pdfextr.pl</= a><br> <br> which is a little less unpolished, it extracts things from an invoice<br> (sorry can't provide you with the input data example).<br> <br> Regards,<br> /Karl Hammar<br> <br> -----------------------------------------------------------------------<br> Asp=C3=B6 Data<br> Lilla Asp=C3=B6 148<br> S-742 94 =C3=96sthammar<br> Sweden<br> <a href=3D"tel:%2B46%20173%20140%2057" value=3D"+4617314057">+46 173 140 57= </a><br> <br> <br> </blockquote></div><br><br clear=3D"all"><br>-- <br><div class=3D"gmail_sig= nature">Jason White</div> </div> --001a11c1cb68993c37051eeb0682--
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |