www.delorie.com/archives/browse.cgi   search  
Mail Archives: geda-user/2015/09/04/08:27:56

X-Authentication-Warning: delorie.com: mail set sender to geda-user-bounces using -f
X-Recipient: geda-user AT delorie DOT com
X-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20120113;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:content-type;
bh=/cdvo8BIhDu4LkRwkdNY9id4Yu8jgoRoDa+yRt7Bw0w=;
b=rQX7WF0JzV1bstrYeV3ejdwug0RDj0g+NAfW6b+inLusnCh5EhQ4Bu0/8alhC/nrm6
r9PKEQL7Es3yjJr33wvvI3N6M6lxUm/nqUyIZU9l7JYy7Mzb9tmICNN0044yoqLtQNz1
578BlqzDGlp/BDz2MtJaRUJjuz/3G8YmLc3dtFFN1VIMh6pZc6Ubb0qNXm3sA/B+5lVd
cElHfIxI+xF6pdxojPvu5AtLW2yi1v5rH6PPZmoVqKjAVz0+qNRVCQ5/aWVokjZpD5Pu
mkSAo/vg7DXKHvy9CI2mz8QrDFxYEwxIUQ9r3SZGjhygl12OAFY0G1Upq4w7LpGchh7q
BFTg==
MIME-Version: 1.0
X-Received: by 10.60.65.68 with SMTP id v4mr2758698oes.84.1441369660405; Fri,
04 Sep 2015 05:27:40 -0700 (PDT)
In-Reply-To: <20150904112133.85560809DB82@turkos.aspodata.se>
References: <CAOP4iL3YWQ_MH3HNnyDHMGCGeYFBmazwcw7Af_GATQzAUQJ57g AT mail DOT gmail DOT com>
<alpine DOT DEB DOT 2 DOT 00 DOT 1509040545240 DOT 6924 AT igor2priv>
<20150904095423 DOT 31827809DB80 AT turkos DOT aspodata DOT se>
<alpine DOT DEB DOT 2 DOT 00 DOT 1509041305230 DOT 6924 AT igor2priv>
<20150904112133 DOT 85560809DB82 AT turkos DOT aspodata DOT se>
Date: Fri, 4 Sep 2015 08:27:40 -0400
Message-ID: <CAOFvGD4rf8e_4DCF8fjS5i3zXebjM_PiR3ebRhdfZPZ5LmrBsw@mail.gmail.com>
Subject: Re: [geda-user] Re: pdf table extraction
From: "Jason White (whitewaterssoftwareinfo AT gmail DOT com) [via geda-user AT delorie DOT com]" <geda-user AT delorie DOT com>
To: geda-user AT delorie DOT com
Reply-To: geda-user AT delorie DOT com
Errors-To: nobody AT delorie DOT com
X-Mailing-List: geda-user AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

--001a11c1cb68993c37051eeb0682
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

My absolute favorite for extracting data from tables in PDF datasheets is
Tabula (http://tabula.technology/), it has a nice interface.

On Fri, Sep 4, 2015 at 7:21 AM, <karl AT aspodata DOT se> wrote:

> Igor2:
> > On Fri, 4 Sep 2015, karl AT aspodata DOT se wrote:
> > > Igor2:
> > > [ about tables in pdf's ]
> > >
> > > It's true that pdf doesn't have a table structure.
> > >
> > > I have some experimetal code to extract tables from pdf's, the is in:
> > >
> > >  http://turkos.aspodata.se/git/openhw/pdftosym/Experimental/
> >
> > Thanx, will check it out. What you wrote suggests your script works
> > similar to mine.
>
> Yes, but I got the impression you used the graphical elements in the
> file and that you possible used pdftohtml in "html" mode, which doesn't
> give you the text positions. I have been working purely on the textual
> part.
>
> And beware that the code above is a big mess. Perhaps you can have a
> look at:
>
>  http://turkos.aspodata.se/computing/pdfextr.pl
>
> which is a little less unpolished, it extracts things from an invoice
> (sorry can't provide you with the input data example).
>
> Regards,
> /Karl Hammar
>
> -----------------------------------------------------------------------
> Asp=C3=B6 Data
> Lilla Asp=C3=B6 148
> S-742 94 =C3=96sthammar
> Sweden
> +46 173 140 57
>
>
>


--=20
Jason White

--001a11c1cb68993c37051eeb0682
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">My absolute favorite for extracting data from tables in PD=
F datasheets is Tabula (<a href=3D"http://tabula.technology/">http://tabula=
.technology/</a>), it has a nice interface.<br></div><div class=3D"gmail_ex=
tra"><br><div class=3D"gmail_quote">On Fri, Sep 4, 2015 at 7:21 AM,  <span =
dir=3D"ltr">&lt;<a href=3D"mailto:karl AT aspodata DOT se" target=3D"_blank">karl@=
aspodata.se</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Igor2:<=
br>
&gt; On Fri, 4 Sep 2015, <a href=3D"mailto:karl AT aspodata DOT se">karl AT aspodata.=
se</a> wrote:<br>
&gt; &gt; Igor2:<br>
&gt; &gt; [ about tables in pdf&#39;s ]<br>
&gt; &gt;<br>
&gt; &gt; It&#39;s true that pdf doesn&#39;t have a table structure.<br>
&gt; &gt;<br>
&gt; &gt; I have some experimetal code to extract tables from pdf&#39;s, th=
e is in:<br>
&gt; &gt;<br>
&gt; &gt;=C2=A0 <a href=3D"http://turkos.aspodata.se/git/openhw/pdftosym/Ex=
perimental/" rel=3D"noreferrer" target=3D"_blank">http://turkos.aspodata.se=
/git/openhw/pdftosym/Experimental/</a><br>
&gt;<br>
&gt; Thanx, will check it out. What you wrote suggests your script works<br=
>
&gt; similar to mine.<br>
<br>
Yes, but I got the impression you used the graphical elements in the<br>
file and that you possible used pdftohtml in &quot;html&quot; mode, which d=
oesn&#39;t<br>
give you the text positions. I have been working purely on the textual<br>
part.<br>
<br>
And beware that the code above is a big mess. Perhaps you can have a<br>
look at:<br>
<br>
=C2=A0<a href=3D"http://turkos.aspodata.se/computing/pdfextr.pl" rel=3D"nor=
eferrer" target=3D"_blank">http://turkos.aspodata.se/computing/pdfextr.pl</=
a><br>
<br>
which is a little less unpolished, it extracts things from an invoice<br>
(sorry can&#39;t provide you with the input data example).<br>
<br>
Regards,<br>
/Karl Hammar<br>
<br>
-----------------------------------------------------------------------<br>
Asp=C3=B6 Data<br>
Lilla Asp=C3=B6 148<br>
S-742 94 =C3=96sthammar<br>
Sweden<br>
<a href=3D"tel:%2B46%20173%20140%2057" value=3D"+4617314057">+46 173 140 57=
</a><br>
<br>
<br>
</blockquote></div><br><br clear=3D"all"><br>-- <br><div class=3D"gmail_sig=
nature">Jason White</div>
</div>

--001a11c1cb68993c37051eeb0682--

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019