您现在的位置是：首页 >

java八大基本类型 java抽取word,pdf的四种武器

火烧 2023-01-16 06:48:02 1032

java抽取word, df的四种武器　　chri 　　毕业于中国人民大学信息学院　　年月　　　　很多人用java进行文档操作时经常会遇到一个问题就是如何获得word excel df等文档的

java抽取word,pdf的四种武器

　　chris ()　　毕业于中国人民大学信息学院　　年月　　　　很多人用java进行文档操作时经常会遇到一个问题就是如何获得word excel pdf等文档的内容？我研究了一下在这里总结一下抽取word pdf的几种方法　　用jacob　　其实jacob是一个bridage 连接java和或者win 函数的一个中间件 jacob并不能直接抽取word excel等文件需要自己写dll哦不过已经有为你写好的了就是jacob的作者一并提供了　　　　jacob jar与dll文件下载 ?id= 　　　　下载了jacob并放到指定的路径之后(dll放到path jar文件放到classpath) 就可以写你自己的抽取程序了下面是一个简单的例子　　　　　　import java io File;　　import *;　　import jacob activeX *;　　/**　　 * Title: pdf extraction　　 * Description: email:　　 * Copyright: Matrix Copyright (c) 　　 * Company: 　　 * @author chris　　 * @version who use this example pls remain the declare　　 */　　public class FileExtracter{　　 public static void main(String[] args) {　　 ActiveXComponent ponent = new ActiveXComponent( Word Application );　　 String inFile = c:\test doc ;　　 String tpFile = c:\ ;　　 String otFile = c:\temp xml ;　　 boolean flag = false;　　 try {　　 ponent setProperty( Visible new Variant(false));　　 Object wordacc = ponent getProperty( document． ) toDispatch();　　 Object wordfile = Dispatch invoke(wordacc Open Dispatch Method 　　 new Object[]{inFile new Variant(false) new Variant(true)} 　　 new int[ ] ) toDispatch();　　 Dispatch invoke(wordfile SaveAs Dispatch Method new Object[]{tpFile new Variant( )} new int[ ]);　　 Variant f = new Variant(false);　　 Dispatch call(wordfile Close f);　　 flag = true;　　 } catch (Exception e) {　　 e printStackTrace();　　 } finally {　　 ponent invoke( Quit new Variant[] {});　　 }　　 }　　}　　　　　　　　　　用apache的poi来抽取word excel 　　poi是apache的一个项目不过就算用poi你可能都觉得很烦不过不要紧这里提供了更加简单的一个接口给你　　　　下载经过封装后的poi包 ?id= 　　　　下载之后放到你的classpath就可以了下面是如何使用它的一个例子　　　　　　import java io *;　　import textmining text extraction WordExtractor;　　/**　　 *

　　Title: word extraction

　　Description: email:

　　Company:

　　 * @author chris　　 * @version who use this example pls remain the declare　　 */　　　　public class PdfExtractor {　　 public PdfExtractor() {　　 }　　 public static void main(String args[]) throws Exception　　 {　　 FileInputStream in = new FileInputStream ( c:\a doc );　　 WordExtractor extractor = new WordExtractor();　　 String str = extractor extractText(in);　　 System out println( the result length is +str length());　　 System out println( the result is +str);　　}　　}　　　　　　　　　　 pdfbox 用来抽取pdf文件　　但是pdfbox对中文支持还不好先下载pdfbox ?id= 　　　　下面是一个如何使用pdfbox抽取pdf文件的例子　　　　　　import pdfbox pdmodel PDdocument．　　import pdfbox pdfparser PDFParser;　　import java io *;　　import pdfbox util PDFTextStripper;　　import java util Date;　　/**　　 *

　　Title: pdf extraction

　　Description: email:

　　Company:

　　 * @author chris　　 * @version who use this example pls remain the declare　　 */　　　　public class PdfExtracter{　　　　public PdfExtracter(){　　 }　　public String GetTextFromPdf(String filename) throws Exception　　 {　　 String temp=null;　　 PDdocument．nbsppdfdocument．null;　　 FileInputStream is=new FileInputStream(filename);　　 PDFParser parser = new PDFParser( is );　　 parser parse();　　 pdfdocument．nbsp= parser getPDdocument．);　　 ByteArrayOutputStream out = new ByteArrayOutputStream();　　 OutputStreamWriter writer = new OutputStreamWriter( out );　　 PDFTextStripper stripper = new PDFTextStripper();　　 stripper writeText(pdfdocument．getdocument．) writer );　　 writer close();　　 byte[] contents = out toByteArray();　　　　 String ts=new String(contents);　　 System out println( the string length is +contents length+ n );　　 return ts;　　}　　public static void main(String args[])　　{　　PdfExtracter pf=new PdfExtracter();　　PDdocument．nbsppdfdocument．nbsp= null;　　　　try{　　String ts=pf GetTextFromPdf( c:\a pdf );　　System out println(ts);　　}　　catch(Exception e)　　 {　　 e printStackTrace();　　 }　　}　　　　}　　　　　　　　　　抽取支持中文的pdf文件－xpdf　　xpdf是一个开源项目我们可以调用他的本地方法来实现抽取中文pdf文件　　　　下载xpdf函数包 ?id= 　　　　同时需要下载支持中文的补丁包 ?id= 　　　　按照readme放好中文的patch 就可以开始写调用本地方法的java程序了　　　　下面是一个如何调用的例子　　　　　　import java io *;　　/**　　 *

　　Title: pdf extraction

　　Description: email:

　　 * 　　 * @author chris　　 * @version who use this example pls remain the declare　　 */　　　　　　public class PdfWin {　　 public PdfWin() {　　 }　　 public static void main(String args[]) throws Exception　　 {　　 String PATH_TO_XPDF= C:\Program Files\xpdf\pdftotext exe ;　　 String filename= c:\a pdf ;　　 String[] cmd = new String[] { PATH_TO_XPDF enc UTF q filename };　　 Process p = Runtime getRuntime() exec(cmd);　　 BufferedInputStream bis = new BufferedInputStream(p getInputStream());　　 InputStreamReader reader = new InputStreamReader(bis UTF );　　 StringWriter out = new StringWriter();　　 char [] buf = new char[ ];　　 int len;　　 while((len = reader read(buf))>= ) {　　 //out write(buf len);　　 System out println( the length is +len);　　 }　　 reader close();　　 String ts=new String(buf);　　 System out println( the str is +ts);　　 }　　}　　　　　　　　　　关于作者　　作者简介 chris 毕业于中国人民大学信息学院现于香港进行金融分析软件研发作者亦活跃于 jxta p p开源软件的开发社区并热衷于网络安全 AI搜索引擎技术与基于java的游戏引擎技术　　如果大家谁有更好的办法请告诉作者 : lishixinzhi/Article/program/Java/JSP/201311/19681

很赞哦！ (1032)

java八大基本类型 java抽取word,pdf的四种武器

荆州府原文李端《荆州泊》原文及翻译赏析

管理方式的四种类型高考录取通知书“走”到哪儿了四种方式可查!

相关文章

java八大基本类型 java抽取word,pdf的四种武器

荆州府原文 李端《荆州泊》原文及翻译赏析

管理方式的四种类型 高考录取通知书“走”到哪儿了 四种方式可查!

相关文章

荆州府原文李端《荆州泊》原文及翻译赏析

管理方式的四种类型高考录取通知书“走”到哪儿了四种方式可查!