Lucene 简介
Lucene 的核心API
Lucene的简单使用
利用Luke查看分词结果
Lucene是一个开源的使用java语言编写的全文搜索引擎开发包,可以融入到自己的项目中来实现增加索引和搜索功能,其实就是一款高性能的、可扩展的信息检索工具库。和其他开源软件一样,有着与生俱来的优点:功能和结构的透明性、功能强大且具有较强的扩展性、技术社区的强大支持同时也方便技术交流。
Lucene只是一个搜索的核心库,并不提供具体的实现,但它的应用十分广泛,Solr、ElasticSearch 、Katta等底层用的都是Lucene。其特点:API简单,易于学习(但是不同的版本API差别较大)。
Lucene的原理:是倒排序索引。
那么到底什么是倒排序索引,什么又是正排序索引呢?
个人理解:
倒排序索引:即经过Lucene分词之后,它会维护一个类似于“词条--文档ID”的对应关系,当我们进行搜索某个词条的时候,就会得到相应的文档ID。
而正排序索引是:当我们进行搜索的时候,会对整个文档的内容进行搜索,不维护“词条--文档ID”的对应关系,然后如果匹配上,得到对应的文档ID,这样做肯定耗时,因为就像是数据库里的表缺少了索引一样。
Document、 Field 、IndexWriter、 Directroy、 Analyzer
Dcoument 文档,一个文档代表一些域(Field)的集合。他是承载数据的实体,是一个抽象的概念,并不是word 或者 Txt什么的。Document代表一个被索引的基本单元。
构造方法
org.apache.lucene.document.Document.Document()
常用的API:
org.apache.lucene.document.Document.add(IndexableField)//添加字段
Field 索引中的每一个Document对象都包含一个或者多个不同的域(Field),域是由域名(name)和域值(value)对组成,每一个域都包含一段相应的数据信息。
常用的构造方法:
org.apache.lucene.document.Field.Field(String, String, IndexableFieldType)
IndexWriter 是索引过程的核心组件。这个类用于创建一个新的索引并且把文档 加到已有的索引中去。他可以为你提供对索引的写入操作,但不能进行读取或搜索。
构造方法:
org.apache.lucene.index.IndexWriter.IndexWriter(Directory, IndexWriterConfig)
其核心API有:
org.apache.lucene.index.IndexWriter.addDocument(Iterable<? extends IndexableField>)//添加文档
org.apache.lucene.index.IndexWriter.updateDocuments(Term, Iterable<? extends Iterable<? extends IndexableField>>)//更新文档
org.apache.lucene.index.IndexWriter.tryDeleteDocument(IndexReader, int)//删除文档
org.apache.lucene.index.IndexWriter.deleteDocuments(Term...)//删除含有词条的文档
Directory 是索引的存放位置,是个抽象类。具体的子类提供特定的存储索引的地址。FSDirectory 将索引存放在指定的磁盘中,RAMDirectory ·将索引存放在内存中。
FSDirectory 的创建:
org.apache.lucene.store.FSDirectory.open(Path)//创建索引库
org.apache.lucene.store.FSDirectory.listAll(Path)//列举索引库下的文件
RAMDirectory 创建:
org.apache.lucene.store.RAMDirectory.RAMDirectory()
Analyzer 分词器,在文本被索引之前,需要经过分词器处理,他负责从将被索引的文档中提取词汇单元,并剔除剩下的无用信息(停止词汇),分词器十分关键,因为不同的分词器,解析相同的文档结果会有很大的不同。Analyzer是一个抽象类,是所有分词器的基类。他通过TokenStream类似一种很好的方式,将文本逐字。
常用的分词器有:
org.apache.lucene.analysis.standard.StandardAnalyzer//标准粉瓷器
org.apache.lucene.analysis.core.SimpleAnalyzer//简单分词器
org.wltea.analyzer.lucene.IKAnalyzer//IK分词器
IndexSearcher 、Term、Query、TermQuery、TopDocs
IndexSearcher 调用它的search方法,用于搜索IndexWriter 所创建的索引。
构造方法:
org.apache.lucene.search.IndexSearcher.IndexSearcher(IndexReader)
常用API:
org.apache.lucene.search.IndexSearcher.search(Query, int)//进行搜索返回评分较高的前n个文档
org.apache.lucene.search.IndexSearcher.searchAfter(ScoreDoc, Query, int)//
org.apache.lucene.search.IndexSearcher.search(Query, int, Sort)
Term 使用于搜索的一个基本单元。
Query Lucene中含有多种查询(Query)子类。比如,TermQuery(单词条查询)、BooleanQuery(布尔查询)、PhraseQuery(短语搜索)、PrefixQuery(前缀搜索)等。它们用于查询条件的限定其中TermQuery 是Lucene提供的最基本的查询类型,也是最简单的,它主要用来匹配在指定的域(Field)中包含了特定项(Term)的文档。
TopDocs 是一个存放有序搜索结果指针的简单容器,在这里搜索的结果是指匹配一个查询条件的一系列的文档。
POM依赖:
<properties>
<lucene.version>7.1.0</lucene.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-smartcn</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependencies>
工具类:
package com.bj58.wuxian.lucene.utils;
import java.io.IOException;
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Formatter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryTermScorer;
import org.apache.lucene.search.highlight.Scorer;
import org.apache.lucene.search.highlight.SimpleFragmenter;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
public class LuceneUtil {
public static Directory getDirectory(String path) {
Directory directory = null;
try {
directory = FSDirectory.open(Paths.get(path));
} catch (IOException e) {
e.printStackTrace();
}
return directory;
}
public static Directory getRAMDirectory() {
Directory directory = new RAMDirectory();
return directory;
}
public static DirectoryReader getDirectoryReader(Directory directory) {
DirectoryReader reader = null;
try {
reader = DirectoryReader.open(directory);
} catch (IOException e) {
e.printStackTrace();
}
return reader;
}
public static IndexSearcher getIndexSearcher(DirectoryReader reader) {
IndexSearcher indexSearcher = new IndexSearcher(reader);
return indexSearcher;
}
public static IndexWriter getIndexWriter(Directory directory, Analyzer analyzer) {
IndexWriter iwriter = null;
try {
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
// Sort sort=new Sort(new SortField("content", Type.STRING));
// config.setIndexSort(sort);//排序
config.setCommitOnClose(true);//自动提交
// config.setMergeScheduler(new ConcurrentMergeScheduler());
// config.setIndexDeletionPolicy(new
// SnapshotDeletionPolicy(NoDeletionPolicy.INSTANCE));
iwriter = new IndexWriter(directory, config);
} catch (IOException e) {
e.printStackTrace();
}
return iwriter;
}
public static void close(IndexWriter indexWriter,Directory directory){
if(indexWriter!=null){
try {
indexWriter.close();
} catch (IOException e) {
indexWriter=null;
}
}
if(directory!=null){
try {
directory.close();
} catch (IOException e) {
directory=null;
}
}
}
public static void close(DirectoryReader reader,Directory directory){
if(reader!=null){
try {
reader.close();
} catch (IOException e) {
reader=null;
}
}
if(directory!=null){
try {
directory.close();
} catch (IOException e) {
directory=null;
}
}
}
/**
* 高亮标签
* @param query
* @param fieldName
* @return
*/
public static Highlighter getHighlighter(Query query, String fieldName) {
Formatter formatter = new SimpleHTMLFormatter("<span style='color:red'>", "</span>");
Scorer fragmentScorer = new QueryTermScorer(query, fieldName);
Highlighter highlighter = new Highlighter(formatter, fragmentScorer);
highlighter.setTextFragmenter(new SimpleFragmenter(200));
return highlighter;
}
}
创建索引:
package com.bj58.wuxian.lucene.index;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.junit.Before;
import org.junit.Test;
import com.bj58.wuxian.lucene.utils.LuceneUtil;
public class Indexer {
private Analyzer analyzer;
private Directory directory;
@Before
public void init() throws IOException {
analyzer = new StandardAnalyzer();
// 在内存中 创建索引
// directory = new RAMDirectory();
// 在磁盘中创建索引
directory = LuceneUtil.getDirectory("D:\\index");
}
@Test
public void index() {
IndexWriter indexWriter = null;
try {
indexWriter = LuceneUtil.getIndexWriter(directory, analyzer);
Document document = new Document();
document.add(new StringField("title", "西游记", Store.YES));
document.add(new StringField("author", "孙悟空", Store.YES));
document.add(new TextField("desc", "《西游记》是中国古典四大名著之一", Store.YES));
indexWriter.addDocument(document);
} catch (IOException e) {
e.printStackTrace();
} finally {
LuceneUtil.close(indexWriter, directory);
}
}
@Test
public void addIndex() {
IndexWriter indexWriter = null;
try {
indexWriter = LuceneUtil.getIndexWriter(directory, analyzer);
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS);
Random random = new Random();
List<Document> documents = new ArrayList<Document>();
for (int i = 0; i < 10; i++) {
Document document = new Document();
document.add(new Field("id", i + "", fieldType));
document.add(new StringField("title", "三国演义"+i, Store.YES));
document.add(new StringField("author", "罗贯中", Store.YES));
document.add(new TextField("desc", "《三国演义》是中国古典四大名著之一", Store.YES));
int price = random.nextInt(100) + 1;
document.add(new NumericDocValuesField("price", price));
document.add(new StoredField("price", price));
documents.add(document);
}
indexWriter.addDocuments(documents);
} catch (IOException e) {
e.printStackTrace();
} finally {
LuceneUtil.close(indexWriter, directory);
}
}
@Test
public void delIndex() throws ParseException {
IndexWriter indexWriter = null;
try {
indexWriter = LuceneUtil.getIndexWriter(directory, analyzer);
indexWriter.deleteDocuments(new Term("title", "三国演义1"));
// indexWriter.deleteDocuments(new Term("id","9"));
// QueryParser queryParser=new QueryParser("id", analyzer);
// Query query=queryParser.parse("2");
// indexWriter.deleteDocuments(query);
// 强制合并删除的索引信息
// indexWriter.forceMergeDeletes();
// 注意提交或者设置关闭自动提交
// indexWriter.commit();
System.out.println("deletions:" + indexWriter.hasDeletions() + " maxDoc:" + indexWriter.maxDoc()
+ " num:" + indexWriter.numDocs());
} catch (IOException e) {
e.printStackTrace();
} finally {
LuceneUtil.close(indexWriter, directory);
}
}
@Test
public void delAllIndex() throws ParseException {
IndexWriter indexWriter = null;
try {
indexWriter = LuceneUtil.getIndexWriter(directory, analyzer);
;
indexWriter.deleteAll();
System.out.println("deletions:" + indexWriter.hasDeletions() + " maxDoc:" + indexWriter.maxDoc()
+ " num:" + indexWriter.numDocs());
} catch (IOException e) {
e.printStackTrace();
} finally {
LuceneUtil.close(indexWriter, directory);
}
}
@Test
public void update() throws ParseException {
IndexWriter indexWriter = null;
try {
indexWriter = LuceneUtil.getIndexWriter(directory, analyzer);
Document doc = new Document();
FieldType fieldType = new FieldType();
fieldType.setStored(true);
fieldType.setIndexOptions(IndexOptions.DOCS);
doc.add(new Field("id", "9", fieldType));
doc.add(new StringField("title", "水浒传", Store.YES));
doc.add(new StringField("author", "施耐庵", Store.YES));
doc.add(new TextField("desc", "《水浒传》,是中国四大名著之一,是一部描写宋江起义的长篇小说", Store.YES));
indexWriter.updateDocument(new Term("id", "9"), doc);
// 强制合并删除的索引信息
// indexWriter.forceMergeDeletes();
// 注意提交或者设置关闭自动提交
// indexWriter.commit();
System.out.println("deletions:" + indexWriter.hasDeletions() + " maxDoc:" + indexWriter.maxDoc()
+ " num:" + indexWriter.numDocs());
} catch (IOException e) {
e.printStackTrace();
} finally {
LuceneUtil.close(indexWriter, directory);
}
}
// 搜索
@Test
public void search() {
// 打开索引字典
DirectoryReader ireader = null;
IndexSearcher isearcher = null;
try {
ireader = LuceneUtil.getDirectoryReader(directory);
isearcher = LuceneUtil.getIndexSearcher(ireader);
// Analyzer analyzer1=new IKAnalyzer(true);
// QueryParser parser = new QueryParser("desc", analyzer);
// Query query = parser.parse("三国演义");
TermQuery query = new TermQuery(new Term("title", "三国演义1"));
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println("id: " + hitDoc.get("id") + " title:" + hitDoc.get("title") + " author:"
+ hitDoc.get("author") + " price:" + hitDoc.get("price") + " desc:" + hitDoc.get("desc")
+ " 得分:" + hits[i].score);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
LuceneUtil.close(ireader, directory);
}
}
}
搜索索引:
package com.bj58.wuxian.lucene.search;
import java.io.IOException;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.Sort;
import org.apache.lucene.search.SortField;
import org.apache.lucene.search.WildcardQuery;
import org.apache.lucene.search.SortField.Type;
import org.apache.lucene.store.Directory;
import org.junit.Before;
import org.junit.Test;
import com.bj58.wuxian.lucene.constants.Constants;
import com.bj58.wuxian.lucene.utils.LuceneUtil;
public class Searcher{
private Analyzer analyzer;
private Directory directory;
@Before
public void init() throws IOException {
analyzer = new StandardAnalyzer();
// 在内存中 创建索引
// directory = new RAMDirectory();
// 在磁盘中创建索引
directory = LuceneUtil.getDirectory("D:\\index");
}
// 搜索
@Test
public void search() throws IOException, ParseException {
// 打开索引库
DirectoryReader ireader = LuceneUtil.getDirectoryReader(directory);
IndexSearcher isearcher = LuceneUtil.getIndexSearcher(ireader);
QueryParser parser = new QueryParser("desc", analyzer);
Query query = parser.parse("三国演义");
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println("id: " + hitDoc.get("id") + " title:" + hitDoc.get("title") + " author:"
+ hitDoc.get("author") + " price:" + hitDoc.get("price") + " desc:" + hitDoc.get("desc")
+ " 得分:" + hits[i].score);
}
LuceneUtil.close(ireader, directory);
}
// 搜索排序
@Test
public void searchBySort() throws IOException, ParseException {
// 打开索引库
DirectoryReader ireader = LuceneUtil.getDirectoryReader(directory);
IndexSearcher isearcher = LuceneUtil.getIndexSearcher(ireader);
QueryParser parser = new QueryParser("desc", analyzer);
Query query = parser.parse("三国演义");
ScoreDoc[] hits = isearcher.search(query, 10, new Sort(new SortField("price", Type.INT, true))).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println("id: " + hitDoc.get("id") + " title:" + hitDoc.get("title") + " author:"
+ hitDoc.get("author") + " price:" + hitDoc.get("price") + " desc:" + hitDoc.get("desc")
+ " 得分:" + hits[i].score);
}
LuceneUtil.close(ireader, directory);
}
@Test
public void testSearchByPage() {
try {
searchByPage("title", "三国演义", 2);
} catch (Exception e) {
e.printStackTrace();
}
}
// 搜索分页
public void searchByPage(String fieldName, String value, int pageNum) throws Exception{
// 打开索引库
DirectoryReader reader=LuceneUtil.getDirectoryReader(directory);
IndexSearcher isearcher =LuceneUtil.getIndexSearcher(reader);
WildcardQuery query = new WildcardQuery(new Term(fieldName,"*"+value+"*"));
Sort sort = new Sort(new SortField("price", Type.INT, true));
// 计算需要几条
int start = (pageNum - 1) * Constants.PAGE_SIZE;
ScoreDoc[] hits = null;
if (start == 0) {
hits = isearcher.search(query, Constants.PAGE_SIZE,
new Sort(new SortField("price", Type.INT, true))).scoreDocs;
} else {
ScoreDoc[] hitsPres = isearcher.search(query, start, sort).scoreDocs;
ScoreDoc preHit = hitsPres[start - 1];
hits = isearcher.searchAfter(preHit, query, Constants.PAGE_SIZE, sort).scoreDocs;
}
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
System.out.println("id: " + hitDoc.get("id") + " title:" + hitDoc.get("title") + " author:"
+ hitDoc.get("author") + " price:" + hitDoc.get("price") + " desc:" + hitDoc.get("desc"));
}
LuceneUtil.close(reader, directory);
}
}
package com.bj58.wuxian.lucene.highlighter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.store.Directory;
import org.wltea.analyzer.lucene.IKAnalyzer;
import com.bj58.wuxian.lucene.constants.Constants;
import com.bj58.wuxian.lucene.utils.LuceneUtil;
public class HighlighterWrapper {
public static void search(Directory directory, Analyzer analyzer, Query query, String fieldName) {
DirectoryReader reader=null;
try {
Highlighter highlighter = LuceneUtil.getHighlighter(query, fieldName);
reader = LuceneUtil.getDirectoryReader(directory);
IndexSearcher isearcher = LuceneUtil.getIndexSearcher(reader);
ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
for (int i = 0; i < hits.length; i++) {
Document hitDoc = isearcher.doc(hits[i].doc);
String reslut = highlighter.getBestFragment(analyzer, fieldName, hitDoc.get(fieldName));
System.out.println(reslut);
}
} catch (Exception e) {
e.printStackTrace();
}finally{
LuceneUtil.close(reader, directory);
}
}
public static void main(String[] args) throws ParseException {
Directory directory = LuceneUtil.getDirectory(Constants.DIRECTROY_PATH);
Analyzer analyzer = new StandardAnalyzer();
//Analyzer analyzer = new IKAnalyzer(true);
QueryParser parser = new QueryParser("desc", analyzer);
Query query = parser.parse("三国演义");
//TermQuery query=new TermQuery(new Term("desc", "三国演义"));
search(directory, analyzer, query, "desc");
}
}
luke 各版本的下载git地址:https://github.com/DmitryKey/luke/releases
具体应用:
打开索引库目录:
查看词条:
进行搜索:
参考:
《Lucene 实战》 - 作者: Michael McCandless / Erik Hatcher / Otis Gospodnetic 出版社: 人民邮电出版社 译者: 牛长流 / 肖宇 出版年: 2011-6-1
http://lucene.apache.org/