annotate MoinMoin/search/Xapian.py @ 5018:67578c72e2d9

Xapian2009: create_page works right no need to check if revision is '99999999'.
author Dmitrijs Milajevs <dimazest@gmail.com>
date Sun, 16 Aug 2009 09:52:24 +0200
parents bc42755b5820
children a93283d1f827
rev   line source
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
1 # -*- coding: iso-8859-1 -*-
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
2 """
1497
ed3845759431 update comments/docstrings
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1496
diff changeset
3 MoinMoin - xapian search engine
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
4
3128
9213b197d1cb Xapian: use own logger instead of request.log
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2286
diff changeset
5 @copyright: 2006-2008 MoinMoin:ThomasWaldmann,
823
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
6 2006 MoinMoin:FranzPletz
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
7 @license: GNU GPL, see COPYING for details.
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
8 """
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
9
1791
6dd2e29acffe Eclipse PyDev Check: fixed lots of its errors and warnings
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1522
diff changeset
10 import os, re
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
11
823
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
12 import xapian
1915
ed85c389031d xapian.Query needs to be imported
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1880
diff changeset
13 from xapian import Query
1793
2a4caa295346 Eclipse PyDev Check: fixed lots of its errors and warnings
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1792
diff changeset
14
3128
9213b197d1cb Xapian: use own logger instead of request.log
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2286
diff changeset
15 from MoinMoin import log
9213b197d1cb Xapian: use own logger instead of request.log
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2286
diff changeset
16 logging = log.getLogger(__name__)
9213b197d1cb Xapian: use own logger instead of request.log
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2286
diff changeset
17
4971
21bc8092a009 Xapian2009: Search is done using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
18 from MoinMoin.support import xappy
801
1f8976e01c3a fix wrong import, make more use of MimeType class
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 797
diff changeset
19 from MoinMoin.parser.text_moin_wiki import Parser as WikiParser
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
20
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
21 from MoinMoin.Page import Page
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
22 from MoinMoin import config, wikiutil
921
45e286183872 abstraction work on search engine index & cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 919
diff changeset
23 from MoinMoin.search.builtin import BaseIndex
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
24
1475
925bbdfe0ab9 check if the correct version of xapian is installed
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1473
diff changeset
25
1915
ed85c389031d xapian.Query needs to be imported
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1880
diff changeset
26 class UnicodeQuery(Query):
1473
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
27 """ Xapian query object which automatically encodes unicode strings """
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
28
823
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
29 def __init__(self, *args, **kwargs):
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
30 """
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
31 @keyword encoding: specifiy the encoding manually (default: value of config.charset)
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
32 """
823
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
33 self.encoding = kwargs.get('encoding', config.charset)
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
34
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
35 nargs = []
843
11a9d77e92d3 stemming works.. in english
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 824
diff changeset
36 for term in args:
11a9d77e92d3 stemming works.. in english
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 824
diff changeset
37 if isinstance(term, unicode):
11a9d77e92d3 stemming works.. in english
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 824
diff changeset
38 term = term.encode(self.encoding)
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
39 elif isinstance(term, list) or isinstance(term, tuple):
852
0ccd65be5656 some more code and thinking on matching stemmed words
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 851
diff changeset
40 term = [t.encode(self.encoding) for t in term]
843
11a9d77e92d3 stemming works.. in english
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 824
diff changeset
41 nargs.append(term)
823
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
42
1915
ed85c389031d xapian.Query needs to be imported
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1880
diff changeset
43 Query.__init__(self, *nargs, **kwargs)
823
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
44
4979
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
45
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
46 class MoinSearchConnection(xappy.SearchConnection):
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
47
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
48 def get_all_documents(self):
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
49 """
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
50 Return all the documents in the xapian index.
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
51 """
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
52 document_count = self.get_doccount()
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
53 query = self.query_all()
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
54 hits = self.search(query, 0, document_count)
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
55 return hits
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
56
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
57 def get_all_documents_with_field(self, field, field_value):
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
58 document_count = self.get_doccount()
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
59 query = self.query_field(field, field_value)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
60 hits = self.search(query, 0, document_count)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
61 return hits
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
62
4979
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
63
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
64 class MoinIndexerConnection(xappy.IndexerConnection):
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
65
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
66 def __init__(self, *args, **kwargs):
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
67
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
68 super(MoinIndexerConnection, self).__init__(*args, **kwargs)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
69
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
70 self._define_fields_actions()
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
71
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
72 def _define_fields_actions(self):
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
73 SORTABLE = xappy.FieldActions.SORTABLE
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
74 INDEX_EXACT = xappy.FieldActions.INDEX_EXACT
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
75 INDEX_FREETEXT = xappy.FieldActions.INDEX_FREETEXT
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
76 STORE_CONTENT = xappy.FieldActions.STORE_CONTENT
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
77
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
78 self.add_field_action('wikiname', INDEX_EXACT)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
79 self.add_field_action('wikiname', STORE_CONTENT)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
80 self.add_field_action('pagename', INDEX_EXACT)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
81 self.add_field_action('pagename', STORE_CONTENT)
4974
eb5644419456 Xapian2009: pagename field now is sortable. test_search.py pep8 fixes, TestSearch is done for both Moin and Xapian searchers.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4973
diff changeset
82 self.add_field_action('pagename', SORTABLE)
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
83 self.add_field_action('attachment', INDEX_EXACT)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
84 self.add_field_action('attachment', STORE_CONTENT)
4971
21bc8092a009 Xapian2009: Search is done using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
85 self.add_field_action('mtime', INDEX_EXACT)
5004
bb2317d4984b Xapian2009: Index field actions and names were updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 5002
diff changeset
86 self.add_field_action('mtime', STORE_CONTENT)
4971
21bc8092a009 Xapian2009: Search is done using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
87 self.add_field_action('revision', STORE_CONTENT)
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
88 self.add_field_action('revision', INDEX_EXACT)
5004
bb2317d4984b Xapian2009: Index field actions and names were updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 5002
diff changeset
89 self.add_field_action('mimetype', INDEX_EXACT)
4976
8df5d749cf2d Xapian2009: xapian_term() was refactored. Code repetition was reduced by introducing BaseFieldSearch class. Field action definitions was updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4974
diff changeset
90 self.add_field_action('mimetype', STORE_CONTENT)
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
91 self.add_field_action('title', INDEX_FREETEXT, weight=5)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
92 self.add_field_action('content', INDEX_FREETEXT, spell=True)
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
93 self.add_field_action('fulltitle', INDEX_EXACT)
4976
8df5d749cf2d Xapian2009: xapian_term() was refactored. Code repetition was reduced by introducing BaseFieldSearch class. Field action definitions was updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4974
diff changeset
94 self.add_field_action('fulltitle', STORE_CONTENT)
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
95 self.add_field_action('domain', INDEX_EXACT)
4976
8df5d749cf2d Xapian2009: xapian_term() was refactored. Code repetition was reduced by introducing BaseFieldSearch class. Field action definitions was updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4974
diff changeset
96 self.add_field_action('domain', STORE_CONTENT)
5004
bb2317d4984b Xapian2009: Index field actions and names were updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 5002
diff changeset
97 self.add_field_action('lang', INDEX_EXACT)
4976
8df5d749cf2d Xapian2009: xapian_term() was refactored. Code repetition was reduced by introducing BaseFieldSearch class. Field action definitions was updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4974
diff changeset
98 self.add_field_action('lang', STORE_CONTENT)
5004
bb2317d4984b Xapian2009: Index field actions and names were updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 5002
diff changeset
99 self.add_field_action('stem_lang', INDEX_EXACT)
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
100 self.add_field_action('author', INDEX_EXACT)
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
101 self.add_field_action('linkto', INDEX_EXACT)
4976
8df5d749cf2d Xapian2009: xapian_term() was refactored. Code repetition was reduced by introducing BaseFieldSearch class. Field action definitions was updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4974
diff changeset
102 self.add_field_action('linkto', STORE_CONTENT)
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
103 self.add_field_action('category', INDEX_EXACT)
4976
8df5d749cf2d Xapian2009: xapian_term() was refactored. Code repetition was reduced by introducing BaseFieldSearch class. Field action definitions was updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4974
diff changeset
104 self.add_field_action('category', STORE_CONTENT)
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
105
823
17d66aec432c add Xapian.UnicodeQuery, small cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 822
diff changeset
106
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
107 ##############################################################################
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
108 ### Tokenizer
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
109 ##############################################################################
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
110
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
111
916
d0af8dce4d0e Xapian.use_stemming -> request.cfg.xapian_stemming and stemming lang bugfix
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 868
diff changeset
112 def getWikiAnalyzerFactory(request=None, language='en'):
1496
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
113 """ Returns a WikiAnalyzer instance
1473
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
114
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
115 @keyword request: current request object
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
116 @keyword language: stemming language iso code, defaults to 'en'
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
117 """
916
d0af8dce4d0e Xapian.use_stemming -> request.cfg.xapian_stemming and stemming lang bugfix
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 868
diff changeset
118 return (lambda: WikiAnalyzer(request, language))
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
119
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
120
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
121 class WikiAnalyzer:
1497
ed3845759431 update comments/docstrings
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1496
diff changeset
122 """ A text analyzer for wiki syntax
2286
01f05e74aa9c Big PEP8 and whitespace cleanup
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2229
diff changeset
123
1497
ed3845759431 update comments/docstrings
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1496
diff changeset
124 The purpose of this class is to anaylze texts/pages in wiki syntax
ed3845759431 update comments/docstrings
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1496
diff changeset
125 and yield yielding single terms for xapwrap to feed into the xapian
ed3845759431 update comments/docstrings
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1496
diff changeset
126 database.
ed3845759431 update comments/docstrings
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1496
diff changeset
127 """
ed3845759431 update comments/docstrings
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1496
diff changeset
128
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
129 singleword = r"[%(u)s][%(l)s]+" % {
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
130 'u': config.chars_upper,
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
131 'l': config.chars_lower,
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
132 }
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
133
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
134 singleword_re = re.compile(singleword, re.U)
3558
44d1cd70e74c fix usage of WikiParser.word_rule (use re.VERBOSE) - fixes xapian indexing of WikiWords (index "WikiWords", "Wiki" and "Words"), fixes detection of WikiWords for the docbook parser
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3552
diff changeset
135 wikiword_re = re.compile(WikiParser.word_rule, re.UNICODE|re.VERBOSE)
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
136
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
137 token_re = re.compile(
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
138 r"(?P<company>\w+[&@]\w+)|" + # company names like AT&T and Excite@Home.
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
139 r"(?P<email>\w+([.-]\w+)*@\w+([.-]\w+)*)|" + # email addresses
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
140 r"(?P<acronym>(\w\.)+)|" + # acronyms: U.S.A., I.B.M., etc.
795
1735cad0cd6c add wikiname to search results, some cleanup
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 794
diff changeset
141 r"(?P<word>\w+)", # words (including WikiWords)
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
142 re.U)
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
143
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
144 dot_re = re.compile(r"[-_/,.]")
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
145 mail_re = re.compile(r"[-_/,.]|(@)")
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
146 alpha_num_re = re.compile(r"\d+|\D+")
1496
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
147
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
148 # XXX limit stuff above to xapdoc.MAX_KEY_LEN
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
149 # WORD_RE = re.compile('\\w{1,%i}' % MAX_KEY_LEN, re.U)
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
150
916
d0af8dce4d0e Xapian.use_stemming -> request.cfg.xapian_stemming and stemming lang bugfix
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 868
diff changeset
151 def __init__(self, request=None, language=None):
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
152 """
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
153 @param request: current request
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
154 @param language: if given, the language in which to stem words
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
155 """
3552
7d9b8040e3be Xapian search / stemming: fix crash if default language is un-stemmable
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3550
diff changeset
156 self.stemmer = None
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
157 if request and request.cfg.xapian_stemming and language:
3552
7d9b8040e3be Xapian search / stemming: fix crash if default language is un-stemmable
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3550
diff changeset
158 try:
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
159 stemmer = xapian.Stem(language)
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
160 # we need this wrapper because the stemmer returns a utf-8
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
161 # encoded string even when it gets fed with unicode objects:
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
162 self.stemmer = lambda word: stemmer(word).decode('utf-8')
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
163 except xapian.InvalidArgumentError:
3552
7d9b8040e3be Xapian search / stemming: fix crash if default language is un-stemmable
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3550
diff changeset
164 # lang is not stemmable or not available
7d9b8040e3be Xapian search / stemming: fix crash if default language is un-stemmable
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3550
diff changeset
165 pass
843
11a9d77e92d3 stemming works.. in english
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 824
diff changeset
166
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
167 def raw_tokenize_word(self, word, pos):
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
168 """ try to further tokenize some word starting at pos """
3856
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
169 yield (word, pos)
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
170 if self.wikiword_re.match(word):
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
171 # if it is a CamelCaseWord, we additionally try to tokenize Camel, Case and Word
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
172 for m in re.finditer(self.singleword_re, word):
3856
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
173 mw, mp = m.group(), pos + m.start()
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
174 for w, p in self.raw_tokenize_word(mw, mp):
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
175 yield (w, p)
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
176 else:
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
177 # if we have Foo42, yield Foo and 42
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
178 for m in re.finditer(self.alpha_num_re, word):
3856
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
179 mw, mp = m.group(), pos + m.start()
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
180 if mw != word:
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
181 for w, p in self.raw_tokenize_word(mw, mp):
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
182 yield (w, p)
6aeb3f0af92c Xapian indexer/tokenizer: tokenize CamelCase parts of non-wikiwords
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3838
diff changeset
183
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
184 def raw_tokenize(self, value):
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
185 """ Yield a stream of words from a string.
1473
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
186
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
187 @param value: string to split, must be an unicode object or a list of
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
188 unicode objects
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
189 """
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
190 if isinstance(value, list): # used for page links
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
191 for v in value:
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
192 yield (v, 0)
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
193 else:
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
194 tokenstream = re.finditer(self.token_re, value)
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
195 for m in tokenstream:
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
196 if m.group("acronym"):
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
197 yield (m.group("acronym").replace('.', ''), m.start())
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
198 elif m.group("company"):
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
199 yield (m.group("company"), m.start())
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
200 elif m.group("email"):
925
4508fc92fcb1 index exact positions of terms (postings)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 924
diff changeset
201 displ = 0
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
202 for word in self.mail_re.split(m.group("email")):
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
203 if word:
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
204 yield (word, m.start() + displ)
925
4508fc92fcb1 index exact positions of terms (postings)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 924
diff changeset
205 displ += len(word) + 1
793
a465544cff9a use WikiAnalyzer, make analyzers yield unicode
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 792
diff changeset
206 elif m.group("word"):
3647
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
207 for word, pos in self.raw_tokenize_word(m.group("word"), m.start()):
b3747c0e81ae Xapian search: improve analyzer to tokenize Foo42Bar23 into Foo, 42, Bar, 23
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3640
diff changeset
208 yield word, pos
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
209
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
210 def tokenize(self, value, flat_stemming=True):
1473
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
211 """ Yield a stream of lower cased raw and stemmed words from a string.
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
212
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
213 @param value: string to split, must be an unicode object or a list of
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
214 unicode objects
1496
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
215 @keyword flat_stemming: whether to yield stemmed terms automatically
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
216 with the natural forms (True) or
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
217 yield both at once as a tuple (False)
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
218 """
925
4508fc92fcb1 index exact positions of terms (postings)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 924
diff changeset
219 for word, pos in self.raw_tokenize(value):
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
220 if flat_stemming:
925
4508fc92fcb1 index exact positions of terms (postings)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 924
diff changeset
221 yield (word, pos)
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
222 if self.stemmer:
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
223 yield (self.stemmer(word), pos)
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
224 else:
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
225 yield (word, self.stemmer(word), pos)
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
226
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
227
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
228 #############################################################################
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
229 ### Indexing
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
230 #############################################################################
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
231
4981
f1d1d8105d52 Xapian2009: pep8 fixes.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4979
diff changeset
232
921
45e286183872 abstraction work on search engine index & cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 919
diff changeset
233 class Index(BaseIndex):
4971
21bc8092a009 Xapian2009: Search is done using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
234
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
235 def __init__(self, request):
1475
925bbdfe0ab9 check if the correct version of xapian is installed
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1473
diff changeset
236 self._check_version()
921
45e286183872 abstraction work on search engine index & cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 919
diff changeset
237 BaseIndex.__init__(self, request)
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
238
1475
925bbdfe0ab9 check if the correct version of xapian is installed
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1473
diff changeset
239 def _check_version(self):
925bbdfe0ab9 check if the correct version of xapian is installed
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1473
diff changeset
240 """ Checks if the correct version of Xapian is installed """
4972
af698a181b01 Xapian2009: AndExpression and TextSearch xapian_term() refactoring. It does not receive allterms parameter, but xappy.SearchConnection.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4971
diff changeset
241 # XXX xappy checks version on import!
2221
f5b9f51e67a9 fix xapian version check, use non-deprecated functions for it with fallback to depracated functions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1915
diff changeset
242 # every version greater than or equal to XAPIAN_MIN_VERSION is allowed
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
243 XAPIAN_MIN_VERSION = (1, 0, 0)
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
244 major, minor, revision = xapian.major_version(), xapian.minor_version(), xapian.revision()
2221
f5b9f51e67a9 fix xapian version check, use non-deprecated functions for it with fallback to depracated functions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1915
diff changeset
245 if (major, minor, revision) >= XAPIAN_MIN_VERSION:
1475
925bbdfe0ab9 check if the correct version of xapian is installed
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1473
diff changeset
246 return
1496
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
247
1475
925bbdfe0ab9 check if the correct version of xapian is installed
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1473
diff changeset
248 from MoinMoin.error import ConfigurationError
2221
f5b9f51e67a9 fix xapian version check, use non-deprecated functions for it with fallback to depracated functions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1915
diff changeset
249 raise ConfigurationError(('MoinMoin needs at least Xapian version '
f5b9f51e67a9 fix xapian version check, use non-deprecated functions for it with fallback to depracated functions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1915
diff changeset
250 '%d.%d.%d to work correctly. Either disable Xapian '
f5b9f51e67a9 fix xapian version check, use non-deprecated functions for it with fallback to depracated functions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1915
diff changeset
251 'completetly in your wikiconfig or upgrade your Xapian %d.%d.%d '
f5b9f51e67a9 fix xapian version check, use non-deprecated functions for it with fallback to depracated functions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1915
diff changeset
252 'installation!') % (XAPIAN_MIN_VERSION + (major, minor, revision)))
1475
925bbdfe0ab9 check if the correct version of xapian is installed
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1473
diff changeset
253
855
481c72d4a181 support for common indices directory cfg.xapian_index_dir
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 852
diff changeset
254 def _main_dir(self):
1473
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
255 """ Get the directory of the xapian index """
855
481c72d4a181 support for common indices directory cfg.xapian_index_dir
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 852
diff changeset
256 if self.request.cfg.xapian_index_dir:
481c72d4a181 support for common indices directory cfg.xapian_index_dir
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 852
diff changeset
257 return os.path.join(self.request.cfg.xapian_index_dir,
481c72d4a181 support for common indices directory cfg.xapian_index_dir
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 852
diff changeset
258 self.request.cfg.siteid)
481c72d4a181 support for common indices directory cfg.xapian_index_dir
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 852
diff changeset
259 else:
868
01750f3c867c fix backtrace when not using xapian_index_dir
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 857
diff changeset
260 return os.path.join(self.request.cfg.cache_dir, 'xapian')
855
481c72d4a181 support for common indices directory cfg.xapian_index_dir
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 852
diff changeset
261
980
f472ddeba121 SystemInfo macro extended with the state of the index, ensure fallback to moinSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 946
diff changeset
262 def exists(self):
f472ddeba121 SystemInfo macro extended with the state of the index, ensure fallback to moinSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 946
diff changeset
263 """ Check if the Xapian index exists """
f472ddeba121 SystemInfo macro extended with the state of the index, ensure fallback to moinSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 946
diff changeset
264 return BaseIndex.exists(self) and os.listdir(self.dir)
f472ddeba121 SystemInfo macro extended with the state of the index, ensure fallback to moinSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 946
diff changeset
265
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
266 def _search(self, query, sort='weight', historysearch=0):
4971
21bc8092a009 Xapian2009: Search is done using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
267 """
21bc8092a009 Xapian2009: Search is done using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
268 Perform the search using xapian (read-lock acquired)
2286
01f05e74aa9c Big PEP8 and whitespace cleanup
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2229
diff changeset
269
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
270 @param query: the search query objects
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
271 @keyword sort: the sorting of the results (default: 'weight')
1794
c3288587c552 Eclipse PyDev Check: fixed some more errors and warnings
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1793
diff changeset
272 @keyword historysearch: whether to search in all page revisions (default: 0) TODO: use/implement this
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
273 """
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
274 while True:
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
275 try:
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
276 searcher, timestamp = self.request.cfg.xapian_searchers.pop()
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
277 if timestamp != self.mtime():
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
278 searcher.close()
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
279 else:
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
280 break
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
281 except IndexError:
4979
d2cda24a5475 Xapian2009: MoinSearchConnection class was introduced which inherits from xappy.SearchConnection and provides get_all_documents method. Various cahnges to the queryparser.py now tests fail on asserts, but not in the errors in the code.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4976
diff changeset
282 searcher = MoinSearchConnection(self.dir)
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
283 timestamp = self.mtime()
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
284 break
1496
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
285
1237
0a947454dec7 use xapian for sorting search results
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1226
diff changeset
286 kw = {}
0a947454dec7 use xapian for sorting search results
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1226
diff changeset
287 if sort == 'page_name':
4969
75cbff83e907 Xapian2009:Part of xapian search code was refactored. Search._xapianMatch is broken must be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
288 kw['sortby'] = 'pagename'
1237
0a947454dec7 use xapian for sorting search results
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1226
diff changeset
289
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
290 # Refresh connection, since it may be outdated.
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
291 searcher.reopen()
4972
af698a181b01 Xapian2009: AndExpression and TextSearch xapian_term() refactoring. It does not receive allterms parameter, but xappy.SearchConnection.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4971
diff changeset
292 query = query.xapian_term(self.request, searcher)
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
293
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
294 # Get maximum possible amount of hits from xappy, which is number of documents in the index.
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
295 document_count = searcher.get_doccount()
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
296 hits = searcher.search(query, 0, document_count, **kw)
4971
21bc8092a009 Xapian2009: Search is done using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4968
diff changeset
297
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
298 self.request.cfg.xapian_searchers.append((searcher, timestamp))
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
299 return hits
1496
70e94a679c47 cleanup whitespace, add/fix comments
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1480
diff changeset
300
921
45e286183872 abstraction work on search engine index & cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 919
diff changeset
301 def _do_queued_updates(self, request, amount=5):
45e286183872 abstraction work on search engine index & cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 919
diff changeset
302 """ Assumes that the write lock is acquired """
1205
73f576c4bca3 fix multiconfig merge and more informative SystemInfo macro
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1201
diff changeset
303 self.touch()
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
304 connection = MoinIndexerConnection(self.dir)
1478
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
305 # do all page updates
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
306 pages = self.update_queue.pages()[:amount]
921
45e286183872 abstraction work on search engine index & cleanups
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 919
diff changeset
307 for name in pages:
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
308 self._index_page(request, connection, name, mode='update')
1478
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
309 self.update_queue.remove([name])
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
310
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
311 # do page/attachment removals
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
312 items = self.remove_queue.pages()[:amount]
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
313 for item in items:
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
314 assert len(item.split('//')) == 2
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
315 pagename, attachment = item.split('//')
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
316 page = Page(request, pagename)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
317 self._remove_item(request, connection, page, attachment)
1478
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
318 self.remove_queue.remove([item])
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
319
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
320 connection.close()
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
321
926
134b5ee99046 basic fetching of matches for terms with xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 925
diff changeset
322 def termpositions(self, uid, term):
1473
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
323 """ Fetches all positions of a term in a document
2286
01f05e74aa9c Big PEP8 and whitespace cleanup
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2229
diff changeset
324
1473
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
325 @param uid: document id of the item in the xapian index
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
326 @param term: the term as a string
b5864c9492fb ensure new attachments trigger an index update, doc update for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1465
diff changeset
327 """
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
328 raise NotImplementedError, "XXX xappy doesn't require this"
926
134b5ee99046 basic fetching of matches for terms with xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 925
diff changeset
329
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
330 def _index_file(self, request, writer, filename, mode='update'):
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
331 """ index a file as it were a page named pagename
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
332 Assumes that the write lock is acquired
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
333 """
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
334 fs_rootpage = 'FS' # XXX FS hardcoded
946
72aeb2ba133d support complete rebuild of the index
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 926
diff changeset
335
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
336 try:
4480
af8cea9bfcda made cfg.interwikiname a unicode object (str only worked for ascii)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3856
diff changeset
337 wikiname = request.cfg.interwikiname or u'Self'
797
89c724b4de15 make mtime dependant updating work again
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 796
diff changeset
338 itemid = "%s:%s" % (wikiname, os.path.join(fs_rootpage, filename))
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
339 mtime = os.path.getmtime(filename)
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
340 mtime = wikiutil.timestamp2version(mtime)
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
341 if mode == 'update':
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
342 try:
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
343 doc = connection.get_document(itemid)
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
344 docmtime = long(doc.data['mtime'])
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
345 updated = mtime > docmtime
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
346 logging.debug("itemid %r: mtime %r > docmtime %r == updated %r" % (itemid, mtime, docmtime, updated))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
347 except KeyError:
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
348 updated = True
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
349 doc = xappy.UnprocessedDocument()
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
350 doc.id = itemid
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
351 updated = mtime > docmtime
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
352
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
353 elif mode == 'add':
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
354 updated = True
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
355 doc = xappy.UnprocessedDocument()
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
356 doc.id = itemid
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
357
3128
9213b197d1cb Xapian: use own logger instead of request.log
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2286
diff changeset
358 logging.debug("%s %r" % (filename, updated))
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
359
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
360 if updated:
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
361 doc.fields.append(xappy.Field('wikiname', wikiname))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
362 doc.fields.append(xappy.Field('pagename', fs_rootpage))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
363 doc.fields.append(xappy.Field('attachment', filename)) # XXX we should treat files like real pages, not attachments
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
364
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
365 doc.fields.append(xappy.Field('mtime', str(mtime)))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
366 doc.fields.append(xappy.Field('revision', '0'))
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
367 title = " ".join(os.path.join(fs_rootpage, filename).split("/"))
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
368 doc.fields.append(xappy.Field('title', title))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
369
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
370 mimetype, file_content = self.contentfilter(filename)
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
371 doc.fields.extend([xappy.Field('mimetype', mt) for mt in [mimetype, ] + mimetype.split('/')])
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
372 doc.fields.append(xappy.Field('content', file_content))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
373
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
374 # Stemming
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
375 # doc.analyzerFactory = getWikiAnalyzerFactory()
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
376
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
377 connection.replace(doc)
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
378
1805
ebcebba1afb3 removed some unused attributes, used 'dummy' for dummies
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1794
diff changeset
379 except (OSError, IOError):
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
380 pass
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
381
847
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
382 def _get_languages(self, page):
1477
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
383 """ Get language of a page and the language to stem it in
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
384
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
385 @param page: the page instance
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
386 """
1880
b07b4c102d3d began refactoring send_page(): processing instruction extraction, getting meta/data part of page only, fixed related problems with language detection. Removed lots of duplicate or unused code.
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1805
diff changeset
387 lang = None
847
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
388 default_lang = page.request.cfg.language_default
846
04703997eb66 added language indexing
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 844
diff changeset
389
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
390 # if we should stem, we check if we have stemmer for the language available
916
d0af8dce4d0e Xapian.use_stemming -> request.cfg.xapian_stemming and stemming lang bugfix
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 868
diff changeset
391 if page.request.cfg.xapian_stemming:
1880
b07b4c102d3d began refactoring send_page(): processing instruction extraction, getting meta/data part of page only, fixed related problems with language detection. Removed lots of duplicate or unused code.
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 1805
diff changeset
392 lang = page.pi['language']
2229
c1ef587208c0 Xapain: raise exception TypeError if Stemmer fails
Reimar Bauer <rb.proj AT googlemail DOT com>
parents: 2228
diff changeset
393 try:
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
394 xapian.Stem(lang)
2229
c1ef587208c0 Xapain: raise exception TypeError if Stemmer fails
Reimar Bauer <rb.proj AT googlemail DOT com>
parents: 2228
diff changeset
395 # if there is no exception, lang is stemmable
c1ef587208c0 Xapain: raise exception TypeError if Stemmer fails
Reimar Bauer <rb.proj AT googlemail DOT com>
parents: 2228
diff changeset
396 return (lang, lang)
3676
8dc2c2fc64ef removed PyStemmer dependency by just using xapian.Stem and requiring xapian >= 1.0.0, also remove code handling older xapian versions
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3648
diff changeset
397 except xapian.InvalidArgumentError:
2229
c1ef587208c0 Xapain: raise exception TypeError if Stemmer fails
Reimar Bauer <rb.proj AT googlemail DOT com>
parents: 2228
diff changeset
398 # lang is not stemmable
c1ef587208c0 Xapain: raise exception TypeError if Stemmer fails
Reimar Bauer <rb.proj AT googlemail DOT com>
parents: 2228
diff changeset
399 pass
2286
01f05e74aa9c Big PEP8 and whitespace cleanup
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2229
diff changeset
400
847
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
401 if not lang:
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
402 # no lang found at all.. fallback to default language
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
403 lang = default_lang
846
04703997eb66 added language indexing
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 844
diff changeset
404
847
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
405 # return actual lang and lang to stem in
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
406 return (lang, default_lang)
846
04703997eb66 added language indexing
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 844
diff changeset
407
1199
5ce3bea2e66c index categories
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1195
diff changeset
408 def _get_categories(self, page):
1477
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
409 """ Get all categories the page belongs to through the old
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
410 regular expression
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
411
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
412 @param page: the page instance
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
413 """
1199
5ce3bea2e66c index categories
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1195
diff changeset
414 body = page.get_raw_body()
5ce3bea2e66c index categories
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1195
diff changeset
415
1200
b953b5ff4877 CategorySearch is live
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1199
diff changeset
416 prev, next = (0, 1)
b953b5ff4877 CategorySearch is live
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1199
diff changeset
417 pos = 0
b953b5ff4877 CategorySearch is live
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1199
diff changeset
418 while next:
b953b5ff4877 CategorySearch is live
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1199
diff changeset
419 if next != 1:
b953b5ff4877 CategorySearch is live
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1199
diff changeset
420 pos += next.end()
3813
a3cf0aa7bf97 category search: ignore traling whitespace after ----
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3676
diff changeset
421 prev, next = next, re.search(r'-----*\s*\r?\n', body[pos:])
1200
b953b5ff4877 CategorySearch is live
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1199
diff changeset
422
b953b5ff4877 CategorySearch is live
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1199
diff changeset
423 if not prev or prev == 1:
1199
5ce3bea2e66c index categories
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1195
diff changeset
424 return []
3573
124d0ef138aa change page_*_regex processing, see docs/CHANGES (fixes Xapian category search for non-english)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3569
diff changeset
425 # for CategoryFoo, group 'all' matched CategoryFoo, group 'key' matched just Foo
5005
dbec2f99c0dc Xapian2009: BaseExpression._get_query_for_search_re() now checks all field values, not only the first one. xapian_term() returns not empty queries for link:re: and link:re:case:. Categories are stored as they are, they are not lowercased. Comment why xapian_term for category:re: does not work is added.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 5004
diff changeset
426 return [m.group('all') for m in self.request.cfg.cache.page_category_regex.finditer(body[pos:])]
1199
5ce3bea2e66c index categories
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1195
diff changeset
427
1226
9b101f696445 index domains of a page (standard, underlay)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1206
diff changeset
428 def _get_domains(self, page):
1477
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
429 """ Returns a generator with all the domains the page belongs to
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
430
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
431 @param page: page
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
432 """
1226
9b101f696445 index domains of a page (standard, underlay)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1206
diff changeset
433 if page.isUnderlayPage():
9b101f696445 index domains of a page (standard, underlay)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1206
diff changeset
434 yield 'underlay'
9b101f696445 index domains of a page (standard, underlay)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1206
diff changeset
435 if page.isStandardPage():
9b101f696445 index domains of a page (standard, underlay)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1206
diff changeset
436 yield 'standard'
1377
bb37beca7545 fixed system pages search, added underlay search, started with mtime filtering
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1374
diff changeset
437 if wikiutil.isSystemPage(self.request, page.page_name):
bb37beca7545 fixed system pages search, added underlay search, started with mtime filtering
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1374
diff changeset
438 yield 'system'
1226
9b101f696445 index domains of a page (standard, underlay)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1206
diff changeset
439
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
440 def _index_page(self, request, connection, pagename, mode='update'):
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
441 """ Index a page - assumes that the write lock is acquired
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
442
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
443 @arg connection: the Indexer connection object
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
444 @arg pagename: a page name
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
445 @arg mode: 'add' = just add, no checks
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
446 'update' = check if already in index and update if needed (mtime)
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
447 """
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
448 p = Page(request, pagename)
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
449 if request.cfg.xapian_index_history:
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
450 for rev in p.getRevList():
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
451 updated = self._index_page_rev(request, connection, Page(request, pagename, rev=rev), mode=mode)
4541
38110c49d0a6 Xapian indexing: in update mode, do not try to re-index old revisions again
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4514
diff changeset
452 logging.debug("updated page %r rev %d (updated==%r)" % (pagename, rev, updated))
38110c49d0a6 Xapian indexing: in update mode, do not try to re-index old revisions again
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4514
diff changeset
453 if not updated:
38110c49d0a6 Xapian indexing: in update mode, do not try to re-index old revisions again
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4514
diff changeset
454 # we reached the revisions that are already present in the index
38110c49d0a6 Xapian indexing: in update mode, do not try to re-index old revisions again
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4514
diff changeset
455 break
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
456 else:
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
457 self._index_page_rev(request, connection, p, mode=mode)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
458
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
459 self._index_attachments(request, connection, pagename, mode)
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
460
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
461 def _index_attachments(self, request, connection, pagename, mode='update'):
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
462 from MoinMoin.action import AttachFile
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
463
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
464 wikiname = request.cfg.interwikiname or u"Self"
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
465 # XXX: Hack until we get proper metadata
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
466 p = Page(request, pagename)
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
467 language, stem_language = self._get_languages(p)
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
468 domains = tuple(self._get_domains(p))
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
469 updated = False
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
470
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
471 attachments = AttachFile._get_files(request, pagename)
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
472 for att in attachments:
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
473 filename = AttachFile.getFilename(request, pagename, att)
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
474 itemid = "%s:%s//%s" % (wikiname, pagename, att)
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
475 mtime = wikiutil.timestamp2version(os.path.getmtime(filename))
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
476 if mode == 'update':
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
477 try:
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
478 doc = connection.get_document(itemid)
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
479 docmtime = long(doc.data['mtime'])
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
480 updated = mtime > docmtime
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
481 except KeyError:
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
482 updated = True
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
483 doc = xappy.UnprocessedDocument()
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
484 doc.id = itemid
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
485 elif mode == 'add':
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
486 updated = True
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
487 doc = xappy.UnprocessedDocument()
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
488 doc.id = itemid
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
489
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
490 logging.debug("%s %s %r" % (pagename, att, updated))
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
491
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
492 if updated:
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
493 doc.fields.append(xappy.Field('wikiname', wikiname))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
494 doc.fields.append(xappy.Field('pagename', pagename))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
495 doc.fields.append(xappy.Field('attachment', att))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
496
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
497 doc.fields.append(xappy.Field('mtime', str(mtime)))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
498 doc.fields.append(xappy.Field('revision', '0'))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
499 doc.fields.append(xappy.Field('title', '%s/%s' % (pagename, att)))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
500
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
501 doc.fields.append(xappy.Field('lang', language))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
502 doc.fields.append(xappy.Field('stem_lang', stem_language))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
503 doc.fields.append(xappy.Field('fulltitle', pagename))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
504
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
505 mimetype, att_content = self.contentfilter(filename)
4973
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
506 doc.fields.extend([xappy.Field('mimetype', mt) for mt in [mimetype, ] + mimetype.split('/')])
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
507 doc.fields.append(xappy.Field('content', att_content))
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
508 doc.fields.extend([xappy.Field('domain', domain) for domain in domains])
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
509
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
510 # XXX Stemming
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
511 # doc.analyzerFactory = getWikiAnalyzerFactory(request, stem_language)
2f16bd87444d Xapian2009: Files and attachments are indexed using xappy.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4972
diff changeset
512 connection.replace(doc)
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
513
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
514 def _index_page_rev(self, request, connection, page, mode='update'):
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
515 """ Index a page revision - assumes that the write lock is acquired
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
516
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
517 @arg connection: the Indexer connection object
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
518 @arg page: a page object
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
519 @arg mode: 'add' = just add, no checks
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
520 'update' = check if already in index and update if needed (mtime)
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
521 """
4514
af09c1b3a153 fixed search (see details below)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4480
diff changeset
522 request.page = page
4480
af8cea9bfcda made cfg.interwikiname a unicode object (str only worked for ascii)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3856
diff changeset
523 wikiname = request.cfg.interwikiname or u"Self"
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
524 pagename = page.page_name
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
525 mtime = page.mtime_usecs()
1368
949341c1c5ed index author und revision number
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1241
diff changeset
526 revision = str(page.get_real_rev())
1441
05482b439f89 optional history indexing and search is working
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1434
diff changeset
527 itemid = "%s:%s:%s" % (wikiname, pagename, revision)
3550
65eac5f65a11 Page.edit_info: better return empty dict than None when no edit-log entry is found
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3546
diff changeset
528 author = page.edit_info().get('editor', '?')
847
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
529 # XXX: Hack until we get proper metadata
813125ff0d74 Introducing LanguageSearch
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 846
diff changeset
530 language, stem_language = self._get_languages(page)
1199
5ce3bea2e66c index categories
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1195
diff changeset
531 categories = self._get_categories(page)
1226
9b101f696445 index domains of a page (standard, underlay)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1206
diff changeset
532 domains = tuple(self._get_domains(page))
810
413cc62c6ec4 fix for the xapian indexer
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 801
diff changeset
533 updated = False
413cc62c6ec4 fix for the xapian indexer
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 801
diff changeset
534
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
535 if mode == 'update':
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
536 try:
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
537 doc = connection.get_document(itemid)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
538 docmtime = long(doc.data['mtime'])
1499
ffa0d1f81059 final polishing round adding docstrings, comments and fixing small issues
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1497
diff changeset
539 updated = mtime > docmtime
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
540 logging.debug("itemid %r: mtime %r > docmtime %r == updated %r" % (itemid, mtime, docmtime, updated))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
541 except KeyError:
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
542 updated = True
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
543 doc = xappy.UnprocessedDocument()
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
544 doc.id = itemid
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
545
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
546 elif mode == 'add':
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
547 updated = True
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
548 doc = xappy.UnprocessedDocument()
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
549 doc.id = itemid
3648
8352dcd5a282 Xapian search: for mimetypes also index major and minor separately, so you can search for 'text' or 'plain'
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3647
diff changeset
550
849
02d6697b000d basic searching using stemmed and unstemmed terms
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 848
diff changeset
551
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
552 logging.debug("%s %r" % (pagename, updated))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
553
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
554 if updated:
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
555 doc.fields.append(xappy.Field('wikiname', wikiname))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
556 doc.fields.append(xappy.Field('pagename', pagename))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
557 doc.fields.append(xappy.Field('attachment', '')) # this is a real page, not an attachment
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
558 doc.fields.append(xappy.Field('mtime', str(mtime)))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
559 doc.fields.append(xappy.Field('revision', revision))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
560 doc.fields.append(xappy.Field('title', pagename))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
561
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
562 doc.fields.append(xappy.Field('lang', language))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
563 doc.fields.append(xappy.Field('stem_lang', stem_language))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
564 doc.fields.append(xappy.Field('fulltitle', pagename))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
565 doc.fields.append(xappy.Field('author', author))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
566
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
567 mimetype = 'text/%s' % page.pi['format'] # XXX improve this
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
568 doc.fields.extend([xappy.Field('mimetype', mt) for mt in [mimetype, ] + mimetype.split('/')])
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
569
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
570 doc.fields.extend([xappy.Field('linkto', pagelink) for pagelink in page.getPageLinks(request)])
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
571 doc.fields.extend([xappy.Field('category', category) for category in categories])
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
572 doc.fields.extend([xappy.Field('domain', domain) for domain in domains])
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
573
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
574 doc.fields.append(xappy.Field('content', page.get_raw_body()))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
575
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
576 # XXX Stemming
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
577 # doc.analyzerFactory = getWikiAnalyzerFactory(request, stem_language)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
578
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
579 logging.debug("%s (replace %r)" % (pagename, itemid))
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
580 connection.replace(doc)
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
581
4541
38110c49d0a6 Xapian indexing: in update mode, do not try to re-index old revisions again
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 4514
diff changeset
582 return updated
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
583
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
584 def _remove_item(self, request, connection, page, attachment=None):
4480
af8cea9bfcda made cfg.interwikiname a unicode object (str only worked for ascii)
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 3856
diff changeset
585 wikiname = request.cfg.interwikiname or u'Self'
1478
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
586 pagename = page.page_name
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
587
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
588 if not attachment:
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
589
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
590 search_connection = MoinSearchConnection(self.dir)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
591 docs_to_delete = search_connection.get_all_documents_with_field('fulltitle', pagename)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
592 ids_to_delete = [d.id for d in docs_to_delete]
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
593 search_connection.close()
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
594
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
595 for id in ids_to_delete:
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
596 connection.delete(id)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
597 logging.debug('%s removed from xapian index' % pagename)
1478
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
598 else:
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
599 # Only remove a single attachment
5002
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
600 id = "%s:%s//%s" % (wikiname, pagename, attachment)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
601 connection.delete(id)
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
602
e64a9fe80f6d Xapian2009: Xapian search queries an index once, not twice to get all hits for a query. Xappy is used for the index update.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4991
diff changeset
603 logging.debug('attachment %s from %s removed from index' % (attachment, pagename))
1478
53e9c1db5ace support for page/attachment removal and renaming (preliminary commit to show activity, needs more testing)
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1477
diff changeset
604
4991
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
605 def _index_pages(self, request, files=None, mode='update', pages=None):
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
606 """ Index pages (and all given files)
2286
01f05e74aa9c Big PEP8 and whitespace cleanup
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2229
diff changeset
607
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
608 This should be called from indexPages or indexPagesInNewThread only!
2286
01f05e74aa9c Big PEP8 and whitespace cleanup
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2229
diff changeset
609
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
610 This may take some time, depending on the size of the wiki and speed
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
611 of the machine.
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
612
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
613 When called in a new thread, lock is acquired before the call,
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
614 and this method must release it when it finishes or fails.
1477
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
615
7f5a6374e0e1 finish code docs for MoinMoin.search.Xapian
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 1475
diff changeset
616 @param request: the current request
4991
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
617 @param files: an optional list of files to index
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
618 @param mode: how to index the files, either 'add', 'update' or 'rebuild'
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
619 @param pages: list of pages to index, if not given, all pages are indexed
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
620
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
621 """
4991
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
622 if pages is None:
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
623 # Index all pages
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
624 pages = request.rootpage.getPageList(user='', exists=1)
981
dbb3bf01ae19 the index rebuild code was in the wrong spot
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 980
diff changeset
625
dbb3bf01ae19 the index rebuild code was in the wrong spot
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 980
diff changeset
626 # rebuilding the DB: delete it and add everything
dbb3bf01ae19 the index rebuild code was in the wrong spot
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 980
diff changeset
627 if mode == 'rebuild':
dbb3bf01ae19 the index rebuild code was in the wrong spot
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 980
diff changeset
628 for f in os.listdir(self.dir):
982
541271bb8a56 fix for rebuilding the index again, get the full path for each file
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 981
diff changeset
629 os.unlink(os.path.join(self.dir, f))
981
dbb3bf01ae19 the index rebuild code was in the wrong spot
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 980
diff changeset
630 mode = 'add'
dbb3bf01ae19 the index rebuild code was in the wrong spot
Franz Pletz <fpletz AT franz-pletz DOT org>
parents: 980
diff changeset
631
5015
bc42755b5820 Xapian2009: AndExpression.xapian_term() was refactored. Tests for "and" were updated.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 5005
diff changeset
632 connection = MoinIndexerConnection(self.dir)
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
633 try:
4991
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
634 self.touch()
3128
9213b197d1cb Xapian: use own logger instead of request.log
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2286
diff changeset
635 logging.debug("indexing all (%d) pages..." % len(pages))
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
636 for pagename in pages:
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
637 self._index_page(request, connection, pagename, mode=mode)
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
638 if files:
3128
9213b197d1cb Xapian: use own logger instead of request.log
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents: 2286
diff changeset
639 logging.debug("indexing all files...")
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
640 for fname in files:
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
641 fname = fname.strip()
4991
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
642 self._index_file(request, connection, fname, mode)
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
643 connection.flush()
4991
d39bdb239da4 Xapian2009: py.test.importorskip in tests was removed, tests try import Xapian, and on ImportError skip a test. Index.indexPages now takes a pages parameter - list of pages which must be indexed.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4985
diff changeset
644 finally:
4968
b0afbf750a24 Xapian2009: Xapian Index._index_page_rev() was refactored and now uses xappy. Files and attachments are not indexed, indexing methods are still needed to be refactored.
Dmitrijs Milajevs <dimazest@gmail.com>
parents: 4541
diff changeset
645 connection.close()
788
4840926790f5 indexed search: added Xapian support (needs more work), removed Lupy support
Thomas Waldmann <tw AT waldmann-edv DOT de>
parents:
diff changeset
646