PostgreSQL中的GIN索引有什么作用

發布時間：2021-11-09 10:49:17 來源：億速云閱讀：260 作者：iii 欄目：關系型數據庫

本篇內容主要講解“PostgreSQL中的GIN索引有什么作用”，感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷，實用性強。下面就讓小編來帶大家學習“PostgreSQL中的GIN索引有什么作用”吧!

GIN索引的主要用處是加快全文檢索full-text search的速度.

全文檢索
全文檢索full-text search的目的是從文檔集中找到匹配檢索條件的文檔(document).在搜索引擎中,如果有很多匹配的文檔,那么需要找到最匹配的那些,但在數據庫查詢中,找到滿足條件的即可.

在PG中,出于搜索的目的,文檔會被轉換為特定的類型tsvector,包含詞素(lexemes)和它們在文檔中的位置.詞素Lexemes是那些轉換適合查詢的單詞形式(即分詞).比如:

testdb=# select to_tsvector('There was a crooked man, and he walked a crooked mile');
               to_tsvector               
-----------------------------------------
 'crook':4,10 'man':5 'mile':11 'walk':8
(1 row)

從本例可以看到,分詞后,出現了crook/man/mile和walk,其位置分別是4,10/5/11/8.同時,也可以看到比如there等詞被忽略了,因為這些詞是stop words(從搜索引擎的角度來看,這些詞太過普通,不需要記錄),當然這是可以配置的.

PG全文檢索中的查詢通過tsquery來表示,查詢條件包含1個或多個使用and(\&)/or(|)/not(!)等操作符連接的詞素.同樣的,使用括號來闡明操作的優先級.

testdb=# select to_tsquery('man & (walking | running)');
         to_tsquery         
----------------------------
 'man' & ( 'walk' | 'run' )
(1 row)

操作符 @@ 用于全文檢索

testdb=# select to_tsvector('There was a crooked man, and he walked a crooked mile') @@ to_tsquery('man & (walking | running)');
 ?column? 
----------
 t
(1 row)
select to_tsvector('There was a crooked man, and he walked a crooked mile') @@ to_tsquery('man & (going | running)');
 ?column? 
----------
 f
(1 row)

GIN簡介
GIN是Generalized Inverted Index通用倒排索引的簡稱,如熟悉搜索引擎,這個概念不難理解.它所操作的數據類型的值由元素組成而不是原子的.這樣的數據類型成為復合數據類型.索引的是數據值中的元素.
舉個例子,比如書末尾的索引，它為每個術語提供了一個包含該術語出現位置所對應的頁面列表。訪問方法(AM)需要確保索引元素的快速訪問,因此這些元素存儲在類似Btree中,引用包含復合值(內含元素)數據行的有序集合鏈接到每個元素上.有序對于數據檢索并不重要(如TIDs的排序),但對于索引的內部結構很重要.
元素不會從GIN索引中刪除,可能有人會認為包含元素的值可以消失/新增/變化,但組成這些元素的元素集大多是穩定的.這樣的處理方式大大簡化了多進程使用索引的算法.

如果TIDs不大,那么可以跟元素存儲在同一個page中(稱為posting list),但如果鏈表很大,會采用Btree這種更有效的數據結構,會存儲在分開的數據頁中(稱為posting tree).
因此,GIN索引包含含有元素的Btree,TIDs Btree或者普通鏈表會鏈接到該Btree的葉子行上.

與前面討論的GiST和SP-GiST索引一樣，GIN為應用程序開發人員提供了接口，以支持復合數據類型上的各種操作。

舉個例子,下面是表ts,為ts創建GIN索引:

testdb=# drop table if exists ts;
psql: NOTICE:  table "ts" does not exist, skipping
DROP TABLE
testdb=# create table ts(doc text, doc_tsv tsvector);
CREATE TABLE
testdb=# truncate table ts;
 slitter.'), 
  ('I am a sheet slitter.'),
  ('I slit sheets.'),
  ('I am the sleekest sheet slitter that ever slit sheets.'),
  ('She slits the sheet she sits on.');
update ts set doc_tsv = to_tsvector(doc);
create index on ts using gin(doc_tsv);
TRUNCATE TABLE
testdb=# insert into ts(doc) values
testdb-#   ('Can a sheet slitter slit sheets?'), 
testdb-#   ('How many sheets could a sheet slitter slit?'),
testdb-#   ('I slit a sheet, a sheet I slit.'),
testdb-#   ('Upon a slitted sheet I sit.'), 
testdb-#   ('Whoever slit the sheets is a good sheet slitter.'), 
testdb-#   ('I am a sheet slitter.'),
testdb-#   ('I slit sheets.'),
testdb-#   ('I am the sleekest sheet slitter that ever slit sheets.'),
testdb-#   ('She slits the sheet she sits on.');
INSERT 0 9
testdb=# 
testdb=# update ts set doc_tsv = to_tsvector(doc);
UPDATE 9
testdb=# 
testdb=# create index on ts using gin(doc_tsv);
CREATE INDEX

在這里,使用黑底(page編號 + page內偏移)而不是箭頭來表示對TIDs的引用.
與常規的Btree不同,因為遍歷只有一種方法,GIN索引由單向鏈表連接,而不是雙向鏈表.

testdb=#  select ctid, left(doc,20), doc_tsv from ts;
  ctid  |         left         |                         doc_tsv                         
--------+----------------------+---------------------------------------------------------
 (0,10) | Can a sheet slitter  | 'sheet':3,6 'slit':5 'slitter':4
 (0,11) | How many sheets coul | 'could':4 'mani':2 'sheet':3,6 'slit':8 'slitter':7
 (0,12) | I slit a sheet, a sh | 'sheet':4,6 'slit':2,8
 (0,13) | Upon a slitted sheet | 'sheet':4 'sit':6 'slit':3 'upon':1
 (0,14) | Whoever slit the she | 'good':7 'sheet':4,8 'slit':2 'slitter':9 'whoever':1
 (0,15) | I am a sheet slitter | 'sheet':4 'slitter':5
 (0,16) | I slit sheets.       | 'sheet':3 'slit':2
 (0,17) | I am the sleekest sh | 'ever':8 'sheet':5,10 'sleekest':4 'slit':9 'slitter':6
 (0,18) | She slits the sheet  | 'sheet':4 'sit':6 'slit':2
(9 rows)

在這個例子中,sheet/slit/slitter使用Btree存儲而其他元素則使用簡單的鏈表.

如果我們希望知道元素的個數,如何獲取?

testdb=# select (unnest(doc_tsv)).lexeme, count(*) from ts
testdb-# group by 1 order by 2 desc;
  lexeme  | count 
----------+-------
 sheet    |     9
 slit     |     8
 slitter  |     5
 sit      |     2
 upon     |     1
 mani     |     1
 whoever  |     1
 sleekest |     1
 good     |     1
 could    |     1
 ever     |     1
(11 rows)

下面舉例說明如何通過GIN索引進行掃描:

testdb=# explain(costs off)
testdb-# select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                        QUERY PLAN                         
-----------------------------------------------------------
 Seq Scan on ts
   Filter: (doc_tsv @@ to_tsquery('many & slitter'::text))
(2 rows)
testdb=# set enable_seqscan=off;
SET
testdb=# explain(costs off)
select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                             QUERY PLAN                              
---------------------------------------------------------------------
 Bitmap Heap Scan on ts
   Recheck Cond: (doc_tsv @@ to_tsquery('many & slitter'::text))
   ->  Bitmap Index Scan on ts_doc_tsv_idx
         Index Cond: (doc_tsv @@ to_tsquery('many & slitter'::text))
(4 rows)

執行此查詢首先需要提取單個詞素(lexeme,亦即檢索鍵):mani/slitter.PG中有專門的API函數來完成,該函數考慮了由op class確定的數據類型和使用場景.

testdb=# select amop.amopopr::regoperator, amop.amopstrategy
testdb-# from pg_opclass opc, pg_opfamily opf, pg_am am, pg_amop amop
testdb-# where opc.opcname = 'tsvector_ops'
testdb-# and opf.oid = opc.opcfamily
testdb-# and am.oid = opf.opfmethod
testdb-# and amop.amopfamily = opc.opcfamily
testdb-# and am.amname = 'gin'
testdb-# and amop.amoplefttype = opc.opcintype;
        amopopr        | amopstrategy 
-----------------------+--------------
 @@(tsvector,tsquery)  |            1
 @@@(tsvector,tsquery) |            2
(2 rows)

回到本例中,在詞素Btree中,下一步會同時檢索鍵并進入TIDs鏈表中,得到:
mani — (0,2)
slitter — (0,1), (0,2), (1,2), (1,3), (2,2)

對于每一個找到的TID,調用consistency function API,由此函數確定找到的行是否匹配檢索鍵.因為查詢為AND,因此只返回(0,2).

testdb=# select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                     doc                     
---------------------------------------------
 How many sheets could a sheet slitter slit?
(1 row)

Slow Update
對GIN index的列進行DML(主要是insert & update)是相當慢的,每一個文檔通常包含許多需要索引的詞素.因此,雖然只添加或更新一個文檔,但也需要更新大量索引樹.換句話說,如果多個文檔同時更新,這些文檔中的詞素可能是一樣的,因此總的消耗可能比逐個更新文檔要小.
PG提供了fastupdate選項,用打開此參數后,更新將在一個單獨的無序鏈表中處理,當這個鏈表超過閾值(參數:gin_pending_list_limit或索引同名存儲參數)時才會對索引進行更新.這種技術也有負面影響,一是降低了查詢效率(需額外掃描該鏈表),二是某個更新恰好碰上索引更新,那么該次更新會相對很久.

Limiting the query result
GIN AM的其中一個特性時通常會返回bitmap而不是逐個返回TID,因此執行計劃都是bitmap scan.
這樣的特性胡導致LIMIT子句不會太有效:

testdb=# explain verbose   
select doc from ts where doc_tsv @@ to_tsquery('many & slitter');
                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Bitmap Heap Scan on public.ts  (cost=12.25..16.51 rows=1 width=32)
   Output: doc
   Recheck Cond: (ts.doc_tsv @@ to_tsquery('many & slitter'::text))
   ->  Bitmap Index Scan on ts_doc_tsv_idx  (cost=0.00..12.25 rows=1 width=0)
         Index Cond: (ts.doc_tsv @@ to_tsquery('many & slitter'::text))
(5 rows)
testdb=# explain verbose
select doc from ts where doc_tsv @@ to_tsquery('many & slitter') limit 1;
                                     QUERY PLAN                                     
------------------------------------------------------------------------------------
 Limit  (cost=12.25..16.51 rows=1 width=32)
   Output: doc
   ->  Bitmap Heap Scan on public.ts  (cost=12.25..16.51 rows=1 width=32)
         Output: doc
         Recheck Cond: (ts.doc_tsv @@ to_tsquery('many & slitter'::text))
         ->  Bitmap Index Scan on ts_doc_tsv_idx  (cost=0.00..12.25 rows=1 width=0)
               Index Cond: (ts.doc_tsv @@ to_tsquery('many & slitter'::text))
(7 rows)

這是因為Bitmap Heap Scan的啟動成本與Bitmap Index Scan不會差太多.

基于這樣的情況,PG提供了gin_fuzzy_search_limit參數控制返回的結果行數(默認為0,即全部返回).

testdb=# show gin_fuzzy_search_limit ;
 gin_fuzzy_search_limit 
------------------------
 0
(1 row)

到此，相信大家對“PostgreSQL中的GIN索引有什么作用”有了更深的了解，不妨來實際操作一番吧！這里是億速云網站，更多相關內容可以進入相關頻道進行查詢，關注我們，繼續學習！

向AI問一下細節

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

PostgreSQL中的GIN索引有什么作用

猜你喜歡

亚洲激情专区-91九色丨porny丨老师-久久久久久久女国产乱让韩-国产精品午夜小视频观看

PostgreSQL中的GIN索引有什么作用

猜你喜歡

最新資訊

相關推薦

相關標簽