-
Notifications
You must be signed in to change notification settings - Fork 0
/
05-tmvbt.tex
2049 lines (1897 loc) · 98 KB
/
05-tmvbt.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%=====================================================================
%% Transactions on the MVBT
%%=====================================================================
\chapter{Transactions on the MVBT}
\label{chapter:tmvbt}
\label{def:tmvbt}
As shown in the previous chapter, there are efficient multiversion index
structures available, but there is no single structure that is both optimal
and that can be used in a concurrent transactional environment.
We reviewed the result by Becker
et~al.~\cite{becker:1993:optimal,becker:1996:mvbt} that showed that the
multiversion \Btree~(MVBT) is an optimal multiversion index structure,
% if the root page of the queried version is known
% (\secref{sec:tsbmvbt:mvbt})
but it follows a single-update model,
% in which a transaction can only consist of a single update that
and the update cannot be rolled back.
In this chapter, we present our redesigned MVBT, called the
\emph{transactional multiversion \Btree} (TMVBT)\@.
The TMVBT adds transactions to the MVBT by redesigning the
structure-modification operations (SMOs) so that multiple data-item updates
can be performed within a single transaction, and the updates can be rolled
back.
% We show that the optimality of the index structure is preserved, and that
% each page still contains at least \minlive\ entries that are alive
% at version~$v$ for all versions in which the page is part of the search
% tree~$S_v$.
% The extended structure-modification operations do not need to access any more
% pages than the MVBT structure-modification operations.
% We also show how concurrency-control and recovery algorithms can be used to
% allow a single updating transaction to operate on the index concurrently with
% multiple read-only transactions.
% The TMVBT is thus not yet an optimal general-purpose
% multiversion database index, because multiple updating transaction cannot
% operate of the index concurrently.
% Multiple updating transactions are discussed on the next chapter.
The TMVBT structure was first introduced in our previous
article~\cite{haapasalo:2009:tmvbt}.
In the discussion here, we explain the algorithms in more detail, and provide
detailed proofs for the properties of the structure.
We begin the chapter by describing the implementation of the transaction
model of Sections~\ref{sec:mv-data:read-only-tx}
and~\ref{sec:mv-data:updating-tx} for the TMVBT in
\secref{sec:tmvbt:multi-action-tx}.
After that, \secref{sec:tmvbt:active-entries} defines the concept of active
entries, which is needed for maintaining the optimality in the presence of
multi-action transactions, and
\secref{sec:tmvbt:structure} describes the structure of the TMVBT\@.
In \secref{sec:tmvbt:actions}, we show how the user actions are performed,
and in \secref{sec:tmvbt:smo}, we describe the structure-modification
operations triggered by the user actions.
Finally, in \secref{sec:tmvbt:multiupdate}, we illustrate why we cannot
allow multiple updating transactions to operate on the index concurrently,
and in \secref{sec:tmvbt:summary}, we summarize the discussion on the TMVBT
index.
%% Multi-Action Transactions
%%---------------------------------------------------------------------
\section{Multi-Action Transactions}
\label{sec:tmvbt:multi-action-tx}
We allow two kinds of transactions to operate on the TMVBT concurrently:
any number of read-only transactions (as defined in
\secref{sec:mv-data:read-only-tx}) and at most one updating transaction (as
defined in \secref{sec:mv-data:updating-tx}) at a time.
For reasons explained in \secref{sec:tmvbt:multiupdate}, we cannot allow more
than one updating transaction to operate on the TMVBT at a time.
Concurrent updating transactions are discussed in the next chapter.
In contrast to the MVBT (\secref{sec:tsbmvbt:mvbt}), each updating
transaction operating on the TMVBT can perform any number of updates, and the
updates all receive the same version.
Because only one updating transaction can operate on the TMVBT at a time, the
commit order of the transactions is known during the execution of the
transactions, and each data-item update can be directly performed with the
correct transaction-time version.
In the context of the TMVBT, we use the term \emph{version} to denote these
transaction-time instants.
This does mean, however, that the version assigned to an updating
transaction cannot be based on the real time instant of the commit action,
because that is not known at the beginning of the transaction.
We thus assume that the versions used in the TMVBT are increasing
integer numbers that are assigned at the beginning of the updating
transaction.
The versions can be based on an increasing counter value or they can be based
on the real-time instant of the begin action of the transaction, as long as
they are increasing values.
For simplicity, in the discussions in this chapter, we assume that the
active version variable (defined below) is based on an integer value that is
incremented by one each time a new updating transaction begins.
Like the MVBT, the TMVBT also maintains a commit version variable~\comver\
that records the version of the latest committed transaction.
The TMVBT also maintains an active version
variable~\actver\phantomsection\label{def:actver} that holds the version
of the current updating transaction.
If there is no active updating transaction operating on the index, $\actver =
\comver$.
When an updating transaction starts, the active version variable is
incremented, $\actver \leftarrow \actver + 1$.
When the single updating transaction commits, the commit version is
incremented to match the active version variable $\comver \leftarrow
\actver$.
These version variables therefore tell whether there is an active updating
transaction running on the TMVBT index: if the active version variable is
larger than the commit version variable, then there is an updating
transaction running, and no other updating transaction can begin.
Read-only transactions can always target any version that is less than
or equal to the commit version~\comver, unless purging of old versions is
implemented (see discussion in \secref{sec:tsbmvbt:mvbt} and in the MVBT
articles~\cite{becker:1993:optimal,becker:1996:mvbt}).
In that case, the minimum version that can be accessed must also be
maintained in a separate variable.
The transaction model used for the TMVBT is the transaction model
explained in Sections~\ref{sec:mv-data:read-only-tx}
and~\ref{sec:mv-data:updating-tx}.
The log records written by a transaction~$T$ for the actions presented here
must also contain the transaction identifier \txid{T}.
Because the common form of log records
\lrb{T}, \logact{action}, \lre{\ldots} already contains the identifier~$T$,
we use $T$ to mean that the transaction identifier $\txid{T}$ is written in
the log records.
The control actions of the transaction model are implemented as shown below;
the query and update actions are discussed in \secref{sec:tmvbt:actions}.
\begin{itemize}
\setlength{\itemsep}{0pt}
% Begin
\item \action{begin-read-only}$($version~$v)$: begins a new read-only
transaction; this action takes a short-duration read lock on the~\comver\
variable, checks if $v \leq \comver$, and records the value $\snapver{T}
\leftarrow v$ for the transaction.
If $v > \comver$, the transaction is aborted.
% Begin
\item \action{begin-update}: begins a new updating transaction~$T$; this
action takes a commit-duration write lock on the active version
variable~\actver, increments the variable $\actver \leftarrow \actver +
1$, and assigns it to the transaction: $\txid{T} \leftarrow \actver$.
A redo-undo log record \lrb{T}, \logact{begin}, $v$, \lre{\actver} is
written, with $v$ denoting the previous value of the variable~\actver,
but the log is not forced to disk.
% Commit
\item \action{commit-update}: commits the active updating transaction~$T$
by
(1)~taking a commit-duration write lock on the committed version
variable~\comver,
(2)~updating the variable $\comver \leftarrow \actver$;
(3)~writing a log record \lrb{T}, \logact{commit}, \lre{\comver};
(4)~forcing the log onto disk; and
(5)~calling the release-version action.
% Version release
\item \action{release-version}: this action does nothing, because the
updating transaction has already assigned correct versions to each updated
data item entry.
If the versions of the TMVBT index should be timestamps that are based on
the time of the commit action (which is not known at the beginning of the
updating transaction), then this action can perform the necessary changes
to update the entries.
% Abort
\item \action{abort}: labels the updating transaction as aborted and
starts the backward-rolling phase.
This action writes the log record \lrb{T}, \lre{\logact{abort}}.
% Finish-rollback
\item \action{finish-rollback}: finishes the rollback of an aborting
transaction by decrementing the active version variable $\actver \leftarrow
\actver - 1$, writing a log record
\lrb{T}, \logact{finish-rollback}, \lre{\actver}, and
forcing the log to disk.
\end{itemize}
% (1)~when an updating transaction~$T$ begins, the \actver\ variable is
% incremented and assigned to the transaction, $\txid{T} \to
% \actver$;
% (2)~when the updating transaction commits, the same version \actver\ is used
% as the commit-time version of the transaction; and
% (3)~the version release method does nothing, because all the items already
% have the correct version.
All the update actions of an updating transaction are logged using
the write-ahead logging protocol as in
\abbr{ARIES}~\cite{mohan:1992:aries}\@.
In addition to the log records described above, redo-undo log records are
written for an \action{insert} action and a \action{delete} action,
while redo-only log records are written for an \action{undo-insert} action
and an \action{undo-delete} action.
These log records are described in \secref{sec:tmvbt:actions}.
A read-only transaction does not create any log records; it only stores
transient control information in the active-transactions table when it
begins, and removes that information when it commits.
%% Active Entries
%%---------------------------------------------------------------------
\section{Active Entries}
\label{sec:tmvbt:active-entries}
Recall from \figref{fig:mvbt-invalid-split} on
page~\pageref{fig:mvbt-invalid-split} the problem of inserting multiple
data-item entries into the MVBT index with the same version.
When performing a version-split operation on page~$p$, a new copy~$p'$ of the
page is created and the life span of the original page~$p$ is truncated to
the current version and the page is left as it is for use in historical
queries.
If the page~$p$ was created by the same transaction that triggers the
version split, the life span of the page will degenerate into an empty range,
and the page will thus not be part of any search tree in the database.
In these situations, the key split can be performed directly on the
page~$p$, without applying the version-split operation first.
Let us now define the concepts of active entries and active pages to
classify the situations where a version-split operation is not required and
in fact must not be performed.
Remember that the single updating transaction always has the version~\actver\
as its identifier and uses that version to stamp the data-item updates
and structure-modification operations.
\thmskip
\begin{definition}
\label{def:active-entries-pages}
An \emph{active entry} (or \emph{active page}, respectively) in the TMVBT
index is an entry (page) that has a life span of $[\actver, \infty)$.
An active entry (page) has been created earlier on by the same updating
transaction.
Entries (pages) that are not active are called \emph{inactive entries}
(\emph{inactive pages}).
\end{definition}
\thmskip
As stated in the previous chapter, read-only transactions may only read
versions that have a commit-time version of at most \comver.
This leads to the following observation:
\thmskip
\begin{invariant}
\label{inv:tmvbt-read-only-inactive}
Read-only transactions in the TMVBT index only read inactive entries and
pages.
Active entries and pages are only seen by the single active updating
transaction.
\end{invariant}
\thmskip
If the active updating transaction is deleting an active entry, the entry can
be physically removed from the index, instead of changing its life span.
This does not invalidate partial persistence, because the active entry was
created by the same transaction, and thus did not exist before the updating
transaction first inserted it.
Updates that are internal to the transaction are not visible outside
the transaction and must not consume space in the index.
\thmskip
\begin{invariant}
\label{inv:active-entries-physical-delete}
When a single updating transaction~$T$ deletes an active entry (created by
$T$), the entry is physically removed from the TMVBT index.
Similarly, if $T$ updates an active entry it physically removes the old entry
and creates a new active entry to replace the old one.
\end{invariant}
\thmskip
When performing a version-split operation on a page~$p$ at version~$v$ in the
original MVBT index, Becker et~al.\ suggested that the
entries that are left in page~$p$ may be left
unmodified~\cite{becker:1996:mvbt}, so that the life spans of the
entries~$e_i$ that were alive at version~$v$ remain unbounded on the above;
that is, of the form $[v_i, \infty)$, where $v_i < v$.
This does not affect any queries, because only historical queries targeting
versions $v' < v$ will ever end up in the historical page~$p$; thus, even if
an entry~$e_i$ is deleted by a transaction with a version $v'' > v$, the
queries targeting those newer versions will never encounter the now-outdated
entry~$e_i$ on the historical page~$p$.
However, if a transaction~$T$ stores a previously used path as a saved path
(see p.~\pageref{def:saved-path}) and reuses the path later on, it is
possible that the pages in the saved path are no longer valid.
The transaction~$T$ cannot ascertain the validity of the pages unless the
consistency of the life spans of all the entries and pages is maintained, so
that the deletion times of entries in historical pages are set to the
deletion time of the historical page.
In the TMVBT index, we explicitly require that the life spans of entries that
are left on a historical page are cropped so that they end at the version
during which the page was split:
\thmskip
\begin{invariant}
\label{inv:tmvbt-live-entry-copy}
When a page~$p$ in the TMVBT index is version-split into a new page~$p'$ at
version~\actver, all live entries $(k, [v, \infty), w)$ such that $v <
\actver$ are processed as follows:
a live copy $(k, [\actver, \infty), w)$ is created and inserted into the new
live page~$p'$, and the live entry at page~$p$ is changed to the historical
entry $(k, [v, \actver), w)$.
All active live entries of the form $(k, [\actver, \infty), w)$ are physically
moved to the new page~$p'$.
\end{invariant}
\thmskip
This invariant is required in order that the key-version regions of all
the entries of a given level of the TMVBT index do not overlap, as shown in
\figref{fig:mvbt-space-partition} on page~\pageref{fig:mvbt-space-partition}.
By adhering to these rules, we can also obtain the following lemmata:
\thmskip
\begin{lemma}
\label{lemma:active-pages-active-entries}
Active pages in the TMVBT only contain active entries.
\end{lemma}
\begin{proof}
An active page is a page that was created by the active updating transaction.
When the transaction commits, the page immediately becomes inactive.
When an active page~$p$ was created, all the live entries that were copied to
it were changed so that their life spans start at the split boundary
(i.e., at version~\actver), thus making the copies of the entries active.
If the active entries are changed in any way by the same transaction, they
will be physically deleted or replaced by new active entries, as per
\invref{inv:active-entries-physical-delete}.
\end{proof}
\thmskip
\thmskip
\begin{lemma}
\label{lemma:active-pages-single-parent}
Active pages have at most one parent.
\end{lemma}
\begin{proof}
Multiple parents in a multiversion index are caused by multiple routers to
the same page~$p$ in the index pages above the page~$p$.
When an active page~$p$ is created, a new index entry $i_p$ is inserted to the
parent page~$p'$.
Note that the index entry~$i_p$ is also active, and will remain active until
the current active transaction commits.
When the current transaction commits, both the entry~$i_p$ and the page~$p$
will immediately become inactive.
By \invref{inv:tmvbt-live-entry-copy}, active entries are physically moved
during a version-split operation.
If the parent page~$p'$ is version-split before the active transaction
commits, the index entry $i_p$ is physically moved to the new page,
thereby preventing the creation of new copies of~$i_p$.
Because there can be only a single index entry~$i_p$ pointing to an
active page~$p$, active pages can only have a single parent.
\end{proof}
\thmskip
All the entries of active TMVBT pages have the same life span of $[\actver,
\infty)$.
This holds for both leaf pages and index pages, and is illustrated in
\figref{fig:tmvbt-active-entries}.
Because of this fact, we can in fact disregard the life spans of entries when
performing an operation on active pages: in effect, we can treat active pages
as if they were pages in a non-versioned \Btree\ index.
The extended TMVBT algorithms are based on this observation.
The algorithms themselves are explained in detail in \secref{sec:tmvbt:smo}.
\begin{figure}[!htb]
\begin{center}
\subfigure[Active index entries]{\input{images/tmvbt-active-index-entries.tex}
\label{fig:tmvbt-active-entries:index}}
\subfigure[Active leaf entries]{\input{images/tmvbt-active-leaf-entries.tex}
\label{fig:tmvbt-active-entries:leaf}}
\figcaption{Active entries in the TMVBT index}%
{The index page contains three index entries with routers to pages $p_1$,
$p_2$, and $p_3$.}
\label{fig:tmvbt-active-entries}
\end{center}
\end{figure}
For an example, let us review the problem scenario in MVBT as depicted in
\figref{fig:mvbt-invalid-split} on page~\pageref{fig:mvbt-invalid-split}.
In the TMVBT, the page~$p_1$ is active, and thus it can be key-split directly
without version-splitting it first.
The operation of the same transaction, executing on the TMVBT index, is shown
in \figref{fig:mvbt-invalid-split-solved}.
\begin{figure}[htb]
\begin{center}
\input{images/mvbt-problem-solved}
\figcaption{Key split without version split in the TMVBT}%
{A key-split is triggered by the insertion of key~\num{4}.
The leaf page contains three entries, namely $e_1$, $e_2$, and $e_3$.
The format of the page header is key range, life span;
and the format of the entries is (key, life span, data).}
\label{fig:mvbt-invalid-split-solved}
\end{center}
\end{figure}
%% Transactional Multiversion B-tree
%%---------------------------------------------------------------------
\section{Transactional Multiversion \Btree}
\label{sec:tmvbt:structure}
As we explained in the previous chapter, only the MVBT
index~\cite{becker:1993:optimal,becker:1996:mvbt} can be
considered optimal when updating transactions follow a single-update model,
although the MVAS of Varman and Verma has access cost guarantees that are close
to optimal ($m$-optimal, see \secref{sec:tsbmvbt:mvas}).
We have chosen the MVBT as the basis of our work, instead of the MVAS,
because
(1)~the page reusing rules of the MVAS make the structure of the pages more
complicated without improving the space complexity bounds,
(2)~the lack of a separate \rootstar\ structure makes history queries less
efficient, and
(3)~the access list incurs a high maintenance cost.
Nevertheless, the improvements presented in this chapter could also be
implemented on the MVAS index structure.
The \emph{transactional multiversion \Btree} (TMVBT) index, which was first
introduced in our previous article~\cite{haapasalo:2009:tmvbt}, is a directed
acyclic graph with multiple root pages that is based on the multiversion
\Btree\ of Becker et~al.~\cite{becker:1993:optimal,becker:1996:mvbt}.
The original MVBT structure was reviewed in \secref{sec:tsbmvbt:mvbt}.
The different roots of the TMVBT index are stored in a \rootstar\ structure,
exactly as in the MVBT index.
The page format in the \abbr{TMVBT} is identical to that of the MVBT, with
the addition of recovery information required for our \abbr{ARIES}-based
recovery algorithm, such as a Page-LSN field that stores the log
sequence number (LSN) of the log record
of the latest update on the page.
We assume that each page~$p$ explicitly stores the life span $\vr{p}$,
the key range $\kr{p}$, and also the height of the page.
%, denoted $\height{p}$.
The height of a page is one for all leaf pages, and greater
% $\height{p} > 1$
for index pages.
As discussed in \secref{sec:tsbmvbt:mvbt}, the MVBT has three variables that
determine how many live entries there are in each page and how often pages
are split or merged.
These variables are used in the TMVBT in the same meaning.
The variable \minlive\ determines the minimum number of live entries that
must be present in each live page (see \invref{inv:mvbt-live-count} on
page~\pageref{inv:mvbt-live-count}), and variables \minsplit\ and \maxsplit\
%, $\minsplit < \maxsplit$,
control how many live entries must be present in each live page created by a
structure-modification operation.
Becker et~al.\ use the term \emph{weak version condition} to refer to the
first requirement, and the term \emph{strong version
condition}\phantomsection\label{def:strong-version} to refer to the
second~\cite{becker:1993:optimal,becker:1996:mvbt}.
The variables \minsplit\ and \maxsplit\ are defined as
$\minsplit = \minlive + s$, and $\maxsplit = \capacity - s$, where $s$~is
a \emph{split tolerance variable} that determines how many actions must at
least be performed on the page before a new structure-modification operation
is required.
If the strong version condition holds, then at least $s$~entries can be
deleted from the page before the number of live entries falls below \minlive,
and similarly at least~$s$ entries can be inserted to the page before the page
becomes full.
In effect, $s$~is used to prevent thrashing.
When a page has more than \maxsplit\ entries immediately after a
version-split, it will be key-split into two pages.
We thus require that $\maxsplit \geq 2 \times \minsplit$ so that the two new
pages will have at least \minsplit\ entries each.
\thmskip
\begin{invariant}
\label{inv:tmvbt-minsplit-maxsplit}
All the live pages at level~$l$ that are involved in a structure-modification
operation that targets a page~$p$ at level~$l$ must contain
from \minsplit\ to \maxsplit\ live entries immediately after the \abbr{SMO}\@.
\end{invariant}
\thmskip
Note that these requirements do not need to hold for the parent page~$q$ at
level $l+1$, because only a small constant number of updates is applied to it
during any SMO\@.
The router entries in a parent page are only updated by the SMOs at a lower
level, so the updates performed on the parent page~$q$ correspond to inserting
or deleting entries from a leaf page.
The values chosen to the variables affect the size of the index structure and
the frequency of structure-modification operations.
It is theoretically possible to set \minlive\ as high as $\capacity/2$, if $s
= 0$, but this means that thrashing is not prevented.
The upper limit of the value of $s$ is $\capacity/3$, but with this setting
$\minlive = 0$ and thus the optimality constraints are lost.
For the discussion in this dissertation, we assume that the following values
are used: $\minlive = \nicefrac{1}{5}\, \capacity$, $s = \nicefrac{1}{5}\,
\capacity$, $\minsplit = \minlive + s = \nicefrac{2}{5}\, \capacity$, and
$\maxsplit = \capacity - s = \nicefrac{4}{5}\, \capacity$.
Although the representations of the variables differ from the
definition used by Becker et~al., we can show that the variables are the
same as in the \abbr{MVBT} article~\cite{becker:1996:mvbt}.
Becker et~al.\ require that $\minlive = d = \capacity/k$, $\minsplit =
(1 + \epsilon) \times d$ and $\maxsplit = (k - \epsilon) \times d$, where
$k$ and $\epsilon$ are variables that can be selected.
If we assign $s = \epsilon d$, we obtain $\minsplit = d + \epsilon d =
\minlive + \epsilon d = \minlive + s$, and $\maxsplit = k d - \epsilon
d = \capacity - s$, which are the definitions used here.
For optimality of the index structure, we wish to keep the structure of
the TMVBT index as close to the MVBT index as possible.
Most importantly, we wish to maintain \invref{inv:mvbt-live-count}, so that
all pages of each search tree~$S_v$ contain at least \minlive\ entries that
are alive at version~$v$, for all versions~$v$.
Let us first restate \invref{inv:mvbt-static-entries} for the TMVBT:
\thmskip
\begin{invariant}
\label{inv:tmvbt-static-inactive-entries}
All inactive entries in the TMVBT pages remain in place.
They are never moved to another page.
Only the deletion time of an inactive entry may be changed, always to the
current active version~\actver.
Active entries in the TMVBT pages may be physically deleted,
updated, or moved to another page (see
Invariants~\ref{inv:active-entries-physical-delete}
and~\ref{inv:tmvbt-live-entry-copy}).
\end{invariant}
\thmskip
This means, in practice, that the structure of the search tree~$S_v$ of a
version~$v$ can only change when $v = \actver$.
After $T$ commits, version~$v$ becomes inactive, and the structure of
the search tree~$S_v$ becomes static.
By this we mean that the set of pages that forms the search tree~$S_v$ can
no longer change, and the entries that are alive at version~$v$ are
never physically deleted or moved to another page.
If we design the algorithms in such a way that the search tree of the active
version is balanced (\defref{def:consistent-balanced}) in all situations,
this implies that search trees of all versions are balanced, and thus
optimal.
This follows from the fact that the search tree~$S_v$ of the active
version is balanced immediately before the active transaction commits and thus
also at the moment version~$v$ becomes inactive.
Furthermore, because inactive search trees are static, $S_v$ will always remain
balanced.
We will show in the next sections that the TMVBT algorithms maintain
\invref{inv:mvbt-live-count} for the TMVBT\@.
This, together with the observation that all the root-to-leaf paths in the
search tree~$S_v$ are always of the same length, implies that the balance
conditions of the active-version search tree are also maintained.
% TMVBT live count
\thmskip
\begin{invariant}
\label{inv:tmvbt-live-count}
\invref{inv:mvbt-live-count} holds for the TMVBT index.
That is, for all versions~$v$ and all pages~$p$, page~$p$ contains at least
\minlive\ entries that are alive at version~$v$; or $p$ is a root page of
$S_v$, in which case it contains at least \num{2}~entries that are alive at
version~$v$; or $p$ is the only page of $S_v$ and contains at least one entry
that is alive at version~$v$; or $p$ is not part of the search tree~$S_v$ and
therefore contains no entries that are alive at version~$v$.
\end{invariant}
\thmskip
Figures~\ref{fig:example-oper-1}--\ref{fig:example-oper-3} show
an example of the TMVBT page operations.
In this illustrative example, the index is structurally consistent and
balanced, with suboptimal settings of $\minlive = 1$ and $s = 1$ for a page
capacity of $\capacity = 5$.
All the following examples have been generated by our visualization
software \TreeLib\ (see \chapref{chapter:performance}).
Pages $p_1$, $p_2$, and $p_4$ are not shown in the figures, because
$p_1$ and $p_2$ are used as database information pages, and page
$p_4$ is the root page of the \rootstar\ index.
\begin{figure}[!hbt]
\begin{center}
\input{images/tmvbt-oper-1}
\figcaption{Example of a TMVBT index after insertions}
{
The page header shows the page identifier followed by the key range
and version of the page;
the format of index-page entries is (key range, life span, page
identifier); and
the format of leaf-page entries is (key, life span, data), but
the associated data has been left out for clarity.
This TMVBT has been created by transaction~$T_1$
inserting keys~\range{1}{6} and transaction~$T_2$ inserting
keys~\num{7} and~\num{8}.
}
\label{fig:example-oper-1}
\end{center}
\end{figure}
In \figref{fig:example-oper-1}, the index contains six inactive live
entries inserted by transaction~$T_1$ (entries with keys \range{1}{6}), and
two active entries inserted by transaction~$T_2$ (entries with keys \num{7}
and \num{8}).
During the execution of~$T_1$, the leaf page~$p_3$ was key-split into
pages~$p_3$ and~$p_5$, and a new root page~$p_6$ was created, thus
incrementing the height of the search tree~$S_1$ by one.
\begin{figure}[!htb]
\begin{center}
\input{images/tmvbt-oper-2}
\figcaption{\abbr{TMVBT} after inserting a data item with key~\num{9}}
{
The format of the figure is the same as in \figref{fig:example-oper-1}.
White rectangles denote live pages, and gray rectangles denote dead pages.
Transaction~$T_2$ has caused a version-split on~$p_5$ by inserting
key~\num{9}.}
\label{fig:example-oper-2}
\end{center}
\end{figure}
\figref{fig:example-oper-2} shows the result of a version split
after transaction~$T_2$ tried to insert key~\num{9} to the full page~$p_5$.
The page~$p_5$ was version-split into pages $p_7$~and~$p_8$.
The historical entries are left stored in the dead page $p_5$, and active
copies of the entries have been created into pages~$p_7$ and~$p_8$.
Note that all the active entries have been physically moved away from
page~$p_5$.
\begin{figure}[!htb]
\begin{center}
\input{images/tmvbt-oper-3}
\figcaption{\abbr{TMVBT} after deleting most of the entries}
{Transaction~$T_2$ deleted keys~\range{4}{9}, thus shrinking
the current-version search tree to a single page.}
\label{fig:example-oper-3}
\end{center}
\end{figure}
\figref{fig:example-oper-3} shows the status of the database after
transaction~$T_2$ has deleted entries \range{4}{9}.
Deleting the active entries has caused the number of live entries in
pages~$p_7$ and~$p_8$ to fall below \minlive, so the pages have been
consolidated by merging them.
In more detail, first~$p_7$ was merged with~$p_8$ by moving the active
entries of~$p_7$ to~$p_8$, which caused~$p_7$ to be deallocated.
Note that also the active router to~$p_7$ was deleted from the parent
page~$p_6$.
When the rest of the entries in~$p_8$ were deleted, page~$p_8$ was further
merged with $p_3$ by killing the page~$p_3$ and by creating active copies of
the live entries in~$p_3$ into~$p_8$.
As we will show in~\secref{sec:tmvbt:smo}, the algorithms actually created
a new live copy of~$p_3$ when killing it (call it $p_9$), and the active live
copy~$p_9$ was then merged with~$p_8$, causing $p_9$ to be deallocated.
At this point~$p_8$ was the only live page at level~\num{1}, so the height of
the current-version search tree~$S_2$ was decremented by making
$p_8$ the root page of version~\num{2}.
The auxiliary structure \rootstar\ now contains page identifiers of root
pages $p_6$ (for version~\num{1}) and $p_8$ (for version~\num{2}).
A more diverse example of a TMVBT index is shown in
Figures~\ref{fig:tmvbt-example:1}--\ref{fig:tmvbt-example:3}, with the same
settings used as in the previous examples.
This example has been generated by our visualization software with
the action sequence given below:
\begin{itemize}
\setlength{\itemsep}{0pt}
\item Transaction~$T_1$: insert data items with keys~\range{1}{9}
(\figref{fig:tmvbt-example:1}).
\item Transaction~$T_2$: delete data items with keys~\range{7}{9}
(\figref{fig:tmvbt-example:2}); insert data items with keys~\range{10}{15}
(\figref{fig:tmvbt-example:3}).
\end{itemize}
\begin{figure*}[htb]
\begin{center}
\input{images/tmvbt-example-1}
\figcaption{Example of a TMVBT index after insertions}{In this
figure, transaction $T_1$ has inserted keys \range{1}{9}.}
\label{fig:tmvbt-example:1}
\end{center}
\end{figure*}
\begin{figure*}[htb]
\begin{center}
\input{images/tmvbt-example-2}
\figcaption{Example of a TMVBT index after deletions}{In this
figure, transaction $T_2$ has deleted keys \range{7}{9}.}
\label{fig:tmvbt-example:2}
\end{center}
\end{figure*}
\begin{figure*}[htb]
\begin{center}
\input{images/tmvbt-example-3}
\figcaption{Example of a TMVBT index after more insertions}{In this
figure, transaction $T_2$ has inserted keys \range{10}{15}.}
\label{fig:tmvbt-example:3}
\end{center}
\end{figure*}
The transactions on this TMVBT index have induced the following
structure-modification operations:
\begin{itemize}
\setlength{\itemsep}{0pt}
\item The first six insertions by transaction~$T_1$ have triggered a
key-split, splitting page~$p_3$ to~$p_3$ and~$p_5$.
At this point, the root page~$p_6$ was created to hold the routers to these
pages, and \rootstar\ was updated by replacing the page
identifier stored for version~\num{1} from $p_3$ to $p_6$.
\item The further three insertions by~$T_1$ have triggered another
key-split on~$p_5$, creating the new leaf page~$p_7$.
The situation after these SMOs is depicted in \figref{fig:tmvbt-example:1}.
\item After~$T_2$ has deleted the entries with entries \range{7}{9}, a
page-merge operation was triggered on~$p_7$ to merge the page with~$p_5$.
Because both of these pages were inactive, they were first killed, creating
two new active pages.
These were then merged into the active leaf page~$p_8$.
The situation after this SMO is shown in \figref{fig:tmvbt-example:2}.
\item The insertions by~$T_2$ further induced two page splits; first on the
active page~$p_8$, creating the active page~$p_9$; and then on~$p_9$,
thus creating page~$p_{11}$.
\item Insertion of the router to~$p_{11}$ to the parent page~$p_6$ caused a
split operation on the parent page~$p_6$.
Because~$p_6$ was inactive, it was first version-split into~$p_{10}$.
At this point~$p_{10}$ had enough space to hold the router to $p_{11}$,
so~$p_{10}$ was not further key-split into two pages.
The page identifier $p_{10}$ was inserted to the \rootstar\ to mark that the
root page of version~\num{2} differs from the root page of version~\num{1}.
The situation after these SMOs is shown in \figref{fig:tmvbt-example:3}.
\end{itemize}
\figref{fig:tmvbt-example:3} shows that the page~$p_3$ containing entries
with keys \range{1}{3} is shared by both roots of the TMVBT index.
Note that page~$p_3$ is alive but not active, because $\actver = 2$
(assuming that transaction $T_2$ has not yet committed), and $p_3$ has a
life span other than $[2, \infty)$.
It is thus possible for this page to have more than one parent.
The pages $p_8$ to $p_{11}$ are active and only contain entries of
the most recent version.
Also note that the index page~$p_{10}$ is active even though it contains a
router to the inactive page~$p_3$, because the router itself is active.
% LinkRef cannot be used with TMVBT
In the previous chapter, we briefly discussed efficient version-range queries
(i.e., \qtype{$x$/$-$/range} queries) on the MVBT index structure.
These were introduced by van~den~Bercken and
Seeger~\cite{bercken:1996:multiversion}.
Even though the TMVBT is based on the MVBT index, the most efficient
\LinkRef\ technique cannot be used with the TMVBT index.
This is because the technique relies on storing links to historical pages
that temporally precede a page~$p$.
In the MVBT, each page can have at most two temporal predecessors (see the
discussion in the end of \secref{sec:tsbmvbt:mvbt}), and the links to those
pages can therefore be tracked.
In the TMVBT, pages can have an unlimited number of temporal predecessors,
because merging active pages combines the temporal predecessors of the
merged pages.
%% User Actions
%%---------------------------------------------------------------------
\section{User Actions}
\label{sec:tmvbt:actions}
Having defined the transactions and the structure of the TMVBT index, we
will now describe the implementation of the user actions in this section.
As a general rule, we assume that the physical consistency of the database
during normal processing is maintained by short-duration
latching~\cite{mohan:1992:aries} of pages, so that the server process or
thread that executes a transaction keeps a page~$p$ read-latched for the time
a read action is performed on~$p$, and write-latched for the time an update
action is performed.
We also assume that the buffer manager applies the standard
steal-and-no-force buffering policy~\cite{gray:1993:transactionprocessing}.
These assumptions are in accordance with the \abbr{ARIES} recovery
algorithm~\cite{mohan:1992:aries}.
No logical key-level locking is required for the TMVBT, because
(1)~for read-only transactions, the historical versions that the read-only
transactions read are never deleted from the index; and
(2)~for updating transactions, there can be only one updating transaction
operating on the index at a time.
% We also assume that only the single active updating transaction can perform
% any updates on the index structure.
% The undo actions performed during restart recovery are thus performed by an
% updating system transaction which prevents new updating transactions from
% beginning before restart recovery is finished.
% Furthermore, this also means that historical version
% purging~\cite{becker:1996:mvbt} cannot be performed concurrently with the
% updating transactions.
% If historical version purging is desired as a background process that can be
% run concurrently with the updating transaction, then the latching policy
% described in this section is not sufficient.
% Rather, the updating transaction must also read-latch pages to
% prevent collisions with the concurrent version purging process.
The global version variables~\comver\ and~\actver\ are maintained
in the permanent database and their reading and writing is protected
by locking.
A \action{begin-read-only} action acquires a short-duration read lock
on~\comver\ for reading its value, and a \action{commit-update}
action acquires a commit-duration write lock on it for incrementing
its value.
A \action{begin-update} action acquires a commit-duration write lock
on~\actver, thus guaranteeing that at most one updating
transaction is active at a time.
The decrement of \actver\ in a \action{finish-rollback} action
is performed under the protection of that lock.
The \action{begin-read-only} and \action{commit-read-only} actions do not
write any log records, because read-only transactions do not involve any
logging.
In a fully dynamic index structure in which any inserted data can be
physically deleted at any time,
\emph{latch-coupling} (called \emph{crabbing} by Gray and
Reuter~\cite{gray:1993:transactionprocessing}) is the standard way to
guarantee the validity of traversed search paths in all circumstances.
In a general situation, the validity of the traversed path can be
ascertained by releasing the latch on the parent page only after
a latch on a child page has been acquired.
Latch-coupling is deadlock-free if the latches are acquired in a
predefined order, such as first top-down, then left-to-right.
However, in the case of the TMVBT index the fact that
inactive data always remains in place
(\invref{inv:tmvbt-static-inactive-entries}), together with our assumption
that a read-only transaction only reads inactive data
(\invref{inv:tmvbt-read-only-inactive}), implies that the \action{query} and
\action{range-query} actions of read-only transactions do not need to perform
latch-coupling, and a parent page may be unlatched during tree traversal
before acquiring a latch on the child page.
Accordingly, an action \action{query}$(k)$ in a read-only transaction
that is reading the version~\snapver{T} can be implemented as follows.
First, the root page for version~\snapver{T} is located from \rootstar\ and
read-latched.
Then the TMVBT is traversed using read latches without latch-coupling until
the leaf page~$p$ is found that covers key~$k$ and version~\snapver{T}; that
is, $k \in \kr{p}$ and $\snapver{T} \in \vr{p}$.
At each index page~$p'$ on the traversed path, the next page on the
path is the child page~$p''$ of $p'$ with $k \in \kr{p''}$ and $\snapver{T} \in
\vr{p''}$.
Once the identifier of the child page~$p''$ has been determined, the read latch
on the parent page~$p'$ is released and the child page~$p''$ is read-latched.
When the correct leaf page~$p$ has been found, the proper entry $(k, [v_1,v_2),
w)$ with $v_1 \leq \snapver{T} < v_2$ is located, and page~$p$ is unlatched.
An action \action{range-query}$([k_1,k_2))$ is implemented similarly,
except that for each index page~$p'$ in the search path we need to
traverse all subtrees rooted at each child page~$p''$ such that
$[k_1,k_2) \cap \kr{p''} \neq \emptymark$ and $\snapver{T} \in \vr{p''}$.
If there are more than one such child page~$p''$, then the page identifiers
of all but the first child page are pushed into a stack, and the traversal
proceeds to the subtree rooted at the first child.
When a subtree has been searched, a page identifier (if any) is popped from
the stack, the corresponding page is read-latched, and the search is
continued at the subtree rooted at that page.
Because the inactive entries and pages are static
(\invref{inv:tmvbt-static-inactive-entries}), the pages do not need to be
latched while the page identifiers are queued in the stack.
Latching is used only to prevent inconsistent reads if the updating
transaction needs to modify a page at the same time the read-only transaction
is reading it.
The following theorem follows directly from the definitions of the query
actions of the read-only transactions and from the fact that only one
updating transaction can be active at a time:
\thmskip
\begin{theorem}
\label{theorem:tmvbt:serializable}
The TMVBT algorithms produce a
snapshot-isolated schedule~\cite{berenson:1995:sql-critique} for the
transactions.
\end{theorem}
\begin{proof}
Firstly, because there can only be a single active updating transaction that
operates on the TMVBT at a time, the updating transactions are processed in a
fully serialized manner, thus fulfilling the requirements for
snapshot-isolated transactions.
Secondly, read-only transaction only read committed data that is never deleted,
so they also form snapshot-isolated schedules.
\end{proof}
\thmskip
An updating transaction begins with the \action{begin-update} action and ends
with the \action{commit-update} action, as described in
\secref{sec:tmvbt:multi-action-tx}, unless the transaction is aborted and
rolled back.
% The action takes a commit-duration write lock on~\actver, increments
% it, and assigns it to the transaction as described in
% \secref{sec:tmvbt:multi-action-tx}.
% A transaction~$T$ executing this action writes a redo-undo log record
% \lrb{T}, \logact{begin}, $v$, \lre{\actver}, where $v$ is the previous value
% of the active version variable, and \actver\ is the incremented value.
% The \action{commit-update} action, in addition to taking the lock and updating
% the commit version variable~\comver, as explained in
% \secref{sec:tmvbt:multi-action-tx}, writes a redo-only log record
% \lrb{T}, \logact{commit}, \lre{\comver}.
% Finally, this action forces the database log to disk.
The \action{query} and \action{range-query} actions are the same as for
read-only transactions, except that they now target the version~\actver, and
the actions may read active entries and pages.
As with read-only transactions, these actions in an updating transaction do
not write log records, because they do not create changes to the database
that would have to be redone or undone during restart recovery.
For efficiency, we assume that the TMVBT index records the page identifier of
the root page of version~\actver\ separately so that the queries in updating
transactions do not need to use the \rootstar\ structure to find it.
Similarly, because read-only transactions reading the most recent committed
version always target the version~\comver, the page identifier of the root
page of that version is also maintained separately.
\thmskip
\begin{theorem}
\label{thm:tmvbt-query-cost}
When the root of the search tree of version~$v$ is known, the cost of a
single-key query action in the TMVBT targeting version~$v$ is
\OhT{\log_\capacity \entries{v}} pages, and the cost of the key-range query
action for version~$v$ is \OhT{\log_\capacity \entries{v} + r/\capacity}
pages of the TMVBT structure, where $\entries{v}$ denotes the number of data
items that are alive at version~$v$, $r$ is the number of entries returned by
the range query and \capacity\ is the page capacity.
\end{theorem}
\begin{proof}
Assuming that \invref{inv:tmvbt-live-count} holds, each page of the TMVBT
that is part of the search tree~$S_v$ has at least \minlive\ entries that are
alive at version~$v$.
The proof is therefore the same as the proof of \thmref{thm:mvbt-cost}.
We will show later on in
Lemmas~\ref{lemma:kill-page-split-counts},
\ref{lemma:split-page-split-counts}, and~\ref{lemma:merge-page-split-counts}
that all the SMOs maintain \invref{inv:tmvbt-live-count}, thereby confirming
this result.
\end{proof}
\thmskip
We assume that all TMVBT page traversals maintain a \emph{saved
path}\phantomsection\label{def:saved-path}~\cite{lomet:1992:conc-rec,lomet:1997:concurrency};
that is, an array \emph{path} local to the server process or thread in
question and indexed by the height of pages.
An entry \emph{path}$[i]$ holds the page identifier, key range, life span,
and Page-LSN of the page that was located at level~$i$ when traversing the
root-to-leaf path.
The saved-path concept can be used to accelerate the user actions by starting
the traversal at the lowest-level page in the saved path that, according to
the saved information, covers the queried search space.
This page is known to be the correct page to start the tree
traversal, because
(1)~for read-only transactions, the inactive data is never moved away
from the pages; and
(2)~for updating transactions, there can be no other updating
transaction that would invalidate the data in the saved path of the
current updating transaction.
This holds regardless of whether a concurrent purging process is allowed,
because the purging process only deletes pages that are part of
historical versions that are no longer queried.
% Write and delete action general idea
For the \action{write}$(k,w)$ and \action{delete}$(k)$ actions of the
updating transaction, the TMVBT is traversed using read latches without
latch-coupling as for the \action{query}$(k)$ action of the updating
transaction, except that the target leaf page~$p$ is write-latched.
% As in \action{query}$(k)$ and \action{range-query}$([k_1, k_2))$, no
% latch-coupling is needed.
If the target leaf page~$p$ can accommodate the update, then the update is
applied on page~$p$ directly; otherwise a structure-modification operation
is performed before the action can proceed.
After the update has been applied, a redo-undo log record for the action
is generated, its LSN is stamped in the Page-LSN field of~$p$, and the
write latch on~$p$ is released.
% Write action specifics
In the \action{write}$(k,w)$ action, if the index contains a live entry
of the form $(k,[v,\infty),w')$, then that entry is logically deleted by
either replacing it with a new entry $(k,[v,\actver),w')$, if $v \neq
\actver$; or by physically removing the old entry, if $v = \actver$.
After the existing entry has been deleted, a new entry
$(k,[\actver,\infty),w)$ is inserted into the page~$p$.
The page~$p$ can accommodate this update action, if the operations explained
above can be carried out without the page overflowing.
The redo-undo log entry written for this action contains the version and data
of the replaced entry, in addition to the version and data of the inserted
entry.
The log entry written by an updating transaction~$T$ is thus
\lrb{T}, \logact{write}, $p$, $k$, \actver, $w$, $v$, $w'$, \lre{n}, where
$n$ is the log sequence number of the previous not-yet-undone action
of~$T$, and $v$ and $w'$ are null if the index contained no live entry with
the key~$k$.
% Delete action specifics
In the case of the \action{delete}$(k)$ action, page~$p$ can
accommodate the update if replacing the entry $(k,[v,\infty),w)$ by
$(k,[v,\actver),w)$ (in the case $v \neq \actver$), or physically removing
the entry $(k,[\actver,\infty),w)$ (otherwise) does not decrease the number
of live entries in the page below the required minimum number of live entries,
$\minlive$.
% The redo-undo log record written for this action contains the version and
% data of the deleted entry.
An updating transaction~$T$ writes a redo-undo log record
\lrb{T}, \logact{delete}, $p$, $k$, \actver, $v$, $w$, \lre{n} for this
action.
When the target leaf page~$p$ cannot accommodate the update,
structure modifications are needed.
These operations are explained in \secref{sec:tmvbt:smo}.
For writes, the operation \alg{split-page} is called before