TY - JOUR
T1 - A data structure for a sequence of string accesses in external memory
AU - Ciriani, Valentina
AU - Ferragina, Paolo
AU - Luccio, Fabrizio
AU - Muthukrishnan, S.
PY - 2007/2/1
Y1 - 2007/2/1
N2 - We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S1, . . ., Sn, we aim at supporting a search sequence for m not necessarily distinct strings T1, T2, . . ., Tm, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se. The search for the whole sequence T1, T2, . . ., Tm can be done in an expected number of I/Os: O (∑ m j=1 |Tj|/B + ∑ni=1 (ni log B m/ni)), where each Tj may or may not be present in the dictionary, and ni is the number of times Si is queried (i.e., the number of Tj s equal to Si ). Moreover, inserting or deleting a string Si takes an expected amortized number O( |Si |/ B + logB n) of I/Os. The term ∑m j =1 |Tj |/ B in the search formula is a lower bound for reading the input, and the term ∑ni=1 ni logB m/ni (entropy of the query sequence) is a standard information-theoretic lower bound.We regard this result as the static optimality theorem for external-memory string access, as compared to Sleator and Tarjan's classical theorem for numerical dictionaries [Sleator and Tarjan 1985]. Finally,we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search.
AB - We introduce a new paradigm for querying strings in external memory, suited to the execution of sequences of operations. Formally, given a dictionary of n strings S1, . . ., Sn, we aim at supporting a search sequence for m not necessarily distinct strings T1, T2, . . ., Tm, as well as inserting and deleting individual strings. The dictionary is stored on disk, where each access to a disk page fetches B items, the cost of an operation is the number of pages accessed (I/Os), and efficiency must be attained on entire sequences of string operations rather than on individual ones. Our approach relies on a novel and conceptually simple self-adjusting data structure (SASL) based on skip lists, that is also interesting per se. The search for the whole sequence T1, T2, . . ., Tm can be done in an expected number of I/Os: O (∑ m j=1 |Tj|/B + ∑ni=1 (ni log B m/ni)), where each Tj may or may not be present in the dictionary, and ni is the number of times Si is queried (i.e., the number of Tj s equal to Si ). Moreover, inserting or deleting a string Si takes an expected amortized number O( |Si |/ B + logB n) of I/Os. The term ∑m j =1 |Tj |/ B in the search formula is a lower bound for reading the input, and the term ∑ni=1 ni logB m/ni (entropy of the query sequence) is a standard information-theoretic lower bound.We regard this result as the static optimality theorem for external-memory string access, as compared to Sleator and Tarjan's classical theorem for numerical dictionaries [Sleator and Tarjan 1985]. Finally,we reformulate the search bound if a cache is available, taking advantage of common prefixes among the strings examined in the search.
KW - Caching
KW - External-memory data structure
KW - Sequence of string searches and updates
KW - Skip list
UR - http://www.scopus.com/inward/record.url?scp=33847316545&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33847316545&partnerID=8YFLogxK
U2 - 10.1145/1186810.1186816
DO - 10.1145/1186810.1186816
M3 - Article
AN - SCOPUS:33847316545
SN - 1549-6325
VL - 3
JO - ACM Transactions on Algorithms
JF - ACM Transactions on Algorithms
IS - 1
M1 - 1219952
ER -