Skip to content
Home » [Update] The Trie Data Structure: A Neglected Gem | tries – NATAVIGUIDES

[Update] The Trie Data Structure: A Neglected Gem | tries – NATAVIGUIDES

tries: นี่คือโพสต์ที่เกี่ยวข้องกับหัวข้อนี้

From the very first days in our lives as programmers, we’ve all dealt with data structures: Arrays, linked lists, trees, sets, stacks and queues are our everyday companions, and the experienced programmer knows when and why to use them. In this article we’ll see how an oft-neglected data structure, the trie, really shines in application domains with specific features, like word games.

Word Games as Trie Example

For starters, let’s consider a simple word puzzle: find all the valid words in a 4×4 letter board, connecting adjacent letters horizontally, vertically, or diagonally. For example, in the following board, we see the letters ‘W’, ‘A’, ‘I’, and ‘T’ connecting to form the word “WAIT”.

simple word puzzle

The naive solution to finding all valids words would be to explore the board starting from the upper-left corner and then moving depth-first to longer sequences, starting again from the second letter in the first row, and so on. In a 4×4 board, allowing vertical, horizontal and diagonal moves, there are 12029640 sequences, ranging in length from one to sixteen characters.

Now, our goal is to find the best data structure to implement this valid-word checker, i.e., our vocabulary. A few points to keep in mind:

  • We only need a single copy of each word, i.e., our vocabulary is a set, from a logical point of view.
  • We need to answer the following questions for any given word:
    • Does the current character sequence comprise a valid word?
    • Are there longer words that begin with this sequence? If not, we can abandon our depth-first exploration, as going deeper will not yield any valid words.

To illustrate the second point, consider the following board: There’s no point in exploring subsequent moves, since there are no words in the dictionary that start with “ASF”.

nothing starts with asf

We’d love our data structure to answer these questions as quickly as possible. ~O(1) access time (for checking a sequence) would be ideal!

We can define the Vocabulary interface like this (see here for the GitHub repo):

public interface Vocabulary {
    boolean add(String word);
    boolean isPrefix(String prefix);
    boolean contains(String word);
}

Trie Data Structure vs. Alternatives

Implementing the contains() method requires a backing data structure that lets you find elements efficiently, while the isPrefix() method requires us to find the “next greater element”, i.e. we need to keep the vocabulary sorted in some way.

We can easily exclude hash-based sets from our list of candidates: while such a structure would give us constant-time checking for contains(), it would perform quite poorly on isPrefix(), in the worst case requiring that we scan the whole set.

For quite the opposite reason, we can also exclude sorted linked-lists, as they require scanning the list at least up to the first element that is greater than or equal to the searched word or prefix.

Two valid options are using a sorted array-backed list or a binary tree.
On the sorted array-backed list we can use binary search to find the current sequence if present or the next greater element at a cost of O(log2(n)), where n is the number of words in the dictionary.

We can implement an array-backed vocabulary that always maintains ordering of like this, using standard java.util.ArrayList and java.util.Collections.binarySeach:

public class ListVocabulary implements Vocabulary {
    private List<String> words = new ArrayList<String>();

    /**
     * Constructor that adds all the words and then sorts the underlying list
     */
    public ListVocabulary(Collection<String> words) {
        this.words.addAll(words);
        Collections.sort(this.words);
    }

    public boolean add(String word) {
        int pos = Collections.binarySearch(words, word);
        // pos > 0 means the word is already in the list. Insert only
        // if it's not there yet
        if (pos < 0) {
            words.add(-(pos+1), word);
            return true;
        }
        return false;
    }

    public boolean isPrefix(String prefix) {
        int pos = Collections.binarySearch(words, prefix) ;
        if (pos >= 0) {
            // The prefix is a word. Check the following word, because we are looking 
            // for words that are longer than the prefix
            if (pos +1 < words.size()) {
                String nextWord = words.get(pos+1);
                return nextWord.startsWith(prefix);
            }
            return false;
        }
        pos = -(pos+1);
        // The prefix is not a word. Check where it would be inserted and get the next word.
        // If it starts with prefix, return true.
        if (pos == words.size()) {
            return false;
        }
        String nextWord = words.get(pos);
        return nextWord.startsWith(prefix);
    }

    public boolean contains(String word) {
        int pos = Collections.binarySearch(words, word);
        return pos >= 0;
    }
}

If we decided to use a binary tree, the implementation could be even shorter and more elegant (again, here’s a link to the code):

public class TreeVocabulary extends TreeSet<String> implements Vocabulary {

    public TreeVocabulary(Collection<String> c) {
        super(c);
    }

    public boolean isPrefix(String prefix) {
        String nextWord = ceiling(prefix);
        if (nextWord == null) {
            return false;
        }
        if (nextWord.equals(prefix)) {
            Set<String> tail = tailSet(nextWord, false);
            if (tail.isEmpty()) {
                return false;
            }
            nextWord = tail.iterator().next();
        }
        return nextWord.startsWith(prefix);
    }

    /**
     * There is a mismatch between the parameter types of vocabulary and TreeSet, so
     * force call to the upper-class method
     */
    public boolean contains(String word) {
        return super.contains(word);
    }
}

In both cases, we can expect O(log n) performance for each access method (contains() and isPrefix()). As for space requirements, both the array-backed implementation and the tree-backed implementation require O(n+M) where n is the number of words in the dictionary and M is the bytesize of the dictionary, i.e. the sum of the length of the strings in the dictionary.

Trie Applications: When and Why Use Tries

Logarithmic performance and linear memory isn’t bad. But there are a few more characteristics of our application domain that can lead us to better performance:

  • We can safely assume that all words are lowercase.
  • We accept only a-z letters—no punctuation, no hyphens, no accents, etc.
  • The dictionary contains many inflected forms: plurals, conjugated verbs, composite words (e.g., house –> housekeeper). Therefore, many words share the same stem.
  • Words have a limited length. For example, if we are working on a 4×4 board, all words longer than 16 chars can be discarded.

This is where the trie (pronounced “try”) comes in. But what exactly is a trie? Tries are neglected data structures, found in books but rarely in standard libraries.

For motivation, let’s first consider Computer Science’s poster child: the binary tree. Now, when we analyze the performance of a binary tree and say operation x is O(log(n)), we’re constantly talking log base 2. But what if, instead of a binary tree, we used a ternary tree, where every node has three children (or, a fan-out of three). Then, we’d be talking log base 3. (That’s a performance improvement, albeit only by a constant factor.) Essentially, our trees would become wider but shorter, and we could perform fewer lookups as we don’t need to descend quite so deep.

Taking things a step further, what if we had a tree with fan-out equal to the number of possible values of our datatype?

This is the motivation behind the trie. And as you may have guessed, a trie is indeed a tree, a trie tree so to speak!

But, contrary to most binary-trees that you’d use for sorting strings, those that would store entire words in their nodes, each node of a trie holds a single character (and not even that, as we’ll see soon) and has a maximum fan-out equal to the length of the alphabet. In our case, the length of the alphabet is 26; therefore the nodes of the trie have a maximum fan-out of 26. And, while a balanced binary tree has log2(n) depth, the maximum depth of the trie is equal to the maximum length of a word! (Again, wider but shorter.)

Within a trie, words with the same stem (prefix) share the memory area that corresponds to the stem.

To visualize the difference, let’s consider a small dictionary made of five words. Assume that the Greek letters indicate pointers, and note that in the trie, red characters indicate nodes holding valid words.

visualizing the trie

Java Trie Implementation

As we know, in the tree the pointers to the children elements are usually implemented with a left and right variable, because the maximum fan-out is fixed at two.

In a trie indexing an alphabet of 26 letters, each node has 26 possible children and, therefore, 26 possible pointers. Each node thus features an array of 26 (pointers to) sub-trees, where each value could either be null (if there is no such child) or another node.

How, then, do we look-up a word in a trie? Here is the method that, given a String s, will identify the node that corresponds to the last letter of the word, if it exists in the tree:

public LowercaseTrieVocabulary getNode(String s) {
	LowercaseTrieVocabulary node = this;
	for (int i = 0; i < s.length(); i++) {
		int index = LOWERCASE.getIndex(s.charAt(i));
		LowercaseTrieVocabulary child = node.children[index];
		if (child == null) {
			// There is no such word
			return null;
		}
		node = child;
	}
	return node;
}

The LOWERCASE.getIndex(s.charAt(i)) method simply returns the position of the ith character in the alphabet. On the returned node, a Boolean property node indicates that the node corresponds to the last letter of a word, i.e. a letter marked in red in the previous example. Since each node keeps a counter of the number of children, if this counter is positive then there are longer words in the dictionary that have the current string as a prefix. Note: the node does not really need to keep a reference to the character that it corresponds to, because it’s implicit in its position in the trie.

Analyzing Performance

What makes the trie structure really perform well in these situations is that the cost of looking up a word or prefix is fixed and dependent only on the number of characters in the word and not on the size of the vocabulary.

In our specific domain, since we have strings that are at most 16 characters, exactly 16 steps are necessary to find a word that is in the vocabulary, while any negative answer, i.e. the word or prefix is not in the trie, can be obtained in at most 16 steps as well! Considering that we have previously ignored the length of the string when calculating running time complexity for both the array-backed sorted list and the tree, which is hidden in the string comparisons, we can as well ignore it here and safely state that lookup is done in O(1) time.

Considering space requirements (and remembering that we have indicated with M the bytesize of the dictionary), the trie could have M nodes in the worst case, if no two strings shared a prefix. But since we have observed that there is high degree of redundancy in the dictionary, there is a lot of compression to be done. The English dictionary that is used in the example code is 935,017 bytes and requires 250,264 nodes, with a compression ratio of about 73%.

However, despite this, even a compressed trie will usually require more memory than a tree or array. This is because, for each node, at least 26 x sizeof(pointer) bytes are necessary, plus some overhead for the object and additional attributes. On a 64-bit machine, each node requires more than 200 bytes, whereas a string character requires a single byte, or two if we consider UTF strings.

Tries and Performance Tests

So, what about performance? The vocabulary implementations were tested in two different situations: checking for 20,000,000 random strings and finding all the words in 15,000 boards randomly generated from the same word list.

Four data structures were analyzed: an array-backed sorted list, a binary tree, the trie described above, and a trie using arrays of bytes corresponding to the alphabet-index of the characters themselves (a minor and easily implemented performance optimization). Here are the results, in ms:

performance results

The average number of moves made to solve the board is 2,188. For each move, a word lookup and a prefix lookup are done, i.e., for checking all the boards, more than 32M word lookups and 32M prefix lookups were performed. Note: these could be done in a single step, I kept them separated for clarity in the exposition. Compacting them in a single step would cut the time for solving the boards almost in half, and would probably favour the trie even more.

As can be seen above, the word lookup perform better with the trie even when using strings, and is even faster when using alphabet indexes, with the latter performing more than twice as fast as a standard binary tree. The difference in solving the boards is even more evident, with the fast trie-alphabet-index solution being more than four times as fast as the list and the tree.

Wrapping Up

The trie is a very specialized data structure that requires much more memory than trees and lists. However, when specific domain characteristics apply, like a limited alphabet and high redundancy in the first part of the strings, it can be very effective in addressing performance optimization.

References

An extensive explanation of tries and alphabets can be found in chapter 5 of Robert Sedgewick’s book “Algorithms, 4th edition”. The companion website at Princeton has the code for an implementation of Alphabet and TrieST that is more extensive than my example.

Description of the trie and implementations for various languages can also be found on Wikipedia and you can take a look at this Boston University trie resource as well.

[NEW] Tries | tries – NATAVIGUIDES

5.2   Tries

This section under major construction.

Symbol tables with string keys.

Could use standard symbol table implementation.
Instead, exploit additional structure of string keys.
Customized searching algorithms for strings (and other
keys represented as digits).
Goal: as fast as hashing, more flexible than binary search trees.
Can efficiently support additional operations including
prefix and wildcard matching, e.g., IP routing table wants
to forward to 128.112.136.12, instead forwards to 128.112 which
is longest matching prefix that it knows about.
Side benefit: fast and space-efficient string searching.

Could use standard symbol table implementation. Instead, exploit additional structure of string keys. Customized searching algorithms for strings (and other keys represented as digits). Goal: as fast as hashing, more flexible than binary search trees. Can efficiently support additional operations including prefix and wildcard matching, e.g., IP routing table wants to forward to 128.112.136.12, instead forwards to 128.112 which is longest matching prefix that it knows about. Side benefit: fast and space-efficient string searching.

R-way tries.
Program TrieST.java
implements a string symbol table using a multiway trie.

Ternary search tries.
Program TST.java
implements a string symbol table using a ternary search trie.

Reference:

Fast Algorithms for Sorting and Searching
by Bentley and
Sedgewick.

Property A. (Bentley-Sedgewick)
Given an input set, the number of nodes in its TST
is the same, regardless of the order in which the
strings are inserted.

Pf. There is a unique node in the TST for each
distinct string prefix in the set. The relative
positions of the nodes within the TST can change depending
on the insertion order, but the number of nodes is
invariant.

Advanced operations.

Wildcard search, prefix match.
The r-way trie and TST implementations include code for
wildcard matching and prefix matching.

Wildcard search, prefix match. The r-way trie and TST implementations include code for wildcard matching and prefix matching.

Lazy delete = change the word boundary bit.
Eager delete = clean up any dead parent links.

Application: T9 text input for cell phones.
User types using phone pad keys; system displays all words
that correspond (and auto-completes as soon as it is unique).
If user types 0, system displays all possible auto-completions.

Q+A

Exercises

  1. Write nonrecursive versions of an R-way trie string set and a TST.
  2. Unique substrings of length L.
    Write a program that reads in text from standard input and calculate
    the number of unique substrings of length L that it contains.
    For example, if the input is cgcgggcgcg then
    there are 5 unique substrings of length 3:
    cgc, cgg, gcg, ggc, and ggg.
    Applications to data compression.
    Hint: use the string method substring(i, i + L)
    to extract ith substring and insert into a symbol table. Alternative solution:
    compute the hash of the i+1st substring using the hash of the ith
    substring. Test it out on the first
    million digits of π.
    or the first
    10 million digits of π.

  3. Unique substrings.
    Write a program that reads in text from standard input and calculates
    the number of distinct substrings of any length.
    (Can do very efficiently with a suffix tree.)

  4. Document similarity.
    To determine the similarity of two documents, calculate
    the number of occurrences of each trigram (3 consecutive letters).
    Two documents are similar if the Euclidean distance between the
    frequency vector of trigrams is small.

  5. Spell checking.
    Write a program SpellChecker.java
    that the name of a file
    containing a dictionary of words in the English language, and then
    reads string from standard input and prints out any word that
    is not in the dictionary. Use a string set.

  6. Spam blocklist.
    Insert known spam email addresses into an existence table
    and use to blocklist spamm.

  7. IP lookup by country.
    Use the data file ip-to-country.csv
    to determine what country a given IP address is coming from. The data file
    has five fields (begining of IP address range, ending of IP address range,
    two character country code, three character country code, and country name.
    See The IP-to-country website.
    The IP addresses are non-overlapping.
    Such a database tool can be used for: credit card fraud detection,
    spam filtering, auto-selection of language on a web site, and web server log
    analysis.

  8. Inverted index of web.
    Given a list of web pages, create a symbol table of words contained in
    the web pages. Associate with each word a list of web pages in which that
    word appears. Write a program that reads in a list of web pages, creates
    the symbol table, and support single word queries by returning the list
    of web pages in which that query word appears.

  9. Inverted index of web.
    Extend the previous exercise so that it supports multi-word queries.
    In this case, output the list of web pages that contain at least one occurrence
    of each of the query words.

  10. Symbol table with duplicates.
  11. Password checker.
    Write a program that reads in a string from the command line and
    a dictionary of words from standard input, and
    checks whether it is a “good” password. Here, assume “good” means
    that it (i) is at least 8 characters long, (ii) is not a word in
    the dictionary, (iii) is not a word in the dictionary followed
    by a digit 0-9 (e.g., hello5), (iv) is not two words separated by a
    digit (e.g., hello2world)

  12. Reverse password checker.
    Modify the previous problem so that (ii) – (v) are also satisfied for
    reverses of words in the dictionary (e.g., olleh and olleh2world).
    Clever solution: insert each word and its reverse into the
    symbol table.

  13. Random phone numbers.
    Write a program that takes a comand line input N and prints N
    random phone numbers of the form (xxx) xxx-xxxx. Use a symbol
    table to avoid choosing the same number more than once.
    Use this list of area codes
    to avoid printing out bogus area codes.
    Use an R-way trie.

  14. Contains prefix.
    Add a method containsPrefix() to StringSET
    takes a string s as input and return true if there is a
    string in the set that contains s as a prefix.

  15. Substring matches.
    Given a list of (short) strings, your goal is to support queries
    where the user looks up a string s and your job is to report
    back all strings in the list that contain s.
    Hint: if you only want prefix matches (where the
    strings have to start with s), use a TST as described in the text.
    To support substring matches, insert the suffixes
    of each word (e.g., string, tring, ring, ing, ng, g)
    into the TST.

  16. Zipf’s law.
    Harvard linguist George Zipf observed that the frequency of the ith most
    common word in an English text containing N words
    is roughly proporitional to 1/i, where the constant of
    proportionality is 1 + 1/2 + 1/3 + … + 1/N.
    Test “Zipf’s law
    by reading in a sequence of words from standard input,
    tabulate their frequencies, and compare against the predicted
    frequencies.

  17. Typing monkeys and power laws.
    (Micahel Mitzenmacher)
    Suppose that a
    typing monkey
    creates random words by
    appending each of 26 possible lettter with probability p to the
    current word, and finishes the word with probability
    1 – 26p. Write a program to estimate the frequency distribution
    of the lengths of words produced. If “abc” is produced more than
    once, only count it once.

  18. Typing monkeys and power laws.
    Repeat the previous exercise, but assume that the letters a-z occur
    proportional to the following probabilities, which are typical of
    English text.

    CHAR
    FREQ
     
    CHAR
    FREQ
     
    CHAR
    FREQ
     
    CHAR
    FREQ
     
    CHAR
    FREQ

    A
    8.04

    G
    1.96

    L
    4.14

    Q
    0.11

    V
    0.99

    B
    1.54

    H
    5.49

    M
    2.53

    R
    6.12

    W
    1.92

    C
    3.06

    I
    7.26

    N
    7.09

    S
    6.54

    X
    0.19

    D
    3.99

    J
    0.16

    O
    7.60

    T
    9.25

    Y
    1.73

    E
    12.51

    K
    0.67

    P
    2.00

    U
    2.71

    Z
    0.09

    F
    2.30

  19. Indexing a book.
    Write a program that reads in a text file from standard input and compiles
    an alphabetical index of which words appear on which lines, as in
    the following input. Ignore case and puncuation.

    It was the best of times,
    it was the worst of times,
    it was the age of wisdom,
    it was the age of foolishness,
    
    age 3-4
    best 1
    foolishness 4
    it 1-4
    of 1-4
    the 1-4
    times 1-2
    was 1-4
    wisdom 4
    worst 2
    
  20. Entropy.
    We define the relative entropy of a text corpus with N words, k of which
    are distinct as E = 1 / (N log N) * sum (pi log(k) – log(pi),
    i = 1..k)
    where p_i is the fraction of times that word i appears.
    Write a program that reads in a text corpus and prints out the relative
    entropy. Convert all letters to lowercase and treat punctuation marks
    as whitespace.

  21. Longest prefix.
    True or false. The longest prefix of a binary string x that is a key in
    a symbol table is either the floor of x
    or the ceiling of x (or both if x is in the set).

    False. The longest prefix of 1100 in { 1, 10, 1011, 1111 } is 1, not 1011 or 1111.

Creative Exercises

Web Exercises


Luigi tries his absolute best


any luigi fans in chat?
MarioParty Superstars Episode6

นอกจากการดูบทความนี้แล้ว คุณยังสามารถดูข้อมูลที่เป็นประโยชน์อื่นๆ อีกมากมายที่เราให้ไว้ที่นี่: ดูเพิ่มเติม

Luigi tries his absolute best

My Husky Tries Everything To IMPRESS My Mum! Stunt Dog!


My Husky Tries Everything To IMPRESS My Mum! Stunt Dog!
New Shorts channel: https://www.youtube.com/channel/UC4WsaRyARC7KuEKIMPtpA
Teespring Merchandise: https://teespring.com/stores/keyushthestuntdog
Merchandise sold by us directly: https://designedbyboo.com/productcategory/keyushthestuntdog/
Consider becoming a member to support your favourite fluffy boy: https://www.youtube.com/channel/UC_9EXcw4PlVlaklxx12cDqA/join
If you would like to send Key a gift he has an amazon wish list: http://amzn.eu/9Q7QVjx
Consider becoming a patron to help support our channel and receive exclusive content and free gifts: https://patreon.com/Keyush_The_Stuntdog
Discount code to our blue Gear Fur Lead: https://www.gearfur.com/discount/KEYUSH
More channels you might like:
Our second channel Jodie Boo: https://www.youtube.com/channel/UCJyPQb5f_JY1ohmtTPJg4w
My sister’s channel: https://www.youtube.com/user/Pags8D
Key’s best friend Sherpa’s channel: https://www.youtube.com/channel/UCYTwXM6v_DRQrsuZvPW0MzA
Our American malamute friend Tonka: https://www.youtube.com/channel/UCdzF9O7eRA6jclUYxQ4ZBaA
New Videos every Monday, Wednesday and Friday. Plus Sundays on our second channel Jodie Boo(link above). Always at 4pm GMT London Time.
Follow us:
Facebook: http://bit.ly/2fIr3yh
www.instagram.com/keyush_the_stuntdog/
Tiktok: @keyush_thestuntdog
Amazon Affiliate link to products we use:
https://www.amazon.co.uk/shop/keyushthestuntdog
Products We Use:
Teal and aztec bandanas: mymountainhusky.com
Blue Rope Lead: https://www.gearfur.com/discount/KEYUSH
Red Waist Lead and blue collar: indidog.co.uk
Filming Equipment:
Canon G7x Mark 2
Panasonic GH5

My Husky Tries Everything To IMPRESS My Mum! Stunt Dog!

TOP 10 Best Rugby Players 2020 | RUGBY HD


TOP 10 Best Rugby Players 2020 | RUGBY HD
Watch in HD!
• Made by Korol Pavel (KOROL)
Rate, comment and subscribe..
Thanks for Watching!

TAGS:
top tries,top tries 2020,top tries 2019,rugby international tries,top international tries,international rugby,rugby top tries,best tries,rugby highlights,,top tries of 2019,international rugby tries,best rugby tries,,best rugby matches,tries of the year
Best RUGBY Tries 2019/20 best rugby players 2020
best rugby skills
ラグビースキル
ラグビートライ
最高のラグビートライ
ラグビー動画

TOP 10 Best Rugby Players 2020 | RUGBY HD

Eva learning make Halloween


Eva learning make Halloween. Halloween decorations and food makes with mommy
EVA YouTube http://www.youtube.com/c/EvaBravoPlay?sub_confirmation=1
EVA Instagram https://www.instagram.com/eva_bravo_play
evabravoplay, evabravopretendplay, evabravo evabravo

Eva learning make Halloween

Lecture – 18 Tries


Lecture Series on Data Structures and Algorithms by Dr. Naveen Garg, Department of Computer Science and Engineering ,IIT Delhi. For more details on NPTEL visit http://nptel.iitm.ac.in

Lecture - 18 Tries

นอกจากการดูบทความนี้แล้ว คุณยังสามารถดูข้อมูลที่เป็นประโยชน์อื่นๆ อีกมากมายที่เราให้ไว้ที่นี่: ดูวิธีอื่นๆLEARN FOREIGN LANGUAGE

ขอบคุณที่รับชมกระทู้ครับ tries

Leave a Reply

Your email address will not be published. Required fields are marked *