Current web page describes the Fast WordNet project.
If you are looking for original WordNet website, or other details about the very popular WordNet lexical database , please click here.
Current project was developed as a part of master program in Human Language Technology and Interfaces from University of Trento, Italy
In search through huge vocabularies, usually it is taken into account the first or first and second letters of a word as a reference key to hash table. Mainly, they are processed in order to have a uniform distribution of the size of the lists, and this is due to have the same complexities among searches in lists.
In this project we try to minimize the search algorithm by including some new features. Any feature can be included in search on which we can define the order operator, or features which are comparable. Having an order on our data, it is possible to apply Binary Search.
Humans are able to recognize words by shapes of the word, with no need in reading each letters.
So called mixed up letters suggests that words first and last characters includes important information about word.
occdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.
For more information about studies from Cambridge University about Mixed Up Letters, please click here, or search it on GOOGLE
First letters of the word are widely exploited as a feature in search algorithm and hash table indexing.
Therefore, starting from this evidence, we try to extract new features and exploit them with the same aim, reducing the complexity of searching. Below, we exploit these geometrical (orthographical) features and show the statistics and the words distribution among WordNet, and they are: first character, last character, and word length. We will start in analyzing the distribution of words by first character and by last character. As distribution of words by (last/first) character is meant the number of words that start with c character, where c takes value from A to Z as ASCII code.
Figure 1. Words distribution by first character (left),and by last character (right)
It seems that English prefers a list of 6 characters to put at the beginning of the each word, and they are: C13127 tokens, I-8493 tokens, O- 7479 tokens, R- 10468 tokens, U- 14635 tokens, Y 3599 tokens. To mention that the maximum number of words organized by first character is 14635, and the complexity in this case would be Log2(14635)=14. The distribution is not uniform (we have variety among number of words per character), this being one of the reasons that first character is not used as a primary key in hash tables alone. The ASCII code of the first character is divided by a number, and the rest of the division is taken as a primary key for hash table indexing. Having a not uniform distribution of lists over hash table will give us a not equilibrated search complexity. In this case, if we index in the hash table with indexing key as first character, we would have a worst case of complexity O(log2 (14635)) and the best case of complexity as O(log2(161)), which is a big difference. The distribution of words by last character is similar to the distribution of words by first character (figure 1, we have 6 preferred letters), with the difference between the local maxima and global maxima. 24.542 of words in English end with G character. It is excluded to have a similar distribution for others languages.
Figure 2. Words distribution by length
It is not important which is the character, but it is important that we have local maximum and minimum far from each others. The distribution of the words by length instead, is different then what we sow before, and it is that we have the Gaussian distribution (figure 2), with maximum point reached in 10, with 12224 terms. Gaussian distribution of words by lengths suggests that language prefers to have few short words, and few long words, while the number of words of length around 10 characters is the highest. After analyzing all these distribution, we try to exploit them due tot build a hash table, based on first, last character and length. In figure 3 these distributions are shown, in the left image we have the distribution of words having the starting with the same character, having the same length, while in the right image we have words starting with the same character, ending with the same character and having the same length. In a hash table, the vertical axis numbers would be the length of the lists, while the horizontal axis would be the order number of the list. In a hash table we could refer to these lists by having first character and word length for left image, and first character, last character plus the length for the right image.
Figure 3. Words distribution by length and first characters (left),and by length, first, and last character (right)
Using geometrical features such as first, last character, and the length, will reduce the size of hash tables to maximum 300, while when using only first character and the length of the word it will reduce the number of possible words to 2403. It shows that language tends to solve minimization problem, where the minimization problem is to reduce the number of possible word geometrical features: word length, first and last characters. In this statistics are not included others characters from the words. Short length words are less, and this is due to the number of possible combinations can be done with 3 characters. Long words are also few, and this is due to the memory required for remembering. This statistics are language dependent, and every language has its own preference for letters. English indeed, we have these distributions.
For search algorithm such as binary search, the difference is between these two distribution is not considered as being relevant, as the worst cases of complexities would be: log2(2403)= 11.2306, log2(300)=8.2288, and the difference between them two is only 3. Also, it is much simpler to implement and represent hash tables by taking as primary key only the first character and the length, and especially because the last character does not give much relevant information about the possible position. In a dictionary words are organized by order of characters and word lengths, while the last character will depend on the length and the ending of the previous word. Also, we save much memory space if we dont index last character in hash table. Therefore, the hash table in our experiments is built based only on first character and words lengths.
Results on complexity
The experiments to compare these two methods were done directly on array of words with no hash table and on hashed table by first words character and words lengths. In the search applied directly on the array of words the worst complexity is 18, while on hashed table the worst case complexity is 12. The experiments for comparing these methods do not aim to show the differences between worst case complexities, instead we aim at showing the difference between medium complexities of searches. As mentioned before, an indexing by first character would have complexity 14, which is not far from our method, where we use hashed table with orthographical features for search. Without a qualitative analysis of complexities would not be possible to see the improvements. In figure 4 we have distributions of complexity for both cases.
Figure 4. Complexities using non hashed table (left), and complexities using hashed table by first character and length of word (right)
In both images from figure 4, the points on vertical axis close to zero are the best case complexities, and the points close to maximum are the worst cases of complexities. In the right image, we have more points close to zero and less close to 12, in comparison with the left image, where we have few points closer to zero and many points closer to 18. That means that applying the search based only on the first character of the word will lead us to a complexity close to maximum and will exclude many best cases of complexity. The median of these two cases will show us the medium complexity, as we assign the same probability for each word being searched.
In first case (only first character, the left image) we have the median equal to 16, while the worst case complexity is 18. Though, the medium complexity for this case is less only by 11.11 % then worst complexity. The median in second case is equal to 8, with the worst case complexity equal to 12, and the medium complexity is less then worst case complexity by 33.33 %, which is an improvement. In table 1 this results are shown, in terms of complexity and number of words having these complexities in search. For example, for search using only the first character as feature we have 44091 of words with worst case complexity, while in search using first character and word length we have only 516 of worst case complexity, while for the medium complexity we have the maximum number of cases. The best case complexity also improves in this case, with a big difference.
Author: Vitalie Scurtu
Contact:scurtu19 (at) yahoo.de