For each document - whether it be a page from an Issue Crawler network or text submitted by the user - the Issue Discovery Tool does the following:

  1. Make a phrase list of noun phrases and Capitalized Sequences (Resulting in a list of Proper Nouns, Acronyms, ...)
  2. Add to the phrase list a list of significant words or phrases extracted from a larger source set of content by using the Yahoo Term Extraction Web Service
  3. Output is adjusted as follows:
  • Lowercase all phrases in the list (for easy comparison)
  • Remove phrases that have a length less than 3
  • Weight each phrase found in the previous steps as follows: Count the number of times the phrase appears in the document. If the phrase comes from Yahoo add 1 to the previous count (This favors Yahoo's presumed robustness). If the phrase does not come from Yahoo but if there are multiple terms in the phrase, add 2 to the previous count. (This assumes preference for multiple terms to single terms, if they did not come from Yahoo).
  • Remove phrases that are on the stop word list.
  • Remove phrases that are also part of a longer phrase in the list.
  • Sum the weight of all phrases obtained from all documents into one large list.
  • Rank the list.

The Issue Discovery tool is not designed to 'give proper weight' to items. It is more a heuristic, a data exploration tool rather than an empirical tool.
Topic revision: r3 - 01 Dec 2009, ErikBorra
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback