Overview

Query term analyzer is used to analyze query terms attributes, such as important level, synonym pairs and so on. To get the latest source code, please access [SOURCE CODE] section and download project.

Term Weight Analyzer

The analyzer mainly takes advantages of user behaviors and statistics machine learning algorithm, such as CRF, GBDT, language model and so on, to predict each term's weight and importance level in a specified query. The core algorithm is language and word breaker independent, so it can be applied into many languages with different or without word breaker.  The main workflow to train term weight analyzer as follows:

 Query Term Weight Analyzer Workflow

Currently, the ranker model contains following features:

Feature Name

Description

Importance

WordFormation

Term weight level generated by CRF model with Ngram feature. This feature contains the knowledge of query pattern

1.0

LanguageModel

Value = {perplexity (entire query) / perplexity (entire query without current term)}. This feature reflects the query quality with/without current term

0.5054572

TermRank4Percent

The percentage of current term as level 4 in global statistics

0.4699351

IsEndTerm

Whether current term is at the end of a specified query

0.2596452

TermRank2Percent

The percentage of current term as level 2 in global statistics

0.06063229

TermRank1Percent

The percentage of current term as level 1 in global statistics

0.06028711

TermLength

The length of current term in characters

0.04571865

TermOffset

The offset of current term in a specified query

0.02643343

TermRank3Percent

The percentage of current term as level 3 in global statistics

0.02244079

TermRank0Percent

The percentage of current term as level 0 in global statistics

0.0065783

IsPunct

Is current term a punctuation

0.002716962

IsBeginTerm

Whether current term is at the begin of a specified query

0.001061089

There are some examples in different languages as follows:

Raw English query = {example of a retail resume}
Analyzed result = {example[RANK_0, 0.78] of[RANK_3, 0.30] a[RANK_4, 0.18] retail[RANK_2, 0.50] resume[RANK_0, 0.86]}

The RANK_X is each term's important level and the following float number is weight score ranged from 0.0 to 1.0. For above query, the analyzed result shows that "example" and "resume" are core terms, "of" and "retail" are normal terms, "a" is optional term.

Raw Chinese query = {2011年重庆社保缴费基数}
Analyzed result = {2011[RANK_2, 0.48] 年[RANK_3, 0.17] 重庆[RANK_1, 0.71] 社保[RANK_0, 0.89] 缴费[RANK_3, 0.18] 基数[RANK_0, 0.81]}

The analyzed result shows that "社保" and "基数" are core terms, "2011" and "重庆" are normal terms, "年" and "缴费" are considered as optional terms.

Raw Japanese query = {高齢者の医療の確保に関する法律解説}
Analyzed result = {高齢[RANK_0, 0.81] 者[RANK_0, 0.79] の[RANK_3, 0.22] 医療[RANK_0, 0.76] の[RANK_0, 0.76] 確保[RANK_0, 0.81] に[RANK_2, 0.39] 関する[RANK_2, 0.29] 法律[RANK_2, 0.43] 解説[RANK_3, 0.23]}

The analyzed result shows that "高齢者", "医療の確保[" are core terms, "に関する法律" are normal terms, "の" and "解説" are optional terms.

Raw Korean query = {재미있는스포츠동영상}
Analyzed result = {재[RANK_2, 0.28] 미[RANK_2, 0.30] 있[RANK_2, 0.26] 는[RANK_2, 0.27] 스[RANK_0, 0.78] 포[RANK_0, 0.80] 츠[RANK_0, 0.80] 동[RANK_3, 0.11] 영[RANK_3, 0.11] 상[RANK_3, 0.10]}

The analyzed result shows that "스포츠" are core terms, "재미있는" are normal terms, and "동영상" are optional terms.

From above examples, we can see that the analyzer labels RANK_X for each term to show its important level and calculates weight score ranged from 0.0 to 1.0. X is a non-negative integer. The smaller the X is, the more important the term is. The following are the introduction about X:

1) if the X is 0 or 1, the term should be a core term or a important term in the query. Missing such term, the meaning of the query will be changed or incomplete.

2) if the X is not only larger than 1, but also the largest number in all labeled result. The term with such RANK_X should be considered as optional terms in the query. Missing one or more optional terms, there should be no influence on query's semantic.

3) For X's other value, the term can be considered as normal terms in the query. Missing one or more normal terms, although query's main semantic is not obvious changing, it may lead the semantic fuzziness.

How To Get Demo Package

Since the capacity limitation in codeplex, the query term weight analyzing full demo package is shared at OneDrive. [DOWNLOADS] section only contains CRF model to label query term important level without language model and ranker.

Query Term Weight Analyzing Full Demo Package in Chinese: link

Console Tool

After download above full demo package, you are able to run the query term weight analyzer console tool.  The tool can predict each term's weight level and score in a specified query. The usage of the tool as follows:

QueryTermWeightAnalyzerConsole.exe [configuration file name] <input file name> <output file name>
[configuration file name] : a specified file name contains configuration items for analyzing
<input/output file name> : input/output file name contains input query for analyzing and output result. If this parameter isn't existed, the input and output are console.

There are some examples about how to use the tool

#1. Load queries from input.txt file, analyze and save result into output.txt file
QueryTermWeightAnalyzerConsole.exe qt_analyzer.ini input.txt output.txt

#2. Load queries from console, analyze and output result to console
QueryTermWeightAnalyzerConsole.exe qt_analyzer.ini

Using Query Term Weight Analyzer API in your project

Besides using above console tool to analyze query term weight, you can also use APIs in your project.  The following steps instruct how to use APIs in a project.

1. Add QueryTermWeightAnalyzer.dll as reference into the project and add name space as follows

using QueryTermWeightAnalyzer;

2. Create a query term weight analyzer instance and initialize it. The QueryTermWeightAnalyzer.Initialize method need a configuration file name as the input parameter. If the analyzer is initialized successfully, the method will return true, otherwise, it will returen false. This method initializes some global data structure, such as loading model and ranker, so it only needs to be run once even in multip-thread environment.

QueryTermWeightAnalyzer analyzer = new QueryTermWeightAnalyzer();
if (analyzer.Initialize(strConfigurationFileName) == false)
{
    Console.WriteLine("Initialize the analyzer failed.");
    return;
}

3. For each thread, create a separated working instance. If the CreateInstance method return null, that means it is failed to create the instance.

Instance instance = analyzer.CreateInstance();

4. Given a query and predict the weight level and score of each term. The QueryTermWeightAnalyzer.Analyze method needs to two paramters. The first parameter is working instance created in above step. Since it is not thread-safe, each thread should create a separated one. The second parameter is the given text and the analyzer will predict its term weight. The return value of this method is a token list which contains the weight level and score of each term. If the analyzer runs failed, the return value is null.

List<Token> tknList = analyzer.Analyze(instance, strLine);
if (tknList == null)
{
    //Analyze term weight is failed.
    Console.WriteLine("Failed to analyze {0}", strLine);
    return;
}

5. Use term weight level and score in your project. For each token, it contains the term string(strTerm), term weight level(rankId) and weight score(rankingscore). You can access as below example.

foreach (Token token in tknList)
{
    Console.WriteLine("{0}\t{1}\t{2}", token.strTerm, token.rankId, token.rankingscore);
}

In above steps, we have introduced main APIs that predict query term weight in your project. For other APIs, please refer QueryTermWeightAnalyzerConsole project source code.

How To Get Source Code

You can sync the latest source code in [SOURCE CODE] section. In this project, the CRF model is powered by CRFSharp and the language model is LMSharp. Both of them are also open source projects and you can visit their home page to get more detailed information.

Term Synonym Pairs Analyzer

This analyzer is used to find the best synonynm pair of terms according the given query. It leverages users' query-clicked data to mine synonym pairs with contexts and uses language model and some rules to detect which term is the best synonym of the given term. By these technologies, for a specified term, we are able to find out which term is the best synonym of it in different context. So far, although the core algorithm is language independent, we only build the demo package for Chinese.

Solution

The solution contains following projects
Pipeline projects : these projects is used to generate training corpus from raw data. The pipeline runs automatically.
Pipeline\BuildQueryTermWeightCorpus : Build terms importance level training corpus from raw term weight score corpus automatically.
Pipeline\StatTermWeightInQuery : Generate term weight score corpus from users' query-clicked data.
Pipeline\ConvertRawTrainingCorpusToCRFSharp : Convert raw training corpus generated by the pipeline to CRFSharp format for model encoding.

Core projects: core algorithms for query term analyzer.
Core\QueryTermWeightAnalyzer : online logic to analyze terms' important level according its context.
Core\QueryTermSynonymAnalyzer : online logic to analyze terms' synonym according its context.

QueryTermWeightAnalyzerConsole : The console program to predict terms' importance level in a certain query

Last edited Mar 18 at 5:19 AM by monkeyfu, version 44