

In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. This type of documents is referred to as multiple-record documents. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. The Web contains a tremendous amount of information. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose. The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback–Leibler divergence for informativeness and phraseness. Larger collections lead to better terms all methods are hindered by small collection sizes (below 1000 words). We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain.
#White pages data extractor series
In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion.

We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Moreover, it gives a high performance result when tested on the SWDE benchmark dataset (84.91%). The experiments show an encouraging result as it outperforms the CSP-based extractor algorithm (95% and 96% of recall and precision, respectively). Moreover, the system solves the problem of automatic data extraction from modern JavaScript sites in which data/schema are attached (on the client side) in a JSON format. The problem is solved by breaking down an observation sequence (a Web page) into simpler subsequences that will be labeled using CRF. It verifies the site schema and extracts data from the Web pages using Conditional Random Fields (CRFs). In this paper, a new data extractor called GenDE is proposed. If the wrapper failed to work with the new page, a new wrapper/schema would be regenerated by calling an unsupervised wrapper induction system. A wrapper verifier would check whether a new page from a site complies with the detected schema, and so the extractor will use the wrapper to get instances of the schema types.
#White pages data extractor verification
Although, few researches have focused on the more challenging jobs: wrapper verification or extractor generation. Web site schema detection and data extraction from the Deep Web have been studied a lot.
