INDIGO Home University of Illinois at Urbana-Champaign logo uic building uic pavilion uic student center

Distilling Trustworthy Knowledge from Crowdsourced Data

Show full item record

Bookmark or cite this item: http://hdl.handle.net/10027/21247

Files in this item

File Description Format
PDF Xie_Sihong.pdf (2MB) (no description provided) PDF
Title: Distilling Trustworthy Knowledge from Crowdsourced Data
Author(s): Xie, Sihong
Advisor(s): Yu, Philip S
Contributor(s): Liu, Bing; Ziebart, Brian; Fan, Wei; Hu, Yuheng
Department / Program: Computer Science
Graduate Major: Computer Science
Degree Granting Institution: University of Illinois at Chicago
Degree: PhD, Doctor of Philosophy
Genre: Doctoral
Subject(s): crowdsourcing trustworthiness data mining machine learning
Abstract: Crowdsourcing, a technique referring to sourcing data from a large crowd of human workers, has become an effective, efficient and scalable data collecting paradigm in domains like text and image tagging, spam detection, product rating and ranking, etc., that are easier for human beings than for computers. However, the crowdsourced data are usually noisy, incomplete, erroneous due to incompetence of crowdsourcing workers, malicious injection of false information, etc., leading to trustworthiness issues in the crowdsourced data. In this thesis, I explore the issues in two crowdsourcing settings: 1) crowdsourcing with a panel and 2) crowdsourcing in the wild. Under the first setting, I study the problems of worker competence estimation, to better sift out the less accurate workers and emphasize the input from more reliable ones. Then I handle the label correlation in multi-labeled crowdsourcing and propose models to jointly infer the label correlations and the more trustworthy multi-labeled annotations. I then propose a large margin based framework to find the best parameter space for distillation of trustworthy information from crowdsourced data. The situations are quite different when crowdsourcing information from a crowd in the wild, such as rating and ranking systems where a large number of unknown workers contribute their opinions. The challenges mainly come from malicious workers in the crowd and the goal is to detect and remove such workers. I propose a time series pattern mining based approach to collectively detect singleton spamming attacks, which are widely adopted by attackers due to significant financial incentive and the well-covered trails of attacks. I then study various biases in the crowdsourced ratings due to sample selection bias and subjectivity, and propose a transfer learning based iterative bias correction method that is efficient in terms of human supervision. Lastly, I propose a framework based on dimension reduction to detect the irrelevant text comments crowdsourced on social medias.
Issue Date: 2016-10-18
Genre: thesis
URI: http://hdl.handle.net/10027/21247
Rights Information: Copyright 2016 Sihong Xie
Date Available in INDIGO: 2016-10-18
Date Deposited: 2016-08
 

This item appears in the following Collection(s)

Show full item record

Statistics

Country Code Views
United States of America 84
Russian Federation 40
China 34
Ukraine 9
Germany 3

Browse

My Account

Information

Access Key