I was born in
I came to U.S in 2006 to take my graduate study of computer science and received my master degree in August
I am interested in the work related to statistical data analysis, data mining, SAS programming and data analysis
software development. I am currently looking for full time or intern jobs related in these areas.
My resume could be found [here].
Abstract: Analyzing/tracking weblogs by given communities (ATWC) is increasingly important for sociologists and government agencies, etc. This paper introduces an approach to address the needs of ATWC by using concise discriminative weblog collection representatives (DCRs), which are constructed from large collections of blogs by communities of interest. DCRs are aimed at helping users to quickly identify the major themes/trends in such collections, and to quickly identify important shifts/differences in major themes and trends of blogs by given communities over time and space. We propose to use the quality of DCR-based classifiers to measure DCRs' quality. We present algorithms for constructing DCRs, report experimental results to evaluate the efficiency of the algorithms and the quality of the DCRs they construct, and provide real-data examples to demonstrate the usefulness of DCRs for ATWC.
Abstract: Given a pair of objects, it is of interest to know how they are related to each other and the strength of their similarity. Many previous studies focused on two types of similarity measures: The first type is based on closeness of attribute values of two given objects, and the second type is based on how often the two objects co-occur in transactions/tuples. In this thesis we study a new ¡°behavior-based¡± similarity measure, which evaluates similarity between two objects by considering how similar their correlated ¡°third-party¡± object sets are. Behavior-based similarity can help us find pairs of objects that have similar external functions but do not have very similar attribute values or do not co-occur quite often. After introducing and formalizing behavior-based similarity, we give an algorithm to mine pairs of similar objects under this measure. We demonstrate the usefulness of our algorithm and this measure using experiments on several news and medical datasets.
(1) Loading the dataset, a progress bar appears to indicate the loading progress:(2) A progress monitor showing the process of mining similar object pairs in real time:
(3)The results are displayed in a java table, all the columns could be sorted and extended:(4) The final results are also automatically generated as an html file:
Projects @ Statistical Consulting Center,
1. Help the institutional research department of WSU to identify important factors that have impacts on students¡¯ retention and graduation.
2. Help the internal audit department to do different statistical analysis on school credit cards usage.
1. Help the non-profit organizations to identify potential donors, new donors and influential donors.
2. Help the
OLAP-style Entity Correlation Analysis on Events Data, Lexis-Nexis,
2006-2007. [Video Demo]
In this project I designed and developed tools to perform OLAP-style entity correlation analysis on events data contained in news reports. The aim of the tools is to extract interesting correlations among entities.
The source data is metadata extracted from news reports. The metadata contains a number of attributes such as "company" "organization" "ticker" "person" "city" "country" etc. Each specific event contains a number of attributes, and it contains a number of values for each of those attributes. From each event, each pair of attribute values, for two (possibly identical) attributes, is considered as a correlation instance. A user can provide any specific set of events as input to this program.
The frequent correlations are computed from the given set of events. They are displayed through a user-friendly user interface. Users can navigate the display to do drill-down and roll-up of correlations.
At each level of the display, the user interface first provides a list of attributes in order to give the users a schema description of the data. When a user clicks any of the attributes, she/he will see the top-K most frequent entities for the clicked attribute. The default value for K is 100. When the user clicks any of the displayed entities, the list of attributes will again appear, allowing the user to drill-down another time. This process can repeat many times, allowing the user to drill-down the correlation to a number of levels. At any time, the path from the root to the current attribute value is high-lighted to allow the user to see the history/context of the correlations associated with the current path.
(1) Root Level Display for Correlation Analysis: (2) Expanded display for each attribute:
(3) A detailed level of the display: (4) A more detailed low level display:
Projects @ TechEdge, Wright Brothers Institute:
1. Open Layer Sensing Test bed project:
(1) Traffic web cameras in Dayton: (2) Traffic web cameras in Cincinnati, Dayton and Columbus:
(3) Connect to the traffic web camera located on I-75/3rd street: (4) Start to play the real-time traffic video:
2. PocketLST project:
Worked as the team leader of the android phone development group and developed android phone applications that could send/receive text and image messages to/from the Google map in real time. [Technical Report] [Slides] [Video Demo]
The screenshots below are my part of android implementations for the PocketLST project:
Computer Science Courses:
--CS516: Survey of Computer Science Numerical Methods
--CS605: Introduction to Data Management System
--CS634: Concurrent Software Design
--CS666: Introduction to Formal Language
--CEG66: Matrix Computation
--CS680: Comparative Languages
--CS701: Database System & Design
--CS702: Advanced Computer Networks
--CEG720: Computer Architecture
--CS740: Natural Language processing techniques
--CS790: Advanced Data Mining
Applied Statistical Courses:
--STT611 Applied Time Series
--STT646: Statistical Methods for Engineers
--STT661: Statistical Theory I
--STT662: Statistical Theory II
--STT666: Statistical Methods I
--STT667: Statistical Methods II
--STT669: Introduction to Experimental Design
--STT740 Categorical Data Analysis
--STT761: Theory of Linear Model
--STT767: Applied Regression Analysis
Unofficial WSU Transcript could be found [here].
TA courses and labs:
cs240(lab): Programming Language I (java)
cs241(lab): Programming Language II (advanced java)
cs242(lab): Programming Language III (c++)
STT264(lab): elementary statistics
STT265(lab): elementary statistics II
MTH126: Intermediate Algebra