NSF Grant

Description

High-dimensional data is ubiquitous in real-world applications - from text categorization, to image processing, and to Web searches. The shortage of labeled data, resulting from high labeling costs, necessitates the need to explore machine learning approaches beyond classic classification and clustering paradigms. Semi-supervised learning is one such approach that demonstrates its potential in handling data with small labeled samples and reducing the need for expensive labeled data. However, high-dimensional data with small labeled samples permits too large a hypothesis space yet with too few constraints (labeled instances). The combination of the two data characteristics manifests a new research challenge. Employing computational and statistical learning theory, we analyze specific challenges presented by such data, show preliminary studies, delineate the need to integrate feature selection and extraction in a novel framework to reduce hypothesis space, propose to design efficient and novel algorithms, and conduct theoretical and empirical studies to understand complex relationships between high-dimensional data and classification performance.

Publications

Journal Articles

Z. Zhao and H. Liu. "Multi-Source Feature Selection via Geometry-Dependent Covariance Analysis", JMLR Workshop and Conference Proceedings Volume 4: New challenges for feature selection in data mining and knowledge discovery, 4:36-47, 2008
Z. Zhao and H. Liu. "Searching for Interacting Features in Subset Selection", Intelligent Data Analysis - An International Journal, 13:207-228, 2009.
M. Berens, H. Liu, L. Parsons, L. Yu, and Z. Zhao. “Fostering Biological Relevance in Feature Selection for Microarray Data”, Trends and Controversies,[PDF], pp 71 - 73. November/December 2005, IEEE Intelligent Systems.
H. Liu and L. Yu. "Toward Integrating Feature Selection Algorithms for Classification and Clustering", IEEE Trans. on Knowledge and Data Engineering, pdf, 17(4), 491-502, 2005.
Jieping Ye, Jianhui Chen, Ravi Janardan, and Sudhir Kumar. Developmental Stage Annotation of Drosophila Gene Expression Pattern Images via an Entire Solution Path for LDA. ACM Transactions on Knowledge Discovery from Data. special issue on Bioinformatics. Vol. 2, No. 1, pp. 1-21, 2008. [ PDF]

Conferences and Workshops
- Z. Zhao, L. Wang, and H. Liu. Efficient Spectral Feature Selection with Minimum Redundancy. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI), 2010 . [PDF, Supplementary]
- Z. Zhao, J. Wang, S. Sharma, N. Agarwal, H. Liu, and Y. Chang. An Integrative Approach to Identifying Biologically Relevant Genes. In Proceedings of SIAM International Conference on Data Mining (SDM), 2010. [PDF]
- Z. Zhao, J. Wang, H. Liu, and Y. Chang. Biological relevance detection via network dynamic analysis. In Proceedings of 2nd International Conference on Bioinformatics and Computational Biology (BICoB), 2010. BEST PAPER AWARD [PDF]
- J. Liu, L. Yuan, and J. Ye. An Efficient Algorithm for a Class of Fused Lasso Problems. The Sixteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2010). [PDF]
- L. Sun, B. Ceran, and J. Ye. A Scalable Two-Stage Approach for a Class of Dimensionality Reduction Techniques. The Sixteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2010).
- J. Chen, J. Liu, and J. Ye. Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks. The Sixteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2010).
- H. Liu, H. Motoda, R. Setiono, and Z. Zhao. Feature Selection: An Ever Evolving Frontier in Data Mining, Journal of Machine Learning Research, Workshop and Conference Proceedings Volume 10, 10:4-13, 2010.[PDF]
- L. Sun, J. Liu, J. Chen, and J. Ye. Efficient Recovery of Jointly Sparse Vectors. The Twenty-Third Annual Conference on Neural Information Processing Systems (NIPS 2009). [PDF]
- J. Liu, S. Ji, and J. Ye. Multi-task Feature Learning via Efficient L2,1-Norm Minimization. The Twenty-fifth Conference on Uncertainty in Artificial Intelligence (UAI 2009).[PDF]
- J. Liu, J. Chen, and J. Ye. Large-Scale Sparse Logistic Regression. The Fifteenth ACM SIGKDD International Conference On Knowledge Discovery and Data Mining (SIGKDD 2009), pp. 547-556.
- L. Sun, S. Ji, and J. Ye. A Least Squares Formulation for a Class of Generalized Eigenvalue Problems in Machine Learning. The Twenty-Sixth International Conference on Machine Learning (ICML 2009). [PDF]
- S. Ji and J. Ye. Linear Dimensionality Reduction for Multi-label Classification. The Twenty-first International Joint Conference on Artificial Intelligence (IJCAI 2009).[PDF]
- Z. Zhao, J. Wang, S. Sharma, N. Agarwal, H. Liu and Y. Chang. " A Knowledge-Oriented Framework for Gene Selection", Poster. Tuscon, Arizona, May 18-21. RECOMB'09
- Z. Zhao, L. Sun, S. Yu, H. Liu, J. Ye. "Multiclass Probabilistic Kernel Discriminant Analysis", IJCAI'09 [PDF]

Z. Zhao, J. Wang, H. Liu, J. Ye, and Y. Chang. "Identifying Biologically Relevant Genes via Multiple Heterogeneous Data Sources", KDD'08: 839 - 847. [PDF]
Z. Zhao and H. Liu. ``Spectral Feature Selection for Supervised and Unsupervised Learning''. International Conference on Machine Learning (ICML-07), June 20-24, 2007, Corvallis, Oregon. [PDF]
Z. Zhao and H. Liu. ``Semi-supervised Feature Selection via Spectral Analysis", SIAM International Conference on Data Mining (SDM-07), April 26-28, 2007, Minneapolis, Minnesoda. [PDF]
Z. Zhao and H. Liu. ``Searching for Interacting Features", The 20th International Joint Conference on AI (IJCAI-07), January 6-12 Hyderabad, India. [PDF]. Software available.
Jieping Ye. Least Squares Linear Discriminant Analysis. The Twenty-Fourth International Conference on Machine Learning (ICML 2007), pp. 1087-1093. Technical Report TR-06-003, Department of Computer Science and Engineering, Arizona State University , March, 2006. [PDF]

Books or Chapters

Huan Liu and Hiroshi Motoda, "Feature Selection for Knowledge Discovery and Data Mining", July 1998, ISBN 0-7923-8198-X, by Kluwer Academic Publishers
Huan Liu and Hiroshi Motoda, “Computational Methods of Feature Selection”, editors, 2008, Chapman and Hall/CRC Press.
H. Liu and Z. Zhao. "Manipulating Data and Dimensionality Reduc-tion Methods: Feature Selection", in Encyclopedia of Complexity and Systems Science, Robert Meyers (Ed.), Springer. 2009.
H. Liu. "Feature Selection: An Overview", in Encyclopedia of Machine Learning, Claude Sammut (Ed.), Springer. Forthcoming.
Z. Zhao and H. Liu. "On Interacting Features in Subset Selection", in Encyclopedia of Data Warehousing and Mining, 2nd Edition, Idea Group, Inc. pp 1079 -- 1084, September, 2008.

Technical Reports

Z. Zhao and H. Liu. ``Semi-supervised Feature Selection via Spectral Analysis", Technical Report, TR-06-022, Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, 2006.
Y. Ye, L. Yu, and H. Liu. ``Sparse Linear Discriminant Analysis", Technical Report, TR-06-010, Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287, 2006.

Thesis

Z. Zhao. Spectral Feature Selection for Mining Ultrahigh Dimensional Data [PDF]

Resources

Related Activities

Workshop on Feature Selection in Data Mining (FSDM 10) [link]
The proceedings of FSDM 2010 has been published by JMLR Workshop and Conference Proceedings
Tutorial at SDM10: Mining Sparse Representations: Formulations, Algorithms, and Applications
SIAM Data Mining SDM 2007 Tutorial: Dimensionality Reduction for Data Mining - Techniques, Applications, and Trends
AAAI 2005 Tutorial: Notes on Downsizing Data for High Performance in Learning - Feature Selection Methods, pdf.zip.

Project Members

Huan Liu (PI)
Jieping Ye (Co-PI)
Zheng Zhao (Successfully defensed his PhD dissertation, joined SAS institute)
Salem Alelyani (Graduate, PhD)
Lei Yuan (Graduate, PhD)
Shashvata Sharma (Finished her graduate study for master degree, joined Microsoft)
Fred Morstatter (Undergraduate)
Aneeth Anand (Graduate, Master)

Acknowledgments

This project is sponsored by NSF (#0812551), 9/2008 - 8/2011.

Description of Research

Feature selection aims to choose a subset of original features according to a selection criterion. It is an important technique that is widely used in pattern analysis. Feature selection removes irrelevant and redundant features and brings about many benefits: giving more reliable parameter estimates, reducing computational cost and memory usage, improving learning performance, and providing better result comprehensibility [Guyo-Elis03,Liu-Moto98c]. According to the way of utilizing label information, feature selection algorithms can be categorized as supervised algorithms [West-etal03, Robn-Kono03], unsupervised algorithms [Dy-Brod04,He-etal05] or semi-supervised algorithms [zhao-sdm07,xu-ijcai-09]. From the perspective of selection strategy, feature selection algorithms broadly fall into three models: filter, wrapper or embedded [Guyo-Elis03]. The filter model evaluates features without involving any learning algorithm. The wrapper model requires a learning algorithm and uses its performance to evaluate the goodness of features. Algorithms of the embedded model, e.g., C4.5 [Quin93] and LARS [efro-etal04], incorporate feature selection as a part of learning process, and use the objective function of the learning model to guide searching for relevant features. In addition, feature selection algorithms may return either a subset of features [Yu-Liu03,Hall00] or the weights of all features measuring their utility [Aha98,Robn-Kono03]. Hence, they can also be categorized as subset selection algorithms or feature weighting algorithms. As an important technique, feature selection has been applied to various areas: computer vision [Dy-etal03], text mining [Geor03], and bioinformatics [Saeys2007], to name a few.

The task of this depository is to collect the most popular algorithms that have been developed in the feature selection research area to serve as a platform to facilitate their application, comparison and joint study. You are encouraged to donate your algorithms and data sets to our depository.

-------------------Reference---------------------

[Guyo-Elis03] Guyon, I. & Elisseeff, A. An introduction to variable and feature selection Journal of Machine Learning Research, 2003, 3, 1157-1182
[Liu-Moto98c] Liu, H. & Motoda, H. Feature Selection for Knowledge Discovery and Data Mining Boston: Kluwer Academic Publishers, 1998
[West-etal03] Weston, J.; Elisseff, A.; Schoelkopf, B. & Tipping, M. Use of the zero norm with linear models and kernel methods Journal of Machine Learning Research, 2003, 3, 1439-1461
[Robn-Kono03] Sikonja, M. R. & Kononenko, I. Theoretical and empirical analysis of Relief and ReliefF Machine Learning, 2003, 53, 23-69
[Dy-Brod04] Dy, J. G. & Brodley, C. E. Feature Selection for Unsupervised Learning J. Mach. Learn. Res., MIT Press, 2004, 5, 845-889
[He-etal05] He, X.; Cai, D. & Niyogi, P. Weiss, Y.; Schölkopf, B. & Platt, J. (ed.) Laplacian Score for Feature Selection Advances in Neural Information Processing Systems 18, MIT Press, 2005
[zhao-sdm07] Zhao, Z. & Liu, H. Semi-supervised Feature Selection via Spectral Analysis Proceedings of SIAM International Conference on Data Mining (SDM), 2007
[xu-ijcai-09] Xu, Z.; Jin, R.; Ye, J.; Lyu, M. R. & King, I. Discriminative semi-supervised feature selection via manifold regularization IJCAI' 09: Proceedings of the 21th International Joint Conference on Artificial Intelligence, 2009
[Quin93] Quinlan, J. R. C4.5: Programs for Machine Learning Morgan Kaufmann, 1993
[efro-etal04] Efron, B.; Hastie, T.; Johnstone, I. & Tibshirani, R. Least Angle Regression Annals of Statistics, 2004, 32, 407-49
[Yu-Liu03] Yu, L. & Liu, H. Fawcett, T. & Mishra, N. (ed.) Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution Proceedings of the 20th International Conference on Machine Learning (ICML-03),, Morgan Kaufmann, 2003, 856-863
[Hall00] Hall, M. A. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning Proceedings of the Seventeenth International Conference on Machine Learning, 2000, 359-366
[Aha98] Aha, D. W. Feature Weighting for Lazy Learning Algorithms Feature Extraction, Construction and Selection: A Data Mining Perspective, 1998, 13-32
[Dy-etal03] Dy, J. G.; Brodley, C. E.; Kak, A. C.; Broderick, L. S. & Aisen, M. A. Unsupervised Feature Selection Applied to Content-Based Retrieval of Lung Images IEEE Transactions on Pattern Analysis and Machine Intelligence, 2003, 25, 373-378
[Geor03] Forman, G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification Journal of Machine Learning Research, 2003, 3, 1289-1305
[Saeys2007] Saeys, Y.; Inza, I. & Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics, 2007, 23, 2507-2517