Pattern Discovery in Combinatorial Databases: Algorithms, Applications, and Software for the Scientific Community
| Dennis Shasha, Professor | Jason T. L. Wang, Associate Professor |
| Department of Computer Science | Data and Knowledge Engineering Lab |
| Courant Institute of Mathematical Sciences | Department of Computer and Information Science |
| New York University | New Jersey Institute of Technology |
Contact Information
| Dennis Shasha, Professor | Jason T. L. Wang, Associate Professor |
| Department of Computer Science | Data and Knowledge Engineering Lab |
| Courant Institute of Mathematical Sciences | Department of Computer and Information Science |
| New York University | New Jersey Institute of Technology |
| 251 Mercer Street | University Heights |
| New York, New York 10012 | Newark, New Jersey 07102 |
| Phone: (212) 998-3086 | Phone: (973) 596-3396 |
| Fax: (212) 995-4123 | Fax: (973) 596-5777 |
| E-mail: shasha@cs.nyu.edu | E-mail: jason@cis.njit.edu |
| URL: http://cs.nyu.edu/cs/faculty/shasha | URL: http://www.cis.njit.edu/~jason |
WWW Page
http://cs.nyu.edu/cs/faculty/shasha/papers/agm.htmlSupported Students
PhD: Chia-Yo Chang, Gung-Wei Chirn, Karpjoo Jeong, Bin Li, Qicheng Ma, Peter Piatko, Steve Rozen, Xiong Wang, Peter Wyckoff
MS: Philip Johnson, Naresh Kannan, Girish Patel, Qiang Wan, Xinhuan Zheng
Project Award Information
Keywords
Scientific data mining, pattern matching and discovery, fault-tolerant parallel processing, molecular biology and chemistry, data structures and algorithms, query language and retrieval
Project Summary
Combinatorial data consisting of sequences, trees, and graphs arise in many scientific disciplines. For example, the primary structure of proteins is a sequence, whereas the tertiary structure is a graph. Comparing such data to find similarities entails the use of a "distance metric" that measures the difference between two data items. Numerous distance metrics are possible. This work consists primarily of (i) inventing efficient ways to compute known distance metrics; (ii) developing a data structure to decide which of a set of data items is "closest" (according to a given distance metric) to a new data item; (iii) techniques and software for discovering patterns with minimum or near-minimum distance to a given set of data items with respect to a given distance metric; and (iv) software to solve such discovery problems on networks of occasionally idle workstations. This work will help every field in which approximate matching is important. Significant applications are expected to molecular biology and rational drug design, but also to finding patterns in linguistic strings.
Publications and Products
J. T. L. Wang, B. A. Shapiro and D. Shasha, editors, Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications, Oxford University Press, New York, 1999, ISBN 0-19-511940-1.
X. Wang and J. T. L. Wang, "Fast Similarity Search in Three-Dimensional Structure Databases," Journal of Chemical Information and Computer Sciences, in press.
X. Wang, J. T. L. Wang, K.-I. Lin, D. Shasha, B. A. Shapiro and K. Zhang, "An Index Structure for Data Mining and Clustering," Knowledge and Information Systems, in press.
J. T. L. Wang, S. Rozen, B. A. Shapiro, D. Shasha, Z. Wang and M. Yin, "New Techniques for DNA Sequence Classification," Journal of Computational Biology, Vol. 6, No. 2, 1999, pp. 209-218.
J. T. L. Wang, B. A. Shapiro, D. Shasha, K. Zhang and K. M. Currey, "An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 8, August 1998, pp. 889-895.
J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, G.-W. Chirn and T. Y. Lee, "Complementary Classification Approaches for Protein Sequences," Protein Engineering, Vol. 9, No. 5, May 1996, pp. 381-386.
K. Zhang, J. T. L. Wang and D. Shasha, "On the Editing Distance between Undirected Acyclic Graphs," International Journal of Foundations of Computer Science, Special Issue on Computational Biology, Vol. 7, No. 1, March 1996, pp. 43-57.
J. T. L. Wang, X. Wang, D. Shasha, B. A. Shapiro, K. Zhang, Q. Ma and Z. Weinberg, "An Approximate Search Engine for Structural Databases," Proceedings of the ACM SIGMOD International Conference on Management of Data, Dallas, Texas, May 2000.
J. T. L. Wang, X. Wang, K.-I. Lin, D. Shasha, B. A. Shapiro and K. Zhang, "Evaluating a Class of Distance-Mapping Algorithms for Data Mining and Clustering," Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, August 1999, pp. 307-311.
Y. Yang, K. Zhang, X. Wang, J. T. L. Wang and D. Shasha, "An Approximate Oracle for Distance in Metric Spaces," Combinatorial Pattern Matching, (ed. M. Farach-Colton), Lecture Notes in Computer Science, Springer-Verlag, 1998, Vol. 1448, pp. 104-117.
X. Wang, J. T. L. Wang, D. Shasha, B. A. Shapiro, S. Dikshitulu, I. Rigoutsos and K. Zhang, "Automated Discovery of Active Motifs in Three Dimensional Molecules," Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Newport Beach, California, August 1997, pp. 89-95.
K. Jeong, D. Shasha, S. Talla, and P. Wyckoff, "An Approach to Fault-Tolerant Parallel Processing on Intermittently Idle, Heterogeneous Workstations," Proceedings of the 27th International Symposium on Fault Tolerant Computing, Seattle, Washington, June 1997, pp. 11-20.
J. T. L. Wang, D. Shasha, G. J. S. Chang, L. Relihan and K. Zhang, "Structural Matching and Discovery in Document Databases," Proceedings of the ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, May 1997, pp. 560-563.
J. T. L. Wang, B. A. Shapiro, D. Shasha, K. Zhang and C.-Y. Chang, "Automated Discovery of Active Motifs in Multiple RNA Secondary Structures," Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, August 1996, pp. 70-75.
Project Impact
Five Ph.D. students have graduated with Dennis Shasha as advisor (in order of completion): Steve Rozen (now a research scientist in the Whitehead Institute for Biomedical Research, MIT), Karpjoo Jeong (now an assistant professor at Konkuk University, Korea), Bin Li (now a vice-president at Citibank), Peter Wyckoff (a research scientist at Telcordia Research), and Peter Piatko (also a research scientist at Telcordia Research). Jason Wang has graduated three Ph.D. students, Gung-Wei Chirn (now a research scientist at Novartis Pharmaceuticals Corporation), Chia-Yo Chang (a senior systems specialist at Prudential) and Xiong Wang (now a faculty member at NJIT), and has advised seven master's theses by Jennifer Cerequas, Wei-Jen Chuang, Philip Johnson, Tom Tien-Hua Shih, Girish Patel and Joyce Ye Lu, Xinhuan Zheng, respectively.
Our tree matching algorithms have seen actual use in the Unisys Lab for comparing parse trees in natural language processing. Our graph matching algorithms are being tested in the drug database of Novartis Pharmaceuticals Corporation for flexible chemical searching. Jack Collins of the Advanced Biomedical Computing Center of the National Cancer Institute is interested in problems related to drug discovery, such as small molecule - protein docking, which is greatly enhanced by preconditioned databases, and optimizing combinatorial libraries.
Goals, Objectives, and Targeted Activities
Our long-term goal is to provide scientists with powerful, testable, efficient, usable, and versatile tools to discover structural patterns in a variety of combinatorial data sets and test the patterns as classifiers. Our main applications and users so far have been in biology and medicine, but we are currently exploring possible applications in chemistry and structural document comparison. Specifically, in the coming year, we will focus on the design and implementation of algorithms for approximate graph matching and mining as well as their applications to managing protein and drug molecules.
Project References
We have established three web sites for the project. Descriptions of the
Area Background
Our project is concerned with data mining and pattern matching. Most of our work concentrates on structural patterns in sequences, trees, and graphs. Typical questions are "Given a set of the combinatorial data, which patterns occur frequently (albeit, perhaps approximately) in the set?" We have developed efficient data structures and algorithms to answer these questions.
Area References
The National Cancer Institute 3D Structure Database, http://dtp.nci.nih.gov/docs/3d_database/dis3d.html
The Protein Data Bank, http://www.rcsb.org/pdb/
The Structure Group at the National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/Structure
TreeBase, http://herbaria.harvard.edu/treebase/
Potential Related Projects
Related projects on scientific databases and molecular biology conducted in Wisconsin, Washington, Pennsylvania, RPI, Michigan State, etc. within the NSF IDM program can be found at the web site http://www.interact.nsf.gov/cise/descriptions.nsf/PD/idm.