Discovering Networks of Interdependent Features in High-Dimensional Problems
PBN-AR
Instytucja
Instytut Podstaw Informatyki Polskiej Akademii Nauk
Książka
Tytuł książki
Big Data Analysis: New Algorithms for a New Society
Data publikacji
2016
ISBN
978-3-319-26987-0
Wydawca
Springer
Publikacja
Główny język publikacji
angielski
Tytuł rozdziału
Discovering Networks of Interdependent Features in High-Dimensional Problems
Rok publikacji
2016
Strony (od-do)
285-304
Numer rozdziału
12
Identyfikator DOI
Liczba arkuszy
1,35
Hasło encyklopedyczne
Słowa kluczowe
angielski
MCFS-ID
ROSETTA
Ciruvis
High-dimensional problems
Gene expression data
Streszczenia
Język
angielski
Treść
The availability of very large data sets in Life Sciences provided earlier by the technological breakthroughs such as microarrays and more recently by various forms of sequencing has created both challenges in analyzing these data as well as new opportunities. A promising, yet underdeveloped approach to Big Data, not limited to Life Sciences, is the use of feature selection and classification to discover interdependent features. Traditionally, classifiers have been developed for the best quality of supervised classification. In our experience, more often than not, rather than obtaining the best possible supervised classifier, the Life Scientist needs to know which features contribute best to classifying observations (objects, samples) into distinct classes and what the interdependencies between the features that describe the observation. Our underlying hypothesis is that the interdependent features and rule networks do not only reflect some syntactical properties of the data and classifiers but also may convey meaningful clues about true interactions in the modeled biological system. In this chapter we develop further our method of Monte Carlo Feature Selection and Interdependency Discovery (MCFS and MCFS-ID, respectively), which are particularly well suited for high-dimensional problems, i.e., those where each observation is described by very many features, often many more features than the number of observations. Such problems are abundant in Life Science applications. Specifically, we define Inter-Dependency Graphs (termed, somewhat confusingly, ID Graphs) that are directed graphs of interactions between features extracted by aggregation of information from the classification trees constructed by the MCFS algorithm. We then proceed with modeling interactions on a finer level with rule networks. We discuss some of the properties of the ID graphs and make a first attempt at validating our hypothesis on a large gene expression data set for CD4++ T-cells. The MCFS-ID and ROSETTA including the Ciruvis approach offer a new methodology for analyzing Big Data from feature selection, through identification of feature interdependencies, to classification with rules according to decision classes, to construction of rule networks. Our preliminary results confirm that MCFS-ID is applicable to the identification of interacting features that are functionally relevant while rule networks offer a complementary picture with finer resolution of the interdependencies on the level of feature-value pairs.
Inne
System-identifier
PX-57ac4232c2dce0b031a61956
CrossrefMetadata from Crossref logo
Cytowania
Liczba prac cytujących tę pracę
Brak danych
Referencje
Liczba prac cytowanych przez tę pracę
Brak danych