J/A+A/645/A87 Star-galaxy classification feature importance (Baqui+, 2021)
The miniJPAS survey: star-galaxy classification using machine learning.
Baqui P.O., Marra V., Casarini L., Angulo R., Diaz-Garcia L.A.,
Hernandez-Monteagudo C., Lopes P.A.A., Lopez-Sanjuan C., Muniesa D.,
Placco V.M., Quartin M., Queiroz C., Sobral D., Solano E., Tempel E.,
Varela J., Vilchez J.M., Abramo R., Alcaniz J., Benitez N., Bonoli S.,
Carneiro S., Cenarro A.J., Cristobal-Hornillos D., de Amorim A.L.,
de Oliveira C.M., Dupke R., Ederoclite A., Gonzalez Delgado R.M.,
Marin-Franch A., Moles M., Vazquez Ramio H., Sodre L., Taylor K.
<Astron. Astrophys. 645, A87 (2021)>
=2021A&A...645A..87B 2021A&A...645A..87B (SIMBAD/NED BibCode)
ADC_Keywords: Galaxies, photometry ; Galaxy catalogs ; Photometry, narrow-band
Keywords: methods: data analysis - catalogs - galaxies: statistics -
stars: statistics
Abstract:
Future astrophysical surveys such as J-PAS will produce very large
datasets, the so-called "big data", which will require the deployment
of accurate and efficient machine-learning (ML) methods. In this work,
we analyze the miniJPAS survey, which observed about ∼1deg2 of the
AEGIS field with 56 narrow-band filters and 4 ugri broad-band filters.
The miniJPAS primary catalog contains approximately 64 000 objects in
the r detection band (magAB≲24), with forced-photometry in all
other filters.
We discuss the classification of miniJPAS sources into extended
(galaxies) and point-like (e.g., stars) objects, which is a step
required for the subsequent scientific analyses. We aim at developing
an ML classifier that is complementary to traditional tools that are
based on explicit modeling. In particular, our goal is to release a
value-added catalog with our best classification.
In order to train and test our classifiers, we cross-matched the
miniJPAS dataset with SDSS and HSC-SSP data, whose classification is
trustworthy within the intervals 15≤r≤20 and 18.5≤r≤23.5,
respectively. We trained and tested six different ML algorithms on the
two cross-matched catalogs: K-nearest neighbors, decision trees,
random forest (RF), artificial neural networks, extremely randomized
trees (ERT), and an ensemble classifier. This last is a hybrid
algorithm that combines artificial neural networks and RF with the
J-PAS stellar and galactic loci classifier. As input for the ML
algorithms we used the magnitudes from the 60 filters together with
their errors, with and without the morphological parameters. We also
used the mean point spread function in the r detection band for each
pointing.
We find that the RF and ERT algorithms perform best in all scenarios.
When the full magnitude range of 15≤r≤23.5 is analyzed, we find an
area under the curve AUC=0.957 with RF when photometric information
alone is used, and AUC=0.986 with ERT when photometric and
morphological information is used together. When morphological
parameters are used, the full width at half maximum is the most
important feature. When photometric information is used alone, we
observe that broad bands are not necessarily more important than
narrow bands, and errors (the width of the distribution) are as
important as the measurements (central value of the distribution). In
other words, it is apparently important to fully characterize the
measurement.
ML algorithms can compete with traditional star and galaxy
classifiers; they outperform the latter at fainter magnitudes (r≳21).
We use our best classifiers, with and without morphology, in order to
produce a value-added catalog.
Description:
Relative importance of the features used for the classification of
sources into extended (galaxies) and point-like (e.g. stars) objects.
Table 2a (first and second columns of Table 2 in the paper) gives the
feature importance for the analysis that uses only photometric bands
while Table 2b (third and fourth columns of Table 2 in the paper)
gives the importance for the analysis that uses also morphological
information. The analysis is relative to the crossmatched catalog
XMATCH.
File Summary:
--------------------------------------------------------------------------------
FileName Lrecl Records Explanations
--------------------------------------------------------------------------------
ReadMe 80 . This file
table2a.dat 39 121 Feature importance using only photometry
table2b.dat 39 125 Importance using photometry and morphology
--------------------------------------------------------------------------------
Byte-by-byte Description of file: table2a.dat table2b.dat
--------------------------------------------------------------------------------
Bytes Format Units Label Explanations
--------------------------------------------------------------------------------
1- 17 A17 --- Feature Name of feature
32- 39 F8.6 --- Importance Relative importance
--------------------------------------------------------------------------------
Acknowledgements:
Valerio Marra, valerio.marra(at)me.com
(End) Valerio Marra [UFES], Patricia Vannier [CDS] 12-Nov-2020