J/A+A/633/A154 HDBSCAN star, galaxy, QSO classification (Logan+, 2020)
Unsupervised star, galaxy, QSO classification. Application of HDBSCAN.
Logan C.H.A., Fotopoulou S.
<Astron. Astrophys. 633, A154 (2020)>
=2020A&A...633A.154L 2020A&A...633A.154L (SIMBAD/NED BibCode)
ADC_Keywords: Surveys ; MK spectral classification ; Redshifts ; Photometry
Keywords: stars: general - galaxies: general - galaxies: active -
methods: data analysis - surveys
Abstract:
Classification will be an important first step for upcoming surveys
that will detect billions of new sources such as LSST and Euclid, as
well as DESI, 4MOST and MOONS. The application of traditional methods
of model fitting and colour-colour selections will face significant
computational constraints, while machine-learning (ML) methods offer a
viable approach to tackle datasets of that volume. While supervised
learning methods can perform very well for classification tasks, the
creation of representative and accurate training sets is a resource
and time consuming task. We present a viable alternative using an
unsupervised ML method to separate stars, galaxies and QSOs using
photometric data. The heart of our work uses HDBSCAN to find the star,
galaxy and QSO clusters in a multidimensional colour space. We
optimized the hyperparameters and input attributes of three separate
HDBSCAN runs, each to select a particular object class, and thus treat
the output of each separate run as a binary classifier. We
subsequently consolidate the output to give our final classifications,
optimized on their F1 scores. We explore the use of Random Forest and
PCA as part of the pre-processing stage for feature selection and
dimensionality reduction. Using our dataset of ∼50000
spectroscopically labelled objects we obtain an F1 score of 98.9, 98.9
and 93.13 respectively for star, galaxy and QSO selection using our
unsupervised learning method. We find that careful attribute selection
is a vital part of accurate classification with HDBSCAN. We applied
our classification to a subset of the SDSS spectroscopic catalogue and
demonstrate the potential of our approach in correcting misclassified
spectra useful for DESI and 4MOST. Finally, we create a
multiwavelength catalogue of 2.7 million sources using the KiDS,
VIKING and ALLWISE surveys and publish corresponding classifications
and photometric redshifts.
Description:
Photometric data and classifications of sources and other outputs from
the star, galaxy, QSO classification method presented in the paper. We
note that the column descriptions are also in the Appendix, sometimes
with specific references to what sections detail the output of certain
columns.
File Summary:
--------------------------------------------------------------------------------
FileName Lrecl Records Explanations
--------------------------------------------------------------------------------
ReadMe 80 . This file
cpz.dat 692 48686 CPz catalogue with object classifications
klabels.dat 535 2728329 KiDSVW catalogue with object classifications
--------------------------------------------------------------------------------
See also:
J/A+A/619/A14 : Classification-aided zph estimation (Fotopoulou+, 2018)
Byte-by-byte Description of file: cpz.dat
--------------------------------------------------------------------------------
Bytes Format Units Label Explanations
--------------------------------------------------------------------------------
1- 24 A24 --- id Spectroscopic redshift ID (G1)
26- 34 F9.5 deg RAdeg Spectroscopic redshift
right ascension (J2000) (G1)
36- 44 F9.5 deg DEdeg Spectroscopic redshift
declination (J2000) (G1)
46- 54 F9.6 --- z Spectroscopic redshift value (G1)
56 I1 --- Hclass [0/3] Spectroscopic redshift
classification
(0=star, 1=galaxy, 3=QSO) (2)
58- 64 F7.4 mag umag u band total magnitude
66- 74 F9.4 mag e_umag u band total magnitude error
76- 82 F7.4 mag gmag g band total magnitude
84- 93 F10.4 mag e_gmag ? g band total magnitude error
95-101 F7.4 mag rmag r band total magnitude
103-113 F11.4 mag e_rmag r band total magnitude error
115-121 F7.4 mag imag i band total magnitude
123-133 F11.4 mag e_imag i band total magnitude error
135-141 F7.4 mag zmag z band total magnitude
143-153 F11.4 mag e_zmag z band total magnitude error
155-161 F7.4 mag Ymag Y band total magnitude
163-168 F6.4 mag e_Ymag Y band total magnitude error
170-176 F7.4 mag Jmag J band total magnitude
178-183 F6.4 mag e_Jmag J band total magnitude error
185-191 F7.4 mag Hmag H band total magnitude
193-198 F6.4 mag e_Hmag H band total magnitude error
200-206 F7.4 mag Kmag K band total magnitude
208-213 F6.4 mag e_Kmag K band total magnitude error
215-221 F7.4 mag W1mag W1 band total magnitude
223-228 F6.4 mag e_W1mag W1 band total magnitude error
230-236 F7.4 mag W2mag W2 band total magnitude
238-243 F6.4 mag e_W2mag W2 band total magnitude error
245-251 F7.4 mag u3mag u band 3 arcsecond magnitude
253-262 F10.4 mag e_u3mag ? u band 3 arcsecond magnitude error
264-270 F7.4 mag g3mag g band 3 arcsecond magnitude
272-278 F7.4 mag e_g3mag g band 3 arcsecond magnitude error
280-286 F7.4 mag r3mag r band 3 arcsecond magnitude
288-293 F6.4 mag e_r3mag r band 3 arcsecond magnitude error
295-301 F7.4 mag i3mag i band 3 arcsecond magnitude
303-308 F6.4 mag e_i3mag i band 3 arcsecond magnitude error
310-316 F7.4 mag Z3mag z band 3 arcsecond magnitude
318-323 F6.4 mag e_Z3mag z band 3 arcsecond magnitude error
325-331 F7.4 mag Y3mag Y band 3 arcsecond magnitude
333-338 F6.4 mag e_Y3mag Y band 3 arcsecond magnitude error
340-346 F7.4 mag J3mag J band 3 arcsecond magnitude
348-353 F6.4 mag e_J3mag J band 3 arcsecond magnitude error
355-361 F7.4 mag H3mag H band 3 arcsecond magnitude
363-369 F7.4 mag e_H3mag H band 3 arcsecond magnitude error
371-377 F7.4 mag K3mag K band 3 arcsecond magnitude
379-384 F6.4 mag e_K3mag K band 3 arcsecond magnitude error
386-397 F12.5 arcsec Yhlr ?=-99 Y band half light radius (HLR)
399-410 F12.5 arcsec Jhlr ?=-99 J band half light radius (HLR)
412-423 F12.5 arcsec Hhlr ?=-99 H band half light radius (HLR)
425-436 F12.5 arcsec Khlr ?=-99 K band half light radius (HLR)
438-445 F8.5 --- PCAs1c PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcastar1_colours)
447-454 F8.5 --- PCAs2c PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcastar2_colours)
456-463 F8.5 --- PCAs3c PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcastar3_colours)
465-472 F8.5 --- PCAg1c PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcagal1_colours)
474-481 F8.5 --- PCAg2c PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcagal2_colours)
483-490 F8.5 --- PCAg3c PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcagal3_colours)
492-499 F8.5 --- PCAq1c PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcaqso1_colours)
501-508 F8.5 --- PCAq2c PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcaqso2_colours)
510-517 F8.5 --- PCAq3c PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup,
colours used (pcaqso3_colours)
519 I1 --- ClasscO [0/3] Consolidation method classification
(0=outlier, 1=star, 2=gal, 3=QSO), colours
used (hdbscanclassoptimalmethod_colours)
521 I1 --- CassscA [0/3] (0=outlier, 1=star, 2=gal, 3=QSO),
colours used
(hdbscanclassalternativemethod_colours)
523 I1 --- dpc [0/1] Doubly positively classified objects,
colours used (doublepositivescolours)
525-529 F5.3 --- Poutc Outlier probability , colours used
(outlierprobabilitycolours)
531-535 F5.3 --- Pstarc Star probability, colours used
(starprobabilitycolours)
537-541 F5.3 --- Pgalc Galaxy probability, colours used
(galprobabilitycolours)
543-547 F5.3 --- Pqsoc QSO probability, colours used
(QSOprobabilitycolours)
549 I1 --- Labelc [0/3] Final labels, 'highest probability'
method (highestprobabilitylabels)
551-559 F9.5 --- PCAs1cHLR ?=-99 PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup,
colours & HLR used (pcastar1_colours+HLR)
561-569 F9.5 --- PCAs2cHLR ?=-99 PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup,
colours & HLR used (pcastar2_colours+HLR)
571-579 F9.5 --- PCAs3cHLR ?=-99 PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup,
colours & HLR used (pcastar3_colours+HLR)
581-589 F9.5 --- PCAg1cHLR ?=-99 PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup ,
colours & HLR used (pcagal1_colours+HLR)
591-599 F9.5 --- PCAg2cHLR ?=-99 PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup ,
colours & HLR used (pcagal2_colours+HLR)
601-609 F9.5 --- PCAg3cHLR ?=-99 PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup ,
colours & HLR used (pcagal3_colours+HLR)
611-619 F9.5 --- PCAq1cHLR ?=-99 PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup ,
colours & HLR used (pcaqso1_colours+HLR)
621-629 F9.5 --- PCAq2cHLR ?=-99 PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup ,
colours & HLR used (pcaqso2_colours+HLR)
631-639 F9.5 --- PCAq3cHLR ?=-99 PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup ,
colours & HLR used (pcaqso3_colours+HLR)
641-643 I3 --- ClasscHLRO [0/3]?=-99 Optimal consolidation method
classification
(0=outlier, 1=star, 2=gal, 3=QSO),
colours & H LR used
(hdbscanclassoptimalmethod_colours+HLR)
645-647 I3 --- ClasscHLRA [0/3]?=-99 Alternative consolidation method
classification
(0=outlier, 1=star, 2=gal, 3=QSO),
colours & HLR used
(hdbscanclassalternativemethod_colours+HLR)
649-651 I3 --- dpcHLR [0/1]?=-99 Doubly positively classified
objects, colours & HLR used
(doublepositivescolours+HLR)
653-660 F8.3 --- PoutcHLR ?=-99 Outlier probability, colours & HLR
used (outlierprobabilitycolours+HLRs)
662-669 F8.3 --- PstarcHLR ?=-99 Star probability, colours & HLR
used (starprobabilitycolours+HLRs)
671-678 F8.3 --- PgalcHLR ?=-99 Galaxy probability, colours & HLR
used (galprobabilitycolours+HLRs)
680-687 F8.3 --- PqsocHLR ?=-99 QSO probability, colours & HLR used
(QSOprobabilitycolours+HLRs)
690-692 I3 --- LabelcHLR [0/3]?=-99 Final labels for when the
'highest probability' method is used,
colours & HLR used
(highestprobabilitylabels_colours+HLRs)
--------------------------------------------------------------------------------
Note (2): same as column 5 in the CPz catalogue presented in Fotopoulou and
Paltani, 2018, Cat. J/A+A/619/A14, with the change as described in Sect.2.2
in the paper paper for AGN and UNKNOWN.
--------------------------------------------------------------------------------
Byte-by-byte Description of file: klabels.dat
--------------------------------------------------------------------------------
Bytes Format Units Label Explanations
--------------------------------------------------------------------------------
1- 8 I8 --- KiDSu KiDSVW identifier (serialidkidsu_dr4v3)
10- 18 F9.5 deg RAdeg Spectroscopic redshift right ascension (J2000)
20- 28 F9.5 deg DEdeg Spectroscopic redshift declination (J2000)
30- 35 A6 --- zClass14 Spectroscopic redshift classification
37- 45 F9.6 --- zSDSS14 ? Spectroscopic redshift value from SDSS DR14
47- 55 F9.6 --- zSDSSQ14 ? Spectroscopic redshift value from DR14Q (1)
57- 80 A24 --- id Spectroscopic redshift ID (G1)
82- 88 F7.4 mag umag u band total magnitude
90- 96 F7.4 mag e_umag u band total magnitude error
98-104 F7.4 mag gmag g band total magnitude
106-112 F7.4 mag e_gmag g band total magnitude error
114-120 F7.4 mag rmag r band total magnitude
122-128 F7.4 mag e_rmag r band total magnitude error
130-136 F7.4 mag imag i band total magnitude
138-143 F6.4 mag e_imag i band total magnitude error
145-151 F7.4 mag zmag z band total magnitude
153-161 F9.4 mag e_zmag z band total magnitude error
163-169 F7.4 mag Ymag Y band total magnitude
171-178 F8.4 mag e_Ymag Y band total magnitude error
180-186 F7.4 mag Jmag J band total magnitude
188-195 F8.4 mag e_Jmag J band total magnitude error
197-203 F7.4 mag Hmag H band total magnitude
205-213 F9.4 mag e_Hmag H band total magnitude error
215-221 F7.4 mag Kmag K band total magnitude
223-230 F8.4 mag e_Kmag K band total magnitude error
232-238 F7.4 mag W1mag W1 band total magnitude
240-245 F6.4 mag e_W1mag W1 band total magnitude error
247-253 F7.4 mag W2mag W2 band total magnitude
255-260 F6.4 mag e_W2mag W2 band total magnitude error
262-268 F7.4 mag u3mag u band 3 arcsecond magnitude
270-278 F9.4 mag e_u3mag ? u band 3 arcsecond magnitude error
280-286 F7.4 mag g3mag g band 3 arcsecond magnitude
288-293 F6.4 mag e_g3mag g band 3 arcsecond magnitude error
295-301 F7.4 mag r3mag r band 3 arcsecond magnitude
303-309 F7.4 mag e_r3mag r band 3 arcsecond magnitude error
311-317 F7.4 mag i3mag i band 3 arcsecond magnitude
319-324 F6.4 mag e_i3mag i band 3 arcsecond magnitude error
326-332 F7.4 mag z3mag z band 3 arcsecond magnitude
334-339 F6.4 mag e_z3mag z band 3 arcsecond magnitude error
341-347 F7.4 mag Y3mag Y band 3 arcsecond magnitude
349-354 F6.4 mag e_Y3mag Y band 3 arcsecond magnitude error
356-362 F7.4 mag J3mag J band 3 arcsecond magnitude
364-369 F6.4 mag e_J3mag J band 3 arcsecond magnitude error
371-377 F7.4 mag H3mag H band 3 arcsecond magnitude
379-385 F7.4 mag e_H3mag H band 3 arcsecond magnitude error
387-393 F7.4 mag K3mag K band 3 arcsecond magnitude
395-400 F6.4 mag e_K3mag K band 3 arcsecond magnitude error
402-409 F8.5 --- PCAs1 PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup
(pcastar1_)
411-418 F8.5 --- PCAs2 PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup
(pcastar2)
420-427 F8.5 --- PCAs3 PCA components, STAR HDBSCAN binary
classifiers, 'optimal' method setup
(pcastar3)
429-436 F8.5 --- PCAg1 PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup
(pcagal1)
438-445 F8.5 --- PCAg2 PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup
(pcagal2)
447-454 F8.5 --- PCAg3 PCA components, GAL HDBSCAN binary
classifiers, 'optimal' method setup
(pcagal3)
456-463 F8.5 --- PCAq1 PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup
(pcaqso1)
465-472 F8.5 --- PCAq2 PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup
(pcaqso2)
474-481 F8.5 --- PCAq3 PCA components, QSO HDBSCAN binary
classifiers, 'optimal' method setup
(pcaqso3)
483 I1 --- ClassO [0/3] Optimal consolidation method
classification
(0=outlier, 1=star, 2=gal, 3=QSO), colours
used (hdbscanclassoptimalmethod)
485 I1 --- CasssA [0/3] Alternative consolidation method
classification
(0=outlier, 1=star, 2=gal, 3=QSO), colours
used (hdbscanclassalternativemethod)
487 I1 --- dpc [0/3] Doubly positively classified objects,
colours used (double_positives)
489-493 F5.3 --- Poutc Outlier probability, colours used
(outlier_probability)
495-499 F5.3 --- Pstarc Star probability, colours used
(star_probability)
501-505 F5.3 --- Pgalc Galaxy probability, colours used
(gal_probability)
507-511 F5.3 --- Pqsoc QSO probability, colours used
(QSO_probability)
513 I1 --- Labelc [0/3] Final labels, 'highest probability'
method (highestprobabilitylabels)
515-522 F8.6 --- zPredG Photometric redshift predictions
(GALPHOTOZPREDICTOR)
524-531 F8.6 --- zPredQ Photometric redshift predictions
(QSOPHOTOZPREDICTOR)
533-535 I3 --- phztrain [0/10]?=-99 Training/validation/test set
labels for the photometric redshift
predictions (see Sect. 7.3) (2)
--------------------------------------------------------------------------------
Note (1): column zSDSSQ14 can be used as a flag for if a point is in the DR14Q
catalogue - if it has a value, it is in the DR14Q catalogue.
Note (2): If the source is in SDSS DR14 it has a value from 1-10. The training
set has values 1-6, validation 7-8 and test 9-10.
-99 values are for sources not in SDSSDR14.
--------------------------------------------------------------------------------
Global notes:
Note (G1): same as in the CPz catalogue presented in Fotopoulou and Paltani,
2018, Cat. J/A+A/619/A14.
--------------------------------------------------------------------------------
Acknowledgements:
Crispin Logan, crispin.logan(at)bristol.ac.uk
(End) Crispin Logan [Bristol University, UK], Patricia Vannier [CDS] 21-Nov-2019