MANCompiled | Galaxy Classification with Deep Learning

Galaxy Classification with Deep Learning

July 9, 2023

🧠 Building a CNN to classify galaxy morphologies using the Galaxy10 dataset

🌀 Project Overview

This project implements a convolutional neural network (CNN) to classify galaxy images into 10 different morphological categories. Using the Galaxy10 dataset from DECaLS (Dark Energy Camera Legacy Survey), I trained a deep learning model to automatically identify galaxy types based on their visual characteristics.

🌟 Key Results Achieved 73.98% test accuracy on galaxy classification with 10 distinct morphological categories

🗂️ Dataset & Methodology

The Galaxy10 dataset contains over 17,000 galaxy images labeled by citizen scientists through the Galaxy Zoo project. Each image is 69×69 pixels and represents one of ten galaxy morphologies:

Completely round smooth galaxy
In-between smooth galaxy
Cigar-shaped smooth galaxy
Edge-on galaxy (no bulge)
Edge-on galaxy (with bulge)
Spiral galaxy
Galaxy with bar
Galaxy with no bulge
Galaxy with just noticeable bulge
Galaxy with obvious bulge

🧱 Model Architecture

The CNN architecture consists of four convolutional layers with progressively increasing filter sizes, followed by global average pooling and dense layers:


model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, (3, 3), 
                          activation='relu', 
                          input_shape=(69, 69, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(64, (3, 3), 
                          activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(128, (3, 3), 
                          activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(256, (3, 3), 
                          activation='relu'),
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(10, activation='softmax')
])

🧮 Total Parameters: ~1.2M

⏱️ Training Time: ~45 minutes

📈 Final Accuracy: 73.98%

🏋️ Training Process

The model was trained for 20 epochs with the following configuration:

Optimizer: Adam
Loss Function: Categorical Crossentropy
Batch Size: 64
Train/Validation Split: 90/10
Test Split: 20% of total data

📊 Training Results

The model showed steady improvement throughout training:

Initial Training Accuracy: 76.04%
Final Training Accuracy: 88.87%
Best Validation Accuracy: 72.66%
Final Test Accuracy: 73.98%

📉 The model demonstrated good learning progression with validation accuracy stabilizing around 72%, indicating successful generalization without significant overfitting.

🧪 Implementation Details

🔄 Data Preprocessing

# Load and preprocess Galaxy10 dataset
X, y = galaxy10.load_data()
X = np.array([cv2.resize(img, (69, 69)) 
              for img in X], 
             dtype='float32') / 255.0
y_cat = tf.keras.utils.to_categorical(y, 10)

🖼️ Model Testing Results

Test Image Analysis

Here's an example of the model in action on a real galaxy image:

Original galaxy image used for testing

📋 Prediction Results

Model prediction output showing classification results

Model Prediction:

Predicted Class: Spiral galaxy
Confidence: 46%
Processing Time: <0.1 seconds

While the confidence is moderate, this reflects the inherent difficulty in galaxy classification, where morphological boundaries can be subtle and subjective even for human experts.

⚙️ Technical Stack

Deep Learning: TensorFlow/Keras

Data Processing: NumPy, OpenCV

Dataset: astroNN Galaxy10

Environment: Google Colab

💡 Key Learnings

Morphological Classification Complexity: Galaxy classification is inherently challenging due to the continuous nature of morphological features and subjective classification boundaries.
Data Augmentation Potential: The model could benefit from data augmentation techniques to improve generalization and handle orientation variations.
Transfer Learning Opportunities: Pre-trained models could potentially improve performance, especially given the limited dataset size.
Validation Strategy: The relatively stable validation accuracy suggests the model learned meaningful features without excessive overfitting.

🚀 Future Improvements

Data Augmentation: Implement rotation, scaling, and brightness variations
Transfer Learning: Experiment with pre-trained CNN backbones
Ensemble Methods: Combine multiple models for improved accuracy
Attention Mechanisms: Incorporate attention layers to focus on relevant morphological features

This project was developed using Google Colab and leverages the Galaxy10 dataset, which combines high-quality DECaLS imaging with Galaxy Zoo classifications originally derived from the Sloan Digital Sky Survey (SDSS) project.