Text Recognition Model Architecture

Taeyang's Learning Lab 2025. 5. 19. 18:43

2025. 5. 19. 18:43

After completing the data preprocessing, I will now document the architecture of the model I built.

model overview

The model used for text-based emotion classification follows a CNN + BiLSTM + Attention architecture.

This structure was chosen because it captures not only the sequential characteristics of a sentence but also local patterns, making it well-suited for emotion analysis.

CNN (Convolutional Neural Network)
- The Conv1D layer is used to extract local features by capturing consecutive word patterns in a sentence—essentially n-gram information, such as emotion-related expressions that appear in groups of 2 to 3 words.
- By setting the kernel size to 3 (kernel_size=3), the model is trained to detect patterns at the 3-gram level.
BiLSTM (Bidirectional Long Short-Term Memory)
- A bidirectional LSTM is used to capture both the forward and backward context of a sentence.
- With return_sequences=True, the output at each time step is preserved and passed to the next layer, allowing the Attention mechanism to make use of the full sequence information.
Attention Layer
- This is not a built-in Keras layer, but a custom Attention layer that I implemented myself.
- It learns attention weights based on the word vectors at each time step and generates a context vector that focuses on the most important parts of the sentence.
- Internally, it uses two Dense layers (W and V) to compute attention scores, which are then normalized using a softmax function.
- When a mask is provided, extremely small values are assigned to the padding positions to prevent the model from attending to them.

With this combination, I aimed to enhance emotion classification performance, especially for Korean—a language with a flexible word order.

model implementation and design

The model was implemented using TensorFlow and Keras.

The model was trained with the following configuration:

Loss Function: Sparse Categorical Crossentropy (well-suited for integer-encoded labels)
Optimizer: Adam (Learning_rate = 0.0003)
Batch Size: 64
Epochs: 50
EarlyStopping & ReduceLROnPlateau : prevent overfitting during training

In addition, to address class imbalance in the dataset, class_weight was used to assign appropriate weights during training.

An ensemble approach was also applied, selecting the best-performing model based on validation accuracy.

Detailed training parameters and performance results will be covered in the next post.

Lessons Learned from Building a Text Classification Model

Designing and implementing the model architecture was a process filled with important decisions and challenges.

One of the biggest difficulties was balancing model complexity with training stability—especially when combining convolutional and recurrent layers with a custom attention mechanism.

It required careful experimentation to ensure that each layer added meaningful value without introducing unnecessary overhead.

In particular, handling the flexible word order of the Korean language posed unique modeling challenges, which led me to choose a BiLSTM + Attention structure that could dynamically capture both local and contextual features.

Through this experience, I realized how crucial it is to design models not only for accuracy, but also for robustness, scalability, and relevance to the linguistic structure of the target domain—principles that are essential in any real-world AI application.

'Multimodal Chatbot Project : ESA > development process' 카테고리의 다른 글

Text data pre-processing process (0)	2025.05.18
Improving text recognition model Accuracy (2)	2025.03.18

taeyang4208 님의 블로그