All Projects
NLPFeatured

English → French Neural Machine Translation

Memory-safe NMT on a 6 GB dataset without RAM crashes. Custom Seq2Seq + HuggingFace mBART/Helsinki-NLP fine-tuning. Fixed 5 critical upstream bugs (GradientTape, tokenizer overflow, deprecated API).

Dataset

6 GB English-French parallel corpus

Approach

Memory-safe chunked loading + custom Seq2Seq baseline + HuggingFace pretrained fine-tuning

Tech Stack
PythonTensorFlow 2.19PyTorch 2.9HuggingFace Transformers 4.36+mBART
Keywords
Seq2SeqmBARTMarianMTHuggingFaceNMTTensorFlowPyTorch
Visualizations4 Charts
Deep Dive

Memory-safe NMT that handles a 6 GB parallel corpus on Kaggle's 33 GB RAM limit without crashing.

Memory Strategy Chunked reading → sample → delete raw data → train on subset → clear between models.

5 Critical Bugs Fixed From Upstream Notebooks

BugRoot CauseFix Applied
None gradients crashGradientTape consumed twice in train_stepRestructure tape scope
TypeError on callEncoder/Decoder.call() missing training= kwargAdd explicit kwarg
AttributeErroras_target_tokenizer() removed in transformers≥4.36Use context manager API
Deprecated argumentevaluation_strategy renamed eval_strategyUpdate arg name
Integer overflowint16 array in tokenizer exceeded max valueCast to int32

Models Implemented

ModelFrameworkApproach
Custom Seq2SeqTF 2.19LSTM encoder-decoder + Bahdanau attention
mBARTPyTorch 2.9facebook/mbart-large-cc25 fine-tuning
Helsinki-NLPPyTorch 2.9opus-mt-en-fr fine-tuning
MarianMTPyTorch 2.9Alternative MarianMT strategy

Key Insight Pretrained multilingual models (mBART trained on 25 languages) dramatically outperform from-scratch Seq2Seq. The LSTM baseline validates the pipeline architecture; pretrained models show the transfer learning gap.