English → French Neural Machine Translation
Memory-safe NMT on a 6 GB dataset without RAM crashes. Custom Seq2Seq + HuggingFace mBART/Helsinki-NLP fine-tuning. Fixed 5 critical upstream bugs (GradientTape, tokenizer overflow, deprecated API).
6 GB English-French parallel corpus
Memory-safe chunked loading + custom Seq2Seq baseline + HuggingFace pretrained fine-tuning
Memory-safe NMT that handles a 6 GB parallel corpus on Kaggle's 33 GB RAM limit without crashing.
Memory Strategy Chunked reading → sample → delete raw data → train on subset → clear between models.
5 Critical Bugs Fixed From Upstream Notebooks
| Bug | Root Cause | Fix Applied |
|---|---|---|
| None gradients crash | GradientTape consumed twice in train_step | Restructure tape scope |
| TypeError on call | Encoder/Decoder.call() missing training= kwarg | Add explicit kwarg |
| AttributeError | as_target_tokenizer() removed in transformers≥4.36 | Use context manager API |
| Deprecated argument | evaluation_strategy renamed eval_strategy | Update arg name |
| Integer overflow | int16 array in tokenizer exceeded max value | Cast to int32 |
Models Implemented
| Model | Framework | Approach |
|---|---|---|
| Custom Seq2Seq | TF 2.19 | LSTM encoder-decoder + Bahdanau attention |
| mBART | PyTorch 2.9 | facebook/mbart-large-cc25 fine-tuning |
| Helsinki-NLP | PyTorch 2.9 | opus-mt-en-fr fine-tuning |
| MarianMT | PyTorch 2.9 | Alternative MarianMT strategy |
Key Insight Pretrained multilingual models (mBART trained on 25 languages) dramatically outperform from-scratch Seq2Seq. The LSTM baseline validates the pipeline architecture; pretrained models show the transfer learning gap.