மாற்றம்:
LSTM-லிருந்து Transformer-க்கு

Attention, Memory மற்றும் Parallelization பற்றிய ஒரு ஆழமான பார்வை

Navigate செய்ய Arrow Keys-ஐப் பயன்படுத்தவும் →

LSTM-ன் இரண்டு முக்கிய குறைபாடுகள்

Long Short-Term Memory (LSTM) Networks 2014-2017 வரை AI துறையை ஆட்சி செய்தன. ஆனால், அதன் அடிப்படையான Conveyor Belt கட்டமைப்பால் அவை மிகப்பெரிய தடையைச் சந்தித்தன.

1. Information Bottleneck (Amnesia): முதல் வார்த்தையின் தகவல், கணிதரீதியாக அடுத்தடுத்த வார்த்தைகளுக்கு வரிசையாகக் கடத்தப்பட வேண்டும். ஒரு 2,000 வார்த்தைகள் கொண்ட கட்டுரையில், முதல் வார்த்தையின் சிக்னல் இறுதிக்கு வரும்போது முற்றிலும் அழிந்துவிடும்.
2. Sequential Training (Speed): 49-வது வார்த்தையைக் கணக்கிட்டு முடிக்கும் வரை உங்களால் 50-வது வார்த்தைக்கான Context-ஐ கணக்கிட முடியாது. இது ஒரு for loop. எனவே, நவீன GPU-களில் இதை இணையாக (Parallel) Train செய்ய முடியாது.

LSTM-கள் வார்த்தைகளை ஒன்றன்பின் ஒன்றாக மட்டுமே படிக்க முடியும், இது ஒரு மிகப்பெரிய நெரிசலை உருவாக்குகிறது.

The Amnesia Test: Mathematical Proof

இந்தக் குறைபாட்டை கணிதரீதியாக நிரூபிக்க `amnesia_test_demo.py` ஸ்கிரிப்ட் உருவாக்கப்பட்டுள்ளது.

20 எழுத்துக்களைக் கொண்ட Random String ஒன்று உருவாக்கப்படுகிறது. LSTM மற்றும் Transformer ஆகிய இரண்டிடமும் அந்த வரிசையின் முதல் எழுத்தை மட்டும் நினைவில் வைத்து வெளியிடுமாறு கேட்கப்படுகிறது.

LSTM, 'K' என்ற எழுத்தை 19 அடுத்தடுத்த Hidden States வழியே கடத்த வேண்டியிருந்ததால், அதன் Gradient அழிந்து Context-ஐ அது முற்றிலும் மறந்துவிடுகிறது.

The Amnesia Test Output

Theoretical Takeaway: நீண்ட தொடர்களில் (Long Sequences) LSTM-கள் Vanishing Gradient பிரச்சனையால் பாதிக்கப்படுகின்றன, ஆனால் Transformers தங்களது Attention மூலமாக Context-ஐ துல்லியமாக மீட்டெடுக்கின்றன.

===================================================== THE AMNESIA TEST: LSTM vs Transformer (Context Memory) ===================================================== Unseen Test String: 'K N M B E G J S V L N J T Y H O O A Y S' Target Answer: 'K' ----------------------------------------------------- LSTM Predicted: 'A' (Failed! The Vanishing Gradient destroyed memory) Transformer Predicted: 'K' (Success! Attention instantly retrieved the first letter)

The Transformer Revolution

2017-ல், Google "Attention Is All You Need" என்ற கட்டுரையை வெளியிட்டது. அவர்கள் Sequential Conveyor Belt முறையை முற்றிலும் கைவிட்டனர்.

Hidden State-ஐ வரிசையாகக் கடத்துவதற்குப் பதிலாக, Transformer ஒரு Web of Connections-ஐ (வலைப்பின்னல்) உருவாக்குகிறது. ஒவ்வொரு வார்த்தையும் மற்ற அனைத்து வார்த்தைகளுடனும் ஒரே நேரத்தில் நேரடியாக இணைக்கப்படுகிறது.

1-வது வார்த்தையிலிருந்து 10,000-வது வார்த்தைக்கான "Path Length" O(1) ஆகும். இது ஒரு நேரடியான, உடனடி கணித இணைப்பு.

Self-Attention ஒவ்வொரு வார்த்தையையும் மற்ற அனைத்து வார்த்தைகளுடனும் உடனடியாக ஆய்வு செய்ய அனுமதிக்கிறது.

How Attention Works: The Cocktail Party Effect

நீங்கள் ஒரு சத்தமான பார்ட்டியில் இருப்பதாகக் கற்பனை செய்து கொள்ளுங்கள். நீங்கள் ஒருவருடன் பேசும்போது, மற்ற எல்லா சத்தங்களையும் புறக்கணித்து அவர் குரலில் மட்டும் உங்களால் முழுமையாகக் கவனம் செலுத்த முடியும். Query, Key, மற்றும் Value (Q, K, V) என்ற மூன்று அணிகளைப் (matrices) பயன்படுத்தி Transformer இதையே செய்கிறது.

Query (Q): "நான் எதைத் தேடுகிறேன்?"
(எ.கா: 'The' என்ற வார்த்தை: "நான் ஒரு Noun-ஐத் தேடுகிறேன்" என்று கத்துகிறது)
Key (K): "நான் என்னவாக இருக்கிறேன்?"
(எ.கா: 'Bank' என்ற வார்த்தை: "நான் நதி/பணம் தொடர்பான ஒரு Noun" என்று கத்துகிறது)
Value (V): "என்னுடைய உண்மையான அர்த்தம் என்ன?"

ஒரு Query-யும் Key-யும் பொருந்தும்போது (Dot Product மூலமாக), Model அந்த Value-வை உறிஞ்சிக் கொள்கிறது. 'The' என்ற வார்த்தை 'Bank'-உடன் சரியாகப் பொருந்தினால், அது 'Bank'-ன் 99% அர்த்தத்தை உறிஞ்சி தனக்கான Context-ஐப் புரிந்துகொள்கிறது.

The Math of Attention

`transformer_deep_dive.py` ஸ்கிரிப்ட்டில், இந்த Matrix Math-ஐ துல்லியமாகக் காணலாம்.

Query Matrix-ஐ Key Matrix-ஆல் பெருக்கி, ஒரு Softmax Function-ஐப் பயன்படுத்துவதன் மூலம், நமக்கு ஒரு சதவீத கட்டமைப்பு (Grid of Percentages) கிடைக்கிறது.

"propose" என்ற வார்த்தைக்கான Context-ஐப் புரிந்துகொள்ள, Model கணிதரீதியாக துல்லியமாக 14% கவனத்தை "network" என்ற வார்த்தைக்கு எப்படி வழங்குகிறது என்பதைக் கவனியுங்கள்.

The Attention Matrix Output

Theoretical Takeaway: Softmax Function கவனத்தை ஒரு சதவீதமாக (மொத்தம் 1.0) மட்டுமே விநியோகிக்கிறது, இதன் மூலம் Model ஒட்டுமொத்த வாக்கியத்தின் மீதும் தனது கவனத்தை சரியாகப் பகிர்ந்தளிக்கிறது.

Attention Weights (Softmax applied to scaled scores) Observe how every row perfectly sums to 1.0 (100% distribution) We propose a new simple network architec the Transfor We 0.04 0.07 0.05 0.17 0.06 0.14 0.06 0.20 0.19 | Σ=1.00 propose 0.06 0.05 0.05 0.23 0.06 0.14 0.08 0.14 0.21 | Σ=1.00 a 0.04 0.05 0.04 0.21 0.05 0.14 0.05 0.24 0.18 | Σ=1.00 new 0.11 0.10 0.06 0.12 0.12 0.11 0.19 0.04 0.15 | Σ=1.00 simple 0.03 0.04 0.03 0.22 0.04 0.14 0.07 0.13 0.28 | Σ=1.00 network 0.21 0.11 0.18 0.07 0.12 0.08 0.12 0.06 0.05 | Σ=1.00

Watching the Model Learn

`transformer_training_demo.py` ஸ்கிரிப்ட்டில், இந்த Matrix Math-ஐ பயன்படுத்தி Backpropagation செய்யப்படுகிறது.

Epoch 0-ல், Q, K, V அணிகள் முற்றிலும் Random-ஆக இருக்கும், அதனால் Model அர்த்தமில்லாத வார்த்தைகளை வெளியிடும்.

Train செய்யும்போது, ஆங்கில இலக்கண விதிகளுடன் சரியாகப் பொருந்துமாறு அந்த அணிகள் தானாகவே மாற்றி அமைக்கப்பட்டு, Loss பூஜ்ஜியத்திற்குக் கொண்டு வரப்படுகிறது.

The Training Progression

Theoretical Takeaway: Loss குறையும்போது, Attention Matrices வார்த்தைகளுக்கு இடையிலான உறவுகளைச் சரியாகக் கற்றுக்கொள்கின்றன (உதாரணமாக, Adjectives to Nouns), இதனால் Random வார்த்தைகள் அர்த்தமுள்ள வாக்கியமாக மாறுகின்றன.

► EPOCH 0 (Untrained Model) Explanation: The weights are random. The model is guessing blindly. Loss: 5.8054 Target: propose a new simple network architecture, the Transformer Output: squeeze parsing MHA ALIGN machine SSD Peephole WFM ------------------------------------------------------------------ [Running Backpropagation... Fine-tuning the weights] ► EPOCH 160 Loss: 0.4844 Target: propose a new simple network architecture, the Transformer Output: propose a deep simple network architecture, the Transformer ------------------------------------------------------------------ [Running Backpropagation... Fine-tuning the weights] ► EPOCH 240 Loss: 0.4626 Target: propose a new simple network architecture, the Transformer Output: propose a new simple network architecture, the Transformer

The Speed Benchmark

Transformers எப்படி ட்ரில்லியன் கணக்கான Parameters-க்கு Scale ஆகின்றன? ஏனென்றால், அவை for loop-ஐ முற்றிலுமாக அகற்றிவிட்டன.

ஒரு பெரிய 1,000 வார்த்தைகள் கொண்ட ஆவணத்தை Pure Python-ல் சோதிக்க `lstm_vs_transformer_race.py` ஸ்கிரிப்ட் உருவாக்கப்பட்டுள்ளது.

LSTM, முந்தைய வார்த்தைக்காக 1,000 முறை காத்திருக்கக் கட்டாயப்படுத்தப்படுகிறது. ஆனால் Transformer 1,000 வார்த்தைகளையும் Matrix Math மூலமாக ஒரே நேரத்தில் கணக்கிடுகிறது.

The Speed Benchmark Output

Theoretical Takeaway: Transformer-ன் Parallel $O(1)$ கட்டமைப்பு, LSTM-ன் Sequential $O(N)$ கட்டமைப்பை விட பல மடங்கு வேகமாக Hardware Accelerators-ஐப் பயன்படுத்துகிறது.

THE SPEED BENCHMARK: SEQUENTIAL vs PARALLEL ► Testing LSTM (O(N) Sequential Process)... Time taken: 0.4053 seconds (The LSTM was forced to pause and wait for the previous word 1,000 times!) ► Testing Transformer (O(1) Parallel Process)... Time taken: 0.2198 seconds (The Transformer calculated all 1,000 words simultaneously using Matrices!) ============================================================ CONCLUSION: The Transformer is 1.8x faster at training time! ============================================================

The Grand Finale: The Generative Showdown

இறுதியாக, முழுமையாக Train செய்யப்பட்ட இரண்டு மாடல்களுக்கும் "We propose a" என்ற Prompt வழங்கப்படுகிறது. LSTM நீண்ட தூர தொடர்புகளைப் புரிந்துகொள்ளச் சிரமப்படுவதால், அது இலக்கணப் பிழையுடன் Hallucinate செய்கிறது. ஆனால் Transformer அதன் $O(1)$ Context-ஐப் பயன்படுத்தி, சரியான Domain-Specific வாக்கியங்களை உருவாக்குகிறது. இப்படித்தான் ChatGPT-யும் வேலை செய்கிறது!

► LSTM GENERATION: LSTM thinks next word is: global (Context: We propose a) LSTM thinks next word is: context (Context: We propose a global) LSTM thinks next word is: network (Context: We propose a global context) LSTM Final Output: We propose a global context network architecture, the LSTM ------------------------------------------------------------ ► TRANSFORMER GENERATION: Transformer thinks next word is: channel (Context: We propose a) Transformer thinks next word is: attention (Context: We propose a channel) Transformer thinks next word is: neural (Context: We propose a channel attention) Transformer Final Output: We propose a channel attention neural model, the SENet

மாற்றம்:LSTM-லிருந்து Transformer-க்கு