diff --git a/docs/_config.yml b/docs/_config.yml
index c2fd8a3e5..032ccb16e 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -610,6 +610,7 @@ fa:
     - path: fa/week12/12.md
       sections:
         - path: fa/week12/12-1.md
+        - path: fa/week12/12-2.md
     - path: fa/week13/13.md
       sections:
         - path: fa/week13/13-1.md
diff --git a/docs/fa/week12/12-1.md b/docs/fa/week12/12-1.md
index 265a15982..9e9dcc268 100644
--- a/docs/fa/week12/12-1.md
+++ b/docs/fa/week12/12-1.md
@@ -6,9 +6,7 @@ lecturer: Mike Lewis
 authors: Jiayu Qiu, Yuhong Zhu, Lyuang Fu, Ian Leefmans
 date: 20 Apr 2020
 translator: Tayeb Pourebrahim
-translation-date: 10 Oct 2020
 ---
-
 <!-----
 ## [Overview](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=44s)
 
@@ -214,7 +212,7 @@ We compute the same thing with different queries, values, and keys multiple time
 <!---
 One big advantage about the multi-headed attention is that it is very parallelisable. Unlike RNNs, it computes all heads of the multi-head attention modules and all the time-steps at once. One problem of computing all time-steps at once is that it could look at futures words too, while we only want to condition on previous words. One solution to that is what is called **self-attention masking**. The mask is an upper triangular matrix that have zeros in the lower triangle and negative infinity in the upper triangle. The effect of adding this mask to the output of the attention module is that every word  to the left has a much higher attention score than words to the right, so the model in practice only focuses on previous words. The application of the mask is crucial in language model because it makes it mathematically correct, however, in text encoders, bidirectional context can be helpful.
 --->
-یک مزیت بزرگ در مورد توجه چند سر این است که به صورت موازی بسیار قابل محاسبه است. برخلاف RNNها، که سرهای ماژول های توجه چند سر و همه مراحل گام را به طور همزمان محاسبه می کند. یک مشکل محاسبه همزمان همه مراحل زمانی این است که می تواند کلمات آینده را نیز بررسی کند، در حالی که ما فقط می خواهیم به کلمات قبلی شرط بگذاریم. یک راه حل برای آن چیزی است که ** پوشش خود-توجه ای ** نامیده می شود. پوشش یک ماتریس مثلثی فوقانی است که در مثلث پایین صفر و در مثلث بالایی بی نهایت منفی دارد. تأثیر افزودن این پوشش به خروجی ماژول توجه این است که هر کلمه به سمت چپ دارای نمره توجه بسیار بیشتری نسبت به کلمات به سمت راست است، بنابراین مدل در عمل فقط بر روی کلمات قبلی تمرکز دارد. استفاده از پوشش در مدل زبان بسیار مهم است زیرا آن را از نظر ریاضی صحیح می کند، با این حال، در رمزگذارهای متن، متن دو زبانه می تواند مفید باشد.
+یک مزیت بزرگ در مورد توجه چندسر این است که محاسبات آن را به راحتی می‌توان به صورت موازی انجام داد برخلاف RNNها، که سرهای ماژول های توجه چند سر و همه مراحل گام را به طور همزمان محاسبه می کند. یک مشکل محاسبه همزمان همه مراحل زمانی این است که می تواند کلمات آینده را نیز بررسی کند، در حالی که ما فقط می خواهیم به کلمات قبلی شرط بگذاریم. یک راه حل برای آن چیزی است که ** پوشش خود-توجه ای ** نامیده می شود. پوشش یک ماتریس مثلثی فوقانی است که در مثلث پایین صفر و در مثلث بالایی بی نهایت منفی دارد. تأثیر افزودن این پوشش به خروجی ماژول توجه این است که هر کلمه به سمت چپ دارای نمره توجه بسیار بیشتری نسبت به کلمات به سمت راست است، بنابراین مدل در عمل فقط بر روی کلمات قبلی تمرکز دارد. استفاده از پوشش در مدل زبان بسیار مهم است زیرا آن را از نظر ریاضی صحیح می کند، با این حال، در رمزگذارهای متن، متن دو زبانه می تواند مفید باشد.
 
 
 <!--
@@ -240,7 +238,7 @@ One detail to make the transformer language model work is to add the positional
 --->
 
 **چرا این مدل خوب است؟**
-۱. ارتباط مستقیمی بین هر جفت کلمه ایجاد می کند. هر کلمه می تواند مستقیماً به حالت های پنهان کلمات قبلی دسترسی پیدا کند و گرادیان های ناپدید شده را کاهش دهد. تابع پرهزینه ای را به راحتی یاد می‌گیرد.
+۱. ارتباط مستقیمی بین هر جفت کلمه ایجاد می کند. هر کلمه می تواند مستقیماً به حالت های پنهان کلمات قبلی دسترسی پیدا کند و مشکل گرادیان‌های ناپدیدشونده را برطرف می‌کند. تابع پرهزینه ای را به راحتی یاد می‌گیرد.
 ۲. تمام گام‌های زمانی به صورت موازی محاسبه می‌شود.
 ۳. خود-توجه‌ای درجه دوم است (تمام گام‌های زمانی می تواند به همه موارد دیگر مربوط شود)، محدود به حداکثر طول دنباله.
 
@@ -284,7 +282,7 @@ You could see that when transformers were introduced, the performance was greatl
 - برای دگرگون ساز بسیار مهم است
  
  
-### ترفند ۲: دست گرمی + برنامه زمانی آموزش ریشه مربع معکوس
+### ترفند ۲: دست گرمی + زمان بندی آموزش ریشه مربع معکوس
 
 - از برنامه زمانی نرخ یادگیری استفاده کنید: برای اینکه دگرگون سازها به خوبی کار کنند، باید سرعت یادگیری خود را از صفر تا هزار مرحله به صورت خطی کاهش دهید
 
@@ -369,7 +367,7 @@ It requires computing all possible sequences and because of the complexity of $O
 
 ### رمزگشایی حریص کار نمی کند
 
-ما محتمل ترین کلمه را در هر مرحله زمان می گیریم. با این وجود، هیچ تضمینی این محتمل ترین توالی ممکن باشد، زیرا اگر مجبور باشید در مرحله ای آن مرحله را انجام دهید، دیگر هیچ راهی برای پیگیری جستجوی خود برای پس‌گرد سایر نشست‌های قبلی ندارید.
+ما محتمل ترین کلمه را در هر مرحله زمان می گیریم. با این وجود،هیچ تضمینی برای این‌که این محتمل‌ترین توالی ممکن باشد وجود ندارد، زیرا اگر مجبور باشید در مرحله ای آن مرحله را انجام دهید، دیگر هیچ راهی برای پیگیری جستجوی خود برای پس‌گرد سایر نشست‌های قبلی ندارید.
 
 ### جستجوی خسته کننده نیز امکان پذیر نیست
 
diff --git a/docs/fa/week12/12-2.md b/docs/fa/week12/12-2.md
new file mode 100644
index 000000000..2a696718b
--- /dev/null
+++ b/docs/fa/week12/12-2.md
@@ -0,0 +1,614 @@
+---
+lang: fa
+lang-ref: ch.12-2
+title: مدل های زبان رمزگشا
+lecturer: Mike Lewis
+authors: Trevor Mitchell, Andrii Dobroshynskyi, Shreyas Chandrakaladharan, Ben Wolfson
+date: 20 Apr 2020
+translator: Tayeb Pourebrahim
+---
+
+
+<!--
+## [Beam Search](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=2732s)
+
+
+Beam search is another technique for decoding a language model and producing text. At every step, the algorithm keeps track of the $k$ most probable (best) partial translations (hypotheses). The score of each hypothesis is equal to its log probability.
+The algorithm selects the best scoring hypothesis.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Beam_Decoding.png" width="60%"/><br>
+<b>Fig. 1</b>: Beam Decoding
+</center>
+--->
+## [الگوریتم جستجو پرتو محلی](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=2732s)
+الگوریتم جستجو پرتو محلی یک تکنیک دیگر برای رمزگشایی از مدل زبانی و ایجاد متن است. در هر مرحله، الگوریتم ردپای مسیر $k$ محتمل‌ترین (بهترین) ترجمه های جزئی(فرضیه‌ها) دنبال می‌کند. امتیاز هر فرضیه برابر با لگاریتم احتمال آن است.
+الگوریتم فرضیه های با بالاترین نمره را انتخاب می کند.
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Beam_Decoding.png" width="60%"/><br>
+<b>تصویر اول</b>: رمزگشایی پرتو محلی
+</center>
+
+
+<!-----
+How deep does the beam tree branch out ?
+
+The beam tree continues until it reaches the end of sentence token. Upon outputting the end of sentence token, the hypothesis is finished.
+
+Why (in NMT) do very large beam sizes often results in empty translations?
+
+At training time, the algorithm often does not use a beam, because it is very expensive. Instead it uses auto-regressive factorization (given previous correct outputs, predict the $n+1$ first words). The model is not exposed to its own mistakes during training, so it is possible for “nonsense” to show up in the beam.
+
+Summary: Continue beam search until all $k$ hypotheses produce end token or until the maximum decoding limit T is reached.
+---->
+درخت پرتو تا چه عمقی منشعب می شود؟
+
+درخت پرتو تا زمانی که به ژتون انتهای جمله برسد، ادامه می یابد. به محض خروجی ژتون انتهای جمله، فرضیه تمام می شود.
+
+چرا (در NMT) اندازه های بسیار بزرگ پرتو اغلب منجر به ترجمه های خالی می شود؟
+
+در زمان آموزش، الگوریتم اغلب از پرتو استفاده نمی کند، زیرا بسیار گران است. به جای استفاده از فاکتورگیری خودهمبسته (که از خروجی درست پیشبینی $n+1$ کلمه‌ی اول به دست آوردیم). این مدل در هنگام آموزش در معرض اشتباهات خود قرار ندارد، بنابراین ممکن است که «مُهملاتی» در پرتو نشان داده شود.
+
+خلاصه: جستجوی پرتو را تا زمانی که تمام $K$ فرضیه ژتون پایانی را تولید کنند یا به حد حداکثر رمزگشایی T برسند ادامه پیدا می‌کند.
+
+
+<!---
+### Sampling
+
+We may not want the most likely sequence. Instead we can sample from the model distribution.
+
+However, sampling from the model distribution poses its own problem. Once a "bad" choice is sampled, the model is in a state it never faced during training, increasing the likelihood of continued "bad" evaluation. The algorithm can therefore get stuck in horrible feedback loops.
+--->
+
+### نمونه گیری
+ما ممکن است محتمل‌ترین دنباله را نخواهیم. در عوض می توانیم از توزیع مدل نمونه بگیریم.
+
+با این حال، نمونه گیری از توزیع مدل مشکل خاص خود را ایجاد می کند. هنگامی که از یک انتخاب «بد» نمونه برداری شد، مدل در وضعیتی است که در طول آموزش هرگز با آن روبرو نشده است، احتمال ادامه ارزیابی «بد» افزایش میابد. بنابراین الگوریتم می تواند در حلقه‌های بازخورد وحشتناک گیر کند.
+
+<!--
+### Top-K Sampling
+
+A pure sampling technique where you truncate the distribution to the $k$ best and then renormalise and sample from the distribution.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Top_K_Sampling.png" width="60%"/><br>
+<b>Fig. 2</b>: Top K Sampling
+</center>
+--->
+### نمونه گیری کی-بالا
+یک تکنیک نمونه گیری خالص که توزیع را به $K$ بهترین‌ها کوتاه می‌کنیم و سپس دوباره نرمالایز می‌کنیم و از توزیع نمونه می‌گیریم.
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/Top_K_Sampling.png" width="60%"/><br>
+<b>تصویر دوم</b>: نمونه گیری کی-بالا
+</center>
+
+<!---
+#### Question: Why does Top-K sampling work so well?
+
+This technique works well because it essentially tries to prevent falling off of the manifold of good language when we sample something bad by only using the head of the distribution and chopping off the tail.
+--->
+
+### سوال: چرا نمونه گیری کی-بالا به خوبی کار می‌کند؟
+
+این تکنیک به این دلیل خوب کار می‌کند که اساساً سعی می شود مانع از افتادن از منیفولد زبان خوب شود وقتی که ما فقط از سر توزیع استفاده کنیم و دم توزیع را حذف می‌کنیم.
+<!---
+## Evaluating Text Generation
+
+Evaluating the language model requires simply log likelihood of the held-out data. However, it is difficult to evaluate text. Commonly word overlap metrics with a reference (BLEU, ROUGE etc.) are used, but they have their own issues.
+--->
+## ارزیابی تولید متن
+
+ارزیابی مدل زبان به سادگی نیاز به لگاریتم درست نمایی دیتای بیرون نگه داشته (یا دادهای تست) دارد. با این حال، ارزیابی متن دشوار است. معیار های کلمات متداول همپوشانی با الگوریتم‌های مرجع (همانند BLEU، ROUGE و ...) دارد، اما آنها مشکلات خاص خود را دارند.
+
+
+<!---
+## Sequence-To-Sequence Models
+
+
+### Conditional Language Models
+
+Conditional Language Models are not useful for generating random samples of English, but they are useful for generating a text given an input.
+
+Examples:
+
+- Given a French sentence, generate the English translation
+- Given a document, generate a summary
+- Given a dialogue, generate the next response
+- Given a question, generate the answer
+---->
+
+## مدل‌های ترتیب به ترتیب
+
+
+### مدل‌های زبان مشروط
+
+مدل های زبان مشروط برای تولید نمونه های تصادفی انگلیسی مفید نیستند، اما برای تولید متنی با یک ورودی، مفید هستند.
+
+مثال‌ها:
+
+- با توجه به یک جمله فرانسه، ترجمه انگلیسی را تولید کنیم.
+- با توجه به یک سند، خلاصه را تولید کنیم.
+- با توجه به یک دیالوگ، پاسخ بعدی را تولید کنیم.
+- با توجه به یک سوال، جواب را تولید کنیم. 
+
+<!---
+### Sequence-To-Sequence Models
+
+Generally, the input text is encoded. This resulting embedding is known as a "thought vector", which is then passed to the decoder to generate tokens word by word.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_Models.png" width="60%"/><br>
+<b>Fig. 3</b>: Thought Vector
+</center>
+---->
+### مدل‌های ترتیب به ترتیب
+
+به طور کلی، متن ورودی رمزگذاری شده است. این نتیجه جاسازی شده به عنوان "بردار فکر" شناخته می شود، که سپس به رمزگشای منتقل می شود تا کلمه به کلمه ژوتن‌ها را تولید کند.
+
+center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_Models.png" width="60%"/><br>
+<b>تصویر سوم</b>: بردار فکر
+</center>
+
+<!-----
+### Sequence-To-Sequence Transformer
+
+The sequence-to-sequence variation of transformers has 2 stacks:
+
+1. Encoder Stack – Self-attention isn't masked so every token in the input can look at every other token in the input
+
+2. Decoder Stack – Apart from using attention over itself, it also uses attention over the complete inputs
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_transformers.png" width="60%"/><br>
+<b>Fig. 4</b>: Sequence to Sequence Transformer
+</center>
+
+Every token in the output has direct connection to every previous token in the output, and also to every word in the input. The connections make the models very expressive and powerful. These transformers have made improvements in translation score over previous recurrent and convolutional models.
+--->
+### دگرگون ساز ترتیب به ترتیب
+
+نوع ترتیب به ترتیب دگرگون ساز، دو نوع دارد:
+
+۱. نوع رمزگذار - خود-توجه‌ای پوشیده نیست بنابراین همه ژتون‌های ورودی می توانند به ژتون‌های دیگر ورودی نگاه کنند
+
+۲. نوع رمزگشا - جدای استفاده از استفاده «توجه» روی خودش، روی تمام ورودی‌ها هم استفاده می‌شود
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/s2s_transformers.png" width="60%"/><br>
+<b>تصویر چهارم</b>: دگرگون ساز ترتیب به ترتیب
+</center>
+
+هر ژتون در خروجی با هر ژتون قبلی در خروجی و همچنین با هر کلمه در ورودی ارتباط مستقیم دارد. اتصالات این مدل ها را بسیار رسا و قدرتمند می کند. این دگرگون‌سازها بهبودهایی در امتیاز ترجمه نسبت به مدل‌های بازگشتی و کانولوشنی ایجاد کردند.
+
+<!----
+## [Back-translation](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=3811s)
+
+When training these models, we typically rely on large amounts of labelled text. A good source of data is from European Parliament proceedings - the text is manually translated into different languages which we then can use as inputs and outputs of the model.
+
+--->
+##  [ترجمه مجدد] (https://www.youtube.com/watch?v=6D4EWKJgNn0&t=3811s)
+هنگام آموزش این مدل ها، ما معمولاً به مقدار زیادی متن برچسب خورده وابسته‌ایم. متن اقدامات پارلمان اروپا منبع خوبی از دیتا است - این متن به صورت دستی به زبان‌های مختلفی ترجمه شده است که می‌توانیم به عنوان ورودی‌ها و خروجی‌‌های مدل استفاده کنیم.
+<!---
+### Issues
+
+- Not all languages are represented in the European parliament, meaning that we will not get translation pair for all languages we might be interested in. How do we find text for training in a language we can't necessarily get the data for?
+- Since models like transformers do much better with more data, how do we use monolingual text efficiently, *i.e.* no input / output pairs?
+
+Assume we want to train a model to translate German into English. The idea of back-translation is to first train a reverse model of English to German
+
+- Using some limited bi-text we can acquire same sentences in 2 different languages
+- Once we have an English to German model, translate a lot of monolingual words from English to German.
+
+Finally, train the German to English model using the German words that have been 'back-translated' in the previous step. We note that:
+
+- It doesn't matter how good the reverse model is - we might have noisy German translations but end up translating to clean English.
+- We need to learn to understand English well beyond the data of English / German pairs (already translated) - use large amounts of monolingual English
+--->
+
+### مسائل
+
+- همه زبانها در پارلمان اروپا نمایندگی ندارند، به این معنی که الزاماً برای همه زبان هایی که ممکن است به آنها علاقه داشته باشیم جفت ترجمه نخواهیم داشت. چگونه متنی را برای آموزش به زبانی پیدا کنیم که لزوماً نتوانیم داده ها را برای آن بدست آوریم؟
+
+- از آنجا که مدل هایی مانند دگرگون‌سازها با داده های بیشتر عملکرد بهتری دارند، چگونه می‌توان از متن تک زبانه به طور کارآمد استفاده کرد ، * یعنی * بدون جفت ورودی / خروجی؟
+
+فرض کنید می خواهیم مدلی را ترجمه کنیم که آلمانی را به انگلیسی ترجمه کند. ایده ترجمه مجدد این است که ابتدا یک مدل معکوس انگلیسی به آلمانی آموزش دهیم
+
+- با استفاده از برخی  داده های دو متنی محدود می توانیم جملات مشابه را به دو زبان مختلف بدست آوریم
+- هنگامی که ما مدل انگلیسی به آلمانی را داریم، بسیاری از کلمات یک زبانه را از انگلیسی به آلمانی ترجمه کنید.
+
+در آخر، آموزش مدل آلمانی به انگلیسی با استفاده از کلماتی که در مراحل قبلی «ترجمه مجدد» شده است، انجام می‌دهیم. ما توجه داریم که:
+
+- مهم نیست که مدل معکوس چقدر خوب باشد - ما ممکن است ترجمه های آلمانی نویزی داشته باشیم اما در نهایت ترجمه دقیق انگلیسی داشته باشیم.
+- ما باید یاد بگیریم که انگلیسی را فراتر از داده های جفت های انگلیسی / آلمانی (قبلاً ترجمه شده) درک کنیم - از مقادیر زیادی [متن] انگلیسی تک زبانه استفاده کنید
+
+<!----
+### Iterated Back-translation
+
+- We can iterate the procedure of back-translation in order to generate even more bi-text data and reach much better performance - just keep training using monolingual data.
+- Helps a lot when not a lot of parallel data
+--->
+### ترجمه مجدد تکرار شده
+
+- ما می توانیم روش ترجمه مجدد را تکرار کنیم تا حتی داده های دو متنی بیشتری تولید کنیم و عملکرد بسیار بهتری داشته باشیم - فقط با استفاده از داده های تک زبانه آموزش را ادامه دهید
+- وقتی داده موازی زیادی نباشد، خیلی کمک می کند
+<!----
+## Massive multilingual MT
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-language-mt.png" width="60%"/><br>
+<b>Fig. 5</b>: Multilingual MT
+</center>
+
+- Instead of trying to learn a translation from one language to another, try to build a neural net to learn multiple language translations.
+- Model is learning some general language-independent information.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-mt-results.gif" width="60%"/><br>
+<b>Fig. 6</b>: Multilingual NN Results
+</center>
+
+Great results especially if we want to train a model to translate to a language that does not have a lot of available data for us (low resource language).
+--->
+## ترجمه ماشینی چند زبانه عظیم
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/multi-language-mt.png" width="60%"/><br>
+<b>تصویر پنجم</b>: ترجمه ماشینی چند زبانه
+</center>
+
+<!----
+## Unsupervised Learning for NLP
+
+There are huge amounts of text without any labels and little of supervised data. How much can we learn about the language by just reading unlabelled text?
+---->
+## یادگیری بدون نظارت پردازش زبان‌های طبیعی
+
+مقادیر زیادی از متن بدون هیچ برچسب وجود دارد و داده های نظارت شده کمی وجود دارد. چقدر می توان فقط با خواندن متن بدون برچسب در مورد زبان یاد گرفت؟
+
+<!---
+### `word2vec`
+
+Intuition - if words appear close together in the text, they are likely to be related, so we hope that by just looking at unlabelled English text, we can learn what they mean.
+
+- Goal is to learn vector space representations for words (learn embeddings)
+
+Pretraining task - mask some word and use neighbouring words to fill in the blanks.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-masking.gif" width="60%"/><br>
+<b>Fig. 7</b>: word2vec masking visual
+</center>
+
+For instance, here, the idea is that "horned" and "silver-haired" are more likely to appear in the context of "unicorn" than some other animal.
+
+Take the words and apply a linear projection
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-embeddings.png" width="60%"/><br>
+<b>Fig. 8</b>:  word2vec embeddings
+</center>
+
+Want to know
+
+$$
+p(\texttt{unicorn} \mid \texttt{These silver-haired ??? were previously unknown})
+$$
+
+$$
+p(x_n \mid x_{-n}) = \text{softmax}(\text{E}f(x_{-n})))
+$$
+
+Word embeddings hold some structure
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/embeddings-structure.png" width="60%"/><br>
+<b>Fig. 9</b>: Embedding structure example
+</center>
+
+- The idea is if we take the embedding for "king" after training and add the embedding for "female" we will get an embedding very close to that of "queen"
+- Shows some meaningful differences between vectors
+--->
+### `word2vec`
+
+شهود - اگر کلمات در متن نزدیک به هم ظاهر شوند، احتمالاً با هم مرتبط هستند، بنابراین امیدواریم که فقط با نگاه کردن به متن انگلیسی بدون برچسب، معنی آنها را یاد بگیریم.
+
+- هدف یادگیری فضای برداری نماینده های کلمات است (یادگیری جاسازی‌ها)
+
+وظیفه قبل از آموزش - مقداری کلمه را مخفی کنید و از کلمات همسایه برای پر کردن جای خالی استفاده کنید.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-masking.gif" width="60%"/><br>
+<b>تصویر هفتم</b>: word2vec پوشاندن بصری
+</center>
+
+
+
+
+به عنوان مثال، در اینجا ایده این است که "شاخ" و "موی نقره ای" بیش از برخی حیوانات دیگر در متن "تک شاخ" ظاهر می شوند.
+
+کلمات را بگیرید و یک برون‌فکنی خطی اعمال کنید
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/word2vec-embeddings.png" width="60%"/><br>
+<b>تصویر هشتم</b>:  word2vec جاسازی ها
+</center>
+
+می خواهیم بدانیم
+
+$$
+p(\texttt{unicorn} \mid \texttt{These silver-haired ??? were previously unknown})
+$$
+
+$$
+p(x_n \mid x_{-n}) = \text{softmax}(\text{E}f(x_{-n})))
+$$
+
+جاسازی کلمه‌ها دارای ساختار است.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-2/embeddings-structure.png" width="60%"/><br>
+<b>تصویر نهم</b>: مثالی برای ساختار جاسازی
+</center>
+
+- ایده این است که اگر ما جاسازی را برای "پادشاه" پس از آموزش بگیریم و جاسازی را برای "زن" اضافه کنیم ، یک جاسازی بسیار نزدیک به "ملکه" بدست خواهیم آورد
+- برخی اختلافات معنی دار بین بردارها را نشان می دهد
+
+<!---
+#### Question: Are the word representation dependent or independent of context?
+
+Independent and have no idea how they relate to other words
+
+
+#### Question: What would be an example of a situation that this model would struggle in?
+
+Interpretation of words depends strongly on context. So in the instance of ambiguous words - words that may have multiple meanings - the model will struggle since the embeddings vectors won't capture the context needed to correctly understand the word.
+
+-->
+### سوال: آیا نماینده کلمه مستقل از زمینه است یا وابسته به زمینه است؟
+
+مستقل و هیچ ایده ای درباره ارتباط آنها با کلمات دیگر ندارند
+
+
+### سوال: نمونه ای از وضعیتی که این مدل در آن گیر می کند چه خواهد بود؟
+
+تفسیر کلمات بستگی زیادی به زمینه دارد. بنابراین در نمونه کلمات مبهم - کلماتی که ممکن است چندین معنی داشته باشند - این مدل با مشکل روبرو خواهد شد زیرا بردارهای جاسازی، زمینه مورد نیاز که برای درک صحیح کلمه را درک نمی کنند.
+
+
+<!--- 
+### GPT
+
+To add context, we can train a conditional language model. Then given this language model, which predicts a word at every time step, replace each output of model with some other feature.
+
+- Pretraining - predict next word
+- Fine-tuning - change to a specific task. Examples:
+  - Predict whether noun or adjective
+  - Given some text comprising an Amazon review, predict the sentiment score for the review
+
+This approach is good because we can reuse the model. We pretrain one large model and can fine tune to other tasks.
+--->
+### GPT
+
+برای اضافه کردن زمینه به مدل، می توانیم یک مدل زبان مشروط را آموزش دهیم. سپس با توجه به این مدل زبان، که کلمه را در هر مرحله پیش بینی می کند، هر یک از خروجی های مدل را با ویژگی دیگری جایگزین کنید.
+
+- پیش آموزش- کلمه بعدی را پیش بینی کنید
+- تنظیم دقیق - تغییر به یک کار خاص. مثال ها:
+  - پیش بینی کنید اسم است و یا صفت
+  - با توجه به برخی از متن های متشکل از بررسی سایت آمازون، امتیاز را برای بررسی پیش بینی کنید
+
+این روش خوب است زیرا ما می توانیم مدل را مجدداً استفاده کنیم. ما یک مدل بزرگ را از قبل آموزش می دهیم و می توانیم با کارهای دیگر هماهنگ کنیم.
+<!---
+### ELMo
+
+GPT only considers leftward context, which means the model can't depend on any future words - this limits what the model can do quite a lot.
+
+Here the approach is to train _two_ language models
+
+- One on the text left to right
+- One on the text right to left
+- Concatenate the output of the two models in order to get the word representation. Now can condition on both the rightward and leftward context.
+
+This is still a "shallow" combination, and we want some more complex interaction between the left and right context.
+--->
+### ELMo
+
+GPT فقط زمینه سمت چپ را در نظر می گیرد، این بدان معناست که مدل نمی تواند به هیچ کلمه ای در آینده وابسته باشد - این کاری را که مدل می تواند انجام دهد بسیار محدود می کند.
+
+در اینجا رویکرد آموزش دو مدل زبان است
+
+- یکی در متن چپ به راست
+- یکی در متن راست به چپ
+- برای ارائه دادن کلمه، خروجی دو مدل را بهم میچسبانیم. اکنون می تواند به زمینه راست و چپ بستگی داشته باشد.
+
+این هنوز یک ترکیب «کم عمق» است و ما می خواهیم تعامل پیچیده تری بین زمینه چپ و راست داشته باشیم.
+
+<!---
+### BERT
+
+BERT is similar to word2vec in the sense that we also have a fill-in-a-blank task. However, in word2vec we had linear projections, while in BERT there is a large transformer that is able to look at more context. To train, we mask 15% of the tokens and try to predict the blank.
+
+Can scale up BERT (RoBERTa):
+
+- Simplify BERT pre-training objective
+- Scale up the batch size
+- Train on large amounts of GPUs
+- Train on even more text
+
+Even larger improvements on top of BERT performance - on question answering task performance is superhuman now.
+--->
+
+### BERT
+
+
+مدل BERT از این نظر شبیه word2vec است که ما یک وظیفه «پر کردن جای خالی» داریم. با این حال، در word2vec ما پیش بینی های خطی داشتیم، در حالی که در BERT یک دگرگون‌ساز بزرگ وجود دارد که می تواند زمینه های بیشتری را مشاهده کند. برای آموزش، ما ۱۵٪ از نشانه ها را می پوشانیم و سعی می کنیم جای خالی را پیش بینی کنیم.
+
+به راحتی در مقیاس بزرگ قابل استفاده است (RoBERTa)
+
+- هدف BERT را قبل از آموزش ساده کنید
+- مقیاس اندازه دسته را افزایش دهید
+- روی مقادیر زیادی GPU آموزش دهید
+- حتی روی متن بیشتر آموزش دهید
+
+حتی پیشرفت‌های بزرگتری روی عملکرد BERT - امروزه عملکرد در وظیفه «سوال و پاسخ» فوق انسانی است.
+
+<!---
+## [Pre-training for NLP](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=4963s)
+
+Let us take a quick look at different self-supervised pre training approaches that have been researched for NLP.
+
+- XLNet:
+
+  Instead of predicting all the masked tokens conditionally independently, XLNet predicts masked tokens auto-regressively in random order
+
+- SpanBERT
+
+   Mask spans (sequence of consecutive words) instead of tokens
+
+- ELECTRA:
+
+  Rather than masking words we substitute tokens with similar ones.  Then we solve a binary classification problem by trying to predict whether the tokens have been substituted or not.
+
+- ALBERT:
+
+  A Lite Bert: We modify BERT and make it lighter by tying the weights across layers. This reduces the parameters of the model and the computations involved. Interestingly, the authors of ALBERT did not have to compromise much on accuracy.
+
+- XLM:
+
+  Multilingual BERT: Instead of feeding such English text, we feed in text from multiple languages. As expected, it learned cross lingual connections better.
+
+The key takeaways from the different models mentioned above are
+
+- Lot of different pre-training objectives work well!
+
+- Crucial to model deep, bidirectional interactions between words
+
+- Large gains from scaling up pre-training, with no clear limits yet
+
+
+Most of the models discussed above are engineered towards solving the text classification problem. However, in order to solve text generation problem, where we generate output sequentially much like the `seq2seq` model, we need a slightly different approach to pre training.
+---->
+## [پیش آموزش برای ‫طبيعی‬ ‫زبان‬ ‫پردازش‬](https://www.youtube.com/watch?v=6D4EWKJgNn0&t=4963s)
+
+بیایید نگاهی سریع به رویکردهای مختلف پیش از آموزش نظارت بر خود بیندازیم که برای ‫طبيعی‬ ‫زبان‬ ‫پردازش تحقیق شده اند.
+
+- XLNet: 
+   به جای پیشبینی تمام ژتون‌های پوشیده شده به صورت مشروط مستقل، XLNet به صورت خود همبسته به ترتیب تصادفی ژتون‌های پوشیده شده را پیشبینی می‌کند.
+
+- SpanBERT
+    پوشاندن اسپن‌ها (توالی کلمات متوالی) به جای ژتون‌ها
+- ELECTRA:
+    ما به جای پوشاندن کلمات، ژتون‌ها را با کلمات مشابه جایگزین می کنیم. سپس با تلاش برای پیش بینی اینکه آیا نشانه ها جایگزین شده اند یا نه، یک مسئله طبقه بندی باینری را حل می کنیم.
+- ALBERT:
+   مدل سبک‌تر BERT: ما BERT را اصلاح می کنیم و با گره زدن وزنه ها روی لایه ها، آن را سبک تر می کنیم. این پارامترهای مدل و محاسبات مربوطه را کاهش می دهد. جالب اینجاست که نویسندگان ALBERT مجبور نبودند که در «دقت» زیاد مصالحه کنند.
+
+- XLM:
+
+
+   مدل چند زبانه BERT: ما به جای تغذیه یک متن انگلیسی، متن را چند زبان ارائه می دهیم. همانطور که انتظار می رود، ارتباطات بین زبانی را بهتر یاد گرفت.
+   
+مهمترین نقاط کلیدی مدلهای ذکر شده در بالا عبارتند از:
+
+- اهداف پیش آموزش‌های زیاد و مختلف بسیار خوب کار می‌کنند!
+
+- ارتباط دوطرفه بین کلمات برای عمق مدل حیاتی است
+
+- دستاورد های بزرگی از پیش آموزش در مقیاس بزرگ به دست می‌آید، هنوز بدون هیچ محدودیت مشخصی
+
+
+بیشتر مدل هایی که در بالا بحث شد برای حل مسئله طبقه بندی متن مهندسی شده اند. با این حال، برای حل مشکل تولید متن، جایی که ما به ترتیب و مانند مدل «seq2seq» خروجی تولید می کنیم، به روش کمی متفاوت قبل از آموزش نیاز داریم.
+
+   
+<!----
+#### Pre-training for Conditional Generation: BART and T5
+
+BART: pre-training `seq2seq` models by de-noising text
+
+In BART, for pretraining we take a sentence and corrupt it by masking tokens randomly. Instead of predicting the masking tokens (like in the BERT objective), we feed the entire corrupted sequence and try to predict the entire correct sequence.
+
+This `seq2seq` pretraining approach give us flexibility in designing our corruption schemes. We can shuffle the sentences, remove phrases, introduce new phrases, etc.
+
+BART was able to match RoBERTa on SQUAD and GLUE tasks. However, it was the new SOTA on summarization, dialogue and abstractive QA datasets. These results reinforce our motivation for BART, being better at text generation tasks than BERT/RoBERTa.
+---->
+### پیش آموزش برای نسل مشروط: BART و T5
+
+مدل BART: پیش آموزش مدل های «seq2seq» توسط متن بدون نویز
+
+در مدل BART، برای پیش آموزش، ما یک جمله را می گیریم و آن را با پوشاندن تصادفی ژتون‌ها خراب می کنیم. به جای پیش بینی ژتون‌های پوشیده شده (مانند هدف BERT)، ما کل توالی خراب را بگیریم و سعی می کنیم کل توالی صحیح را پیش بینی کنیم.
+
+این رویکرد پیش آموز «seq2seq» به ما انعطاف پذیری در طراحی رویه خراب کردن متن را می دهد. ما می توانیم جملات را بُر بزنیم، عبارات را حذف کنیم، عبارات جدیدی معرفی کنیم، و غیره.
+
+مدل BART می توانست با RoBERTa در کارهای SQUAD و GLUE مطابقت داشته باشد. با این حال، این روش جدید SOTA در زمینه خلاصه سازی، گفتگو و مجموعه داده های انتزاعی پرسش و پاسخ بود. این نتایج انگیزه ما را برای BART تقویت می کند، که در انجام کارهای تولید متن بهتر از BERT / RoBERTa باشیم.
+
+
+<!----
+### Some open questions in NLP
+
+- How should we integrate world knowledge
+- How do we model long documents?  (BERT-based models typically use 512 tokens)
+- How do we best do multi-task learning?
+- Can we fine-tune with less data?
+- Are these models really understanding language?
+---->
+
+### بعضی از سوالات باز در پردازش زبان طبیعی
+
+- چگونه باید دانش جهانی را با هم ادغام کنیم
+- چگونه اسناد طولانی را مدلسازی می کنیم؟ (مدل های بر پایه BERT معمولا ۵۱۲ ژتون دارند)
+- چگونه یادگیری چند وظیفه ای را به بهترین وجه انجام می دهیم؟
+- آیا می توانیم با داده کمتری تنظیم دقیق کنیم؟
+- آیا این مدل ها واقعاً زبان را درک می کنند؟
+<!----
+
+### Summary
+
+- Training models on lots of data beats explicitly modelling linguistic structure.
+
+From a bias variance perspective, Transformers are low bias (very expressive) models. Feeding these models lots of text is better than explicitly modelling linguistic structure (high bias). Architectures should be compressing sequences through bottlenecks
+
+- Models can learn a lot about language by predicting words in unlabelled text. This turns out to be a great unsupervised learning objective. Fine tuning for specific tasks is then easy
+
+- Bidirectional context is crucial
+---->
+### خلاصه
+
+- مدل های آموزشی بر روی تعداد زیادی از داده ها، مدل سازی صریح ساختار زبانی را به راحتی شکست می دهد.
+
+از دیدگاه اریبی واریانس، دگرگون‌سازها مدل هایی با اریبی کم (بسیار رسا) هستند. تغذیه متن های متناسب با این مدل ها بهتر از مدل سازی صریح ساختار زبانی (اریبی زیاد) است. معماری ها باید فشرده سازی توالی ها را از طریق گلوگاه ها انجام دهند.
+
+- مدل ها می توانند با پیش بینی کلمات در متن بدون برچسب، چیزهای زیادی در مورد زبان یاد بگیرند. به نظر می رسد این یک هدف یادگیری بدون نظارت عالی است. تنظیم دقیق برای کارهای خاص پس از آن آسان است
+
+- متن دو زبانه بسیار مهم است
+
+<!----
+### Additional Insights from questions after class:
+
+What are some ways to quantify 'understanding language’? How do we know that these models are really understanding language?
+
+"The trophy did not fit into the suitcase because it was too big”: Resolving the reference of ‘it’ in this sentence is tricky for machines. Humans are good at this task. There is a dataset consisting of such difficult examples and humans achieved 95% performance on that dataset. Computer programs were able to achieve only around 60% before the revolution brought about by Transformers. The modern Transformer models are able to achieve more than 90% on that dataset. This suggests that these models are not just memorizing / exploiting the data but learning concepts and objects through the statistical patterns in the data.
+
+Moreover, BERT and RoBERTa achieve superhuman performance on SQUAD and Glue. The textual summaries generated by BART look very real to humans (high BLEU scores). These facts are evidence that the models do understand language in some way.
+
+--->
+
+### بینش بیشتر از سوالات بعد از کلاس:
+
+چه روش‌هایی برای تعیین کمیت «درک زبان» وجود دارد؟ از کجا می دانیم که این مدل ها واقعاً زبان درک می کنند؟
+
+«جایزه در چمدان قرار نگرفت زیرا بسیار بزرگ بود»: حل ارجاع به «آن» در این جمله برای ماشینها مشکل است. انسانها در این کار تبحر دارند. یک مجموعه داده وجود دارد که از چنین مثالهای دشواری تشکیل شده و انسانها عملکرد ۹۵ درصدی در آن مجموعه داده دارند. قبل از انقلابی که توسط دگرگون ساز‌ها ایجاد شده بود، برنامه های رایانه ای عملکردی فقط در حدود ۶۰٪ در این مجموه داده به دست می آوردند. مدلهای مدرن دگرگون‌سازها قادر به دستیابی به بیش از ۹۰٪ ای در آن مجموعه داده هستند. این نشان می دهد که این مدلها فقط داده‌ها را حفظ / استخراج نکرده اند بلکه مفهایم و اشیای زبانی را از طریق الگوهای آماری یادگرفته‌اند.
+
+<!---
+#### Grounded Language
+
+Interestingly, the lecturer (Mike Lewis, Research Scientist, FAIR) is working on a concept called ‘Grounded Language’. The aim of that field of research is to build conversational agents that are able to chit-chat or negotiate. Chit-chatting and negotiating are abstract tasks with unclear objectives as compared to text classification or text summarization.
+-->
+#### Grounded Language
+
+جالب توجه است که مدرس (مایک لوئیس ، محقق دانشمند ، FAIR) در حال کار بر روی مفهومی به نام «Grounded Language» است. هدف از این زمینه تحقیق ایجاد مکالمه کننده‌ای است که قادر به  صحبت دوستانه و کوتاه (چیت چت) یا مذاکره باشد. گفتگو و مذاکره وظایفی انتزاعی با اهداف نامشخص در مقایسه با طبقه بندی متن یا خلاصه متن است.
+
+<!----
+#### Can we evaluate whether the model already has world knowledge?
+
+‘World Knowledge’ is an abstract concept. We can test models, at the very basic level, for their world knowledge by asking them simple questions about the concepts we are interested in.  Models like BERT, RoBERTa and T5 have billions of parameters. Considering these models are trained on a huge corpus of informational text like Wikipedia, they would have memorized facts using their parameters and would be able to answer our questions. Additionally, we can also think of conducting the same knowledge test before and after fine-tuning a model on some task. This would give us a sense of how much information the model has ‘forgotten’.
+----> 
+#### آیا می توانیم ارزیابی کنیم که آیا این مدل همین الان دانش جهانی دارد؟
+
+«دانش جهانی» مفهومی انتزاعی است. می توانیم با پرسیدن سوالات بسیار ساده در سطوح ابتدایی از موضوعاتی که به آن علاقه مندیم، سطح علم مدل‌های خود را تست کنیم. مدل هایی همانند BERT، RoBERTA و T5 میلیاردها پارامتر دارند و با توجه به اینکه این مدل ها روی مجموعه عظیمی از متن اطلاعاتی مانند ویکی پدیا آموزش دیده اند، آنها حقایق را با استفاده از پارامترهای خود به خاطر می سپارند و می توانند به سوالات ما پاسخ دهند. علاوه بر این، ما همچنین می توانیم آزمایش مشابه دانش را قبل و بعد از تنظیم دقیق یک مدل روی برخی از کارها انجام دهیم. این می تواند به ما درک کند که مدل چه مقدار اطلاعات را «فراموش» کرده است.
diff --git a/docs/fa/week12/12-3.md b/docs/fa/week12/12-3.md
new file mode 100644
index 000000000..0f78eda86
--- /dev/null
+++ b/docs/fa/week12/12-3.md
@@ -0,0 +1,862 @@
+---
+lang: fa
+lang-ref: ch.12-3
+title: توجه و  دگرگون‌ساز
+lecturer: Alfredo Canziani
+authors: Francesca Guiso, Annika Brundyn, Noah Kasmanoff, and Luke Martin
+date: 21 Apr 2020
+translator: Tayeb Pourebrahim
+---
+<!---
+
+## [Attention](https://www.youtube.com/watch?v=f01J0Dri-6k&t=69s)
+
+We introduce the concept of attention before talking about the Transformer architecture. There are two main types of attention: self attention *vs.* cross attention, within those categories, we can have hard *vs.* soft attention.
+
+As we will later see, transformers are made up of attention modules, which are mappings between sets, rather than sequences, which means we do not impose an ordering to our inputs/outputs.
+--->
+## [توجه](https://www.youtube.com/watch?v=f01J0Dri-6k&t=69s)
+ما قبل از صحبت در مورد مفهوم دگرگون‌ساز، مفهوم «توجه» را معرفی می کنیم. دو نوع توجه اصلی وجود دارد: «خود توجه‌ای» * در مقابل * «توجه متقاطع»، در این دسته ها، می توانیم توجه سخت * در مقابل * توجه نرم داشته باشیم.
+
+همانطور که بعدا خواهیم دید، دگرگون‌سازها از ماژول های توجه ساخته شده اند، که به جای توالی، نگاشت بین مجموعه ها هستند، این بدان معناست که ما ترتیب ورودی و خروجی خود را اعمال نمی کنیم.
+
+<!---
+### Self Attention (I)
+
+Consider a set of $t$ input $\boldsymbol{x}$'s:
+
+$$
+\lbrace\boldsymbol{x}_i\rbrace_{i=1}^t = \lbrace\boldsymbol{x}_1,\cdots,\boldsymbol{x}_t\rbrace
+$$
+
+where each $\boldsymbol{x}_i$ is an $n$-dimensional vector. Since the set has $t$ elements, each of which belongs to $\mathbb{R}^n$, we can represent the set as a matrix $\boldsymbol{X}\in\mathbb{R}^{n \times t}$.
+
+With self-attention, the hidden representation $h$ is a linear combination of the inputs:
+
+$$
+\boldsymbol{h} = \alpha_1 \boldsymbol{x}_1 + \alpha_2 \boldsymbol{x}_2 + \cdots +  \alpha_t \boldsymbol{x}_t
+$$
+
+Using the matrix representation described above, we can write the hidden layer as the matrix product:
+
+$$
+\boldsymbol{h} = \boldsymbol{X} \boldsymbol{a}
+$$
+
+where $\boldsymbol{a} \in \mathbb{R}^n$ is a column vector with components $\alpha_i$.
+
+Note that this differs from the hidden representation we have seen so far, where the inputs are multiplied by a matrix of weights.
+
+Depending on the constraints we impose on the vector $\vect{a}$, we can achieve hard or soft attention.
+--->
+
+### خود توجه‌ای (I)
+
+مجموعه‌ی $t$ را به عنوان ورودی $\boldsymbol{x}$ در نظر بگیرید:
+
+$$
+\lbrace\boldsymbol{x}_i\rbrace_{i=1}^t = \lbrace\boldsymbol{x}_1,\cdots,\boldsymbol{x}_t\rbrace
+$$
+
+که در آن هر $\boldsymbol{x}_i$ یک بردار $n$ بعدی است. از آنجا که مجموعه $t$ عنصر دارد، که هریک از آن‌ها متعلق به $\mathbb{R}^n$ است، ما می‌توانیم مجموعه را به عنوان یک ماتریس $\boldsymbol{X}\in\mathbb{R}^{n \times t}$ نشان دهیم. 
+
+با خود توجه‌ای، نمایش پنهان $h$ ترکیبی خطی از ورودی ها است:
+
+$$
+\boldsymbol{h} = \alpha_1 \boldsymbol{x}_1 + \alpha_2 \boldsymbol{x}_2 + \cdots +  \alpha_t \boldsymbol{x}_t
+$$
+
+با استفاده از نمایش ماتریس توضیح داده شده در بالا، می توانیم لایه پنهان را به عنوان محصول ماتریس بنویسیم:
+
+$$
+\boldsymbol{h} = \boldsymbol{X} \boldsymbol{a}
+$$
+جایی که $\boldsymbol{a} \in \mathbb{R}^n$ یک بردار ستونی با اجزای  $\alpha_i$ است.
+
+توجه داشته باشید که این با نمایش پنهانی که تاکنون دیده ایم، متفاوت است، جایی که ورودی ها در یک ماتریس وزن ضرب می شوند.
+
+بسته به محدودیت هایی که به بردار $\vect{a}$ اعمال می کنیم، می توانیم توجه سخت یا نرم را بدست آوریم.
+<!----
+#### Hard Attention
+
+With hard-attention, we impose the following constraint on the alphas: $\Vert\vect{a}\Vert_0 = 1$. This means $\vect{a}$ is a one-hot vector. Therefore, all but one of the coefficients in the linear combination of the inputs equals zero, and the hidden representation reduces to the input $\boldsymbol{x}_i$ corresponding to the element $\alpha_i=1$.
+--->
+#### توجه سخت
+با توجه سخت، ما محدودیت جدید $\Vert\vect{a}\Vert_0 = 1$ را بر آلفاها اعمال می‌کنیم. این بدان معنی است که $\vect{a}$ یک بردار one-hot است. بنابراین همه ضرایب موجود در ترکیب خطی ورودی‌ها به جز یکی برابر با صفر است و نمایش پنهان به ورودی  $\boldsymbol{x}_i$ مربوط به المان $\alpha_i=1$ تقلیل پیدا می‌کنید.
+
+<!---
+#### Soft Attention
+
+With soft attention, we impose that $\Vert\vect{a}\Vert_1 = 1$. The hidden representations is a linear combination of the inputs where the coefficients sum up to 1.
+---->
+#### توجه نَرم
+
+با توجه نَرم، ما  $\Vert\vect{a}\Vert_1 = 1$ را اعمال می‌کنیم. نمایش‌های پنهان ترکیبی خطی از ورودی هایی است که مجموع ضرایب آنها ۱ است.
+
+
+
+### Self Attention (II)
+
+Where do the $\alpha_i$ come from?
+
+We obtain the vector $\vect{a} \in \mathbb{R}^t$ in the following way:
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\boldsymbol{X}^{\top}\boldsymbol{x})
+$$
+
+Where $\beta$ represents the inverse temperature parameter of the $\text{soft(arg)max}(\cdot)$. $\boldsymbol{X}^{\top}\in\mathbb{R}^{t \times n}$ is the transposed matrix representation of the set $\lbrace\boldsymbol{x}_i \rbrace\_{i=1}^t$, and $\boldsymbol{x}$ represents a generic $\boldsymbol{x}_i$ from the set. Note that the $j$-th row of $X^{\top}$ corresponds to an element $\boldsymbol{x}_j\in\mathbb{R}^n$, so the $j$-th row of $\boldsymbol{X}^{\top}\boldsymbol{x}$ is the scalar product of $\boldsymbol{x}_j$ with each $\boldsymbol{x}_i$ in $\lbrace \boldsymbol{x}_i \rbrace\_{i=1}^t$.
+
+The components of the vector $\vect{a}$ are also called "scores" because the scalar product between two vectors tells us how aligned or similar two vectors are. Therefore, the elements of $\vect{a}$ provide information about the similarity of the overall set to a particular $\boldsymbol{x}_i$.
+
+The square brackets represent an optional argument. Note that if $\arg\max(\cdot)$ is used, we get a one-hot vector of alphas, resulting in hard attention. On the other hand, $\text{soft(arg)max}(\cdot)$ leads to soft attention. In each case, the components of the resulting vector $\vect{a}$ sum to 1.
+
+Generating $\vect{a}$ this way gives a set of them, one for each $\boldsymbol{x}_i$. Moreover, each $\vect{a}_i \in \mathbb{R}^t$ so we can stack the alphas in a matrix $\boldsymbol{A}\in \mathbb{R}^{t \times t}$.
+
+Since each hidden state is a linear combination of the inputs $\boldsymbol{X}$ and a vector $\vect{a}$, we obtain a set of $t$ hidden states, which we can stack into a matrix $\boldsymbol{H}\in \mathbb{R}^{n \times t}$.
+
+$$
+\boldsymbol{H}=\boldsymbol{XA}
+$$
+
+### خود توجه‌ای (II)
+
+$\alpha_i$ از کجا آمده است؟
+
+بردار  $\vect{a} \in \mathbb{R}^t$ را به روش زیر به دست می‌آوریم.
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\boldsymbol{X}^{\top}\boldsymbol{x})
+$$
+جایی که $\beta$ نشان دهنده پارامتر معکوس دمای $\text{soft(arg)max}(\cdot)$ است. $\boldsymbol{X}^{\top}\in\mathbb{R}^{t \times n}$ نمایش ماتریس ترانهاده مجموعه $\lbrace\boldsymbol{x}_i \rbrace\_{i=1}^t$ است و $\boldsymbol{x}$ نماینده عمومی از مجموعه است. توجه داشته باشید که $j$-امین ردیف از $X^{\top}$ مربوط به عنصر  $\boldsymbol{x}_j\in\mathbb{R}^n$ است، پس ردیف $j$-ام از $\boldsymbol{X}^{\top}\boldsymbol{x}$ ضرب اسکالر $\boldsymbol{x}_j$ که هر $\boldsymbol{x}_i$ در $\lbrace \boldsymbol{x}_i \rbrace\_{i=1}^t$ است.
+
+به اجزای بردار $\vect{a}$ «نمره» نیز گفته می‌شود زیرا ضرب اسکالر بین دو بردار به ما می‌گوید که دو بردار چقدر همسو یا یکسان هستند. بنابراین، عناصر $\vect{a}$ اطلاعاتی درباره شباهت مجموعه کلی با یک $\boldsymbol{x}_i$ خاص ارائه می دهند.
+
+براکت‌های مربع نشان دهنده یک برهان اختیاری است. توجه داشته باشید که اگر از $\arg\max(\cdot)$ استفاده شود، ما یک بردار on-hot از آلفاها را به دست می‌آوریم که نتیجه آن «توجه سخت» است. از طرف دیگر، $\text{soft(arg)max}(\cdot)$ منجر به «توجه نرم» می‌شود. در هر حالت، مجموع اجزای بردار حاصل $\vect{a}$ برابر با ۱ می‌شود.
+
+تولید $\vect{a}$ به این شکل، مجموعه‌ای از آن‌ها می‌دهد، یکی برای هر $\boldsymbol{x}_i$.  علاوه بر این، هر $\vect{a}_i \in \mathbb{R}^t$، پس ما می‌توانیم آلفا‌ها را در یک ماتریکس $\boldsymbol{A}\in \mathbb{R}^{t \times t}$ انباشته کنیم.
+
+از آنجا که هر حالت پنهان، یک ترکیب خطی از ورودی‌های $\boldsymbol{X}$ و یک بردار $\vect{a}$ است، ما مجموعه‌ای از $t$ حالت پنهان بدست می‌آوریم که میتوانیم آن‌ها را در یک ماتریس $\boldsymbol{H}\in \mathbb{R}^{n \times t}$ انباشته کنیم.
+
+$$
+\boldsymbol{H}=\boldsymbol{XA}
+$$
+
+
+
+
+<!---
+## [Key-value store](https://www.youtube.com/watch?v=f01J0Dri-6k&t=1056s)
+
+A key-value store is a paradigm designed for storing (saving), retrieving (querying) and managing associative arrays (dictionaries / hash tables).
+
+For example, say we wanted to find a recipe to make lasagne. We have a recipe book and search for "lasagne" - this is the query. This query is checked against all possible keys in your dataset - in this case, this could be the titles of all the recipes in the book. We check how aligned the query is with each title to find the maximum matching score between the query and all the respective keys. If our output is the argmax function - we retrieve the single recipe with the highest score. Otherwise, if we use a soft argmax function, we would get a probability distribution and can retrieve in order from the most similar content to less and less relevant recipes matching the query.
+
+Basically, the query is the question. Given one query, we check this query against every key and retrieve all matching content.
+----->
+## [ذخیره کلید-مقدار](https://www.youtube.com/watch?v=f01J0Dri-6k&t=1056s)
+
+ذخیره کلید-مقدار الگویی است که برای ذخیره سازی (ذخیره)، بازیابی (پرسش) و مدیریت آرایه های انجمنی (دیکشنری ها / جداول درهمساز) طراحی شده است.
+
+به عنوان مثال، ما می خواهیم یک دستورالعمل برای تهیه لازانیا پیدا کنیم. ما یک کتاب دستورالعمل آشپزی داریم و در آن «لازانیا» را جستجو می کنیم - این پرسش است. این پرسش در مقابل تمام کلید‌های ممکن در مجموع داده چک شده است - در این مثال، این می‌تواند تمامی عنوان‌های کتاب دستورالعمل آشپزی باشد. بررسی می کنیم که چقدر پرسش با هر عنوان همسو است تا حداکثر امتیاز تطبیق بین پرسش و همه کلیدهای مربوطه را پیدا کنیم. اگر خروجی ما تابع argmax باشد - ما دستور اصلی را با بالاترین امتیاز بازیابی می کنیم. در غیر این صورت، اگر از یک تابع  «soft argmax» استفاده کنیم، توزیع احتمالی را بدست می آوریم و می توانیم به ترتیب از مشابه ترین محتوا به دستورالعمل های کمتر و کمتر مرتبط با پرسش بازیابی کنیم.
+
+اساساً «پرسش» سوال است. با توجه به یک «پرسش»، ما این «پرسش» را در برابر هر کلید بررسی می کنیم و تمام محتوای منطبق را بازیابی می کنیم.
+
+<!----
+### Queries, keys and values
+
+$$
+\begin{aligned}
+\vect{q} &= \vect{W_q x} \\
+\vect{k} &= \vect{W_k x} \\
+\vect{v} &= \vect{W_v x}
+\end{aligned}
+$$
+
+Each of the vectors $\vect{q}, \vect{k}, \vect{v}$ can simply be viewed as rotations of the specific input $\vect{x}$. Where $\vect{q}$ is just $\vect{x}$ rotated by $\vect{W_q}$, $\vect{k}$ is just $\vect{x}$ rotated by $\vect{W_k}$ and similarly for $\vect{v}$. Note that this is the first time we are introducing "learnable" parameters. We also do not include any non-linearities since attention is completely based on orientation.
+
+In order to compare the query against all possible keys, $\vect{q}$ and $\vect{k}$ must have the same dimensionality, *i.e.* $\vect{q}, \vect{k} \in \mathbb{R}^{d'}$.
+
+However, $\vect{v}$ can be of any dimension. If we continue with our lasagne recipe example - we need the query to have the dimension as the keys, *i.e.* the titles of the different recipes that we're searching through. The dimension of the corresponding recipe retrieved, $\vect{v}$, can be arbitrarily long though. So we have that $\vect{v} \in \mathbb{R}^{d''}$.
+
+For simplicity, here we will make the assumption that everything has dimension $d$, i.e.
+
+$$
+d' = d'' = d
+$$
+
+So now we have a set of $\vect{x}$'s, a set of queries, a set of keys and a set of values. We can stack these sets into matrices each with $t$ columns since we stacked $t$ vectors; each vector has height $d$.
+
+$$
+\{ \vect{x}_i \}_{i=1}^t \rightsquigarrow \{ \vect{q}_i \}_{i=1}^t, \, \{ \vect{k}_i \}_{i=1}^t, \, \, \{ \vect{v}_i \}_{i=1}^t \rightsquigarrow \vect{Q}, \vect{K}, \vect{V} \in \mathbb{R}^{d \times t}
+$$
+
+We compare one query $\vect{q}$ against the matrix of all keys $\vect{K}$:
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\vect{K}^{\top} \vect{q}) \in \mathbb{R}^t
+$$
+
+Then the hidden layer is going to be the linear combination of the columns of $\vect{V}$ weighted by the coefficients in $\vect{a}$:
+
+$$
+\vect{h} = \vect{V} \vect{a} \in \mathbb{R}^d
+$$
+
+Since we have $t$ queries, we'll get $t$ corresponding $\vect{a}$ weights and therefore a matrix $\vect{A}$ of dimension $t \times t$.
+
+$$
+\{ \vect{q}_i \}_{i=1}^t \rightsquigarrow \{ \vect{a}_i \}_{i=1}^t, \rightsquigarrow \vect{A} \in \mathbb{R}^{t \times t}
+$$
+
+Therefore in matrix notation we have:
+
+$$
+\vect{H} = \vect{VA} \in \mathbb{R}^{d \times t}
+$$
+
+As an aside, we typically set $\beta$ to:
+
+$$
+\beta = \frac{1}{\sqrt{d}}
+$$
+
+This is done to keep the temperature constant across different choices of dimension $d$ and so we divide by the square root of the number of dimensions $d$. (Think what is the length of the vector $\vect{1} \in \R^d$.)
+
+For implementation, we can speed up computation by stacking all the $\vect{W}$'s into one tall $\vect{W}$ and then calculate $\vect{q}, \vect{k}, \vect{v}$ in one go:
+
+$$
+\begin{bmatrix}
+\vect{q} \\
+\vect{k} \\
+\vect{v}
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q} \\
+\vect{W_k} \\
+\vect{W_v}
+\end{bmatrix} \vect{x} \in \mathbb{R}^{3d}
+$$
+
+There is also the concept of "heads". Above we have seen an example with one head but we could have multiple heads. For example, say we have $h$ heads, then we have $h$ $\vect{q}$'s, $h$ $\vect{k}$'s and $h$ $\vect{v}$'s and we end up with a vector in $\mathbb{R}^{3hd}$:
+
+$$
+\begin{bmatrix}
+\vect{q}^1 \\
+\vect{q}^2 \\
+\vdots \\
+\vect{q}^h \\
+\vect{k}^1 \\
+\vect{k}^2 \\
+\vdots \\
+\vect{k}^h \\
+\vect{v}^1 \\
+\vect{v}^2 \\
+\vdots \\
+\vect{v}^h
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q}^1 \\
+\vect{W_q}^2 \\
+\vdots \\
+\vect{W_q}^h \\
+\vect{W_k}^1 \\
+\vect{W_k}^2 \\
+\vdots \\
+\vect{W_k}^h \\
+\vect{W_v}^1 \\
+\vect{W_v}^2 \\
+\vdots \\
+\vect{W_v}^h
+\end{bmatrix} \vect{x} \in \R^{3hd}
+$$
+
+However, we can still transform the multi-headed values to have the original dimension $\R^d$ by using a $\vect{W_h} \in \mathbb{R}^{d \times hd}$. This is just one possible way to implement the key-value store.
+----->
+### پرسش‌ها, کلیدها و مقادیر
+
+
+$$
+\begin{aligned}
+\vect{q} &= \vect{W_q x} \\
+\vect{k} &= \vect{W_k x} \\
+\vect{v} &= \vect{W_v x}
+\end{aligned}
+$$
+
+هر یک از بردارهای $\vect{q}، \vect{k}، \vect{v}$ را می توان به سادگی به عنوان دوران‌های ورودی خاص $\vect{x}$ مشاهده کرد. جایی که $\vect{q}$ فقط $\vect{x}$ دوران‌ شده توسط $\vect{W_q}$ است، $\vect{k}$ فقط $\vect{x}$ دوران‌ شده توسط $\vect{W_k}$ و به طور مشابه برای $\vect{v}$. توجه داشته باشید که این اولین بار است که پارامترهای «قابل یادگیری» را معرفی می کنیم. ما همچنین هیچ [ویژگی] غیر خطی را حساب نمی‌کنیم زیرا توجه کاملاً براساس جهت گیری است.
+
+برای مقایسه پرسش با تمام کلیدهای ممکن، $\vect{q}$ و $\vect{k}$ باید از ابعاد یکسانی برخوردار باشند، * یعنی * $\vect{q}, \vect{k} \in \mathbb{R}^{d'}$.
+
+با این حال، $\vect{v}$ می تواند از هر ابعادی برخوردار باشد. اگر ما به عنوان نمونه دستورالعمل لازانیا خود ادامه دهیم - برای اینکه ابعاد را به عنوان کلید باشیم، باید درخواست پرسش بزنیم، به عنوان مثال به عنوان عناوین دستورالعمل های مختلفی که در جستجوی آنها هستیم. ابعاد دستورالعمل مربوطه به دست آمده، $\vect{v}$، می تواند به صورت دل بخواهی طولانی باشد. بنابراین داریم $\vect{v} \in \mathbb{R}^{d''}$.
+
+برای سادگی، در اینجا فرض را می گیریم که همه چیز دارای ابعاد $d$ است، یعنی:
+$$
+d' = d'' = d
+$$
+
+بنابراین اکنون ما مجموعه ای از $\vect{x}$، مجموعه ای از پرسش‌ها، مجموعه ای از کلیدها و مجموعه ای از مقادیر را داریم. از آنجا که $t$ بردار را انباشته‌ایم، می توانیم این مجموعه ها را در ماتریس هایی قرار دهیم که هر کدام $t$ ستون دارند و ارتفاع هر بردار $d$ است.
+
+
+$$
+\{ \vect{x}_i \}_{i=1}^t \rightsquigarrow \{ \vect{q}_i \}_{i=1}^t, \, \{ \vect{k}_i \}_{i=1}^t, \, \, \{ \vect{v}_i \}_{i=1}^t \rightsquigarrow \vect{Q}, \vect{K}, \vect{V} \in \mathbb{R}^{d \times t}
+$$
+
+ما یک پرسش $\vect{q}$ را در برابر ماتریس تمام کلیدها $\vect{K}$ مقایسه می کنیم:
+
+$$
+\vect{a} = \text{[soft](arg)max}_{\beta} (\vect{K}^{\top} \vect{q}) \in \mathbb{R}^t
+$$
+
+سپس لایه پنهان به صورت ترکیب خطی از ستون‌های $\vect{V}$ با ضرایب وزنی $\vect{a}$ خواهد بود:
+
+$$
+\vect{h} = \vect{V} \vect{a} \in \mathbb{R}^d
+$$
+
+
+از آنجا که $t$ پرسش داریم، $t$ مربوط به $\vect{a}$ وزن دریافت خواهیم کرد و بنابراین یک ماتریس $\vect{A}$ از بعد $t \times t$.
+
+
+$$
+\{ \vect{q}_i \}_{i=1}^t \rightsquigarrow \{ \vect{a}_i \}_{i=1}^t, \rightsquigarrow \vect{A} \in \mathbb{R}^{t \times t}
+$$
+
+بنابراین در علامت گذاری ماتریس:
+
+$$
+\vect{H} = \vect{VA} \in \mathbb{R}^{d \times t}
+$$
+
+بعلاوه، ما معمولاً $\beta$ را به طور زیر در نظر می‌گیریم:
+
+$$
+\beta = \frac{1}{\sqrt{d}}
+$$
+این کار برای ثابت نگه داشتن دما در بین گزینه های مختلف بعد $d$ انجام می شود و بنابراین ما بر ریشه مربع تعداد ابعاد $d$ تقسیم می کنیم. (فکر کنید طول بردار $\vect{1} \in \R^d$ چقدر است.)
+
+برای پیاده سازی، می توانیم محاسبه را با جمع کردن تمام $\vect{W}$ ها در یک $\vect{W}$ بلند سریع انجام دهیم و سپس $\vect{q}, \vect{k}, \vect{v}$ یک بار محاسبه کنیم:
+
+$$
+\begin{bmatrix}
+\vect{q} \\
+\vect{k} \\
+\vect{v}
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q} \\
+\vect{W_k} \\
+\vect{W_v}
+\end{bmatrix} \vect{x} \in \mathbb{R}^{3d}
+$$
+
+
+
+همچنین مفهومی به نام «سَرها» نیز وجود دارد. در بالا مثالی با یک سر دیده ایم اما می توانیم چندین سر داشته باشیم. به عنوان مثال، مثلاً ما $h$ سر داریم، سپس $h$ $\vect{q}$ها، $h$ $\vect{k}$ها و $h$ $\vect{v}$ داریم و ما در انتها با یک بردار در $\mathbb{R}^{3hd}$ مواجه خواهیم شد:
+
+$$
+\begin{bmatrix}
+\vect{q}^1 \\
+\vect{q}^2 \\
+\vdots \\
+\vect{q}^h \\
+\vect{k}^1 \\
+\vect{k}^2 \\
+\vdots \\
+\vect{k}^h \\
+\vect{v}^1 \\
+\vect{v}^2 \\
+\vdots \\
+\vect{v}^h
+\end{bmatrix} =
+\begin{bmatrix}
+\vect{W_q}^1 \\
+\vect{W_q}^2 \\
+\vdots \\
+\vect{W_q}^h \\
+\vect{W_k}^1 \\
+\vect{W_k}^2 \\
+\vdots \\
+\vect{W_k}^h \\
+\vect{W_v}^1 \\
+\vect{W_v}^2 \\
+\vdots \\
+\vect{W_v}^h
+\end{bmatrix} \vect{x} \in \R^{3hd}
+$$
+
+
+
+با این حال، ما همچنان می توانیم مقادیر چند سر را تغییر دهیم تا بعد اصلی $\R^d$ را با استفاده از $\vect{W_h} \in \mathbb{R}^{d \times hd}$ داشته باشیم. این فقط یکی از راه های ممکن برای پیاده سازی ذخیره‌ی کلید-ارزش است.
+<!----
+## [The Transformer](https://www.youtube.com/watch?v=f01J0Dri-6k&t=2114s)
+
+Expanding on our knowledge of attention in particular, we now interpret the fundamental building blocks of the transformer. In particular, we will take a forward pass through a basic transformer, and see how attention is used in the standard encoder-decoder paradigm and compares to the sequential architectures of RNNs.
+
+
+### Encoder-Decoder Architecture
+
+We should be familiar with this terminology. It is shown most prominently during autoencoder demonstrations, and is prerequisite understanding up to this point. To summarize, an input is fed through an encoder and decoder which impose some sort of bottleneck on the data, forcing only the most important information through. This information is stored in the output of the encoder block, and can be used for a variety of unrelated tasks.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure1.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 1:</b> Two example diagrams of an autoencoder. The model on the left shows how an autoencoder can be design with two affine transformations + activations, where the image on the right replaces this single "layer" with an arbitrary module of operations.
+</center>
+
+Our "attention" is drawn to the autoencoder layout as shown in the model on the right and will now take a look inside, in the context of transformers.
+
+
+### Encoder Module
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure2.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 2:</b> The transformer encoder, which accepts at set of inputs $\vect{x}$, and outputs a set of hidden representations $\vect{h}^\text{Enc}$.
+</center>
+
+The encoder module accepts a set of inputs, which are simultaneously fed through the self attention block and bypasses it to reach the `Add, Norm` block. At which point, they are again simultaneously passed through the 1D-Convolution and another `Add, Norm` block, and consequently outputted as the set of hidden representation. This set of hidden representation is then either sent through an arbitrary number of encoder modules *i.e.* more layers), or to the decoder. We shall now discuss these blocks in more detail.
+
+
+### Self-attention
+
+The self-attention model is a normal attention model. The query, key, and value are generated from the same item of the sequential input. In tasks that try to model sequential data, positional encodings are added prior to this input. The output of this block is the attention-weighted values. The self-attention block accepts a set of inputs, from $1, \cdots , t$, and outputs $1, \cdots, t$ attention weighted values which are fed through the rest of the encoder.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure3.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 3:</b> The self-attention block. The sequence of inputs is shown as a set along the 3rd dimension, and concatenated.
+</center>
+--->
+## [دگرگون‌ساز](https://www.youtube.com/watch?v=f01J0Dri-6k&t=2114s)
+
+با افزایش دانش ما در «توجه» به طور خاص، ما اکنون عناصر اساسی ساخت دگرگون‌ساز را تفسیر می کنیم. به طور خاص، ما یک «شبکه مرور به پیش» را از یک دگرگون‌ساز اساسی عبور خواهیم داد، و خواهیم دید که چگونه توجه در الگوی رمزگذار-رمزگشای استاندارد استفاده می شود و با معماری‌های متوالی RNN مقایسه می شود.
+
+
+### معماری رمزگذار-رمزگشای
+
+ما باید با این اصطلاحات آشنا باشیم. در اثب ات رمزگذار خودکار به طور برجسته ای نشان داده می شود و درک پیش نیاز آن تا این مرحله است. به طور خلاصه، ورودی از طریق رمزگذار و رمزگشایی تغذیه می شود که نوعی گلوگاه را بر داده‌ها تحمیل می کند و فقط مهمترین اطلاعات را از این طریق مجبور می کند. این اطلاعات در خروجی بلوک رمزگذار ذخیره می شود و می تواند برای انواع کارهای غیر مرتبط استفاده شود.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure1.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>تصویر اول:</b>
+دو مثال از طرح یک خودرمزگذار. مدل در سمت چپ نشان می دهد که چگونه یک رمزگذار خودکار می تواند با دو تبدیل آفرین + فعال سازی طراحی شود ، جایی که تصویر سمت راست این واحد «لایه» را با یک ماژول دلخواه جایگزین می کند.
+</center>
+
+"توجه" ما به طرح خود رمزگذار جلب شده است همانطور که در مدل سمت راست نشان داده شده است و اکنون نگاهی به داخل، در زمینه دگرگون‌سازها می اندازد.
+
+### ماژول رمزگذار
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure2.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>تصویر دوم:</b> رمزگذار دگرگون‌ساز که مجموعه ای از ورودی $\vect{x}$, را می پذیرد و مجموعه ای از نمایش های پنهان $\vect{h}^\text{Enc}$ را تولید می کند.
+</center>
+
+ماژول رمزگذار مجموعه ای از ورودی ها را می پذیرد ، که به طور همزمان از طریق بلوک توجه به خود تغذیه می شوند و آن را دور می زنند تا به بلوک ʻAdd، Norm` برسند. در آن زمان، آنها دوباره بطور همزمان از 1D-کانولوشن و یک بلوک دیگر "Add، Norm" عبور می کنند و در نتیجه به عنوان مجموعه نمایش مخفی تولید می شوند. سپس این مجموعه نمایش مخفی یا از طریق تعداد دلخواه ماژول های رمزگذار * یعنی * لایه های بیشتر) ارسال می شود یا به رمزگشای. اکنون باید با جزئیات بیشتری در مورد این بلوک ها بحث کنیم.
+
+### خود توجه‌ای
+
+مدل خود توجه‌، یک مدل توجه عادی است. پرسش، کلید و مقدار از همان مورد ورودی پی در پی تولید می شوند. در وظایفی که سعی در مدل سازی داده های پی در پی دارند، رمزگذاری موقعیتی قبل از این ورودی اضافه می شود. خروجی این بلوک مقادیر توجه شده است. بلوک توجه به خود مجموعه ای از ورودی ها را از $1, \cdots , t$ و ورودی های $1, \cdots, t$ توجه را که از طریق بقیه رمزگذار تغذیه می شود، می پذیرد.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure3.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>تصویر سوم:</b> بلوک توجه به خود. توالی ورودی به صورت مجموعه ای در امتداد بعد سوم نشان داده شده و بهم پیوسته است.
+</center>
+
+
+<!----
+#### Add, Norm
+
+The add norm block has two components. First is the add block, which is a residual connection, and layer normalization.
+
+
+#### 1D-convolution
+
+Following this step, a 1D-convolution (aka a position-wise feed forward network) is applied. This block consists of two dense layers. Depending on what values are set, this block allows you to adjust the dimensions of the output $\vect{h}^\text{Enc}$.
+
+
+### Decoder Module
+
+The transformer decoder follows a similar procedure as the encoder. However, there is one additional sub-block to take into account. Additionally, the inputs to this module are different.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure5.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>Figure 4:</b> A friendlier explanation of the decoder.
+</center>
+--->
+#### جمع و نُرم
+
+بلوک «جمع نُرم» دارای دو جز است. ابتدا بلوک جمع، که یک اتصال باقیمانده است و نرمال سازی لایه است.
+
+#### 1D-کانولوشن
+
+به دنبال این مرحله ، یک کانولوشن-1D (معروف به یک شبکه تغذیه خوراک موقعیتی) اعمال می شود. این بلوک از دو لایه متراکم تشکیل شده است. بسته به اینکه چه مقادیری تنظیم شده است ، این بلوک به شما امکان می دهد ابعاد خروجی $\vect{h}^\text{Enc}$ را تنظیم کنید.
+
+### ماژول رمزگشا
+
+دگرگون‌ساز رمزگشا روشی مشابه رمزگذار را دنبال می کند. با این حال، یک زیر بلوک اضافی وجود دارد که باید در نظر گرفته شود. علاوه بر این، ورودی های این ماژول متفاوت است.
+
+<center>
+<img src="{{site.baseurl}}/images/week12/12-3/figure5.png" style="zoom: 60%; background-color:#DCDCDC;" /><br>
+<b>تصویر چهارم</b> توضیحی دوستانه تر از رمزگشا.
+</center>
+<!---
+#### Cross-attention
+
+The cross attention follows the query, key, and value setup used for the self-attention blocks.  However, the inputs are a little more complicated. The input to the decoder is a data point $\vect{y}\_i$, which is then passed through the self attention and add norm blocks, and finally ends up at the cross-attention block. This serves as the query for cross-attention, where the key and value pairs are the output $\vect{h}^\text{Enc}$, where this output was calculated with all past inputs $\vect{x}\_1, \cdots, \vect{x}\_{t}$.
+---->
+#### توجه متقابل
+توجه متقابل از پرسش، کلید و تنظیم مقدار استفاده شده برای بلوکهایخود توجه استفاده می کند. با این وجود ورودی ها کمی پیچیده تر هستند. ورودی رمزگشا یک نقطه داده$\vect{y}\_i$ است که سپس از طریق خود توجه عبور داده می شود و بلوک های نرم به آن اضافه می شود و در نهایت به بلوک توجه متقابل می رسد. این به عنوان پرسش توجه متقابل عمل می کند، جایی که جفت کلید - مقدار‌ها خروجی $\vect{h}^\text{Enc}$ هستند،  جایی که این خروجی با تمام ورودی های گذشته $\vect{x}\_1, \cdots, \vect{x}\_{t}$ محاسبه شده است.
+
+<!----
+## Summary
+
+A set, $\vect{x}\_1$ to $\vect{x}\_{t}$ is fed through the encoder. Using self-attention and some more blocks, an output representation, $\lbrace\vect{h}^\text{Enc}\rbrace_{i=1}^t$ is obtained, which is fed to the decoder. After applying self-attention to it, cross attention is applied. In this block, the query corresponds to a representation of a symbol in the target language $\vect{y}\_i$, and the key and values are from the source language sentence ($\vect{x}\_1$ to $\vect{x}\_{t}$). Intuitively, cross attention finds which values in the input sequence are most relevant to constructing $\vect{y}\_t$, and therefore deserve the highest attention coefficients. The output of this cross attention is then fed through another 1D-convolution sub-block, and we have $\vect{h}^\text{Dec}$. For the specified target language, it is straightforward from here to see how training will commence, by comparing $\lbrace\vect{h}^\text{Dec}\rbrace_{i=1}^t$ to some target data.
+---->
+## خلاصه
+یک مجموعه $\vect{x}\_1$ به $\vect{x}\_{t}$ از طریق رمزگذار تغذیه می شود. با استفاده از خود توجه‌ای و چند بلوک دیگر، خروجی به صورت $\lbrace\vect{h}^\text{Enc}\rbrace_{i=1}^t$ به دست می‌آید،‌ که رمزگشا را تغذیه می‌کند. پس از اعمال خود توجه به آن، توجه متقابل اعمال می شود. در این بلوک، پرس و جو مربوط به نمایش نمادی در زبان مقصد $\vect{y}\_i$ است،‌ و کلید و مقادیر از جمله زبان مبدا ($\vect{x}\_1$ به $\vect{x}\_{t}$) هستند. به صورت شهودی، توجه متقابل می یابد که کدام مقادیر در دنباله ورودی بیشترین ارتباط را با ساخت $\vect{y}\_t$ دارند و بنابراین مستحق بالاترین ضرایب توجه هستند.سپس خروجی این توجه متقابل از طریق زیر بلوک 1D-کانولوشن دیگری تأمین می شود،‌و ما $\vect{h}^\text{Dec}$ داریم. برای زبان هدف مشخص، با مقایسه  $\lbrace\vect{h}^\text{Dec}\rbrace_{i=1}^t$ با یک دیتای هدف، ساده است که از اینجا ببنیم که آموزش از کجا شروع می‌شود.
+
+<!----
+### Word Language Models
+
+There are a few important facts we left out before to explain the most important modules of a transformer, but will need to discuss them now to understand how transformers can achieve state-of-the-art results in language tasks.
+---->
+### مدل های زبان کلمه ای
+
+چند واقعیت مهم وجود دارد که ما قبلاً برای توضیح مهمترین ماژولهای دگرگون‌ساز کنار گذاشته ایم، اما اکنون باید در مورد آنها بحث کنیم تا بفهمیم چگونه دگرگون‌سازها می توانند در کارهای زبان به نتایج پیشرفته برسند.
+
+<!---
+#### Positional encoding
+
+Attention mechanisms allow us to parallelize the operations and greatly accelerate a model's training time,  but loses sequential information. The positional encoding feature enables allows us to capture this context.
+---->
+#### رمزگذاری موقعیتی
+
+مکانیسم های توجه به ما امکان می دهد عملیات ها را موازی کنیم و زمان آموزش مدل را بسیار سرعت ببخشیم، اما اطلاعات متوالی را از دست می دهیم. ویژگی رمزگذاری موقعیتی به ما امکان می دهد تا این امکان را بدست بیاوریم.
+
+<!----
+#### Semantic Representations
+
+Throughout the training of a transformer, many hidden representations are generated. To create an embedding space similar to the one used by the word-language model example in PyTorch, the output of the cross-attention, will provide a semantic representation of the word $x_i$, at which point further experimentation can be performed over this dataset.
+---->
+### نمایش های معنایی
+
+در طول آموزش یک دگرگون‌ساز، بسیاری از نمایش های پنهان ایجاد می شود. برای ایجاد یک فضای جاسازی مشابه فضایی که توسط مثال مدل «کلمه-زبان» در PyTorch استفاده شده است، خروجی توجه متقابل، نمایشی معنایی از کلمه $ x_i $ فراهم می کند، که در هر زمان می توان آزمایشات بیشتری را بر روی این مجموعه داده اجرا کرد.
+
+
+<!----
+### Code Summary
+
+We will now see the blocks of transformers discussed above in a far more understandable format, code!
+
+The first module we will look at the multi-headed attention block. Depenending on query, key, and values entered into this block, it can either be used for self or cross attention.
+
+
+```python
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, num_heads, p, d_input=None):
+        super().__init__()
+        self.num_heads = num_heads
+        self.d_model = d_model
+        if d_input is None:
+            d_xq = d_xk = d_xv = d_model
+        else:
+            d_xq, d_xk, d_xv = d_input
+        # Embedding dimension of model is a multiple of number of heads
+        assert d_model % self.num_heads == 0
+        self.d_k = d_model // self.num_heads
+        # These are still of dimension d_model. To split into number of heads
+        self.W_q = nn.Linear(d_xq, d_model, bias=False)
+        self.W_k = nn.Linear(d_xk, d_model, bias=False)
+        self.W_v = nn.Linear(d_xv, d_model, bias=False)
+        # Outputs of all sub-layers need to be of dimension d_model
+        self.W_h = nn.Linear(d_model, d_model)
+```
+
+
+Initialization of multi-headed attention class. If a `d_input` is provided, this becomes cross attention. Otherwise, self-attention. The query, key, value setup is constructed as a linear transformation of the input `d_model`.
+
+
+```python
+def scaled_dot_product_attention(self, Q, K, V):
+    batch_size = Q.size(0)
+    k_length = K.size(-2)
+
+    # Scaling by d_k so that the soft(arg)max doesnt saturate
+    Q = Q / np.sqrt(self.d_k)  # (bs, n_heads, q_length, dim_per_head)
+    scores = torch.matmul(Q, K.transpose(2,3))  # (bs, n_heads, q_length, k_length)
+
+    A = nn_Softargmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)
+
+    # Get the weighted average of the values
+    H = torch.matmul(A, V)  # (bs, n_heads, q_length, dim_per_head)
+
+    return H, A
+```
+
+Return hidden layer corresponding to encodings of values after scaled by the attention vector. For book-keeping purposes (which values in the sequence were masked out by attention?) A is also returned.
+
+```python
+def split_heads(self, x, batch_size):
+    return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
+```
+
+Split the last dimension into (`heads` × `depth`). Return after transpose to put in shape (`batch_size` × `num_heads` × `seq_length` × `d_k`)
+
+```python
+def group_heads(self, x, batch_size):
+    return x.transpose(1, 2).contiguous().
+        view(batch_size, -1, self.num_heads * self.d_k)
+```
+
+Combines the attention heads together, to get correct shape consistent with batch size and sequence length.
+
+```python
+def forward(self, X_q, X_k, X_v):
+    batch_size, seq_length, dim = X_q.size()
+    # After transforming, split into num_heads
+    Q = self.split_heads(self.W_q(X_q), batch_size)
+    K = self.split_heads(self.W_k(X_k), batch_size)
+    V = self.split_heads(self.W_v(X_v), batch_size)
+    # Calculate the attention weights for each of the heads
+    H_cat, A = self.scaled_dot_product_attention(Q, K, V)
+    # Put all the heads back together by concat
+    H_cat = self.group_heads(H_cat, batch_size)  # (bs, q_length, dim)
+    # Final linear layer
+    H = self.W_h(H_cat)  # (bs, q_length, dim)
+    return H, A
+```
+
+The forward pass of multi headed attention.
+
+Given an input is split into q, k, and v, at which point these values are fed through a scaled dot product attention mechanism, concatenated and fed through a final linear layer. The last output of the attention block is the attention found, and the hidden representation that is passed through the remaining blocks.
+
+Although the next block shown in the transformer/encoder's is the Add,Norm, which is a function already built into PyTorch. As such, it is an extremely simple implementation, and does not need it's own class. Next is the 1-D convolution block. Please refer to previous sections for more details.
+
+Now that we have all of our main classes built (or built for us), we now turn to an encoder module.
+
+```python
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model, num_heads, conv_hidden_dim, p=0.1):
+        self.mha = MultiHeadAttention(d_model, num_heads, p)
+        self.layernorm1 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+        self.layernorm2 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+
+    def forward(self, x):
+        attn_output, _ = self.mha(x, x, x)
+        out1 = self.layernorm1(x + attn_output)
+        cnn_output = self.cnn(out1)
+        out2 = self.layernorm2(out1 + cnn_output)
+        return out2
+```
+
+In the most powerful transformers, an arbitarily large number of these encoders are stacked on top of one another.
+
+Recall that self attention by itself does not have any recurrence or convolutions, but that's what allows it to run so quickly. To make it sensitive to position we provide positional encodings. These are calculated as follows:
+
+
+$$
+\begin{aligned}
+E(p, 2)    &= \sin(p / 10000^{2i / d}) \\
+E(p, 2i+1) &= \cos(p / 10000^{2i / d})
+\end{aligned}
+$$
+
+
+As to not take up too much room on the finer details, we will point you to https://github.com/Atcold/pytorch-Deep-Learning/blob/master/15-transformer.ipynb for the full code used here.
+
+
+An entire encoder, with N stacked encoder layers, as well as position embeddings, is written out as
+
+
+```python
+class Encoder(nn.Module):
+    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim,
+            input_vocab_size, maximum_position_encoding, p=0.1):
+        self.embedding = Embeddings(d_model, input_vocab_size,
+                                    maximum_position_encoding, p)
+        self.enc_layers = nn.ModuleList()
+        for _ in range(num_layers):
+            self.enc_layers.append(EncoderLayer(d_model, num_heads,
+                                                ff_hidden_dim, p))
+    def forward(self, x):
+        x = self.embedding(x) # Transform to (batch_size, input_seq_length, d_model)
+        for i in range(self.num_layers):
+            x = self.enc_layers[i](x)
+        return x  # (batch_size, input_seq_len, d_model)
+```
+--->
+## خلاصه کد
+
+اکنون بلوک های دگرگون‌ساز‌ها را که در بالا بحث شد با فرمت قابل درک تر، خواهیم دید. کد!
+
+اولین ماژول به بلوک توجه چند سر نگاه خواهیم کرد. بسته به پرسش، کلید و مقادیر وارد شده در این بلوک، می توان از آن برای خود-توجه‌ای یا توجه متقابل استفاده کرد.
+
+```python
+class MultiHeadAttention(nn.Module):
+    def __init__(self, d_model, num_heads, p, d_input=None):
+        super().__init__()
+        self.num_heads = num_heads
+        self.d_model = d_model
+        if d_input is None:
+            d_xq = d_xk = d_xv = d_model
+        else:
+            d_xq, d_xk, d_xv = d_input
+        # Embedding dimension of model is a multiple of number of heads
+        assert d_model % self.num_heads == 0
+        self.d_k = d_model // self.num_heads
+        # These are still of dimension d_model. To split into number of heads
+        self.W_q = nn.Linear(d_xq, d_model, bias=False)
+        self.W_k = nn.Linear(d_xk, d_model, bias=False)
+        self.W_v = nn.Linear(d_xv, d_model, bias=False)
+        # Outputs of all sub-layers need to be of dimension d_model
+        self.W_h = nn.Linear(d_model, d_model)
+```
+کلاس توجه چند سر را شروع می‌کنیم.  اگر `d_input` تامین شده بود، به توجه متقابل تبدیل می‌شود. در غیر این صورت به خود توجه‌ای تبدیل می‌شود. تنظیم پرسش، کلید، مقدار به عنوان یک تغییر شکل خطی از ورودی `d_model` ساخته شده است.
+
+```python
+def scaled_dot_product_attention(self, Q, K, V):
+    batch_size = Q.size(0)
+    k_length = K.size(-2)
+
+    # Scaling by d_k so that the soft(arg)max doesnt saturate
+    Q = Q / np.sqrt(self.d_k)  # (bs, n_heads, q_length, dim_per_head)
+    scores = torch.matmul(Q, K.transpose(2,3))  # (bs, n_heads, q_length, k_length)
+
+    A = nn_Softargmax(dim=-1)(scores)  # (bs, n_heads, q_length, k_length)
+
+    # Get the weighted average of the values
+    H = torch.matmul(A, V)  # (bs, n_heads, q_length, dim_per_head)
+
+    return H, A
+```
+
+لایه پنهان مربوط به رمزگذاری مقادیر را پس از مقیاس گذاری توسط بردار توجه برگردانید. برای اهداف نگهداری کتاب (کدام ارزشها در توالی مورد نظر پوشانده شده اند؟) A نیز بازگردانده می شود.
+
+```python
+def split_heads(self, x, batch_size):
+    return x.view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
+```
+آخرین بعد را به (`heads` × `depth`) تقسیم کنید. بعد از جابجایی برگردانید تا شکل بگیرد (`batch_size` × `num_heads` × `seq_length` × `d_k`)
+
+
+```python
+def group_heads(self, x, batch_size):
+    return x.transpose(1, 2).contiguous().
+        view(batch_size, -1, self.num_heads * self.d_k)
+```
+سرهای توجه را با هم ترکیب می کند، تا شکل صحیح مطابق با اندازه دسته و طول توالی داشته باشد.
+
+
+```python
+def forward(self, X_q, X_k, X_v):
+    batch_size, seq_length, dim = X_q.size()
+    # After transforming, split into num_heads
+    Q = self.split_heads(self.W_q(X_q), batch_size)
+    K = self.split_heads(self.W_k(X_k), batch_size)
+    V = self.split_heads(self.W_v(X_v), batch_size)
+    # Calculate the attention weights for each of the heads
+    H_cat, A = self.scaled_dot_product_attention(Q, K, V)
+    # Put all the heads back together by concat
+    H_cat = self.group_heads(H_cat, batch_size)  # (bs, q_length, dim)
+    # Final linear layer
+    H = self.W_h(H_cat)  # (bs, q_length, dim)
+    return H, A
+```
+
+شبکه مرور به پیش توجه چند سر.
+
+با توجه به ورودی به q ،k و v تقسیم می شود، در این مرحله این مقادیر از طریق یک مکانیزم توجه به محصول با مقیاس کوچک تغذیه می شوند، به هم متصل شده و از طریق یک لایه خطی نهایی تغذیه می شوند. آخرین خروجی بلوک توجه، توجه پیدا شده و نمایش مخفی است که از بلوک های باقیمانده عبور می کند.
+
+اگرچه بلوک بعدی نشان داده شده در نرم و تابع add دگرگون‌سازها / رمزگذار تابعی است که قبلاً در PyTorch تعبیه شده است. به همین ترتیب، این یک اجرای کاملاً ساده است و به کلاس خودش نیازی ندارد. بعدی بلوک کانولوشنی 1-D است. برای جزئیات بیشتر لطفا به بخشهای قبلی مراجعه کنید.
+
+اکنون که همه کلاس های اصلی خود را ساخته ایم (یا برای ما ساخته شده است)، اکنون به یک ماژول رمزگذار روی می آوریم.
+
+```python
+class EncoderLayer(nn.Module):
+    def __init__(self, d_model, num_heads, conv_hidden_dim, p=0.1):
+        self.mha = MultiHeadAttention(d_model, num_heads, p)
+        self.layernorm1 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+        self.layernorm2 = nn.LayerNorm(normalized_shape=d_model, eps=1e-6)
+
+    def forward(self, x):
+        attn_output, _ = self.mha(x, x, x)
+        out1 = self.layernorm1(x + attn_output)
+        cnn_output = self.cnn(out1)
+        out2 = self.layernorm2(out1 + cnn_output)
+        return out2
+```
+
+در قدرتمندترین دگرگون سازها، تعداد زیادی از این رمزگذارها به طور دلخواه روی هم قرار گرفته اند.
+
+یادآوری شود که خود توجه‌ای به خودی خود فاقد بازگشت یا کانولوشن است، اما این همان چیزی است که به آن اجازه می دهد تا خیلی سریع اجرا شود. برای حساس کردن آن به موقعیت، رمزگذاری موقعیتی را ارائه می دهیم. این موارد به شرح زیر محاسبه می شود:
+
+
+
+$$
+\begin{aligned}
+E(p, 2)    &= \sin(p / 10000^{2i / d}) \\
+E(p, 2i+1) &= \cos(p / 10000^{2i / d})
+\end{aligned}
+$$
+
+برای اینکه جزییات دقیق تر فضای زیادی را اشغال نکند، برای کد کامل مورد استفاده در اینجا ما شما را به  https://github.com/Atcold/pytorch-Deep-Learning/blob/master/15-transformer.ipynb ارجاع می دهیم.
+
+یک رمزگذار کامل، با N لایه های رمزگذار انباشته، و همچنین جاسازی موقعیت، به طور زیر نوشته شده است:
+
+```python
+class Encoder(nn.Module):
+    def __init__(self, num_layers, d_model, num_heads, ff_hidden_dim,
+            input_vocab_size, maximum_position_encoding, p=0.1):
+        self.embedding = Embeddings(d_model, input_vocab_size,
+                                    maximum_position_encoding, p)
+        self.enc_layers = nn.ModuleList()
+        for _ in range(num_layers):
+            self.enc_layers.append(EncoderLayer(d_model, num_heads,
+                                                ff_hidden_dim, p))
+    def forward(self, x):
+        x = self.embedding(x) # Transform to (batch_size, input_seq_length, d_model)
+        for i in range(self.num_layers):
+            x = self.enc_layers[i](x)
+        return x  # (batch_size, input_seq_len, d_model)
+```
+<!----
+## Example Use
+
+There is a lot of tasks you can use just an Encoder for. In the accompanying notebook, we see how an encoder can be used for sentiment analysis.
+
+Using the imdb review dataset, we can output from the encoder a latent representation of a sequence of text, and train this encoding process with binary cross entropy, corresponding to a positive or negative movie review.
+
+Again we leave out the nuts and bolts, and direct you to the notebook, but here is the most important architectural components used in the transformer:
+
+
+
+```python
+class TransformerClassifier(nn.Module):
+    def forward(self, x):
+        x = Encoder()(x)
+        x = nn.Linear(d_model, num_answers)(x)
+        return torch.max(x, dim=1)
+
+model = TransformerClassifier(num_layers=1, d_model=32, num_heads=2,
+                         conv_hidden_dim=128, input_vocab_size=50002, num_answers=2)
+```
+Where this model is trained in typical fashion.
+----->
+## مثال کاربرد
+
+وظایف زیادی وجود دارد که می توانید فقط برای آنها از رمزگذار استفاده کنید. در نوت بوک همراه، می بینیم که چگونه می توان از رمزگذار برای عقیده کاوی استفاده کرد.
+
+با استفاده از مجموعه داده های بررسی imdb، می توانیم از رمزگذار نمایشی پنهان از دنباله ای از متن را تولید کنیم و این فرایند رمزگذاری را با آنتروپی متقابل باینری، که مربوط به یک بررسی مثبت یا منفی فیلم است ، آموزش دهیم.
+
+ما دوباره کارها و اصول اولیه را کنار گذاشته و شما را به سمت نوت بوک راهنما هدایت می کنیم، اما در اینجا مهمترین اجزای معماری مورد استفاده در دگرگون‌ساز‌ها وجود دارد:
+
+
+```python
+class TransformerClassifier(nn.Module):
+    def forward(self, x):
+        x = Encoder()(x)
+        x = nn.Linear(d_model, num_answers)(x)
+        return torch.max(x, dim=1)
+
+model = TransformerClassifier(num_layers=1, d_model=32, num_heads=2,
+                         conv_hidden_dim=128, input_vocab_size=50002, num_answers=2)
+```
+جایی که این مدل به صورت معمولی آموزش دیده است.