Fine tuning pre-trained LLM for language translation and to build ChatGPT like application #334
-
Thank you so much for delivering a great book. I just completed Chapter 7 (yet to review the bonus material). Two prominent questions in my mind (out of many) are:
Do you recommend any blog posts or discussion threads that address these topics? Thank you again for writing a great book! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Glad you liked the book! Regarding your first question: I depends on the LLM and languages involved, but I'd say this is relatively straightforward. The key here is that the languages have been present in the pretraining dataset to create the tokenizer and pretrained LLM. Then, finetuning it for language translation (using the technique from Ch07) is relatively easy. The reason is that otherwise the tokenizer will break up a word into too many subtokens -- it will work but it not ideal. Base models that support multiple languages are for example Qwen 2 (~20 languages) and Llama 3.1 (~8 languages). (It's also possible to also extend existing tokenizers with new tokens but this is a separate topic I will write about some time.) Regarding the second question, that depends on how many customers you want to serve, and it can be a bit more effort. First, if you suggest doing an alignment step -- that's usually done for safety reasons. A popular technique for this these days is DPO (see DPO bonus material here). |
Beta Was this translation helpful? Give feedback.
Glad you liked the book!
Regarding your first question: I depends on the LLM and languages involved, but I'd say this is relatively straightforward. The key here is that the languages have been present in the pretraining dataset to create the tokenizer and pretrained LLM. Then, finetuning it for language translation (using the technique from Ch07) is relatively easy. The reason is that otherwise the tokenizer will break up a word into too many subtokens -- it will work but it not ideal. Base models that support multiple languages are for example Qwen 2 (~20 languages) and Llama 3.1 (~8 languages). (It's also possible to also extend existing tokenizers with new tokens but this is a separat…