1. Introduction
This paper introduces gft (general fine-tuning),Footnote 1 a little languageFootnote 2 for deep nets, introduced at an ACL-2022 tutorial.Footnote 3 There are two parts to the tutorial:
-
1. Glass is half-full: make deep nets accessible to a mass audience, including nonprogrammers, and
-
2. Glass is half-empty: based on the successes of the first part on so many benchmarks, one might come to the mistaken impression that deep nets are more successful than they are. There are always opportunities for improvement. We are advocating an interdisciplinary approach that combines the successes in the first part, with decades of work in AI representation and centuries of work in linguistics and philosophy.
This paper will use gft to discuss the first part. It is amazing how much can be done with so little. gft demystifies deep nets. No one would suggest that regression-like methods are “intelligent.”
There are two main functions in gft: fit and predict. Fit takes a pretrained model, $f_{pre}$ , as input, and fine-tunes that on data to produce a post-trained model, $f_{post}$ , as output. Predict takes x, a novel input, and predicts, $\hat{y}=f(x)$ . Hopefully, the prediction, $\hat{y}$ , will be close to the gold label, y.
We discussed deep nets in two previous articles in this journal: (Church et al., Reference Church, Chen and Ma2021a, b). gft makes it possible to do much of that in short (1-line) programs. 1-line programs are easier to read, write, understand, and port from one environment to another than examples on hubs (typically hundreds of lines of Python, PyTorch,Footnote 4 TensorFlow,Footnote 5 JaxFootnote 6, and/or PaddlePaddle).Footnote 7
gft is designed to make much of this functionality accessible to nonprogrammers. Just as one does not need to know Python and Machine Learning to use an off-the-shelf regression package, so too, deep nets should not require much (if any) programming skills.
Following the advice in “Crossing the Chasm” (Moore and McKenna, Reference Moore and McKenna1999), the long-term success of deep nets will depend on finding ways to cross the chasm from the current set of loyal users (so-called early adopters) to a much larger set of users. Early adopters may be willing to invest in machine learning and programming, but most users have other priorities.
The gft interpreter is based on examples from hubs.Footnote 8 , Footnote 9 Hubs encourage users to modify hundreds of lines of Python code as necessary if they want to change models, data sets, and/or tasks. gft generalizes the examples so users can do much of that in a single line of gft code (with comparable performance).
gft supports most of the arguments in the examples on the hubs, so it is possible to tune hyper-parameters such as batch size, learning rate, and stopping rules. Tuning matters for (state of the art) SOTA-chasing, though default settings are recommended for most users who prefer results that are easy to replicate and reasonably competitive.
There is already too much SOTA-chasing in the literature (Church and Kordoni, Reference Church and Kordoni2022). Users should avoid wasting time on hyper-parameter tuning unless they are about to ship a model to a large number of users for an application where small improvements in performance are worth the effort.
2. gft Cheatsheet
gft supports the following functions:Footnote 10
-
1. fit (also known as fine-tuning): $f_{pre} + data \rightarrow f_{post}$
-
2. predict (also known as inference): $f(x) = \hat{y}$ , where x is an input from stdin or from a data set
-
3. eval: $f + data \rightarrow score$ (produce a single score for a data set split, as opposed to a prediction, $\hat{y}$ , for each input row in the split, x)
-
4. summary: Search hubs for popular data sets, models, and tasks and provide snippets. Popularity is estimated from metrics on downloads.
-
5. cat_data: Output data set on stdout
There are four major arguments:
-
1. –data: a data set on a hub, or a local file
-
2. –model: a model on a hub, or a local file
-
3. –task: for example, classify, regressFootnote 11
-
4. –eqn (e.g., classify: $y \sim x_1 + x_2$ ), where a task appears before the colon, and variables refer to columns in the data set.
3. The standard recipe
Following (Howard and Ruder, Reference Howard and Ruder2018; Devlin et al., Reference Devlin, Chang, Lee and Toutanova2019), it has become standard practice to use the 3-step recipe in Table 1. We prefer the terms, fit and predict, to fine-tuning and inference. The proposed terminology has a long tradition in statistics and predates relatively recent work on deep nets.Footnote 12
Fit and predict were discussed in two previous Emerging Trends articles in this journal (Church et al. Reference Church, Chen and Ma2021a, b). This paper will unify much of that discussion into a single github (see footnote 1) with hundreds of examples of short (1-line) programs.Footnote 13
gft makes it easy to use models and data sets on hubs: HuggingFaceFootnote 14 and PaddleHub/PaddleNLP).Footnote 15 The hubs are large ( $\sim$ 40k models and $\sim$ 4k data sets) and growing quickly ( $\sim$ 3x/year). The challenge is to make these amazing resources more accessible to as many users as possible. The target audience has diverse interests and skills. It should not be necessary for them to know much (if any) programming to join in on the fun.
The 40k models include both pretrained and post-trained models, $f_{pre}$ and $f_{post}$ . gft provides tools to make it easy to find popular models, as well as popular data sets. We recommend users make as much use as possible of these resources and resist the temptation to pretrain their own models from scratch, for reasons that will be discussed in Appendix A.1.
3.1. An example of fit and predict in R
As mentioned above, gft is inspired by glm (general linear models) (Guisan et al., Reference Guisan, Edwards and Hastie2002) in R.Footnote 16 Listing 1 illustrates the use of fit and predict in R. The R environment provides a number of standard data sets such as cars, a data table with two columns, speed and dist, shown as black points in Figure 1. The model, g, fits dist as a quadratic function of speed. Predictions from this model are shown in red in Figure 1.
The summary function in R is applied to both the data table cars as well as the model g. The R summary function can be applied to almost any object and provides some useful description of its argument.
3.2. An example of fit (aka fine-tuning)
Listing 2 shows an example of gft_fit. Listing 2 is similar to Listing 1 in a number of ways. Fit takes a pretrained model, $f_{pre}$ , and uses a data set to output a post-trained model, $f_{post}$ . In Listing 2, $f_{pre}$ is a BERT model, and the data table is the emotion data set on HuggingFace. The model in Listing 1, g, is analogous to $f_{post}=\$outdir$ in Listing 2. The variables in both equations, line 7 of Listing 1 and line 3 of Listing 2, refer to columns in the relevant data table.
Many gft programs take four arguments:
-
1. –data specifies the use of the emotion data set on HuggingFace.Footnote 17
-
2. –model specifies the use of a BERT model on HuggingFaceFootnote 18 as $f_{pre}$ .
-
3. –eqn specifies a task (classification), plus a formula expressed in terms of columns in the data set.
-
4. –task specifies a task (not necessary when task is specified by –eqn argument).
Fit takes most of these (except for –task); in addition, fit requires –output_dir to specify a location for the output post-trained model, $f_{post}$ .
3.3. An example of predict (aka inference)
Listings 3 and 4 show two examples of gft_predict. Predict takes a novel input, x, and applies x to a model, f, to produce a prediction, $\hat{y}=f(x)$ . The default model (for the classification task) performs sentiment analysis; other models output other labels. In particular, the f in Listing 4 outputs emotion classes: anger, fear, joy, love, sadness, surprise. To see the set of classes for a model, we recommend the use of gft_summary, as illustrated in Listing 5. gft_summary outputs the set of classes, among other things.
Some more classifications of $x=$ I love you are shown in Tables 2 and 3 using a number of different models from HuggingFace. Most of these models agree that x is positive, though many of them classify x as fake news and some classify x as spam. One can use other models to classify x in many ways such as offensive or not and hate speech or not.
Many of these classifiers were trained on corpora that may not be appropriate for this task. In particular, we really should not apply a Spanish classifier on English inputs, but mistakes like that are likely to happen given how easy it is to make such mistakes.
Most of the models on the hubs were created by the community. The hubs do not vet models for quality. The best models on the hubs are very good, though maybe not state of the art (SOTA). We rarely see results that are as good as PWCFootnote 19 and leaderboards.Footnote 20 Some models produce poor results, or no results (using standard mechanisms in gft). The most popular models (in terms of downloads) often produce competitive results, though the most popular models rarely produce the best results.
3.4. Embarrassment of riches
As mentioned at the beginning of this section, there are a huge number of models and data sets on the hubs. There are currently 40k models and 4k data sets, and these numbers are increasing rapidly ( $\sim$ 3x/year). How do we find the good stuff? And how do we use it?
The hubs provide a number of useful tools to answer these questions. There are GUI interfaces (as illustrated by footnotesFootnote 17 and Footnote 18), as well as APIs. gft_summary uses the APIs to provide much of this functionality, as illustrated in Listing 6, which finds the five most popular data sets (or models) that contain the substring: “emotion.” Popularity is estimated from downloads.
Listing 7 finds the most popular data sets and models by searching for data sets and models that contain the null string:
There are a few common naming conventions. Models containing the string “base” are likely to be base models, $f_{pre}$ (also known as pretrained models or foundation models). Models containing the string “distil” are like to be distilled (compressed models). Models containing the names of popular tasks such as “squad” and GLUE subtasks are likely to be post-trained models, $f_{post}$ .
gft_summary can also be used to summarize data sets, models, tasks, etc. As mentioned in Section 3.1, these summaries are modeled after the summary function in R, which takes many different types of objects and produces useful descriptions.
3.5. Portability across hubs and frameworks
3.5.1. Portability $\rightarrow$ stability over time
The code in the listings above take a dependency on HuggingFace, a small start-up company that has done very well recently. There are also dependencies on a number of Python packages that are constantly changing. We have seen many hardware and software platforms come and go. Many companies do well for a while, but success rarely lasts for long (decades). Deep nets will be more likely to survive the test of time if they are written in high-level languages such as gft that can be ported from one environment to another, as necessary.
Consider the example of operating systems. Unix survived the test of time better than alternatives such as VMSFootnote 21 because Unix was designed to port easily across suppliers. There was a time when Unix was mostly running on DEC machines,Footnote 22 and then there was a time when Unix was mostly running on Sun computers.Footnote 23 These days, Unix has moved on to other platforms. If programs are written in a relatively stable higher level environment like Unix (and gft), then old programs are more likely to continue to work for decades, despite instabilities at lower levels in the hardware and software stacks.
Too many deep nets are taking dependencies on Python packages that are updated very frequently (almost daily), often in incompatible ways. Many of these resources are supported by companies that could go out of business, or could decide to sunset support at any time. Given recent events, there is a risk that support could also be cutoff by sanctions and other instabilities in international relations. Because of these realities, gft is designed to make it easy to port from one hub to another.
3.5.2. H is for HuggingFace and P is for PaddleNLP/PaddleHub
Listing 9 is similar to Listing 2, though dependencies on one company (H $\rightarrow$ HuggingFace) are replaced by dependencies on another company (P $\rightarrow$ Baidu’s PaddleNLP/PaddleHub). gft supports mixing and matching models and data sets from different suppliers. “H:” uses resources from Huggingface, and “P:” uses resources from PaddleNLP/PaddleHub. gft also supports “C:” for custom resources on the local file system.
Note that most of the models on HuggingFace are based on PyTorch, whereas models on PaddleNLP and PaddleHub use a different framework called PaddlePaddle. gft hides much of this complexity.
Listing 9 uses the chnsenticorp data set,Footnote 24 which is different from the emotion dataset in Listing 2. The chnsenticorp data set specifies a sentiment analysis task in Chinese, whereas the emotion data set specifies an emotion classification task in English.
Listing 9 uses the ernie-tiny model (Su et al., Reference Su, Chen, Feng, Liu, Liu, Sun, Tian, Wu and Wang2021), a compressed version of an ERNIE model. ERNIE models are similar to BERT models, though ERNIE models may be more appropriate for Chinese applications. Distillation (Hinton et al., Reference Hinton, Vinyals and Dean2015) is a popular method to compress models. Compressed models tend to trade-off a little bit of performance (accuracy) in order to save a substantial amount of space and time when making predictions at inference time (Ganesh et al., Reference Ganesh, Chen, Lou, Khan, Yang, Sajjad, Nakov, Chen and Winslett2021). Distillation can be important for commercial applications.
4. Data sets and equations
4.1. Data sets
As mentioned in Section 3.4, there are currently more than 4000 data sets on the hubs. We have already mentioned the emotion data set. Many data sets provide splits for training, validation, and test, though different data sets may name these splits differently. Each split provides a data table with columns and rows. The emotion data set, for example, contains two columns, named text and label. As can be seen in HuggingFace’s data set viewer,Footnote 25 each row specifies a text field (e.g., “i didnt feel humiliated”) and a label field (e.g., “sadness”). We will refer to the label field as a gold label. The task is to predict the gold labels.
SQuADFootnote 26 $^,$ Footnote 27 (Rajpurkar et al., Reference Rajpurkar, Zhang, Lopyrev and Liang2016, Reference Rajpurkar, Jia and Liang2018) is a popular data set for question answering. This data set has 5 columns: id, title, context, question, answers. The answers are substrings of the context, which makes this task considerably easier than the general case of Q&A (question answering), where the answer could be almost anything, and need not be mentioned in any of the other columns.
In Section 2.1 of (Church and Kordoni, Reference Church and Kordoni2022), there is a discussion of constructed queries like SQuAD. The TREC QA trackFootnote 28 started with “constructed” questions in 1999, but quickly moved to “real” questions from query logs for subsequent TREC QA tracks (2000–2007) because constructed questions are too easy for systems and unrealistic (Voorhees, Reference Voorhees2001).
Another popular data set is GLUEFootnote 29 $^,$ Footnote 30 (Wang et al., Reference Wang, Singh, Michael, Hill, Levy and Bowman2018). GLUE contains a number of subsets: cola, sst2, wnli, mrpc, rte, qnli, qqp, sstb, mnli. Each subset contains 3 splits (train, validation, test). Different subsets have different columns.
GLUE has been updated with another task, SUPERGLUE (Wang et al., Reference Wang, Pruksachatkun, Nangia, Singh, Michael, Hill, Levy and Bowman2019). Both GLUE and SUPERGLUE are popular on HuggingFace (in terms of downloads), though there are currently more downloads for GLUE.Footnote 31
4.2. Examples of –data and –eqn
Short (1-line) gft programs can fit (fine-tune) many benchmarks, as illustrated in Table 4. Table 4 shows –data and –eqn arguments for a number of popular benchmarks.
-
– data arguments start with a supplier, for example, H, P, C. After the colon, there can be one or two substrings, delimited by comma. For example, for the cola subtask of GLUE, the –data argument is H:glue,cola.
-
– eqn arguments consist of a task, plus a formula expressed in terms of columns in the dataset. See Table 5 for some examples of some tasks. For a more comprehensive list of tasks, see footnote Footnote 11.
5. More examples and more tasks
As mentioned in footnote Footnote 13, there are hundreds of examples of gft in the github: fit,Footnote 32 predict,Footnote 33 summary,Footnote 34 and eval.Footnote 35 A few examples have already been discussed in Sections 3.2 and 3.3. Many more will be discussed in the next few subsections:
-
1. Predict (Section 5.1): token-classification, fill-mask, MT, ASR, etc.
-
2. Input from datasets (as opposed to stdin) (Section 5.2).
-
3. gft_predict $\rightarrow$ gft_eval (Section 5.3).
5.1. Predict
A few examples of predict were shown in Listing 3. The gft documentation has many more examples of predict.Footnote 36
5.1.1. Token classification
Some examples of token classification with PaddleNLP are shown in Listing 11.
Many of these tasks have been in the literature for a long time. Fill-mask is similar to the cloze task (Taylor, Reference Taylor1953), as illustrated in Listing 12.
Text generation is one of the more popular use cases for GPT-3, though Listing 13 uses a different model.
5.1.2. MT, ASR and more
There are translation models for many language pairs, as illustrated in Listing 14.Footnote 37
5.2. Input from data sets (as opposed to stdin)
Listing 17 shows an example of input from a data set.
5.4. Debugging, confusion matrices, and error analysis
In addition to producing a score with gft_eval, suppose we want to do some deep dives to look at particular errors. The code in Listing 19 will create a confusion matrix based on the validation split.
gft_predict outputs TSV (tab separated values) with 4 columns:
-
1. Input, x
-
2. Gold label, y
-
3. Predicted label, $\hat{y}$
-
4. Score
The cut statement on line 4 in Listing 19 selects y and $\hat{y}$ . The sort and uniq statements count the number of confusions, producing the confusion matrix shown in Table 6. Standard Unix tools such as grep (or AWK) can be used to find more details for particular confusions.
5.5. Vectors on the left hand side (LHS)
With regression and classification, the left-hand side (lhs) of the equation is typically a scalar, but gft has been generalized so the lhs can also be a point in a vector space, as shown in Listing 20. This example fine-tunes BERT with the NRC-VAD lexiconFootnote 38 (Mohammad, Reference Mohammad2018). Words are assigned to points in $\mathbb{R}^3$ , Valance, Arousal, and Dominance, based on VAD norms in psychology (Osgood et al., Reference Osgood, Suci and Tannenbaum1957).
Listing 20 is our first example of a custom data set. There are three CSV files on the local filesystem:
-
1. train split: $gft/datasets/VAD/VAD.train
-
2. validation split: $gft/datasets/VAD/VAD.val
-
3. test split: $gft/datasets/VAD/VAD.test
The three CSV files start with a header row that specifies the names of the columns. The variables in the equation refer to these columns in the CSV files.
In addition to illustrating the use of custom data sets, Listing 20 introduces two new features. First, we normally train models on corpora, but Listing 20 trains a model on a lexicon, the NRC-VAD lexicon. Second, regression usually takes scalar values on the left-hand side (lhs), but in this case, the lhs is a point in $\mathbb{R}^3$ .
Listing 20 produces a post-trained model $f_{post}$ . A few results with $f_{post}$ are shown in Table 7. This table shows some predictions, $\hat{y}$ , for some inputs, x, using $f_{post}$ . These predictions, $\hat{h}$ , can be compared with gold labels, y, VAD scores from NRC-VAD (last three columns).
Although the model was trained on words (lemmas in the NRC Lexicon), the inputs, x, in Table 7 include a number of words, phrases, and texts, many of which are not in the NRC-VAD Lexicon (by construction). That is, $f_{post}$ can be applied to any input text (up to 512 subword units). Table 7 shows predictions, $\hat{V}$ , $\hat{A}$ , and $\hat{D}$ , as well as gold values, V, A, and D. When the input, x, is not in the NRC-Lexicon, the gold value, y, is NA (not available). Since NRC-VAD is based on lemmas, NAs are to be expected for inflected forms, OOVs (out-of-vocabulary) words such as unlovable, MWEs (multiword expressions) such as ugly duckling, sentences, documents.
6. Conclusions
This paper proposed gft, a little language for fine-tuning pretrained base (foundation) models. Little languages make it easier for a broader audience (including non-programmers) to join in on the fun. Just as most users of regression do not need to know how to solve the regression optimization, so too users of deep nets should not need to understand hundreds of lines of Python and PyTorch. Higher level environments offer a number of advantages: ease of use, transparency, portability. gft removes much of the complexity, and much of the magic (and the alchemy) in deep nets, reducing fine-tuning to an optimization similar to regression. No one would suggest that regression-like methods are “intelligent.”
A Appendix
A.1 Pretraining ( $f_{pre}$ ): Don’t do it (yourself)
Recent work on foundation modelsFootnote 39 (Bommasani et al., Reference Bommasani, Hudson, Adeli, Altman, Arora, von Arx, Bernstein, Bohg, Bosselut, Brunskill, Brynjolfsson, Buch, Card, Castellon, Chatterji, Chen, Creel, Davis, Demszky, Donahue, Doumbouya, Durmus, Ermon, Etchemendy, Ethayarajh, Fei-Fei, Finn, Gale, Gillespie, Goel, Goodman, Grossman, Guha, Hashimoto, Henderson, Hewitt, Ho, Hong, Hsu, Huang, Icard, Jain, Jurafsky, Kalluri, Karamcheti, Keeling, Khani, Khattab, Kohd, Krass, Krishna, Kuditipudi, Kumar, Ladhak, Lee, Lee, Leskovec, Levent, Li, Li, Ma, Malik, Manning, Mirchandani, Mitchell, Munyikwa, Nair, Narayan, Narayanan, Newman, Nie, Niebles, Nilforoshan, Nyarko, Ogut, Orr, Papadimitriou, Park, Piech, Portelance, Potts, Raghunathan, Reich, Ren, Rong, Roohani, Ruiz, Ryan, Ré, Sadigh, Sagawa, Santhanam, Shih, Srinivasan, Tamkin, Taori, Thomas, Tramèr, Wang, Wang, Wu, Wu, Wu, Xie, Yasunaga, You, Zaharia, Zhang, Zhang, Zhang, Zhang, Zheng, Zhou and Liang2021) attempts to compete with industry on what industry does best. We think this is a mistake. Industry has “unfair” advantagesFootnote 40 on tasks like pretraining $f_{pre}$ , which require large investments in people and machines, as shown in Table 8.
We recommend that academics focus on fit and predict, which are much more affordable than pretraining $f_{pre}$ . The last two columns in Table 8, time and hardware, obviously depend on many factors such as the size of the model. One of the motivations behind distillation (Hinton et al., Reference Hinton, Vinyals and Dean2015; Ganesh et al., Reference Ganesh, Chen, Lou, Khan, Yang, Sajjad, Nakov, Chen and Winslett2021) is to reduce the size of the model. Smaller models tend to run faster at inference time. While inference times are relatively faster than training times, inference time is often a bottleneck for commercial applications since training is a one-time investment, whereas inference is a recurring cost. For successful applications with millions or billions of users, recurring costs can easily dominate one-time training costs.
As for training costs, pretraining is much more expensive than fine-tuning, especially for large models. Pretraining is already very expensive and will become even more expensive in the future as models become larger and larger. Pretraining large models will be beyond the means of academics (and governments).
Consider the pretrained models in Table 9, and especially the largest model, PaLM (Chowdhery et al., Reference Chowdhery, Narang, Devlin, Bosma, Mishra, Roberts, Barham, Chung, Sutton and Gehrmann2022). PaLM produces impressive results, using a huge model (540B parameters). That said, the size of the investment is even more impressive: the paper has dozens of authors using thousands of TPUs (distributed over multiple data centers).
When the investments are this large, projects become risk adverse. Projects of this size cannot afford to fail. Academics should focus on projects that reward creativity and avoid projects that are too big to fail.
We like to think of $f_{pre}$ like Intel CPU chips. Universities can afford to program CPUs, but universities cannot afford to compete with Intel and fabricate their own CPUs. So too, we argue that universities can afford to fit and predict deep nets, but they cannot afford to compete with industry on $f_{pre}$ . When the first author was a student at MIT, his thesis advisor, Jon Allen, urged the university to make large investments in VLSI fabrication. In retrospect, it was probably a mistake for a university to invest in VLSI fabrication, though others may disagree with that assessment.Footnote 41
In short, we recommend users start by downloading $f_{pre}$ from hubs and focus on steps 2 (fit) and 3 (predict) of the standard recipe. Some examples of $f_{pre}$ are shown in Table 9. Many of these models can be downloaded from hubs, with a few exceptions, especially for larger models such as ERNIE 3.0, GPT-3, PaLM. Most models are trained on corpora, as shown in Table 10.