Text Analysis in R - Take Home

Elio Bolliger

2021-03-31

1 Preliminaries

In this exercise, we use a variety of text analysis and machine learning tools to predict the sector of a loan application and the loan amount. For this, we use the data from the KIVA crowdfunding company. KIVA is a loan platform that allows financially excluded parties to access fair and affordable sources of credit. The data includes detailed loan descriptions, the loan amount and the sector the loan is used for.

Regarding the prediction of our binary variable, the sector, we decided to merge the agriculture and food sector as they are closely intertwined. To predict the outcomes, we exploit the text information in the loan description for both, the binary and the continous outcome variable.

In section 2, we will describe the data selection and cleaning process. In section 2, we will provide some descriptive statistics. In section 3, we briefly discuss the modeling approaches and methods used in the analysis and section 4 presents the results. Section 5 concludes.

1.1 Cleaning

Due to the large data base, we decided to focus on the first 100’000 observations of the loan data. We perform some conventional cleaning which includes standardizing the variable names, getting rid of HTML tags and similar chunks in the description data as well as redundant white space. Further, we will focus on English loans descriptions only in this exercise. To ensure that we only use English loan descriptions (or their English translations), we detect the language automatically and only keep those that are in English.[^There have been some loans where the original description was mixed with the translated loan description. However, the number of such loans is neglectable.]

Further, we harmonize the time stamps of the different data indicators. We have a time variable when the loan was posted on the website (postedtime), an indicator for the planned expiration time of the loan (deadline for the funding, plannedexpirationtime), the disbursetime which is the the time when the money was disbursed to the people, as well as the time at which the funds were raised (raisedtime). For all time variables we will use UTC time zone.1

Finally, we discard some of the variables that we do ont use in this exercise. To get a first impression of our main variables of interest, we show the distribution of the loan amount data and the number of loans for each sector. Note that there are a few loans above 10’000 dollars but the histogram is truncated. As mentioned, we will group the food and agricultural sector together due to their natural linkage.

Distribution of loan amount and number of loans per sectorDistribution of loan amount and number of loans per sector

Figure 1.1: Distribution of loan amount and number of loans per sector

2 Descriptive Statistics

To give some first impressions of the text data, we create a so-called corpus which consists of an id corresponding to the loan id and the text, which is in our case the description of the loan. We can have a look at the first two entries of the corpus below.

$`1336583`
[1] "As married parent of six children, Susan works hard to support her family. She has frozen foods business in the Philippines. Susan is borrowing PHP 10,000 through NWTF in order to buy more frozen foods to sell to customers. Susan has been sustaining her business activities through her past loans from NWTF. She hopes that her hard work will help her attain her dream: to save enough money so she can afford to send her children to college."

$`290654`
[1] "Dinah, 42, is married and has five children. She owns house in Kasese, Uganda. Dinah has been in the bar and hotel business for eleven years, and it’ still booming. Her problem is drunk customers who break her glasses. Dinah wants loan to buy seats and more drinks to sell. Her vision is to move to bigger bar, build house and educate her children."

To get some meaningful descriptive statistics, we further clean the corpus. More specifically, we will remove punctuation, numbers, stopwords (words like me, her, she, etc. that do not contain any interesting information for this exercise) and a small list of specific words (for example, kiva, loan, etc.). Moreover, we will try to get rid of the names in the description using a predefined list of common names. Finally, all text will be transformed to lower case.

We will also create a corpus that contains the stemmed words only. Sometimes, this leads to more meaningful representations. For example, stemming would lead to words like fishing, fished or fisher to be stemmed to fish. Hence, a standard corpus would represent each word separately whereas the stemmed version would represent only the stemmed word fish.

2.1 Frequency of words and Word Clouds

To get a first impression of the loan description data, we can have a look at the most frequent words in the text as shown below. We show both the frequency plots for standard non-stemmed and the stemmed corpus.

Most frequent WordsMost frequent Words

Figure 2.1: Most frequent Words

The frequency plots show, in general, a similar picture. For this reason, we will only show the word clouds for the non-stemmed corpus but keep the stemmed corpus for further analysis. We will display the 100 most frequent words in a word-cloud which represents words that are more frequent with a bigger font size.

Wordclouds

Figure 2.2: Wordclouds

2.2 Document-Term-Matrix

Some of our modelling approaches require that the text information of the document is represented in a so-called document-term matrix (DTM). This matrix represents for each document (row indicator) whether it contains a certain word (column index).

Regarding the DTM using unigram, we have in total 69043 terms that appeared at least once. Below, we get a first impression of the document-term-matrix.

<<DocumentTermMatrix (documents: 96544, terms: 69043)>>
Non-/sparse entries: 4798978/6660888414
Sparsity           : 100%
Maximal term length: 67
Weighting          : term frequency (tf)
Sample             :
         Terms
Docs      business buy children family income lives married old school years
  1078304        7   0        2      3      0     1       3   3      3     1
  1078614        0   0        3      0      0     0       2   3      2     5
  1084424        9   1        0      1      0     0       0   1      0     3
  1150438        3   0        0      3      0     0       0   2      2     4
  268424        14   2        4      1      0     3       0   1      4     2
  398174         2   9        2      2      0     1       3   8      2     9
  4591           0   2        0      0      0     1       0   1      0     0
 [ reached getOption("max.print") -- omitted 3 rows ]

Digression

Besides focusing on so called 1-gram which corresponds to a single token (in our case a word), we can also consider a combination of two words. A 2-gram takes into account the combination of one word and its subsequent neighbor. As an example, the sentence “I go home.” in the basic DTM case would simply contain all single tokens \(\{ \text{I,go,home}\}\) whereas a 2-gram representation would take into account always the combination of two words, hence \(\{ \text{I.go, go.home}\}\). In practice, 1-gram, 2-gram or even 3-gram are used but here we focus on 1-gram (unigram) and 2-gram (bigram).

The wordcloud representation of bigram DTM looks as shown in figure 2.3.
Wordcloud with 2gram

Figure 2.3: Wordcloud with 2gram


The representation of text in a DTM allows us to look at correlations between certain words since every word represents a column with count information for each document. For example, we can look at the correlation of the word “business” with other vectors in the matrix. Below, we report the words that correlate at least with 0.2 with the vector corresponding to the word “business”.

$business
       exp   operates        php activities additional    ability     engage 
      0.32       0.28       0.25       0.21       0.21       0.21       0.21 

At this point, the DTM has large dimensions 96544\(\times\) 69043. We have different methods at hand to reduce the dimensionality of our DTM. On the one hand, we can simply remove words that have a low frequency among documents (with the function removeSparseTerms()) or we can calculate the so-called term frequency-inverse document frequency (tf-idf). Moreover, we will discuss a third option (GloVe) in more detail in section 3.

In this report, we will focus on the sparse DTM. More precisely, we use a sparsity of 0.975 means that for a term \(j\), it will be retained if its \(\text{document-frequency}_j > N*(1-0.975)\), with \(N\) being the total number of documents. Hence, the higher our sparsity parameter, the lower will be the right hand side of the threshold above and the more words (or columns in that sense) will be kept.

This approach reduced the document-term matrix dimensions to 96544 \(\times\) 362.

For the sake of completeness, we also show an example of a DTM using tf-idf weights. tf-idf weights account for the probability that some words are in general very frequent. Hence, the weight increases proportionally to the number of times a word appears in the document but it is offset by the number of documents in the corpus that contain the word. Formally, tf-idf is

\[ \text{tf}_{ij} \times \text{idf}_{ij} = \frac{c_{ij}}{\sum_j c_{ij}} \times \ln \left(\frac{N}{\sum_i \mathbf{1}\left\{ c_{ij} > 0\right\}} \right)\] with \(i\) referring to the token and \(j\) to the document, and \(c_{ij}\) refers to the count of the token \(i\) in the document \(j\).

After having calculated the tf-idf for all terms, we can use these weights to filter for the words that have at least a tf-idf of 1.4.

This procedure reduced the dimensions of the DTM to 96544\(\times\) 309 compared to the original DTM 96544\(\times\) 69043. We can have a look at some of the most frequent words

[1] "business" "buy"      "children" "family"   "married"  "years"    "old"     
[8] "income"  

compared to the tf-idf weighted DTM, which are

[1] "leony"   "merilou" "racel"   "niña"    "shohruh" "chella"  "jelita" 

For further analysis we will use two different DTM, namely the standard DTM and the DTM with stemmed words. Due to the high dimensionality of these matrices, we will use the sparse matrices for both of them.

3 Modelling approaches

We briefly discuss the two models used to predict the binary and continuous outcome variable as, namely the Lasso regression and the Random Forest as well as the Global Vectors for Word Representations (GloVe) as an alternative to reduce the dimensionality of the text data.

3.1 Lasso

Lasso allows to shrink some coefficients if they do not have enough predictive power. More precisely, the Lasso method adds a penalty to the coefficients equal to the absolute value of the size of the coefficients. In the case of text, this reduces the numbers of tokens as regressors taken into account from our DTM. Inversely to the Ridge regression, Lasso allows to shrink the size of the coefficients to zero.

For the binary outcome variable, we will use a penalized logit regression whereas for the continuous outcome variable we use a penalized linear regression. The latter can formally be written as

\[\text{argmin}_{\beta} \sum_i Y_i - X_i' \beta + \lambda \sum_{j}^{k} |\beta_k| \] with \(Y_i\) being the vector of outcomes (in the continous variable case the loanamount of loan application \(i\)), \(X_i\) is a vector containing the predictors, and \(\lambda\) is the mentioned penalty term.

3.2 Random Forest

The Random Forest is based on decision trees which partition the data according to a given variable. Each partition, in turn, corresponds to a node. The underlying rule of the split relies on a certain criterion that is usually minimized. In the case of a binary outcome, this would be the Gini impurity index and in the case of a continuous outcome, it would be the sum of squared residuals. However, decision trees are sensitive to the data they are trained on. Small changes can lead to significant changes in the outcome.

Random Forests incorporate the idea of decision trees but allowing each individual tree to randomly sample from the dataset with replacement, resulting in different trees (so-called bagging or bootstrapped aggregation). Random Forest combines bagging with feature randomness, which means that for before each splitting, each tree can only pick from a random subset of features. At the end, the prediction results from averaging over the boostrapped samples.

3.3 GloVe

GloVe is an unsupervised learning algorithm that operates on the global co-occurence statistics, meaning that it evaluates the words by their ratio of co-occurence probabilities with other word. Using the co-occurence ratios, we learn word-vectors which we then average over the documents.

4 Results

4.1 Binary Outcome - Lasso

First, we estimate the lasso regression for our binary outcome. We predict whether the loan belongs to the food or agriculture sector. This outcome variable takes the value 1 if the loan is from the agricultural or food sector, and zero otherwise. To proceed, we first have to define a training and a validation set. We train the model using the training data, and validate the model using our hold-out set of test data.

Among others, one reason to use two different datasets for this process is to avoid overfitting. We don’t want the model to fit patterns that are only particular to a selected sample but are not present in other samples. In brief, the test dataset provides an unbiased evaluation of the model fit estimated with the training data.

To start with, we have to convert the sparse DTM into a dataframe which will indicate for each word the number it appeared in the corresponding document. Using this dataframe, we can also analyze the distribution of the frequency of words over the documents. As an example, we show in figure 4.1 the distribution of the word “woman” among the documents.

Histogram in the data matrix fo

Figure 4.1: Histogram in the data matrix fo

We construct an alternative DTM that transforms the count information from the document term matrix into a simple binary matrix. In most applications, the count data has not much more information than a simple binary dataset. However, we will estimate most models using both the count and binary information. For all new dataframes, we add the dummy information about the loan sector (agriculture or food versus other) and the continuous variable of loanamount.

For the basic DTM, we have 44748 agricultural or food loans and 51796 from other sectors. Next, we define the training and test datasets. For reproducability, we will use a seed of 100 and then select a random sample of document id’s. The size of the test sample will then be 1.9308^{4}.


Basic DTM

For the first few models, we will discuss the outcomes in a bit more detail.

The final sizes of our training data is 77236 \(\times\) 362 and the test sample is 19308 \(\times\) 362. Before estimating the model, we justify the parameter options for our set-up in the Lasso regression.

We chose \(\alpha = 1\) to indicate that we estimate a lasso regression (and not a ridge regression, neither a mix). In empirical applications, Lasso has proven to work well. Second, we will use the minimum \(\lambda\) for prediction. As explained before, the Lasso regression shrinks the coefficients adding a penalty. This penalty can vary. For this reason, we use 10-fold cross-validation to choose the value of \(\lambda\).

We can plot the binomial deviance with respect to the value of \(\lambda\) in the figure 4.2. The intervals in the plot estimate the variance of the loss metric (binomial deviance in this case, red dots), which is a measure of goodness of fit. The vertical lines show the locations of the minimum \(\lambda\) and \(\lambda\) 1se. The numbers across the top are the number of nonzero coefficient estimates for the different values of \(\lambda\). The smaller the penalty is, the more coefficients we will take into account.

Lasso Regression

Figure 4.2: Lasso Regression

Note that the \(\lambda\) 1se stands for the largest value of lambda such that the error is within 1 standard error of the minimum. To evaluate the performance of the model, we will use both \(\lambda\)’s. The confusion matrix shows us the accuracy of the overall predictions (of all predictions, how many were correct). Furthermore, we have the sensitivity and the specificity, as well as other indicators. In our case, the sensitivity shows the share of correct predictions of non-agricultural sectors of all observations with reference category being 0 (hence, non-agricultural sectors) whereas the specificity indicates the share of correct predictions of the agricultural sectors of all observations with reference category being 1 (hence, agricultural sector). We present the confusion matrix that summarizes the predictions having used the minimum \(\lambda\) and the \(\lambda\) 1se, respectively.

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 9350 1619
         1 1099 7240
                                          
               Accuracy : 0.8592          
                 95% CI : (0.8542, 0.8641)
    No Information Rate : 0.5412          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7153          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8948          
            Specificity : 0.8172          
         Pos Pred Value : 0.8524          
         Neg Pred Value : 0.8682          
             Prevalence : 0.5412          
         Detection Rate : 0.4843          
   Detection Prevalence : 0.5681          
      Balanced Accuracy : 0.8560          
                                          
       'Positive' Class : 0               
                                          
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 9390 1656
         1 1059 7203
                                          
               Accuracy : 0.8594          
                 95% CI : (0.8544, 0.8643)
    No Information Rate : 0.5412          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7154          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.8987          
            Specificity : 0.8131          
         Pos Pred Value : 0.8501          
         Neg Pred Value : 0.8718          
             Prevalence : 0.5412          
         Detection Rate : 0.4863          
   Detection Prevalence : 0.5721          
      Balanced Accuracy : 0.8559          
                                          
       'Positive' Class : 0               
                                          

In what follows, we will not separately show the entire output of the confusion matrix for every model. For readability, we will present a summary table. In the text, we will briefly discuss the different modeling approaches and argue why they might be better or worse than other models. Note that we will use the sparse DTM in all modeling approaches to save computation time. As there is no big difference between using the minimum \(\lambda\) or 1se, we will focus on the minimum \(\lambda\).


Below we show a summary table of all models. Note that the table shows only performance outputs of the lasso regressions regarding the binary outcome variables. Binary refers to the dummy indicator of a token, whether it has appeared in a document or not. Count refers to the number of times the word has been counted in a certain document. Additional variables refer to the variables we added to the text information matrix. In more detail, we added information about the funded amount, the country, the number of lenders and the gender of the borrower.

Table 4.1: Lasso Regression Models - Binary Dependant Variable
Accuracy Sensitivity Specificity
Baseline DTM, binary, lambda min 85.9 89.5 81.7
Baseline DTM, binary, lambda 1se 85.9 89.9 81.3
Baseline DTM, count 86.5 90.3 82.1
Stemmed DTM, binary 87.1 90.0 83.7
Stemmed DTM, count 88.5 91.3 85.3
2gram DTM, binary 66.9 79.7 51.7
2gram DTM, count 66.7 79.8 51.1
Baseline DTM, additional variables 86.0 89.6 81.9
Stemmed DTM, additional variables 88.5 91.2 85.4
2gram DTM, additional variables 67.3 78.3 54.5

The first model in the table corresponds to the lasso regression with the sparse DTM where we used binary indicators about tokens and the minimum \(\lambda\), whereas for the second model in the table we used the \(\lambda\) 1se. The differences are marginal which is also why we continued to use minimum \(\lambda\) in all the subsequent Lasso models. The small difference can be explained that te regressors that were left out in the \(\lambda\) 1se did not contain a lot more distinct information. From the third model, we also see that for the standard DTM, count information does not contain more information than simple binary indicators for tokens.

The DTM that are stemmed do not seem to contain much more useful information to predict the sector than the baseline DTM. Also for models with the stemmed DTM, count information does not contain a lot more information than the binary tokens.

The approach to focus on 2gram and not only on one gram is clearly less precise than the other approaches. Even though literature has shown that 2-gram approaches have produced good results, it seems that in this particular exercise with Lasso they do not perform well. Partially, this may be because of the standardized questionnaires. This may causes information about 2-gram to be less precise and less distinctable.

Finally, adding some additional variables has not really improved the performance of the models. Arguably, the funded amount, the country, the number of lenders and the gender of the borrower are already mirrored by the text information. Information such as weather, political risk in the country or GDP would certainly add more information to the context.

4.2 Continous Outcome - Lasso

We apply the methods also to the continuous outcome where we predict the loan amount. Below, we display a short summary of the loan amount distribution.
Mean Std.Dev Min Max N.Valid
Loan Amount 812.82 1105.72 25 100000 96544

Contrary to the binary outcome, we cannot create a confusion matrix. Hence, we will focus on other indicators to evaluate the precision of the model. More precisely, we will calculate the Root Mean Squarred Error (RMSE), the \(R^2\) and the Mean Absolute Error (MAE) to compare the fit of the models.

Table 4.2: Lasso Regression Models - Continous Dependant Variable
Root Mean Squarred Error \(R^2\) Mean Absolute Error
Baseline DTM, binary 941.9 0.3 518.7
Baseline DTM, count 941.8 0.3 521.1
Stemmed DTM, binary 946.8 0.3 518.2
Stemmed DTM, count 942.7 0.3 518.4
2gram DTM, binary 1068.0 0.1 593.2
2gram DTM, count 1067.1 0.1 591.4
Baseline DTM, additional variables 663.0 0.7 298.9
Stemmed DTM, additional variables 665.3 0.7 298.5
2gram DTM, additional variables 704.5 0.6 305.8

Clearly, estimating the amount of the loan only using textual input is very difficult. This is partially also due to our cleaning process when we excluded any numbers of the corpus. However, the idea of this exercise is to use the text input and not necessarily the numbers in the corpus. Otherwise, we could do a simple regex search in every loan description for numbers and use them as a proxy for the loan amount. We think it is more interesting if certain characteristics of the text may give some idea of the loan amount.

Contrary to what we observed in the binary outcome prediction, additional variables about the loan amount, the gender, the country and number of lenders have increased substantially the prediction quality of the model. The \(R^2\) increased strongly for models with additional variables. At the same time, Root Mean Squarred Error and Mean Absolute Error decreased with additional variables. Similar to the binary outcome variable, we could certainly improve the performance by including more informative variables.

Overall, the model estimated using the stemmed DTM shows marginal better predictions than the standard DTM or 2-gram DTM. Nevertheless, the magnitude does not allow to conclude that the stemmed model performs better in a significant way than the standard DTM. Also, differences between count token information and binary indicators are marginal.

4.3 Binary Outcome - Random Forest

In this section, we will apply a Random Forest to the binary outcome. We briefly motivate the options chosen in the Random Forest below. We use 200 trees to save computation time. To tune our model, we include between 10 to 30 variables per node. Also, we set the resampling method to “oob”, out-of-bag. In exercises not shown in this report, the method has proven to perform similarly well as “cross-validation” but with less computation time. For the sake of simplicity, we will stick to this method. Further, we will set the splitrule to “gini” for the binary outcome and to “variance” for the continuous outcome.

A generic plot for the model where we used the baseline DTM with binary tokens and count tokens is displayed below. It shows the accuracy of the random forest model depending on the number of randomly selected predictors.

Random Forest PlotsRandom Forest Plots

Figure 4.3: Random Forest Plots

Looking at the output of the Random Forest model, we see different accuracies depending on the tuning parameters.

Random Forest 

77236 samples
  362 predictor
    2 classes: '0', '1' 

No pre-processing
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
  10    0.8705396  0.7389295
  11    0.8718991  0.7417801
  12    0.8727795  0.7436191
  13    0.8735564  0.7452363
  14    0.8747605  0.7476740
  15    0.8754467  0.7490885
  16    0.8740872  0.7463513
  17    0.8742814  0.7467747
  18    0.8751748  0.7485681
  19    0.8750065  0.7482518
  20    0.8753043  0.7488400
  21    0.8755244  0.7493087
  22    0.8750971  0.7485062
  23    0.8754596  0.7492435
  24    0.8746181  0.7475132
  25    0.8751877  0.7486947
  26    0.8760682  0.7504527
  27    0.8752525  0.7488301
  28    0.8756927  0.7496998
  29    0.8761070  0.7505819
  30    0.8761458  0.7506523

Tuning parameter 'splitrule' was held constant at a value of gini

Tuning parameter 'min.node.size' was held constant at a value of 5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were mtry = 30, splitrule = gini
 and min.node.size = 5.

For all models, we will select the best tune and evaluate the model using the confusion matrix. We display the results of all models estimated in the table below and discuss them briefly.

Table 4.3: Random Forest Models - Binary Dependant Variable
Accuracy Sensitivity Specificity
Baseline DTM, binary 88.6 90.3 86.6
Baseline DTM, count 88.9 90.7 86.8
Stemmed DTM, binary 89.9 90.2 89.5
Stemmed DTM, count 90.7 91.3 90.0
2gram DTM, binary 68.4 76.9 58.5
2gram DTM, count 68.4 76.3 59.1
Baseline DTM, additional variables 88.5 89.9 86.7
Stemmed DTM, additional variables 90.6 91.1 89.9
2gram DTM, additional variables 71.3 74.8 67.2

As mentioned, predicting the loan amount using only text as input is quite difficult. Without additional variables, the standard DTM and the stemmed DTM perform better than the 2-gram DTM, likely due to similar reasons as for the binary outcome variable. Adding variables clearly improved the performance in a significant way for all models. Also here, more meaningful additional variables could certainly improve these results.

4.4 Continous Outcome - Random Forest

We also apply the Random Forest to the continuous outcome. Besides the split rule where we use the option variance, parameters are the same.

Table 4.4: Random Forest Models - Continous Dependant Variable
Root Mean Squarred Error \(R^2\) Mean Absolute Error
Baseline DTM, binary 870.5 0.4 442.0
Baseline DTM, count 871.2 0.4 436.5
Stemmed DTM, binary 871.5 0.4 444.0
Stemmed DTM, count 867.7 0.4 435.4
2gram DTM, binary 1029.2 0.1 549.4
2gram DTM, count 1026.7 0.2 547.1
Baseline DTM, additional variables 587.0 0.8 270.4
Stemmed DTM, additional variables 578.2 0.8 259.6
2gram DTM, additional variables 572.1 0.7 250.2

First, comparing the results to the Lasso regressions we observe that Random Forest performs slighlty better. This might be due to the high flexibility of the class of these models. However, the differences are only marginal.

Second, additional variables add again a lot of information to the predictions. They increase the precision substantially. Simliar to the lasso regressions and to the binary outcome variable, 2-gram DTM do not seem to contain a lot of specific information for loan amount predictions, whereas the stemmed DTM performs again slightly better than the standard DTM. Once more, there is not much difference between count and binary token information.

4.5 GloVe

Our last approach uses Global Vectors for Word Representations to reduce the dimensionality of the DTM and then uses the document embeddings in the regression. Since we are looking at single words and their co-occurence probabilities, we do not need to clean the text as extensively as we did for the DTM. Hence, we only filter some of the HTML tags and special characters. We then tokenize the entire loan description data and create an interator over the tokens. This allows us to define a vocabulary, which also contains the count of the words in the loan applications.

We will exclude words that occur 5 times or less over all documents to eliminate very sparse words. Using that vocabulary, we then create a term co-occurence matrix. We fit the GloVe (with a rank of 50) and create the word-vector contexts. We search for common terms between the word-vector contexts and the original Document-Term-Matrix. Finally, we calculate average document vectors and use the document embeddings in our prediction models. For simplicity, we will only focus on Lasso Regression Model and Random Forest model using the sparse DTM (no count token information and no 2-gram DTM). As usual, we will predict the continuous and binary outcome variables. For Lasso, we use again the minimum \(\lambda\) and for the Random Forest we use the same options as in the previous models. Finally, we summarize the results for both outcome variables separately in a table below.

Table 4.5: Average Document Vectors with GloVe - Binary Dependant Variable
Accuracy Sensitivity Specificity
GloVe with Lasso 66.8 74.4 57.8
GloVe with Random Forest 65.3 74.7 54.2
Table 4.6: Average Document Vectors with GloVe - Continous Dependant Variable
Root Mean Squarred Error \(R^2\) Mean Absolute Error
GloVe with Lasso 811.8 0.1 616.2
GloVe with Random Forest 820.4 0.1 596.1

We observe that for both, the binary and the continuous outcome, only using the average document vectors derived with GloVe does not deliver very precise predictions for the sector, neither with Lasso nor with Random Forest. This might be due to the DTM matrix that we used. Due to the large amount of observations, we had to focus on the sparse DTM where a lot of information is discarded. Another reason for the rather low precision to other methods presented in this report is that we have a quite standardized set-up of the loan description. If documents are more similar to each other, so are the word embeddings and hence, documents are not easy distinguishable in that sense. To summarize, it would be interesting to compare the outcome with pre-trained word-embeddings of GloVe and to use a less sparse DTM.

5 Conclusion

In this project, we tested several text analysis tools using loan description of the Kiva dataset. Our goal was twofold. First we wanted to predict the binary outcome variable of the sector (agricultural and food sector versus others) and second, the continuous outcome variable of the loan amount. We used several different methods to decrease the dimensions of the text information. First, we used different sparse Document-Term-Matrices (DTM) to represent the information of the loan description. The different approaches for the DTM included using the standard body of the description (focusing on single tokens, hence, one 1-gram), a stemmed version, as well as considering 2-gram. Moreover, we used Global Vectors for Word Representation (GloVe) and averaged the information over the documents. Also, we used additional variables in the dataset such as the gender and the number of lenders to complement the text analysis.

To predict the outcome, we focused on Lasso Regressions and Random Forest models. For the binary outcome variable, we reached a maximum of about 90% overall accuracy and marginal better accuracy rates for the Random Forest models. For the continuous outcome variable, the Random Forest model proved also to be more slightly more accurate in its predictions, likely due to its inherent flexibility.

In general, we found the text information to contain important information for both, the binary and the continuous outcome variable. Also, the Random Forest model, on average, predicts the outcomes more precisely than the Lasso regression, even though differences are marginal in most cases.

Finally, note that we could extend the methods in several ways. First, we could increase the number of loans considered. Due to computation power, we had to focus on only 10% of all loans in the database. Second, we could have increased the dimensionality of the DTM matrices. Third, it would have been interesting to include more variables such as the GDP, meteorogical information, political risk in the corresponding countries or regions and more. Fourth, we could have used pre-trained embeddings for the GloVe instead of focusing on our text description. Pre-trained word embeddings might contain better information about the embeddings as they are not trained on such standardized loan description data as in this report. Fifth, we could have incrased the number of words taken into account in the DTM. Using more powerful computers, the amount of text information could have been expanded and would have arguably led to better results.


  1. The funds may be given to each borrower before, during or after the individual loan is posted on Kiva. Most partners give the funds out before the loan is posted (pre-disbursal).↩︎