Created: 7 Jun 2016, last rebuilt: 20 May 2017

Finding repetitions in a Russian novel with R

This was an attempt to analyze a Russian classic Голый год in R, trying to find any interesting occurences of text repetition. As this is a work of significant artistic freedom, much larger blocks of text than sentences are repeated, which this script won’t detect.

As is common with these things, most of the work went into making R – and its RHTML output, word cloud, etc. – work with UTF-8 text without showing garbage.

Source text was carefully massaged into shape and is available here.

View script output.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: "Boris Pilnyak, _Naked Year_"
output: html_document
---

```{r setup, include=FALSE}
Sys.setlocale("LC_ALL", 'en_US.UTF-8')
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
library(tm)
library(SnowballC)
library(wordcloud)
library(gdata)
library(stringi)
library(ggplot2)
```

```{r init, echo=FALSE}
goly_god_url = "http://www.urbansedlar.com/files/pilnyak/goly_god.txt"
data_goly_god <- readLines(goly_god_url)
Encoding(data_goly_god)  <- "UTF-8"
```

```{r link, echo=FALSE, warning=FALSE, results='asis'}

cat(paste("Source text available here [", goly_god_url, "](", goly_god_url, ")" ))

```


#Word cloud

```{r wordcloud, echo=FALSE, warning=FALSE}

corpus <- Corpus( VectorSource(data_goly_god),
                  readerControl = list(reader=readPlain, language="ru"))

corpus <- tm_map(corpus, removePunctuation)
#corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("russian"))

dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
pal2 <- brewer.pal(8, "Dark2")

wordcloud(words = d$word, freq = d$freq, min.freq=2, max.words=200, 
random.order=FALSE, rot.per=0, colors=pal2, scale=c(.9,.9))

```


#Sentence lengths
To get sentences and text fragments, we split the text at the following punctuation: 

* period (.) 
* exclamation mark (!)
* em dash (–)
* new line

```{r sentence_len, echo=F, warning=F}

sentences <- strsplit(data_goly_god, "[.!\n\r–]", fixed=F)
sentences <- trim(sentences)
freqs <- nchar(sentences)

hist(freqs, main="Histogram of sentence lengths")

plot(density(freqs), log="y", main="Probability of sentence length")

```


#Repeating sentences
The sentences obtained by splitting the text were grouped and sorted by repetition count.
Non-repeating sentences were rejected. Of what was left, 1-character sentences were also rejected.

``` {r repeating_1, echo=FALSE, warning=FALSE,  results='asis'}

sentences <- strsplit(data_goly_god, "[.!\n\r–]", fixed=F)
sentences <- trim(sentences)

sentence_counts <- list()
simple_s = unlist(sentences)

single_repetition <- 0
single_char <-0

for(i in 1:length(simple_s)){
  cur_sentence = simple_s[[i]]
  occurences <- sum(simple_s == cur_sentence)
  if(occurences > 1 && nchar(cur_sentence)>1 ){
    sentence_counts[[cur_sentence]] <- occurences  
  }
  if(occurences == 1){
    single_repetition <- single_repetition + 1
  }else{
    if(nchar(cur_sentence) == 1){
      single_char <- single_char + 1
    }
  }
}

cat(paste("* ", single_repetition, " sentences were rejected because no repetitions were found.\n"))
cat(paste("* ", single_char, " sentences were excluded because they were only 1 character long, e.g., _И_.\n"))
cat("\n")
cat("\n")
cat("Remaining sentences are sorted in the descending order of repetition:")
cat("\n")
cat("\n")

list_names = names(sentence_counts)
list_vals  = unlist(sentence_counts, use.names = F)
ordered_val_idxs = order(list_vals, decreasing = TRUE)
ordered_vals  = list_vals[ordered_val_idxs]
ordered_names = list_names[ordered_val_idxs]

plural = "repetitions"
for(i in 1:length(ordered_names)){
  reps = ordered_vals[i]
  cat(paste("* ", ordered_vals[i], " ", plural, " \"", ordered_names[i], "\"\n", sep="", collapse="\n"))
}

```


#Repeating parts of sentences (also considering commas)
The text was again split into sentences and fragments, this time also considering commas. There's a lot of repetiton in dependent clauses, e.g.:

* Знойное небо льет знойное марево, и вечером долго будут желтые сумерки.
* Знойное небо льет знойное марево, вечером будут желтые сумерки, – и вечером под холмом вспыхнут костры: это будут голодные варить похлебку, те, что тысячами ползут в степь, за хлебом, и из-под холма понесутся тоскливые песни.
* Знойное небо льет знойное марево.
* Знойное небо льет знойное марево, знойное небо залито голубым и бездонным, цветет день солнцем и зноем, – а вечером будут желтые сумерки, и бьют колокола в соборе: – дон-дон-дон!..


Obtained parts of sentences were again grouped and sorted by repetition count. All parts of sentences that don't repeat were again rejected, as well as 1-character entities:


``` {r repeating_2, echo=FALSE, warning=FALSE,  results='asis'}

sentences <- strsplit(data_goly_god, "[.!\n\r–,]", fixed=F)
sentences <- trim(sentences)

sentence_counts <- list()
simple_s = unlist(sentences)

single_repetition <- 0
single_char <-0

for(i in 1:length(simple_s)){
  cur_sentence = simple_s[[i]]
  occurences <- sum(simple_s == cur_sentence)
  if(occurences > 1 && nchar(cur_sentence)>1 ){
    sentence_counts[[cur_sentence]] <- occurences  
  }
  if(occurences == 1){
    single_repetition <- single_repetition + 1
  }else{
    if(nchar(cur_sentence) == 1){
      single_char <- single_char + 1
    }
  }
}

cat("\n")
cat("Below are repeating parts of sentences, sorted in descending order of repetition:")
cat("\n")
cat("\n")

list_names = names(sentence_counts)
list_vals  = unlist(sentence_counts, use.names = F)
ordered_val_idxs = order(list_vals, decreasing = TRUE)
ordered_vals  = list_vals[ordered_val_idxs]
ordered_names = list_names[ordered_val_idxs]

plural = "repetitions"
for(i in 1:length(ordered_names)){
  reps = ordered_vals[i]
  cat(paste("* ", ordered_vals[i], " ", plural, " \"", ordered_names[i], "\"\n", sep="", collapse="\n"))
}

```