Created: 7 Jun 2016, last rebuilt: 20 May 2017
Finding repetitions in a Russian novel with R

This was an attempt to analyze a Russian classic Голый год in R, trying to find any interesting occurences of text repetition. As this is a work of significant artistic freedom, much larger blocks of text than sentences are repeated, which this script won’t detect.
As is common with these things, most of the work went into making R – and its RHTML output, word cloud, etc. – work with UTF-8 text without showing garbage.
Source text was carefully massaged into shape and is available here.
View script output.
---
title: "Boris Pilnyak, _Naked Year_"
output: html_document
---

```{r setup, include=FALSE}
Sys.setlocale("LC_ALL", 'en_US.UTF-8')
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
library(tm)
library(SnowballC)
library(wordcloud)
library(gdata)
library(stringi)
library(ggplot2)
```

```{r init, echo=FALSE}
goly_god_url = "http://www.urbansedlar.com/files/pilnyak/goly_god.txt"
data_goly_god <- readLines(goly_god_url)
Encoding(data_goly_god)  <- "UTF-8"
```

```{r link, echo=FALSE, warning=FALSE, results='asis'}

cat(paste("Source text available here [", goly_god_url, "](", goly_god_url, ")" ))

```


#Word cloud

```{r wordcloud, echo=FALSE, warning=FALSE}

corpus <- Corpus( VectorSource(data_goly_god),
                  readerControl = list(reader=readPlain, language="ru"))

corpus <- tm_map(corpus, removePunctuation)
#corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("russian"))

dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m), decreasing=TRUE)
d <- data.frame(word = names(v), freq=v)
pal2 <- brewer.pal(8, "Dark2")

wordcloud(words = d$word, freq = d$freq, min.freq=2, max.words=200, 
random.order=FALSE, rot.per=0, colors=pal2, scale=c(.9,.9))

```


#Sentence lengths
To get sentences and text fragments, we split the text at the following punctuation: 

* period (.) 
* exclamation mark (!)
* em dash (–)
* new line

```{r sentence_len, echo=F, warning=F}

sentences <- strsplit(data_goly_god, "[.!\n\r–]", fixed=F)
sentences <- trim(sentences)
freqs <- nchar(sentences)

hist(freqs, main="Histogram of sentence lengths")

plot(density(freqs), log="y", main="Probability of sentence length")

```


#Repeating sentences
The sentences obtained by splitting the text were grouped and sorted by repetition count.
Non-repeating sentences were rejected. Of what was left, 1-character sentences were also rejected.

``` {r repeating_1, echo=FALSE, warning=FALSE,  results='asis'}

sentences <- strsplit(data_goly_god, "[.!\n\r–]", fixed=F)
sentences <- trim(sentences)

sentence_counts <- list()
simple_s = unlist(sentences)

single_repetition <- 0
single_char <-0

for(i in 1:length(simple_s)){
  cur_sentence = simple_s[[i]]
  occurences <- sum(simple_s == cur_sentence)
  if(occurences > 1 && nchar(cur_sentence)>1 ){
    sentence_counts[[cur_sentence]] <- occurences  
  }
  if(occurences == 1){
    single_repetition <- single_repetition + 1
  }else{
    if(nchar(cur_sentence) == 1){
      single_char <- single_char + 1
    }
  }
}

cat(paste("* ", single_repetition, " sentences were rejected because no repetitions were found.\n"))
cat(paste("* ", single_char, " sentences were excluded because they were only 1 character long, e.g., _И_.\n"))
cat("\n")
cat("\n")
cat("Remaining sentences are sorted in the descending order of repetition:")
cat("\n")
cat("\n")

list_names = names(sentence_counts)
list_vals  = unlist(sentence_counts, use.names = F)
ordered_val_idxs = order(list_vals, decreasing = TRUE)
ordered_vals  = list_vals[ordered_val_idxs]
ordered_names = list_names[ordered_val_idxs]

plural = "repetitions"
for(i in 1:length(ordered_names)){
  reps = ordered_vals[i]
  cat(paste("* ", ordered_vals[i], " ", plural, " \"", ordered_names[i], "\"\n", sep="", collapse="\n"))
}

```


#Repeating parts of sentences (also considering commas)
The text was again split into sentences and fragments, this time also considering commas. There's a lot of repetiton in dependent clauses, e.g.:

* Знойное небо льет знойное марево, и вечером долго будут желтые сумерки.
* Знойное небо льет знойное марево, вечером будут желтые сумерки, – и вечером под холмом вспыхнут костры: это будут голодные варить похлебку, те, что тысячами ползут в степь, за хлебом, и из-под холма понесутся тоскливые песни.
* Знойное небо льет знойное марево.
* Знойное небо льет знойное марево, знойное небо залито голубым и бездонным, цветет день солнцем и зноем, – а вечером будут желтые сумерки, и бьют колокола в соборе: – дон-дон-дон!..


Obtained parts of sentences were again grouped and sorted by repetition count. All parts of sentences that don't repeat were again rejected, as well as 1-character entities:


``` {r repeating_2, echo=FALSE, warning=FALSE,  results='asis'}

sentences <- strsplit(data_goly_god, "[.!\n\r–,]", fixed=F)
sentences <- trim(sentences)

sentence_counts <- list()
simple_s = unlist(sentences)

single_repetition <- 0
single_char <-0

for(i in 1:length(simple_s)){
  cur_sentence = simple_s[[i]]
  occurences <- sum(simple_s == cur_sentence)
  if(occurences > 1 && nchar(cur_sentence)>1 ){
    sentence_counts[[cur_sentence]] <- occurences  
  }
  if(occurences == 1){
    single_repetition <- single_repetition + 1
  }else{
    if(nchar(cur_sentence) == 1){
      single_char <- single_char + 1
    }
  }
}

cat("\n")
cat("Below are repeating parts of sentences, sorted in descending order of repetition:")
cat("\n")
cat("\n")

list_names = names(sentence_counts)
list_vals  = unlist(sentence_counts, use.names = F)
ordered_val_idxs = order(list_vals, decreasing = TRUE)
ordered_vals  = list_vals[ordered_val_idxs]
ordered_names = list_names[ordered_val_idxs]

plural = "repetitions"
for(i in 1:length(ordered_names)){
  reps = ordered_vals[i]
  cat(paste("* ", ordered_vals[i], " ", plural, " \"", ordered_names[i], "\"\n", sep="", collapse="\n"))
}

```