Preparing your own text corpus – Data Visualization in DH

As you work on preparing your corpus for visualization in Voyant, here are a few things to bear in mind.

a) What kinds of texts are you selecting?

Literary?
Philosophical?
Journalistic?
What was the original format of those texts?
Are they digitized printed books or articles?
Are they born-digital?
Are they transcriptions of the spoken word?

b) How many texts will give you a representative sample? According to the literature, 10 text samples from each register should give you good data. As you select those 10, bear in mind the following:

Are you including a range of texts that show the full range of variability?
If you are constructing a corpus which contains a range of texts, what is that range? i.e. if you are sampling journalistic prose, are you choosing from “high brow” sources (NYT; BBC) and also popular media (CNN; US and World Reports; Huffington Post)?
If you are comparing political speeches, are you comparing the two ends of the political spectrum?
If philosophical, are you comparing between schools of thought? Time periods? Cultural groups?
If literary, are you comparing literary genres, periods, authors?

c) How are you selecting these texts? You should document your process and your sampling decisions.

d) How long are your texts? Voyant can manage large document collections well.

e) What are you looking for? Are you looking for dominant terms? Are you looking for vocabulary density? Are you looking for stylistic patterns? Repeated phrases? Are you looking for connections and collocations of terms?

Remember this is a sequential and iterative process ?

Initial formulation of research question —->

Corpus design —->Compilation of corpus —->Empirical investigation —-> repeat