Author: Aung Pyae Phyo

A Study on Aliens

Link to Visualization https://public.tableau.com/profile/aung.pyae.phyo#!/vizhome/APPDataVizFInal/3?publish=yes

Website Link

dataviz2019phyo.blogs.bucknell.edu

Ever since my first year at Bucknell, I’ve been really interested in the patterns of international students pursuing higher education in the United States. I wanted to know why people would travel all the way to the middle of Pennsylvania for their future. I knew why I came to Bucknell, a decision that was made with the consultation of my high school, research into my field of study and the uniqueness of the program I wanted to go for. Interested in this, I take a look into the information available online, I only found the factbook, which had information, but it was not interactive in any way, and was very hard to get through because it was all numbers. The information was also focused a long term institutional outlook rather than a student-focused study. During in-class consultations, Agnes had suggested looking at the University’s interactive dashboards for data. Looking at these sources, they were not as specific about the international community at Bucknell as I wanted to see.

So for my final project, I wanted to take a look at five years’ worth of international student data, from the class of 2019 to 2023. My initial idea when I started thinking about this project was to make a Bucknell international student-specific infographic showing where we were from and how the community has grown over this five-year period. The data that I used for the project was obtained from the International Student Services at Bucknell. The information contains the student’s gender, class year, college within Bucknell, major, country of origin, their high school, and the category of their field of study. Within the data, I have the information of 279 students total. Data plumbing was minimal, as all I had to do was combine append the data of all international students in the academic year 2019-2020 with the data of seniors from the academic year before.

To interpret the data, I wanted to use Tableau to look at the demographics, and make the results interactive, with the user able to change various settings to look at their own specific research questions. Furthermore, I also wanted to look into the connections between high school and enrollment, to see how much secondary education mattered when choosing a college. To do this, I chose to use Gephi and map out connections between the students. The final tool I wanted to use was Timeline JS, to build a timeline about the international students at Bucknell in context to global events. By employing all three tools, I hope to effectively show a more complete look on the international student population with more interesting insights.

For the tableau visualizations, I was thinking about what people might be interested in when they think about international students. In a lot of demographics infographics, we are represented as a number of countries or a percentage statistic. It’s not wrong to do so, it comes up as fascinating data. I just wanted to make sure that my infographic would convey the same information, but in a way that I feel makes me feel personally visible. This is where Tableau’s symbol map comes into play. Throughout my survey of infographics, that was the most visually engaging methods of communicating the international demographic. It also is a great way to highlight the countries that are not represented as well.

Tableau is also the final part of my infographic. By ending on an interactive visualization that gives you tools to play around and explore the data, I want to encourage others to become knowledge generators themselves, to have their small research questions easily explorable. By using the martini glass structure, I hope to take the viewers through my narrative to inform them about what the dataset is and inspire them on what can be done with the final visualization, allowing for free exploration on a path of their choice.

My first foray into the project was the Gephi component, which I was most interested in. My decision to look at Bucknell as an option was mainly because I knew students from my high school who went to Bucknell. To that end, I set out to connect people who came from the same high school, and gave them more weight if they belonged to the same year. After building the edge table, I used the Radial Axis Layout, grouping the nodes by class year. This produced a visualization that, in my opinion, created a pretty strong argument that the high school that an international student went to plays a significant part in deciding if future students apply. Of course, this dataset only contains the number of students who are enrolled, and not those that apply, and thus, this factor might have meaning if looked at in a college application context rather than in a college enrollment context.

For the final part of the project, I wanted to put in a timeline that shows some important events that influenced higher education in the US, Bucknell University, and international students who want to pursue a degree in the US. These events required more research (and from time of writing, more events that actually impacted international students.) For now, major events have been highlighted, and I hope to continue on this project and add more later on.

Working with this project was an interesting exercise in the release of data. I spent a lot of time thinking about how not to single people out because of how unique individuals can be in a small dataset. Due to the low number of international students at Bucknell, it became a scramble to try to find ways to hide the information that was unique enough to identify the individual student. Around half of international students are the only one from their own country, and while their country being represented is not a problem, the other data that comes with the dataset would not be appropriate to release.

Through this project, I think that I have been able to dive into the datasets that I received to produce some interesting visualizations. In particular, Gephi has really helped in my search for an answer about the correlations between enrollment and high school. Tableau has also opened up the path to be interactive with the data and let the readers decide what they want to explore. By being able to employ these tools, I believe that I have created an interface that will help people make their own visualizations, and creating more knowledge generators.

Works cited

Bucknell University, “2018-2019 Fact Book”

https://www.bucknell.edu/sites/default/files/2019-05/fact_book_2018-19.pdf

Bucknell University “Diversity Dashboards”

https://tableau.bucknell.edu/views/EnrollmentGeographicDistribution-InternationalUpdate/InternationalMap?:iid=5&:isGuestRedirectFromVizportal=y&:embed=y

Segel, E, and J Heer. “Narrative Visualization: Telling Stories with Data.” IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 6, 2010, pp. 1139–1148.

Dataset from Bucknell University’s International Student Services

Assignment 5

Assignment 5 – APP

For this project, I decided to work with my own dataset, one on the demographics of the people involved in International Orientation 2019. The dataset included their names, where they came from, why they were participating (first year, teaching assistant, transfer, or staff), the team they were put into for the program, their class year, and other special tags (athlete, etc.) The goal that I had with my dataset would be to try to figure out if the connections made during International Orientation fostered friendships within certain ‘small worlds’ and to see if Gephi could, as Lima puts it, “translate structural complexity to perceptible visual insights aimed at a clearer understanding.” (Lima, 79)

Executing the dataset, however, was a process. I first went with connecting every node to each other to see what information might come up, but a visualization of that dataset did not yield very much information. All of the nodes had the same weight and while the different categories that I had assigned them made for pretty colors, any useful information could not be determined.

Continuing to work with the data, I had to redefine what an edge meant in this dataset, and I decided that they were connected if they had matching entries in the categories mentioned above. The weight would be increased as the similarities increased. This would represent the opportunities given for them to create connections with each other, following Graham’s example of phone call networks “A network of people connected by whom they call on the phone can be weighted by total number or length of phone call.” (Graham, 207)

I went about this by using the excel spreadsheet method, filtering the data and creating edges for all the nodes in the same categories. By making sure that every one of the nodes was connected in some way, I was able to make a dataset where all the nodes were connected to one other node.

For my first visualization with the dataset, I decided to try the ForceAtlas 2 layout, and from there, try to figure out how where there were clusters and to see if I could find where the connections were the most central. The results were not that surprising, as a large cluster formed around the nodes with the two attributes of Chinese citizenship and the class of 2023. In the following image, the pink nodes that dominate the screen are under the category of first year, while the other large categories, the teaching assistants and the orientation staff, grouped in small worlds away from the first years.

With the second visualization that I produced, I wanted to see if team colors played a part in connecting people together and see if how strong the connection was across teams. To achieve this, used the palette table to assign each color team their own color and used the dragging tool to group the nodes together. In the resulting visualization, I saw that there was a strong link between members of teams brown and purple. To double-check this, I unscaled the weights to see if the link was still strong. The result wasn’t too apparent, but I think that that might have happened due to the manual node placement.

In the third visualization, I wanted to use the modularity class to see where the connections were more complex. The average modularity was 0.059, which signaled that the dataset was not very complex. Graham states that “Modularity is successful when there’s a high ratio of edges connecting nodes within any given community compared to edges linking out to other communities.” (Gram, 229) With the dataset being small and most of the first years having connections with each other through their class year and a country of origin, it seems that a high modularity score was not likely. For the visualization, I used the filter to separate out nodes with different modularity classes. As shown below in the visualizations, different modularity classes correspond to different color teams for some reason that I don’t really understand.

I also wanted to see which nodes had the most degree of connections in my network. On average, each node was connected to fifty other nodes. This data is probably skewed by the amount of first years connected to each other through class year. The following visualization shows the highest degree of connected nodes, from a score of 65 and above. It is primarily composed of Chinese first-years. Again, this data is most likely skewed by the demographics of the program.

Going through the process of creating the dataset, I learned two things well. Gephi is amazing at creating visuals, and it is terrible at being user friendly when it comes to data. To start with, it is capable of taking in an enormous amount of connection data, and on the Overview page, it is very easy to drill down the connections and find new patterns in the data that might not have been seen before. Lima notes that systems such as Gephi “expose causality in patterns in relationships, contributing decisively to the holistic understanding of the depicted topology” (Lima, 83) and I believe that I have been able to look at some interesting connections and clusters that I did not expect. However, I realize that my network is not a natural network, but rather, a network constructed to be relatively diverse in participant categories by a single person, and this limitation should be acknowledged when working with datasets that do not occur naturally.

While Gephi could take in a lot of information that I provided, it could not create edges on its own, and it was very user-unfriendly when it came to data. As such, I would say that Gephi is much like Tableau in this regard. While it is a powerful tool to show the connections between the data, the underlying information must be carefully curated in order to produce accurate visualizations. Overall, Gephi is a very powerful visual tool to use, being able to manipulate parts of the data through filtering, coloring and positioning. However, these tools, however powerful, are useless without the knowledge of the underlying dataset.

Assignment 3

For the third assignment, I chose to look at the Trans-Atlantic Slave Data to see if I could play around with the visualizations, hoping that I would find connections that I would be able to write about.

For my first visualization, I wanted to see connections of the places that the enslaved people were sent. This involved setting two visual layers on the map, one layer representing where slaves were bought and where their journeys began, and the other representing where their journeys began to where they disembarked. To achieve this, I had to select three variables – the places that the enslaved people were bought, where they had embarked on their journey and where they had disembarked. There was still a lot of data to interpret for the set of all voyages taken, so I narrowed it down more to the voyages of the top three named ships. The resulting visualization was interesting, and I interpreted it to infer that a lot of slaves whose final destination was South America were transported to the East Coast before going to South America, as opposed to having a direct voyage there.

I also wanted to look for links between the three ships that I had filtered for the data. This time, I wanted to see how many ports they had in common with each other and see if there was any correlation between them. I pulled up the ships and the ports that they were associated with on the graph to see if there were multiple common ports between them. To further polarize the view, I set the names of the ships to anchor in a triangular fashion, as shown below. We can see that while Nancy and Mary had the most number of ports in common, there was much less correlation between them and the NS da ConceiCao Antonio e Almas.

The results, in the context of the service period and origins of the ships, make sense, as Mary and Nancy were affiliated with the United Kingdom and were launched within twenty years of each other, Nancy having been launched in 1789 and Mary in 1806. The NS da ConceiCao Antonio e Almas was launched much earlier, with records indicating that it was in service from 1691 to 1782 (ShipIndex). Another possible explanation is that the countries that owned the ships respected each other’s trade routes and did not infringe on each other’s paths. The timeline JS view below shows how much overlap there was between the three ships and how they may have influenced each other.

These results that I have shown here display data that has been filtered and represented in a different way than it was given to me, which was on an excel sheet. By selectively using parts of this data and making a visualization out of it, I am effectively skewing the dataset to my own ends and drawing new conclusions on data that I already have. By using timelines, I added more meaning to the years that the ships were active, graphically improving their meaning to suit my narrative. I think that this makes me what Johanna Drucker describes a ‘knowledge generator.’

Assignment 2

I started my visualizations with Voyant Tools, to visualize the narratives of enslaved people. First, I prepared the corpus by arranging the text them based on the date written and filtered out more stopwords from the text off the most frequent 25 words, taking out the following: day, told, soon, people, men, man, thought, soon, said, saw, mr, heard, went, come, came, knew, know, like.

Out of the remaining words, I was interested to see that the word ‘children’ was mentioned many times, so I tried to focus on it to see if there were any meaningful connections. Looking at the trendline, Harriet Jacobs was the only one who wrote significantly about this, and by examining the bubblelines visualization, she used the word ‘children’ quite often throughout her narrative. Referring to the title of the memoir, “Incidents in the Life of a Slave Girl. Written by Herself,” it could refer to her own childhood.

Interested in exploring this further, I looked into different visualizations and found that a wordtree yielded the most information about what Jacobs was writing about. In the following visualization, we can see that a lot of the connections made with the word children are varied, and we can through most of the connections that she was talking about the children that she has encountered in her life, e.g. connections such as “master’s, grandmother’s, mother’s.”

Aiming to deconstruct the text further, I shifted to a more analytical collocate view of the word. Reading through the list, I discovered that some terms that had more negative connotations, such as suspicious, jail, or unhappy, had significantly less mentions. For me, this visualization raised up the question of how much editing was done for a white audience back then, and hence, how true are the narratives to the author’s real feelings?

Shifting to Tableau, I used the African Names Database to construct some visualizations. Preparing the data involved correcting the categories, such as setting the arrival year as a date rather than a number value. The first thing I wanted to find out about was if there were any connections in the data for enslaved children. For this dataset, I set up a time versus count of names to visualize the enslaved people over the years, and added an age attribute filter to see how much the data would change based on two age ranges, 1-18 and 19-77. The data that came out, however, showed that the graphs between the two stayed relatively similar, leading to my hypothesis that children under 19 made up half the dataset.

The next topic I wanted to explore was the ships themselves and how many people were usually on them. For this, I looked into making a tree map and successfully a visualization of how many people ships might have carried. The answer ranged from the most populous from 1116 enslaved people on the Maria to as few as just one on board. The tree map shows a very large number of ships which reveals a little bit of how many people had been taken from their homeland then.

The final topic I looked into was the distribution of gender. Playing around with the data, I managed to put it into a packed bubble visualization, and then categorized them by sex. The data shows some clear information. Men were the most enslaved compared to other sexes, and there are a surprising number of non-records.

Using both Voyant and Tableau, I found a stark difference between the two. Voyant, being more capable of qualitative analysis, gave me visualization upon visualization, no matter what I wanted to focus on, or if there was no specific focus at all. The avenues of exploration really let the user find more possible connections. However, Voyant’s results are mostly connections that need to be built on with other, different views that rely on the user making these assumptions. However, when it came to using Tableau, I needed to be very specific on what I wanted from the data. Unless I was able to supply the data types that Tableau needed for the visualizations, there would be no meaningful visualizations. This brings a bit of frustration in setting up the data for success, but the results are, therefore, more concrete than Voyant. A commonality between these two tools, however, is that they show connections that we might not have seen before by looking at the data without them. These tools either save time as well, either by pulling out metadata to analyze or building correlations with numerical data.

From this assignment, it was using Voyant that strongly verified Tanya Clement’s observation of a visualization platform combining multiple views and creating a multidimensional standpoint. Using a single view of the Voyant tools did not give a meaningful view into the slave narratives, but using them together with the viewpoint focused on the word ‘children’ helped layer more meaning onto the visualizations, resulting in a stronger argument that is based on ‘plausible complexities’ as Clement states. For quantitative information, it is harder to not end up with simple answers, due to the defined fields that we must put the data in to get the desired visualizations. However, as with the Voyant visualizations, putting together more data in Tableau allows for a more complete picture, as we pick and select what data to highlight in each of our visualizations, potentially leaving some data unexplored.

Assignment 1

The first visualization I chose was “The Essence of Rabbit” compiled by the design firm Pictoplasma. It explores the art of around 600 artists on the topic of rabbits, their interpretations of the animal and showcases them in a what is effectively summed up as “a bunny mandala,” putting them in a color-coded formation that requires you to zoom in to view the art clearly. Looking back on my decision to view this following the three-stage model of perception in Meirelles’ book, I think that I was attracted to this during Stage 2: “Slow serial processing for extraction of patterns and structures.” It was the only one that had cartoons on it.

t is effectively summed up as “a bunny mandala,” putting them in a color-coded formation that requires you to zoom in to view the art clearly. Looking back on my decision to view this following the three-stage model of perception in Meirelles’ book, I think that I was attracted to this during Stage 2: “Slow serial processing for extraction of patterns and structures.” It was the only one that had cartoons on it.

This visualization showcases the differences in how a rabbit can appear in people’s heads and through the hands of artists, explored in different visual features. In essence, this also explores data feminism in the sense that this design firm is the one organizing the data and thus has a lot of power in choosing what designs fit into their aesthetic. The description of the project says that Pictoplasma asked the artists, however, there is not much more explanation to the process other than this. This opens up the question: “Was there anyone left behind?”

Although this visualization is all static data, the complexity of it that requires you to look only at one portion of the time creates a pseudo-dynamism, allowing you to choose what part of the mandala you want to see. However, the data is still static, and once you are done with the visualization, there is no way to compare parts of the dataset you want, an example of which could be only looking into the brown rabbits of the mandala. This visualization, while interesting, was no more than what they promised: “a bunny overdose.”

The second visualization I chose was “Visualizing Reddit Discussions.” It visualizes the comments and reactions of reddit threads from the front page, and is an interesting tool to see how discussions flow from topic to topic. The sizes of the bubbles note the engagement, while the colors other than black note the same author. This visualization is interesting as different conversations have different flows to them. Some stay anchored in the same main thread, whereas others start sub-discussions in a popular comment subthread.

This visualization is dynamic, changing from post to post and in real-time. The visualizations pull data from the main page, and it is possible to look into the threads you are interested in by pasting the link in the search bar. However, since the data is real-time, tracking it requires saving the post’s link to be able to see it later as it might not come up on its own again.

This tool, I think, serves to open up data interpretation a bit more. Complied over time, a single post’s growth could be an interesting thing to see, but in the current state of non-tracked change, this clearly dynamic visualization becomes static, in a way. These visualizations do create a new understanding of how people interact online, and having the choice of data set puts the power in the users’ hands. This does, however, bring up the question of pre-set algorithms as many of the users might not be familiar with coding, and the data that they end up with, therefore, might be biased after all.

From the DH Sample book, I first chose Native Land, which visualizes the territories, languages, and treaties of native people. By looking at this data and being able to filter through the selected categories, you could find more data about a specific language or topic, and isolate it for better representation of your topic. It does allow for data to be viewed in different, but skewed perspectives, possibly, due to the data not being complete, Nonetheless, the project definitely provides ways to interact with the data and draw new conclusions.

The second example I chose from the DH Book was “Queering the Map.” This allows people to share their own stories about queerness and contribute to a dataset the way that they see fit. It puts a lot of power into the hands of the visitor, which results in strange locations, but interesting stories nonetheless. This representation of data is simple, but the ease of being able to put your own narrative there and effectively changing the dataset might lead to new personal understandings about how and why people share their data.