As a data scientist, history is the most important piece of information to show the present and predict the future. However, when that data is flawed or biased, it creates a Fun House mirror effect that warps perceptions and reinforces stereotypes. These stereotypes could have implications for generations to come.
Or if data is truly transparent, it could instigate healing and rebirth.
In commemoration of this weekend’s celebration of Juneteenth, allow me to introduce three datasets — one built with bias cooked-in, one built to separate, and one dataset built for truth and reconciliation.
The Origin of ‘Juneteenth’
Juneteenth marks our country’s second independence day.
The word ‘Juneteenth’ is a combination of June and nineteenth, in honor of the date of Union General Granger formally announced the end of slavery enacted by the Emancipation Proclamation.
Even though the Emancipation Proclamation declared an end to slavery in the Confederate States on September 22, 1862, it only became effective on January 1, 1863.
However, thousands of slaves had been moved into Texas by slaveholders to escape the war. Although most lived in rural areas, many resided in both Galveston and Houston. By 1865, there were an estimated 250,000 enslaved people in Texas.
Despite the surrender of General Robert E. Lee on April 9, 1865, the western Army of the Trans-Mississippi did not surrender until June 2.
General Gordon Granger was a career U.S. Army officer and a Union general during the American Civil War where he distinguished himself at the Battle of Chickamauga, Battle of Chattanooga, and lifted the siege at Knoxville, Tennessee (a Union stronghold in the Confederate South).
On the morning of June 19, 1865, Union Major General Gordon Granger arrived on the island of Galveston, Texas to take command of the more than 2,000 federal troops recently landed to enforce the emancipation of its slaves and oversee a peaceful transition of power.
Granger read aloud the official handwritten record issued as General Order №3 informing the people of Texas that all enslaved people were now free.
Although this event has come to be celebrated as the end of slavery, slavery remained legal in the two Union border states — Delaware and Kentucky — until December 18, 1865.
This past Friday, we in the United States officially celebrated Juneteenth as a federal holiday. One hundred and fifty-six (156) years after Granger read the official decree.
The End of Slavery Around The World
The United States was not the first to end slavery.
Haiti (then Saint-Domingue) formally declared independence from France in 1804 and became the first sovereign nation in the Western Hemisphere to unconditionally abolish slavery in the modern era.
Because of the slave uprising in Hati, France completely abandoned slavery by 1848. France commemorates the national day of the abolition of slavery on May 10.
Britain officially abolished slavery in 1833, 32 years ahead of the US. However, supposedly freed slaves were in fact committed to six to twelve (12) years of further service as unpaid ‘apprentices’, meaning slave owners were compensated to the tunes of millions — and continued to get free labor. It wasn’t until 1838 that these apprenticeships were ended, and slaves in the British Empire were truly emancipated. Britain celebrates Anti-Slavery Day on 18 October.
And the United States was not the last country to abolish slavery.
That title goes to Mauritania. Mauritania ended slavery in 1981, nearly 120 years after Abraham Lincoln issued the Emancipation Proclamation in the United States. Yet, the country didn’t make slavery a crime until 2007.
But the end of slavery does not end bias and racism. Ending slavery simply tries to create a baseline that regardless of the color of our skin, we are all human: flesh and bone; dreams and hopes.
Until humans begin to use technology with a bias to put a heavier mathematical weight on one demographic as opposed to everyone else.
The Origin of The ‘Boston Housing Dataset’
A data scientist could not learn how to fit a linear regression model and not come across the Boston Housing Dataset. Python’s Scikit-learn even lets you import it directly with sklearn.datasets, along with other classic datasets. Containing information collected in 1970 by the U.S. Census Service around Boston, Massachusetts, its 506 data samples initially had 20 variables, two of which were of primary interest — the median value of homes in a given census tract (MEDV) and air pollution values represented by the concentration of nitrogen oxides in the area (NOX).
Fourteen variables have survived in common versions of the data set available from the University of Toronto, Carnegie Mellon StatLib repository, and Kaggle. These variables include proximity to the Charles River, pupil-teacher ratios, the average number of rooms per house, crime rates, and the racial makeup of the population.
However, it’s the variable, B, a calculation using the formula B = 1000(Bk — 0.63)2 where Bk is the proportion of Blacks/African Americans per town according to using Census statistics.
With this curious variable in place, even simple data explorations such as measuring correlations between the variables suggest there might be a relation between house price and the proportion of African Americans in a given area or, more disturbingly, a relation between crime and the proportion of African Americans in an area.
But it’s important to note, that this dataset was not created by the Census. It was actually derived in 1978 by researchers, David Harrison, Jr. and Daniel Rubinfeld (or the grad students and/or assistants of the researchers) in pursuit of cleaner air. See source.
In every data science class, I teach, I use this dataset to warn my students about the power of bias in machine learning and artificial intelligence. When my students are browsing through this Jupyter Notebook, I ask them to browse the dataset variables and to point out any that might seem strange. Flawed. Or a little out of the ordinary.
“What the heck is variable B?” they ask right away.
Or “Why are Blacks targeted in this dataset when there must be different minorities living in Boston at the same time?”
One answer is that 1970 was in fact the first year the U.S. Census asked about Hispanic/Latinx ethnicity. The 1870 Census is the first to include African Americans by name along with the rest of the population and is often the first official record of a surname for former slaves. Five years after the initial Juneteenth.
There are arguments to be made that the US Census, and the ability to quickly collect, store, and electronically manipulate data on minority populations is perpetually detrimental to those populations.
I do not subscribe to this belief. I believe that “white-washing data” and putting everyone who is not white in an ‘other’ column is dangerous and detrimental to the services, needs, and demands of all communities of color. In fact, not only do we need data about all races, we need more data about sexual orientation for our cities, communities, and towns. There needs and concerns for our LGBTQ+ family needs to be considered.
I wrote about this in an article entitled “The Analytics of Hate.”
But what we do not need is to use this data to discriminate. We need this data to create a more inclusive society. We need to realize that our cities are made from a diversity of people, a multitude of cultures, and the freedom to love whom you love. This inclusion of everyone in the data helps us all.
The Separation of ‘District Six’
I lived in Cape Town, South Africa from 2004–2006 and one of the stories that moved me deeply was the government policy to break apart and raze ‘District Six’ during the early years of the apartheid era.
After World War II, District Six was in Cape Town and was relatively cosmopolitan. Situated within sight of the docks, it was made up largely of coloured residents which included a substantial number of coloured Muslims, called Cape Malays. There were also a number of black Xhosa residents and numbers of Afrikaners, English-speaking whites, and Indians. Friends and family members told me that ‘District Six’ was a true melting pot of people — full of friendship, creativity, and inclusivity.
On 11 February 1966, the government declared District Six a whites-only area under the Group Areas Act, with removals starting in 1968. About 33,446 people living in the specific group area were affected. 31,248 of them were peoples of color. The government’s plan for District Six, finally unveiled in 1971, was considered excessive. Most of the approximately 20,000 people removed from their homes were moved to townships on the wastes of the Cape Flats.
Government officials gave four primary reasons for the removals. In accordance with apartheid philosophy, it stated that interracial interaction bred 1) conflict, necessitating the separation of the races. They deemed District Six a 2) slum, fit only for clearance, not rehabilitation. They also portrayed the area as 3) crime-ridden and dangerous; they claimed that the district was a 4) vice den, full of immoral activities like gambling, drinking, and prostitution. Though these were the official reasons, most residents believed that the government sought the land because of its proximity to the city center, Table Mountain, and the harbor.
Friends of mine in Cape Town explained that how the government enforced this ‘separation’ of peoples was to threaten fines and/or prison time if you talked to or interacted with your neighbors of a different race. Those that defied the laws were thrown in jail or worse — so families were forced to cut off relationships with decades-old friends in fear of losing everything.
The Group Areas Act was created by data: counting the population and subsetting by racial variables. One of my close friends explained that how they would create this ‘racial divide’ data was families had to report to a government office where government officials performed a ‘hair comb’ test. If they could not get a comb through your hair, you were deemed ‘coloured’ unless you specified Indian, Xhosa, etc.
But even though data was used to separate and destroy, a new movement using data also became the way South Africa healed.
Truth And Reconciliation Data
Near the end of my data science class, I always show my students the data that was created during the ‘Truth and Reconciliation Commission’.
The Truth and Reconciliation Commission (TRC) was created to investigate gross human rights violations that were perpetrated during the period of the Apartheid regime from 1960 to 1994, including abductions, killings, torture. Trials occurred across South Africa and those that perpetuated the crimes — if they could confess and ask forgiveness to the families they inflicted violence towards — they would be given amnesty, as long as the crimes were politically motivated, proportionate, and there was full disclosure. No side was exempt from appearing before the commission.
A movie with Juliette Binoche and Samuel L. Jackson called “In My Country” depicted the trials based on the book “Country of My Skull”, a 1998 nonfiction book by Antjie Krog primarily about the findings of the South African Truth and Reconciliation Commission (TRC).
The data for TRC is located here via Africa Open Data.
This dataset depicting violence, rape, and destruction on both sides is a sobering account of bias and racism going unchecked and rampant. It also reveals the retaliation and spiral into violence. As I show my students, the data doesn’t lie. It lists hundreds, thousands of those who died. Those who fought for freedom. Those who fought for separation.
What started in Africa as millions were sold into slavery around the world, to District Six where families lived in harmony until apartheid, until finally Nelson Mandela was elected as the first democratically chosen Black President of South Africa, Juneteenth is a celebration of humanity.
Attempting to bring attention to the bad and make it a learning moment.
But isn’t that the role of every developer and data scientist? Bringing attention to every mistake and learning from it? We all doing our part as technologists, data scientists to ask — is the data I am using — is it free from bias? From manipulation? Will this data separate us into factions? Or will this data allow all of us to understand each other more?
Juneteenth marks our country’s second independence day.
Data is the most important piece of information to show this present and predict our future.
But more importantly, data is the most important piece of information to reveal our hearts.
Do your little bit of good where you are; it’s those little bits of good put together that overwhelm the world. — Desmond Tutu