Passengers in History – A Data Story

Three weeks ago the South Australian Maritime Museum launched their newest online project. Passengers in History is a family history research website containing over 328,000 passenger records along with numerous ship, voyage, port and organisational records. The website is an evolution of the Passengers Database that has been maintained by museum staff for many years. While the Passengers Database was only accessible at the museum, Passengers in History is available to users all over the world.

The process of converting this local database into such a powerful online tool presented many technical challenges. Visit the website if you’d like to jump into researching your family history through Passengers in History, but if you’d like to read about the development challenges that working with such varied sources of data presented, read on.

Why did we build Passengers in History?

Passengers in History contains over 328,000 passenger records along with numerous ship, voyage, port and organisational records. The challenge of building this website was creating one normalised data source from an unwieldy legacy database and an excel file of transcribed ships logs and diaries. The project was prompted by the success of the South Australian Maritime Museum’s Passengers Database, a Microsoft Access database designed, maintained and built upon by non-technical museum staff. Over 25 years the database grew to become an unwieldy but much used and much loved beast. Installed on only 3 computer terminals inside the Maritime Museum, our job was to take that database, build upon it and make the entire resource more widely accessible by going online.

How did we do it?

So, how did I go about this? The first step was to evaluate the structure of the legacy database and to export the data  into a custom MYSQL database (using the MS Access export tools and an ODBC database connection). I chose the open source MYSQL database because of its ease of use and its alignment with the default database used by Drupal CMS (the selected platform for the website). After evaluating the Access database, of 16 tables that included 320,684 passenger records, 57,623 voyage records, 13,477 ship records and 21,567 photographic records it was clear I needed to write a number of SQL queries to extract and import the data. Once I had this data in a stable relational database, the next challenge was to develop a script to import and merge the Log of Logs with the original Passengers  database (download a snippet of the Log of Logs). I chose Python for the script because of it’s simplicity and ease of use. The script used some basic methods to identify existing vessels and voyages between the 2 data sets and only insert new records into the MYSQL database if none already existed. For example, the vessels were matched by name and tonnage – the code snippet below illustrates how this is done (the script starts with the Log of Logs CSV file and then compares with the database).

cursor.execute("SELECT * FROM vessel WHERE LOWER(vesselName) = LOWER('" + row[1].replace("'","\\'") + "') AND tonnage != '' AND tonnage = '" + str(int4) + "'")

if cursor.rowcount:

# Update vessel (ship) record

for vesselrow in cursor:

# Code to update vessel record

...

else:

# Code to add vessel record

...

Code 1: Python script to match vessels

On a simple level, a cursor is a way to get the results from a database query. The query attempts to match the vessel name and tonnage fields obtained from the CSV file (download the complete script to see how this is done). If there are records (indicated by rowcount), then update the ship, otherwise (else command) add the ship.

Not all of the legacy data was relational and to ensure a sound end user experience I had to establish some new data relationships. For example, an instance of a voyage required a mapping for both the origin and destination of the voyage and a vessel record required a mapping to a port record. These were normalised by extracting the distinct (unique) values into a separate table and then adding a relationship to the parent record in a separate mapping table . These mappings were achieved using a few SQL queries. The two queries below shows how a table of unique ports was created. the first query creates a table called ports. the second query selects and inserts unique ports into the table – these are obtained from the voyage origin and destination fields (available in the merged data source).

CREATE TABLE ports(
`id` INT NOT NULL AUTO_INCREMENT,
`portName` VARCHAR(255) NULL,
PRIMARY KEY (`id`));

Query 1: Create the ports table

INSERT INTO `port`(portName) SELECT DISTINCT place
FROM (
SELECT DISTINCT origin AS place FROM voyage
UNION SELECT DISTINCT destination AS place FROM voyage
) AS derived
ORDER BY place;

Query 2: Select unique ports from voyage origin and destination 

INSERT INTO voyageorigins(voyageId,portId)
SELECT v.id, p.id
FROM voyage v
INNER JOIN `port` p ON p.portName = v.origin;

Query 3: Insert voyage origin mappings

INSERT INTO voyagedestinations(voyageId,portId)
SELECT v.id, p.id
FROM voyage v
INNER JOIN `port` p ON p.portName = v.destination;

Query 4: Insert voyage destination mappings

This approach was applied to vessel and voyage sources, voyage masters, vessel built where and vessel built by organisation. Thus, creating a database of individual types of record, instead of a traditional flat information record of dissociated information.

The final technical challenge was to get the data into an open source Drupal content management system (CMS) so that our museum staff could manage and grow the website over time. At History SA we use Drupal for the majority of our websites, so it was the natural choice for the front end of Passengers in History also. I identified and created a number of content types, including master, organisation, passenger, port, source, vessel and voyage. The Drupal CMS would be leveraged to manage and present this data in an organised and user friendly way. I then developed a series of scripts to migrate the data into these new content types using the Drupal Migrate Module.  All data for the master, organisation, passenger, port, source, vessel was imported before importing the voyages. The voyage import tied all the pieces together ie. the passengers, masters, destination port and origin port were then attached to the voyage by using the Drupal Entity Reference module (this module allows Drupal content instances, or nodes to be connected with other content). In a similar manner the related port (Built In) and organisation (Built By) fields were attached to the vessel records. These references created a rich interconnected set of data that could be displayed individually.

Did we succeed?

When all was said and done, the migration process took 70% of development time to refine. The results speak for themselves. Investing in the structure and fidelity of the data has leveraged the content to be a rich and easy to manage source of information. The Passengers in History website hosts some extremely diverse sources of content that enables us to present the information in a more interconnected way. This has allowed visitor contributions to occur on an elementary level to enrich the content. The diversity and split of the information provides a providing a platform for future growth of the website and the development of the connections of this information with other data on the internet.

Please download the Data Migration Resources used to create the Passengers in History data source. This download includes the Python Script used to merge the two original sources, a snippet of the Log of Logs CSV file, the MYSQL database schema (no data) used to merge the two sources and the custom Drupal module (using the Drupal Migrate Module) to import the final merged data into Drupal.

The next challenge (and perhaps the foundation of a new blog post) was to render this data within the framework of the website design by the fabulous people at Simple. I’ll leave that for the next time…

Useful Links

Leave a Reply