Unconfigured Ad Widget

Collapse

Announcement

Collapse
No announcement yet.

Find My Past Blog - Behind the scenes: fully indexing the birth records with data man

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find My Past Blog - Behind the scenes: fully indexing the birth records with data man

    I joined findmypast.co.uk at the start of the year and walked straight into my baptism of fire: reindexing the birth, marriage and death records.

    As head of the data team, I am responsible for the quality of the data we receive from two separate transcription companies. I have to ensure that they meet their guarantees of quality so that everything falls into place in time for the launch.

    Richard Jackson


    We received the files in quarterly batches, usually with one file per quarter - for the births alone this amounted to more than 600 files and 113 million records. The challenges came in verifying that we were not missing any records and ensuring that all of the 800,000 images were in place and of high enough quality and that we could identify and standardise any fields that had been transcribed in error.

    We shared knowledge with the transcription companies, provided them with lookup tables for valid districts, common first and last names and provided regular feedback so they could validate their transcriptions before delivering them to us. This ensured that we were as close as possible to our desired accuracy levels before we handled the data ourselves.

    That said, there was still plenty of work to be done. By programmatically checking the files we received for gaps and inconsistencies, e.g., comparing the representation of first letters of surname across quarters, we were able to identify and plug plenty of gaps well before the project neared completion.

    One of the most time consuming parts of the births project was cleaning up the registration districts from their incorrectly transcribed values into something that could be used in a search. Our list of invalid districts included over 60,000 incorrect values which all had to be standardised. My colleague Francesca Aiyeola and I spent many hours trying to work out if that ‘B’ should have been an ‘R’ or an ‘H’ and acquired a fine appreciation for the transcribers’ skills and patience in the process!

    We hope that you enjoy the birth records and that you’re looking forward to the fully indexed marriages and deaths, coming soon.



    More...
Working...
X