Data science is a multi-disciplinary field that uses statistics, scientific methods and algorithms to extract knowledge and insights from data. A data scientist, should follow the data mining implementation process, in other words, understand the business, understand the data, prepare the data, perform the modelling, evaluate the results and then deploy. Unlike many people think the most crucial and time demanding part of the process is to understand the business and then the data.  The most important skills for a data scientist to have are: statistics, python coding, machine learning, SQL database, intellectual curiosity, business acumen and communication skills.

Personally, I have started learning statistics from a very young age with a Statistics A-Level obtaining an A. At the university while I was pursuing my bachelor degree in Physics I had a lab work in all the 3 years where I had to perform data analysis achieving a first class mark in all the three years. In addition, I have been using machine learning algorithms using sci-kit learn python package to address multiple issues with the large dataset I was using for my PhD. During my MSc Computer Science degree I had taken a Database module which included a project with SQL scoring 66%. As I describe in the interest section I like reading the news and this helped me to understand the economic globalisation. 


In particular, I have employed a Density Based SCANning (DBSCAN) algorithm in order to improve the S/N ratio of the dataset. On the image here (top panels) we can see a seismometer (blue triangle) recording earthquakes (red stars) from various locations in southeast Asia. We have used a clustering approach to group together raypaths that are found in the same space and time to increase the signal and reduce the noise in our dataset. The middle panels show the effect of this approach on the seismic velocity beneath the earth's surface at 200 km. Bottom panels show the checkerboard testing (resolution tests used in seismic tomography) to show the resolution increase after using this approach. LHS images show the results before using the DBSCAN method. More information can be found in my paper section 3.2.

Moreover, I have taken part in Kaggle "Quora Insincere Questions Classification" competition where I have used numpy, pandas and keras in order to predict whether a question in Quora is sincere or not.



©2020 by Aristides Zenonos