Fri, Jan 16, 2026
Text Size
Monday, 12 August 2024 08:20

Ensuring data quality - key to successful asset management and performance

In an Expert Focus article for WaterBriefing, Dr Atai Winkler, Principal Consultant at PAM Analytics, takes an in-depth look at why ensuring data quality is key to effective asset management in order to improve and maximise asset performance. 

DATA WORD HEXAGONS network

Atai Winkler PICDr. Atai Winkler: In common with many organisations, water companies require high quality data to help them plan their next actions with confidence. This includes planning for unwanted events by reducing the risk of such events occurring and minimising their consequences. This article discusses data quality, shows how data quality can be improved and gives examples of poor quality asset management data.

The availability of large quantities of data says nothing about their quality. If poor quality data are modelled, incorrect decisions that result in unexpected or undesirable consequences, including financial loss, serious accidents and loss of life, may be taken.

Using poor quality data to develop models is analogous to building on ground that has not been prepared properly for the structure to be built on it.

It cannot be assumed that data in corporate databases are always of high quality. Data quality is an analytics issue, not an IT issue IT is concerned with how data are recorded and stored, and analytics is concerned with how data are prepared, analysed and modelled. Data quality can be improved by having good comprehensive data governance procedures and storing data in databases where data formats are well defined rather than in spreadsheets where these conditions are not always satisfied.

Definition of Data Quality

There isn’t a single definition of data quality, probably because it covers many different aspects of data. Figure 1 shows the six dimensions of data quality. Data are of high quality if they satisfy all of them.

Figure 1: The Six Dimensions of Data Quality

AW SIX DIMENSIONS OF WATER QUALITY

Improving Data Quality

Improving data quality is a painstaking unglamorous but very important task because successful analytics is contingent on good quality data. Time spent at the start of projects improving the quality of the data is time well spent because the effects of poor quality data become apparent later in projects so requiring the data quality work to be redone. Improving data quality requires programable and graphical software rather than menu-driven software with its limited functionality because each set of data is different and so requires its own transformations. Unfortunately, checking and improving data quality is often ignored when using menu-driven software because the software assumes that the data are in the correct form for modelling  a big and dangerous assumption.

Data Quality and AI

DATA HEAD LAPTOP circuits

AI can help improve some aspects of data quality, for example removing duplicate records and handling missing values. They only work well if the current data have the same characteristics as the data used to train the models. This cannot be assumed and there are some aspects of data quality that AI cannot improve.

A significant problem with AI is that users do not know how the answers were obtained because the models are not visible. This problem is exacerbated if additional work is required to further improve the quality of the data after AI has been applied, and can lead to inconsistencies between results from the AI models and results from the additional work. To avoid this problem, and control and understand all the transformations used to improve the data quality, AI should only be used when the range and extent of the issues to be addressed are known and it can address all of them. For example, if a date is recorded incorrectly, results from using it may be unusually large or small. If the training data do not have such extreme results, AI may not identify them in the current data.

Aspects of Data Quality

Data are of poor quality if they have outliers, missing values, missing records, duplicate records or inconsistencies. They relate to the dimensions shown in Figure 1.

Outliers

Outliers are observations that cannot be explained or are unexpected. Whatever their cause, outliers should be investigated to see if they are valid observations or incorrect observations. If they are due to human error or machine malfunction, they should be treated as missing observations (see below) and imputed to obtain reasonable estimates of their true values (which may never be known). If they are valid observations, they should not be classified as outliers but analysed to understand why they occurred. Outliers that are deemed to be valid observations may require special consideration when the data are modelled.

Outliers can be identified from the frequency distribution of the data. Figure 2 shows the distribution of the number of pump failures at wastewater pumping stations over a number of years (Count on the Y axis is the number of pumps). It suggests that the pump with about 80 failures is an outlier that should be investigated to reveal why it had so many failures. It may then be appropriate to remove it from the data but outliers should not be removed without first establishing why they occurred.

Since this analysis is at the data exploration stage, rules for defining outliers are not required - data exploration is about understanding the data. The criticality of the assets is an important factor to consider when deciding if an asset is an outlier - tighter criteria are required for critical assets even though this may lead to more assets being investigated. Rules for defining outliers can be established when the data are modelled.

 Figure 2: Identifying Outliers

AW IDENTIFYING OUTLIERS

Outliers in models are observations with large differences between the observed values and the predicted values. The models should be analysed to establish reasons for the differences. If the outlier is caused by an incorrect observation or spurious observation, it should be removed from the dataset and the model rerun. A particular type of outlier that requires detailed study is influential observations  omitting them from a model changes the model significantly.

Missing Data in Fields

Missing data in fields is a common problem, particularly for data that are recorded manually. If a field has missing data, records with the missing data will not be used when the data are modelled. When missing data are distributed across all the records rather than being concentrated in a few records, the effective size of a database is reduced significantly. The number of missing values in a field is shown when the frequency distribution is tabulated.

It may sometimes be possible to impute missing data to obtain estimates of their true values. As the proportion of missing data in a database increases, the reasons why so much data are missing should be investigated and the validity of the available data questioned. Furthermore, there is less confidence in the results and conclusions obtained from modelling sparse databases.

Missing Records

If data that differ in key respects from data in the database are missing, the resulting models will not be representative of all the data. This can lead to bad decisions with possibly serious consequences when the models developed from the limited data are applied to data that include records with the missing characteristics.

Care must be taken when analysing a database that has a number of different object types, for example pumps, and some types occur much less frequently than other types. If a feature of the pumps, for example failure time, is being studied, it may be best to develop one model for all pump types and another model for the pump types with many records and compare the models. If they differ significantly, the pumps with fewer records should be modelled separately. Models developed using relatively few records have larger errors than models developed using many records.

Duplicate Records

Duplicate records, i.e. identical records for the same instance, can occur quite easily, particularly when databases are merged. Duplicate records in models bias the results towards them and so must be removed before the data are modelled. Procedures for identifying duplicate records and finding the first and last duplicates in each group of duplicate records are available in many software products.

A very common duplicate record problem in work order databases that must be addressed before the data are used is overlapping and nested records. Figure 3 shows an example of duplicate work order records for one asset.

Figure 3: Example of Duplicate Work Orders

AW EXAMPLES DUPLICATE WORK ORDERS

 

The data have a number of inconsistencies:

site_name: String functions are required to correct the incorrect postcode, either by removing the space in ‘ABC 1’ or inserting a space in ‘ABC1’.

installation_date: Date functions are required to correct the incorrect year.

date_completed: The completion dates and work types (preventive and corrective) are different. The two records must be reduced to one record and decision rules applied to work out the completion date and work type of the new record (this is an example of overlapping work orders).

Inconsistent Data

A good example of inconsistent data is a work order whose start date is after its close date. Inconsistent data are usually due to poor procedures, for example manual data entry without thorough checking. In other cases the reasons may be less clear in which case further work is required to establish the causes. In all cases new data quality procedures are required to ensure that all the data are consistent. Inconsistent data can be identified by carrying out simple analytical and graphical checks.

Table 1 shows aspects of data that are not usually associated with data quality but must be checked for before data are used.

Table 1: Other Aspects of Data Quality

AW OTHER ASPECTS OF DATA QUALITY

 

Preparing Asset Management Data for Modelling

DATA KEYBOARD MAN

The format (scale, ordinal, categorical, date/time) of the data determines how they should be prepared. The first thing to do in all cases is to improve their quality as discussed above.

Since scale data have the richest numerical properties and satisfy the rules of arithmetic, poor quality scale data can be easily identified and improved by carrying out exploratory data and graphical analyses. If a field that should be numeric contains a non-numeric character, it will be treated as a text field and so cannot be used for arithmetic operations.

Ordinal data are grouped (banded) scale data. A common example of grouped data is age (for example <=20, 21-30, 31-40, etc.). Ordinal fields must be defined carefully to ensure that the number used to represent each group reflects the values in the group. They are usually defined for clearer modelling and visualisation for data that have a large range. An example of asset management data that are better modelled when grouped is the number of maintenance interventions.

Categorical data can be represented by numbers or text. The numbers are labels and so do not have numerical significance and arithmetic operations cannot be carried out on them (they are numeric strings). An example of numeric strings is assets identified by numbers - they do not have numerical significance but are just labels.

Frequency distributions of categorical data often show many more values than the field actually has. Consider ways in which the value mytown pumping station can be represented: ‘Mytown pumping station’, ‘Mytown Pumping Station’, ‘Mytown Pumping station’, ‘Mytown Pumping_Station’, ‘Mytown_Pumping_Station’, ‘ Mytown_Pumping_Station’. They differ if only slightly and so if they are not changed to one value software will treat them as six different values.

Another problem with categorical data is leading blanks and trailing blanks. Both types of blank are part of the value and so must be removed. Leading blanks can be identified by studying the frequency distribution but trailing blanks (rarer) are harder to identify.

Date and time fields have their own formats that allow arithmetic to be carried out on them, for example to calculate the interval between two dates. If a format other than a date or time format is used for dates or times, the dates or times may be specified as the number of date or time units since a base date, for example 1 January 1900, and so it is important that the correct formats are used.

Conclusion

This paper has discussed data quality with particular reference to asset management. It does not discuss every aspect of data quality because the quality of the data depends on many factors including how the data were recorded. Nevertheless, it provides very useful guidance for how to improve the quality of asset management data.

 

Readers interested in finding out more can contact Dr Atai Winkler at:

e: This e-mail address is being protected from spambots. You need JavaScript enabled to view it

m: 07817 263016

pamanalytics.com

News Showcase

Sign up to receive the Waterbriefing newsletter:


Watch

Click here for more...

Login / Register




Forgot login?

New Account Registrations

To register for a new account with Waterbriefing, please contact us via email at waterbriefing@imsbis.org

Existing waterbriefing users - log into the new website using your original username and the new password 'waterbriefing'. You can then change your password once logged in.

Advertise with Waterbriefing

WaterBriefing is the UK’s leading online daily dedicated news and intelligence service for business professionals in the water sector – covering both UK and international issues. Advertise with us for an unrivalled opportunity to place your message in front of key influencers, decision makers and purchasers.

Find out more

About Waterbriefing

Water Briefing is an information service, delivering daily news, company data and product information straight to the desks of purchasers, users and specifiers of equipment and services in the UK water and wastewater industry.


Find out more