Tuesday, 4 December 2018

Week 4 [3-9.12.18], An introduction to modern missing data analyses


Hi,
Today I would like to present an article on data analysis. Some of us are researchers who base their doctoral thesis on the results obtained from their own experiment. It is obvious that we draw conclusions from the experiment on the basis of the analysis of the received data. Unfortunately, it is very often the case that we do not have some data.  This may be the result of a poorly designed experiment. Sometimes a failure of the measuring device or a data acquisition software error can occur. However, most often it is the result of a human error, who has not done his or her work accurately and reliably. The result of such errors is the loss of important data.  Missing data is therefore inevitable in research. This problem concerns not only social sciences, but also those related to computer science, this problem is especially related to social computer science, where human behaviour is a dependent variable. The solution to such problems are different methods of data replacement, but in literature their potential undermining the credibility of such research results has often been omitted. This is partly due to the fact that statistical methods, which can solve the problems resulting from missing data, were until recently not readily available to researchers. Therefore, I would like to present you with different approaches to the problem of replacing lost or missing data. I am enthusiastic about the possibilities of algorithms that can improve the reliability of research results and reduce the waste of resources caused by missing data. The cost of algorithms is small compared to the cost of data collection. As a result, it is no longer possible to justify the missing values and the reason for which they were swept under the carpet, nor the potentially misleading and ineffective analysis of complete cases that can be considered appropriate.
Questions
1. Does the method of replacing empty data make sense?
2. does the method of deleting whole lines with incomplete data give better results than substitution?
3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?
4.How can researchers use missing data to improve their research projects?

27 comments:

  1. 1. Does the method of replacing empty data make sense?

    I am not a specialist in this field to make sure that I am right, but I have mixed feelings. On the one hand, adding missing data is important, but on the other I think that supplemented data may not be appropriate and even show false data and lead to incorrect conclusions. I also see an ethical problem in this, because algorithms supplementing the missing data can be manipulated. as we know, tests are often carried out on a very small group or completely unsuitable for the subject of the study. I think that this is a room for maneuver for pulling on the balance of results.

    2. Does the method of deleting whole lines with incomplete data give better results than substitution?

    I personally think that deleting a line is a better option. if the existing data was incorrect or an error has crept into it, the algorithm will suggest untrue data. all this can lead to a false test result. however, deleting and acknowledging that there is no certain data seems to be a more logical and safer step.

    3. Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    As I mentioned, I am not a fan of this solution, but if I have to answer I would choose maximal probability.

    4. How can researchers use missing data to improve their research projects?

    I think it is important to come to conclusions why this data is missing, maybe another research method should be used, find a better set for observation or make other changes.

    ReplyDelete
    Replies
    1. If there is not enough data from some categories in the study, we can over-sample this data. Of course, the distribution of the other data must be kept.

      Delete
  2. 1. Does the method of replacing empty data make sense?

    I think the method of filling in empty data isn't very good for scientific research. For years, people have been counting errors just to estimate the missing data or normalize the results. Every collection at the moment - I think - is burdened with a lack of data or an error. I wouldn't like to use such algorithms.

    2. Does the method of deleting whole lines with incomplete data give better results than substitution?

    Yes, removing empty lines with data is a much better solution. The complementary algorithm can fill empty lines with random data that may distort the results of actual tests.

    3. Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    In my opinion, the maximum probability is the best method for the possible addition of data.

    4. How can researchers use missing data to improve their research projects?

    I don't know how, but I think that a change in the way of research, a different scope of research or a change of thinking can easily replace the algorithms to fill in empty data. I'm really opposed to such treatments personally.

    ReplyDelete
    Replies
    1. Data substitution is better than deleting entire lines. Only if the gaps represent up to 30% of the total dataset.

      Delete
  3. 1. Does the method of replacing empty data make sense?

    I'm interested in data wrangling and related areas so for me it was really interested. Like in every data analysis the method of replace empty elements it depends on what we are researching. If the data are critical and changeable in short part of time we can't just replace or eliminate empty/wrong data from row. We have to check how this data behave and then decide if replace of empty data is necessary or we shouldn't do it.

    2. does the method of deleting whole lines with incomplete data give better results than substitution?

    Like below it depends what we researching. If the line have information about anormal behaviour then this data could be flag for us that something bad happen there in that moments.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    I usually uses the method of insert data based on the maximal probability.

    4.How can researchers use missing data to improve their research projects?

    If researchers start to log not only data, but also peripheral information (e.g. from object outside data flow or precise time) then they could in easy way find out what determine this errors generated inside project.

    ReplyDelete
    Replies
    1. Very good answer, It depends on what data we analyze and what we will do with them.

      Delete
  4. 1. Does the method of replacing empty data make sense?
    From my experience it is always a separate and important part of reasarch to find out, why we have outliers and missing data. Especially in social science, sometimes missing data could be produced intentionally by i.e. respondents of a survey, becouse the do not want to give the answer to this question or there is no right answer. In this case it is really important, to find out, why they do not want to answer.
    If you are sure, that the issue was only a problem with machine or some another random issue and your data is really short, then you can try to replace empty value, but you should be carrefoul and try to run tha same analysis on raw data, to find out, if your action do not create some results not included before replcment.

    2. does the method of deleting whole lines with incomplete data give better results than substitution?
    As I wrote in 1 section, it is always specific to the data that you are analysing. If i.e. you are planning to replace a column, which is a key point of your reaserch, then probably a deleting whole line could give better and unbiased results.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?
    Ususally I prefer maximum likelihood, but always you can try both and test the difference in results to jugde which one in your case is better.

    4.How can researchers use missing data to improve their research projects?
    Based on missing data they can diagnose the issue and maybe prepare better whay of obtaining some value or to fix some technical issues.

    ReplyDelete
    Replies
    1. These methods don't differ significantly. The most important is the last phase, where the choice of prediction results is made by majority voting.

      Delete
  5. 1. Does the method of replacing empty data make sense?
    I think not, because it may entail a false result.Also, with the help of replacement - it is possible to manipulate data.

    2. Does the method of deleting whole lines with incomplete data give better results than substitution?
    I'm sure that deleting lines is much better than filling in empty fields. In this case, it is more difficult to manipulate data.

    3. Which method of data replacement is more efficient; Maximal probability or multiple imputation?
    That method that gives a more truthful result. The most important thing in our situation is reliable information.

    4. How can researchers use missing data to improve their research projects?
    In this case, the most important factor will be the accuracy of the data. I'm not sure that this algorithm will give us the necessary accuracy. Perhaps in the research field, a different algorithm is needed, but unfortunately I am not an expert.

    ReplyDelete
    Replies
    1. Missing data has a huge impact on our knowledge of the field. They influence the values of descriptive statistics such as: mean, median, fashion, but also the graphical representation of data. missing data disturb the interpretation of knowledge about the analysed set

      Delete
  6. 1. Does the method of replacing empty data make sense?
    If it is statistically valid, then yes. Often in order to get something from such incomplete data we have to use such methods.

    2. does the method of deleting whole lines with incomplete data give better results than substitution?
    I think it depends on how much data we have at our disposal. If the set is very large, I would bet on deleting the entire line with the missing data, otherwise I would try to fill in the missing data.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?
    Neither is inherently better than the other; in fact, when implemented in comparable ways the two approaches always produce nearly identical results. With either maximal probability or multiple imputation the information used for modeling missing data comes, naturally, from the set of variables available to the procedure. With maximal probability, this set of variables is typically confined to those included in the particular scientific analysis at hand, even if this means omitting one or more variables that contain information necessary to the missing data model.

    4.How can researchers use missing data to improve their research projects?
    First of all, they can draw a conclusion as to what data is missing, which can give them an opportunity to change the way they design the experiment. Thanks to this, their experiment can have data that has not been obtained in other datasets.

    ReplyDelete
  7. 1. Does the method of replacing empty data make sense?
    I think that make sens but may vary depending on the data and the background knowledge. Sometimes it is enough to calculate the median to complete such cases sometimes not. What is interesting missing data can bring us new information for example if we store data about people and someone does not provide his or her age it may indicate this person is elder and want to hide his/her age.
    2. does the method of deleting whole lines with incomplete data give better results than substitution?
    It depends on the data. If it is just a few lines from a huge dataset then it should not influence the results. If we want to deete rows from a short dataset and probably delete some very unique attributes then we have to be carefull. We can always makesome test to ensure that it is safe to delete.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?
    I think it depends on the data and a problem we wnat to solve. Again we can test which one is better on the dataset we are working on.

    4.How can researchers use missing data to improve their research projects?
    As I mentioned missing data can bring us new informations. That is something we can add to our research. I think that we can try to add some comparisons of varying missing data replacement if it is an important problem in a research. If not we can simply add an information how we deal with missing data.

    ReplyDelete
  8. 1.Does the method of replacing empty data make sense?
    Of course, it has. Every method of enrichment our knowledge has sense, in this case through data set extension. But it had to be done in appropriate way, fitted to each case.
    We should keep in mind that there will cases where those method will not be the best solution.

    2.Does the method of deleting whole lines with incomplete data give better results than substitution?
    Removing whole lines of data is just the simplest way of dealing with missing data, but it’s not the smartest one. It’s only lead as to throwing away valuable information. This approach will not give us better results because we start with shrinking our data set and wiping of part of observation, and significance tests will lack power. While we don’t have to do it because we have grate tools to take care of the missing data problem.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?
    Both methods are pretty good, especially when compared with more traditional methods like listwise deletion or conventional imputation. Maximum Likelihood (ML) and Multiple Imputation (MI) make similar assumptions, and they have similar statistical properties. When we look from implementation point of view, ML is better choice. It’s easier to use, and the results are done under one model on the contrary to MI where it gives you a different result every time you run it because random draws are a crucial part of the process.

    4.How can researchers use missing data to improve their research projects?
    Researchers frequently face decisions about handling missing data that may influence results. Missing data potentially affect validity and reliability of findings, as well as generalizability of results. I think they can compare results with and without filling out missing data. They can also try to identify mechanism that cause lacking the data.

    ReplyDelete
  9. 1. Does the method of replacing empty data make sense?
    In my opinion it doesn’t make sense. Going in this direction, we rely on forecasts and not facts, which reduces the whole experiment to the level of assumptions, not the analysis of facts.

    2. Does the method of deleting whole lines with incomplete data give better results than substitution?
    Yes, as presented article states: mean substitution as a missing data technique never works well. As a result, the use of this technique may result in biased estimates, incorrect standard errors, or both. I think that the method of deleting whole lines is better but deleted data should be reviewed whether it is not a significant anomaly.

    3. Which method of data replacement is more efficient: maximal probability or multiple imputation?
    It depends on the level of missing data. I’m not a specialist but I will choose the maximal probability method.

    4. How can researchers use missing data to improve their research projects?
    To my mind they should not use such methods or maybe only in the first step in approach to the problem to better understand it. I agree with Cezary that researchers should focus on real and full data. If it is impossible to gather them, they should try to find another area where it is possible.

    ReplyDelete
    Replies
    1. With this approach, if we do not focus on cases that are in a minority or there are few of them, we will lose a huge research potential. E.g. detection of fraud in transactions, or detection of rare diseases, such data is a small per mille of all data, but it is of great importance.

      Delete
  10. Does the method of replacing empty data make sense?
    Yes this method seems to be promising.

    Does the method of deleting whole lines with incomplete data give better results than substitution?
    I have not been doing any tests against this hypothesis so it is hard to determine this without proper investigation.


    Which method of data replacement is more efficient; Maximal probability or multiple imputation?
    It is the same as previous questions I have not been doing any research in this field so it is hard to determine correct answer.


    How can researchers use missing data to improve their research projects?
    Once again not being subject matter expert it is difficult to propose something in this field.

    ReplyDelete
  11. 1. Yes, because if you have empty data in our data set, it's very important for the result of the final analysis. Of course, it is best to get rid of the empty data set, but when there is a lot of empty data in our set, there may be a situation in which the data set becomes too small for any analysis. That's why it's good to replace them with other data.

    2. In my opinion, it depends on the specific problem. If removing the line with incomplete data will not significantly affect the results, we should remove it. However, if these data talk about an important, specific factor, then it is better to replace them with randomly generated data that we would expect.

    3. As mentioned in the article that you have presented. The choice of method depends on the individual preferences of the researcher and the problem he is investigating. Both methods produce similar results and deviations. Although the multiplicity of imputation is a more flexible method.

    4. Missing data can have a big impact on the final results of the experiment and research. Missing data may also indicate an error in our experiment. In this case, we should consider the correctness of all factors and eliminate any errors. The missing data can also direct us to using better methods to validate the final results, because it may turn out that their intensity is not significant for our results.

    ReplyDelete
  12. 1.Does the method of replacing empty data make sense?

    I think that the answer for this question is ambiguous. The method based on replacing empty data may change the sense of the data. It is important to know the why data is empty. We have to know if the empty data would be used for analysis purposes or they are only needed to obtain the same dimensions in processing. When we apply the method of replacing empty data we should try to perform analysis on the dataset with empty data and with replaced data and compare the obtained results. The comparison of the two results will show the presence or absence of the influence of the method of replacing empty data. IN some cases when it is necessary to interpolate the data in order to obtain continuous data.

    2.Does the method of deleting whole lines with incomplete data give better results than substitution?

    As I wrote below, it depends what the data we have. I can express my opinion from the point of view of EEG data. If we have significant amount of data and only several lines include incomplete data we can delete them. It does not usually influence other parts of data. This situation mainly may happen when the recording is started or when the recording is finished. EEG signal contains the value of signal for each electrode applied in experiment. The signal is sampled with defined frequency. Deleting the whole line from recording may cause lose of information gathered from others electrodes.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    I do not know which method is more efficient. I think that we cannot say that one of them is the best. If we test the two mentioned methods then we are able to say which method is better for the analyzed case. As I can say, I would choose the multiple imputation for EEG data.


    4.How can researchers use missing data to improve their research projects?

    If researchers obtain missing data they should diagnose the problem and try to solve it to improve their research projects. There may be a lot of potential problems from the way of doing research, problems with hardware and applied algorithms or some others factors. The explanation why the data were lost would lead to better results in further research.

    ReplyDelete
  13. 1. Does the method of replacing empty data make sense?
    If only the missing data is tiny compared to whole, and if experiment cannot be easily and cheaply redone.

    2. Does the method of deleting whole lines with incomplete data give better results than substitution?
    For small number of incompletes, I'd went straight with deleting. When more than, let's say 1%, I'd reconsider substitution, but still prefer deleting.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?
    I think multiple imputation has nicer interpretation: we assume what could have been and gather multiple outcomes. In terms of computational efficiency, maximal probability probably wins.

    4.How can researchers use missing data to improve their research projects?
    If the cause of data missing can be traced back to some cause, the data gathering process can be improved for next iteration.

    ReplyDelete
  14. Thank you for raising an interesting topic. Data that came from real-world datasets very often contain missing values. A basic strategy, in this case, is to discard entire rows and/or columns that contain missing data. However, that leads to narrowing the set of a few values that could be useful in research. So, from my perspective and experience, a good strategy is to impute the missing values by constant value, or using the statistics (e.g. mean or median) of each column in which the missings are located. Currently, ML toolkits, such as scikit-lear, allows a user to choose the strategy of data imputation, so there is no golden-standard for every case and it should be adjusted to the particular case.

    ReplyDelete
  15. 1. Does the method of replacing empty data make sense?

    I agree with the previous speakers, in my opinion, the method of filling in empty data is not a good solution. It can lead to wrong understanding of certain things and in case of critical data it can change the very idea and results of research and projects.
    2. does the method of deleting whole lines with incomplete data give better results than substitution?

    If I am to choose, I prefer this method over the one presented in the question above.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    I'm not an expert in this field, but I would choose maximal probability.

    4.How can researchers use missing data to improve their research projects?

    I'm not sure, but maybe to change the angle or the scope of their research projects.

    ReplyDelete
  16. Filling empty values might help in some statistical, multivariate analyses where having the same number of observations for each variable is necessary.
    But if we have enough observations or filling empty values simply makes no sense (for example because of a high variance of a given variable) then it's better to delete whole sample.

    ReplyDelete
  17. 1. Does the method of replacing empty data make sense?

    I think that depends on for what we want find. We need to saw how data is changing and choose the best option.

    2. Does the method of deleting whole lines with incomplete data give better results than substitution?

    Yes, it does but in some situation. We need to check the data because in sometimes deleting o whole line of data will give a better result. That depends on what we research and of kind of data we have.

    3. Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    It depends of kind of problem to we need to investigation and of preferences of researcher.

    4. How can researchers use missing data to improve their research projects?

    I think we need to analyze data and choose a better methods to obtain data for next reserach eg log additional data. Sometimes we could use methods that can fix our missing data but we need to be carefully.

    ReplyDelete
  18. This is a hard-core statistics / ML topic, and I may be wrong in some intuitions.
    Apart from that, I'll do my best.

    1. Does the method of replacing empty data make sense?

    That depends, but spase datasets are a part of (many a) scientist's life. Throwing out all data points with missing data would be either wasteful, or even completely defeating the purpose of that research. For a small dataset, a human could reason about the data points with some robustness against missing features. Why not try force a method to do it as well?

    2. does the method of deleting whole lines with incomplete data give better results than substitution?

    Probably sometimes, especially if they were outliers to begin with. This is a trade-off between overfitting (a model perfectly fitted not only to the training data, but even to the imputations of the training data), and producing a too weak model to be of any practical use. Sometimes deletion will give us a too simplistic, underfitted model. But it's useful to check both ways in a practical case and e.g. cross-validate to pick the better approach for the specific task.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    I have no idea. Maximal probability sounds a bit like a greedy algorithm, so it may produce "too uniform" results. It may actually not help our model. Not that I know anything about this kind of statistical learning, but I would probalby go for a mix - for example an ensemble model (like AdaBoost), using subsets of features for the basic learners. Or maybe a method which can handle missing values (Random Forest), where the split can also happen on a missing/present value of a decision variable.

    4.How can researchers use missing data to improve their research projects?

    No, because it's missing ;) But I guess you mean the methods of data imputation - well, likely so. That is, if they have real-world data, which sometimes has blanks in it. In case they have full information, these methods won't help them.

    ReplyDelete
  19. 1. Does the method of replacing empty data make sense?

    It’s depends on data structure - how big dataset is and how many missing values is in this dataset.
    Also type of data and what the data representing is the most important in this question.
    But generally – yes, it does.
    2. Does the method of deleting whole lines with incomplete data give better results than substitution?
    In generally no, it doesn’t. This is oldschool, brutal method.
    We once had a situation – during our studies - when people who participated in these studies didn’t want to fill the form.
    None of this methods would work, because we didn’t have any of meaningful data.
    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    These two algorithms works quite different and both work well, again- like I said before – the choice between them depends on our study and our datasets. You can try both and compare the results on your datasets.

    4.How can researchers use missing data to improve their research projects?
    Statistical analysis with missing Data is very huge topic .
    If we want to study this topic i think first step should be this book:
    Little, Roderick JA, and Donald B. Rubin. Statistical analysis with missing data. Vol. 333. John Wiley & Sons, 2014.

    ReplyDelete
  20. 1. Does the method of replacing empty data make sense?

    I am afraid that the method of replacing empty data is not good. What should we do if we replace the relevant data with empty? The results obtained may be distorted. In addition, depending on the problem, each data has a specific effect on the others.

    2. Does the method of deleting whole lines with incomplete data give better results than substitution?

    In my opinion, deleting whole lines would be much better than replacing incomplete data with randomly selected ones. We don't know what information the data line would be complemented with and whether added data will affect the end result.

    3.Which method of data replacement is more efficient; Maximal probability or multiple imputation?

    To check the efficiency of the methods given, you would have to perform calculations on the same data to compare both methods. If I had to choose, I would prefer the maximum probability because there is simpler model.

    4.How can researchers use missing data to improve their research projects?

    I agree with Damian. A good research project has a basis for complete and true data. Adding information by the algorithms causes errors in the calculation and final result. It would be worth taking a look at the data in the data to find a problem in their deficiency or incompatibility.

    ReplyDelete
  21. Missing data have long plagued those conducting applied research in the social, behavioral, and health sciences. As for replacement methods, in fact, I consider them as artificial content creating, leading us to unreal information.
    If I had to choose between substitution and deleting the whole line, I’d rather be for the second method. It is better not to have information, than to have unsure one.
    And choosing between maximal probability and multiple imputation, I guess the first one is more reliable. Probability algorithms are being still developed. They provide better and better accuracy. Of course, there are many issues to discuss, like attrition, non-Monte-Carlo techniques for simulations involving missing data, evaluation of the benefits of auxiliary variables, and highly cost-effective planned missing data designs.
    The researches can use missing data as a chance to find new scientific problems to be solved. I consider this as a possibility to create new hypotheses, which can be checked in further steps, or developed as inspiration to the other fields. What is interesting, investigators who are non-statisticians are able to implement modern missing data procedures properly in their research, and reap the benefits in terms of improved accuracy and statistical power.

    ReplyDelete