Skip to content

Latest commit

 

History

History
227 lines (169 loc) · 9.55 KB

essay.md

File metadata and controls

227 lines (169 loc) · 9.55 KB
title published
Essay
true

Your Task

By Oct23, plan your Oct/Nov project work. To that end, write an essay.

Setting up

Person1 has not done this subject and uses the most popular, complex, slowest, most thorough state of the art tool.

Person2 has done this subject and is willing to make compromises in order support:

  • model readabilty
  • actionable conclusions
  • learnability and repeatability of the results
    • i.e. using ess CPU and less disk/RAM
  • mutli-goal reasoning
  • anomaly detection (a little bell that goes "ping!" when new questions not in old data)
  • incremental (knows how to forget old knowledge and learn new stuff, when required)
    • many use an anomaly detector
  • sharable
    • can succinctly describe the training data (somehow)
    • and/or that data can be mutated to lower the odds of detecting idividuals within the data
    • and/or that summarized/mutated data still works well for learning a model
  • context aware: generates different conclusions for different regions of the data
  • self-tuning (knows how to tune its own parameters)

How much does Person2's approach "cost", compared to Person1, in terms of reduced fucntionality measured on some criteria?

Sections

Your essay should have the sections:

  • CRITERIA:
    • Briefly expand on the above criteria. Explain why each is useful/ useless or easy/hard.
  • KEY CRITERIA:
    • Expand in details on (or more) of the above that you most care about (explaining why)? Hint: the fewer you care about, the less you will have todo in this project.
  • CRITIQUE:
    • For any domain (see below), what is the most popular, complex, slowest, most thorough state-of-the-art SOA tool?
    • On which of the above criteria does that SOA fail?
  • REVIEW:
    • What are simple engineering approach that enables that criteria?
  • PLANNING: which of the REVIEWed technologies do you plan to implement and assess?
    • What is a plan for building it and comparing the performance of your simpler tools with the SOA in order to determine what is won and lost between the simpler and the most complex.
    • For that comparison, consider use the stats.py tool

(x) Domains:

What to hand in


Hints

Note that you cannot read all the the following. Best to focus in on one or two areas.

Methods for model readabilty

Measures of success here are (e.g.) size vs performance of model. Perhaps bigger models are better but by how much? And is it really worth it?

Methods for Actionable Conclusions

  • PLanners

Methods for learnability and repeatability of the results

i.e. using less CPU and less disk/RAM

  • Cluster the data e.g. with mini-batch k-means or some of its more recent variants (https://goo.gl/uTBFu7)
    • Keep only small N examples per clustera (see reservoir sampling: https://goo.gl/cppkuI)
    • Learn one model per cluster
  • Naive Bayes is prety fast, low memory
    • See also NB variants that handle low frequency data sets (see xithe TWCNB psuedcode in section 4.4 of fig https://goo.gl/bM6b87)
    • Use some fast clutering method to divide the data, then build one NB classifier per leaf (https://goo.gl/9PmLLB)
  • Work on very small models
  • Work on very small models, which are so small that it is cool to just blow them away everytime you get new data
  • Data carving:
    • Just generate a tiny table of data from the data, and let people look at that: https://goo.gl/nAdJcb (username=password=guest)

Methods for mutli-goal reasoning

  • Replace class with some aggregated number
    • e.g. the domination count in the assignment

Rule mutation algorithms: Each rule is a vector of size A*R+1 (a ttributes, R ranges per attribute, 1 class var) that is mutated via DE or GAs

Methods anomaly detection

(A little bell that goes "ping!" when new questions not in old data.)

Learner-based:

  • If learner performance plateaus faster that data generator changes states, then just use any learner and report anomalies when performance drops
  • e.g. SAWTOOTH http://menzies.us/pdf/05sawtooth.pdf
  • (caution: this paper was never accepted, sigh, even though it is cool++).

Cluster-based:

  • e.g. non-parametric : cluster the data and find the 20 to 80th percentile of the distances of items to their cluster's centroid. and something is alien if, in its nearest cluster, it is outside 20-80
  • e.g. parametric: http://ai-at-wvu.blogspot.com/search?q=elkan

Methods for Incremental

(Knows how to forget old knowledge and learn new stuff, when required. ) Many use an anomaly detector.

Mining over infinite streams

Active learning:

Naive Bayes is prety fast, low memory, incremental algorithm:

  • See also NB variants that handle low frequency data sets (see xithe TWCNB psuedcode in section 4.4 of fig https://goo.gl/bM6b87)
  • Use some fast clutering method to divide the data, then build one NB classifier per leaf (https://goo.gl/9PmLLB)

If learning over infinite N examples, how soon before the learning stops improving?

Methods for Sharable

Can succinctly describe the training data (somehow)

  • Data carving (feature and instance selection, just share what is left): https://goo.gl/nAdJcb (username=password=guest)

And/or that data can be mutated to lower the odds of detecting idividuals within the data

And/or that summarized/mutated data still works well for learning a model

Methods for context aware:

  • Cluster the data e.g. with mini-batch k-means or some of its more recent variants (https://goo.gl/uTBFu7)
    • Keep only small N examples per clustera (see reservoir sampling: https://goo.gl/cppkuI)
    • Learn one model per cluster

Methods for self-tuning