Benford Law [Lưu trữ] - Diễn đàn học sinh Chuyên Hoàng Văn Thụ - Hòa Bình

lnhoa

14-02-2009, 06:19 PM

Mọi người nghĩ xác suất xuất hiện của chữ số 1 trong một bảng số liệu là 10% ?

Thực tế xác suất của nó nhiều hơn thế (hơn 30%).Điều này được thể hiện trong định luật Benford.Một trong những định lý cơ bản giúp những nhà kiểm toán có thể phát hiện những gian dối trong 1 bảng kê khai tài chính.

Hãy cùng dịch và hiểu thêm về định luật Benford:

Following Benford's Law, or Looking Out for No. 1

By Malcolm W. Browne

(From The New York Times, Tuesday, August 4, 1998)

Dr. Theodore P. Hill asks his mathematics students at the Georgia Institute of Technology to go home and either flip a coin 200 times and record the results, or merely pretend to flip a coin and fake 200 results. The following day he runs his eye over the homework data, and to the students' amazement, he easily fingers nearly all those who faked their tosses. "The truth is," he said in an interview, "most people don't know the real odds of such an exercise, so they can't fake data convincingly."
There is more to this than a classroom trick.
Dr. Hill is one of a growing number of statisticians, accountants and mathematicians who are convinced that an astonishing mathematical theorem known as Benford's Law is a powerful and relatively simple tool for pointing suspicion at frauds, embezzlers, tax evaders, sloppy accountants and even computer bugs.
The income tax agencies of several nations and several states, including California, are using detection software based on Benford's Law, as are a score of large companies and accounting businesses.
Benford's Law is named for the late Dr. Frank Benford, a physicist at the General Electric Company. In 1938 he noticed that pages of logarithms corresponding to numbers starting with the numeral 1 were much dirtier and more worn than other pages.
(A logarithm is an exponent. Any number can be expressed as the fractional exponent -- the logarithm -- of some base number, such as 10. Published tables permit users to look up logarithms corresponding to numbers, or numbers corresponding to logarithms.)
Logarithm tables (and the slide rules derived from them) are not much used for routine calculating anymore; electronic calculators and computers are simpler and faster. But logarithms remain important in many scientific and technical applications, and they were a key element in Dr. Benford's discovery.
Dr. Benford concluded that it was unlikely that physicists and engineers had some special preference for logarithms starting with 1. He therefore embarked on a mathematical analysis of 20,229 sets of numbers, including such wildly disparate categories as the areas of rivers, baseball statistics, numbers in magazine articles and the street addresses of the first 342 people listed in the book "American Men of Science." All these seemingly unrelated sets of numbers followed the same first-digit probability pattern as the worn pages of logarithm tables suggested. In all cases, the number 1 turned up as the first digit about 30 percent of the time, more often than any other.
http://www.rexswain.com/benford1.jpg (From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998) Benford's law predicts a decreasing frequency of first digits, from 1 through 9. Every entry in data sets developed by Benford for numbers appearing on the front pages of newspapers, by Mark Nigrini of 3,141 county populations in the 1990 U.S. Census and by Eduardo Ley of the Dow Jones Industrial Average from 1990-93 follows Benford's law within 2 percent.

Dr. Benford derived a formula to explain this. If absolute certainty is defined as 1 and absolute impossibility as 0, then the probability of any number "d" from 1 through 9 being the first digit is log to the base 10 of (1 + 1/d). This formula predicts the frequencies of numbers found in many categories of statistics. Probability predictions are often surprising. In the case of the coin-tossing experiment, Dr. Hill wrote in the current issue of the magazine American Scientist, a "quite involved calculation" revealed a surprising probability. It showed, he said, that the overwhelming odds are that at some point in a series of 200 tosses, either heads or tails will come up six or more times in a row. Most fakers don't know this and avoid guessing long runs of heads or tails, which they mistakenly believe to be improbable. At just a glance, Dr. Hill can see whether or not a student's 200 coin-toss results contain a run of six heads or tails; if they don't, the student is branded a fake.
Even more astonishing are the effects of Benford's Law on number sequences. Intuitively, most people assume that in a string of numbers sampled randomly from some body of data, the first non-zero digit could be any number from 1 through 9. All nine numbers would be regarded as equally probable.
But, as Dr. Benford discovered, in a huge assortment of number sequences -- random samples from a day's stock quotations, a tournament's tennis scores, the numbers on the front page of The New York Times, the populations of towns, electricity bills in the Solomon Islands, the molecular weights of compounds the half-lives of radioactive atoms and much more -- this is not so.
Given a string of at least four numbers sampled from one or more of these sets of data, the chance that the first digit will be 1 is not one in nine, as many people would imagine; according to Benford's Law, it is 30.1 percent, or nearly one in three. The chance that the first number in the string will be 2 is only 17.6 percent, and the probabilities that successive numbers will be the first digit decline smoothly up to 9, which has only a 4.6 percent chance.
A strange feature of these probabilities is that they are "scale invariant" and "base invariant." For example, it doesn't matter whether the numbers are based on the dollar prices of stocks or their prices in yen or marks, nor does it matter if the numbers are in terms of stocks per dollar; provided there are enough numbers in the sample, the first digit of the sequence is more likely to be 1 than any other.
The larger and more varied the sampling of numbers from different data sets, mathematicians have found, the more closely the distribution of numbers approaches what Benford's Law predicted.
One of the experts putting this discovery to practical use is Dr. Mark J. Nigrini, an accounting consultant affiliated with the University of Kansas who this month joins the faculty of Southern Methodist University in Dallas.
Dr. Nigrini gained recognition a few years ago by applying a system he devised based on Benford's Law to some fraud cases in Brooklyn. The idea underlying his system is that if the numbers in a set of data like a tax return more or less match the frequencies and ratios predicted by Benford's Law, the data are probably honest. But if a graph of such numbers is markedly different from the one predicted by Benford's Law, he said, "I think I'd call someone in for a detailed audit."
http://www.rexswain.com/benford2.jpg (From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998) Benford's law can be used to test for fraudulent or random-guess data in income tax returns and other financial reports. Here the first significant digits of true tax data taken by Mark Nigrini from the lines of 169,662 IRS model files follow Benford's law closely. Fraudulent data taken from a 1995 King’s County, New York, District Attorney's Office study of cash disbursement and payroll in business do not follow Benford's law. Likewise, data taken from the author's study of 743 freshmen's responses to a request to write down a six-digit number at random do not follow the law. Although these are very specific examples, in general, fraudulent or concocted data appear to have far fewer numbers starting with 1 and many more starting with 6 than do true data.

Some of the tests based on Benford's Law are so complex that they require a computer to carry out. Others are surprisingly simple; just finding too few ones and too many sixes in a sequence of data to be consistent with Benford's Law is sometimes enough to arouse suspicion of fraud. Robert Burton, the chief financial investigator for the Brooklyn District Attorney, recalled in an interview that he had read an article by Dr. Nigrini that fascinated him.
"He had done his Ph.D. dissertation on the potential use of Benford's Law to detect tax evasion, and I got in touch with him in what turned out to be a mutually beneficial relationship," Mr. Burton said. "Our office had handled seven cases of admitted fraud, and we used them as a test of Dr. Nigrini's computer program. It correctly spotted all seven cases as "involving probable fraud."
One of the earliest experiments Dr. Nigrini conducted with his Benford's Law program was an analysis of President Clinton's tax return. Dr. Nigrini found that it probably contained some rounded-off estimates rather than precise numbers, but he concluded that his test did not reveal any fraud.
The fit of number sets with Benford's Law is not infallible.
"You can't use it to improve your chances in a lottery," Dr. Nigrini said. "In a lottery someone simply pulls a series of balls out of a jar, or something like that. The balls are not really numbers; they are labeled with numbers, but they could just as easily be labeled with the names of animals. The numbers they represent are uniformly distributed, every number has an equal chance, and Benford's Law does not apply to uniform distributions."
Another problem Dr. Nigrini acknowledges is that some of his tests may turn up too many false positives. Various anomalies having nothing to do with fraud can appear for innocent reasons.
For example, the double digit 24 often turns up in analyses of corporate accounting, biasing the data, causing it to diverge from Benford's Law patterns and sometimes arousing suspicion wrongly, Dr. Nigrini said. "But the cause is not real fraud, just a little shaving. People who travel on business often have to submit receipts for any meal costing $25 or more, so they put in lots of claims for $24.90, just under the limit. That's why we see so many 24's."
Dr. Nigrini said he believes that conformity with Benford's Law make it possible to validate procedures developed to fix the Year 2000 problem -- the expectation that many computer systems will go awry because of their inability to distinguish the year 2000 from the year 1900. A variant of his Benford's Law software already in use, he said, could spot any significant change in a company's accounting figures between 1999 and 2000, thereby detecting a computer problem that might otherwise go unnoticed.
"I foresee lots of uses for this stuff, but for me its just fascinating in itself," Dr. Nigrini said. "For me, Benford is a great hero. His law is not magic, but sometimes it seems like it."
Dow Illustrates Benford's Law

To illustrate Benford's Law, Dr. Mark J. Nigrini offered this example: "If we think of the Dow Jones stock average as 1,000, our first digit would be 1.
"To get to a Dow Jones average with a first digit of 2, the average must increase to 2,000, and getting from 1,000 to 2,000 is a 100 percent increase.
"Let's say that the Dow goes up at a rate of about 20 percent a year. That means that it would take five years to get from 1 to 2 as a first digit.
"But suppose we start with a first digit 5. It only requires a 20 percent increase to get from 5,000 to 6,000, and that is achieved in one year.
"When the Dow reaches 9,000, it takes only an 11 percent increase and just seven months to reach the 10,000 mark, which starts with the number 1. At that point you start over with the first digit a 1, once again. Once again, you must double the number -- 10,000 -- to 20,000 before reaching 2 as the first digit.
"As you can see, the number 1 predominates at every step of the progression, as it does in logarithmic sequences."

Benford's Law Part 2 - The 80/20 Rule or Pareto Principle
Benford's law is useful in detecting fraudulent accounting data but may also have a wider meaning if the digits it evaluates are considered ranks or places. For example the digits 1,2,3,...9 could be considered as representing first through ninth place in a contest The digit's probability of occurring could be considered the relative share of total winnings for each place. In other words, 1st place would win 30.1%, 2nd place 17.6%, 3rd 12.5%,... 9th place 4.6% of the available rewards.
Benford's law enables fraud detection in accounting data because the probability of getting a 1 for the first digit of a number is 30.1% instead of 11% r 1/9 as would normally be expected. The probability of obtaining any of the possible first digits 1 through 9 is calculated as follows:

P = Log10 (1+1/n) eqn 1.
where: n = digit
For other number bases Benford's law becomes:

P = Log10 (n+1) - Log10 n

Log10 B

= Log10 (1+1/n)
eqn 2. Log10 B

http://www.intuitor.com/statistics/images/Benfor1.gif where: B = number base Figure 1 shows the probability of obtaining various first digits for different number bases. This could potentially be used as a model for ranked data sets of different sizes. In this case each digit would represent a ranked data point. For instance, we could model contributions to a charity such as the Red Cross. If 3 contributions are made to the Red cross we would use the base 4 curve. Presumably the number one or largest donation would be about 50% of the total. The second highest would be 29.2 % of the total and the third 20.8% (see Table 1). If 9 donations are made, the highest one should be about 30 % of the total the second 18.5% and so on. This is interesting food for thought but not likely to be useful as a predictive model, at least for small sized data sets. In these cases random errors could easily overwhelm a correlation between the data points and the model.

If we normalize the curve for each number base by dividing the individual values by the first value, something amazing happens. The various curves merge into a single number base independent curve. For example the first value of the base 4 curve is 50% and the second is 29.2%.

(29.25%)/(50%) = 58.5%
The second value divided by the first value gives 58.5 % for every curve regardless of the base! Table 1 shows the original values. Table 2 shows normalized data in which each value for a given base has been divided by the first value. Again, notice that the different curves have merged into a number base independent curve. This curve indicates that each ranked value has a defined percentage of the first or largest value.
We can derive an equation for this curve by dividing eqn 2 by the first value:

P = Log10 (1+1/n)
Log10 B Log10 (1+1/1)
Log10 B

P = Log10 (1+1/n)
eqn 3. Log10 2

The normalized Benford curve (see Figure 2) could be used as a model for ranked data such as the wealth of individuals in a country. In this case the second richest person in a country should have about 58% of the first person's wealth. The third richest person would have about 41.5% of the first persons wealth and so on. Since n can be any size, the normalized Benford curve could model a nation of any size, even a nation with billions of people. This model obviously indicates that most of a country's wealth would be controlled by a few individuals.

Table 1: Percent of Total by Rank Using Benford's Law
Rank
Base
2 3 4 5 6 7 8 9 10 1 100 63.1 50.0 43.1 38.7 35.6 33.3 31.5 30.1 2 36.9 29.2 25.2 22.6 20.8 19.5 18.5 17.6 3 20.8 17.9 16.1 14.8 13.8 13.1 12.5 4 13.9 12.5 11.5 10.7 10.7 9.7 5 10.2 9.4 8.8 8.3 7.9 6 7.5 7.4 7.0 6.7 7 6.4 6.1 5.8 8 5.4 5.1 9 4.6

Table 1: Percent of First Rank by Rank Using Benford's Law
Rank
Base
2 3 4 5 6 7 8 9 10 1 100 100 100 100 100 100 100 100 100 2 58.5 58.5 58.5 58.5 58.5 58.5 58.5 58.5 3 41.5 41.5 41.5 41.5 41.5 41.5 41.5 4 32.2 32.2 32.2 32.2 32.2 32.2 5 26.3 26.3 26.3 26.3 26.3 6 22.2 22.2 22.2 22.2 7 19.3 19.3 19.3 8 17.0 17.0 9 15.2 In 1906 the Italian economist Vilfredo Pareto (1848-1923) determined that about 80% of Italy's wealth was controlled by about 20% of the people. This has evolved into the 80/20 Rule or Pareto principle which is frequently applied to business or quality control problems. For example: 20% of the employees do 80% of the work or 20% of the quality problems account for 80% of the rejects. This is only a rule of thumb since actual proportions are rarely exactly 20% and 80%. However, it's a very useful rule of thumb.
To determine if the Benford model gives results similar to those of the Pareto principle we use the normalized Benford equation in a computer program. This calculates the percent of the total wealth controlled by the top 20% of the most wealthy individuals in hypothetical countries of various sizes. Figure 3 shows the results. Based on this figure we would derive a 90/20 rule instead of an 80/20 rule. However, the result is still strikingly similar to Pareto's findings.
If we total the GDP's (a measure of wealth) of the nations with the top 20% of world's per capita GDP's (most wealthy people) we get a similar finding. 20% of the world's people control 85% of the wealth. This is based on countries in which the GDP can be estimated. These countries account for about 5.8 billion people. The 85% figure is probably low since there is no data for some of the poorest countries. Also using per capita GDP ignores the fact that some of the world's richest people live in poor countries and some poor people live in wealthy countries. Again the agreement with the normalized Benford's model is surprisingly good and suggests that a 90/20 rule may describe the distribution of wealth in the world better than the 80/20 rule.
The correlation between the Benford model and wealth raises interesting questions. For example it implies that an increase in the discrepancy between the rich and poor may be a natural outcome of an increase in population. The model casts doubt on whether anything can realistically be done to eliminate the huge variability in wealth. According to the model, the wealth of the poor moves up and down with the wealth of the rich. Certainly, attempts to redistribute wealth using various communist systems have not been overly successful. They have tended to impoverish rich and poor alike.

http://www.intuitor.com/statistics/images/Benfor2.gif
http://www.intuitor.com/statistics/images/Benfor3.gif The key weakness of the Benford model lies in the fact that it is dependent on a single data point, namely the first or highest value. All other values are divided by the first one. Random error in this number could give a poor fit between the Benford model and a set of ranked data. A constant can be used to compensate for errors in the first value. Using this method, the Benford's model would become:

P = k(Log10 (1+1/n))

where: k = an experimentally developed constant

Unlike some correlation models, the Benford model is derived from basic principles. The fact that the model appears to correlate with a particular data set, however, may be pure coincidence, but it certainly raises interesting questions about the possibility of an underlying order.

http://www.rexswain.com/benford.html
http://www.intuitor.com/statistics/Benford%27s%20Law2.html