The Lancaster Los Angeles Spoken Chinese Corpus (http://www.lancaster.ac.uk/fass/projects/corpus/LLSCC/) has about 1 million words. If it's still too small, perhaps you could try to create your own, using some existing data, such as TV shows, movie transcripts, etc. If you're interested in natural spoken interaction, the the dataset will necessarily be small, considering the amount of work in transcription.
Thank you very much Dr. Ai. May I follow up with two more questions considering your advice:
1. If I use TV shows to create my own corpus, shall I consider about the Copyright of those shows as transcripts, audio and visual are all protected?
2. The link to LLSCC is valid, but I can not see where to log in onto the corpus. And the data in this corpus is not open to public for free, I remember. It is ok even it charges. Do you know how to access to it?
And last, do you know what CHA files are and how to open such files?
Many thanks. Wish you a happy new year of the Monkey.
To answer your first question, if you're only using the corpus you created for your private use, then you should be fine. However, if you intend to publish or distribute the corpus, then you need to deal with the copyright issue. You might contact Professor Tao for details regarding the availability of the LLSCC corpus. I believe CHA files can be opened by CLAN software http://childes.psy.cmu.edu/clan/.