Member Avatar for Malaoshi

Hi,

I am still new to C# and am working on an annotator:

1. Input some Chinese Text.
2. Click Button.
3. Output some Chinese Text - each word that's found in the dictionary get's a span-html-class around for later implementing it in a website.

Here it the code for clicking the button:

private void annobutton_Click(object sender, EventArgs e)
        {
            while (pos < inputBox.TextLength)
            {
                if (inputBox.Text.Substring(pos).Length < len)
                {
                    textpart = inputBox.Text.Substring(pos, inputBox.Text.Substring(pos).Length);
                }
                else
                {
                    textpart = inputBox.Text.Substring(pos, len);
                }

                if (!dict.ContainsKey(textpart) && len >= 1)
                {
                    len--;
                }

                else if (dict.ContainsKey(textpart) && len >= 1)
                {
                    str = str + "<span class=\"tttword\">" + textpart + "<span class=\"ttt\">" + textpart + " - " + dict[textpart]["py"] + "<br>" + dict[textpart]["de"] + "</span></span>";
                    pos = pos + len;
                    len = 5;
                }

                else if (len <= 0)
                {
                    textpart = inputBox.Text.Substring(pos, 1);
                    str = str + textpart;
                    pos++;
                    len = 5;
                }

            }
            outputBox.Text = str;
        }

And here's my problem - the dictionary file is too big (about 300000 lines) - how to properly include it and make the programm work (just needs to work on my computer, as I am generating the output for my website which I will manually upload.)

public partial class Form1 : Form
    {
        private string str = "";
        private string textpart = "";
        private int pos = 0;
        private int len = 5;
        private Dictionary<string, Dictionary<string, string>> dict = new Dictionary<string, Dictionary<string, string>>();


        public Form1()
        {
            InitializeComponent();
            
            dict["我"] = new Dictionary<string, string>();
            dict["我"]["py"] = "wo";
            dict["我"]["de"] = "wo";
            dict["你"] = new Dictionary<string, string>();
            dict["你"]["py"] = "ni";
            dict["你"]["de"] = "ni";
/* And a lot more of these lines - is there a better way of including the dictionary file? */

        }

Can you help me make that work? What is the best way of including such a large dictionary file? (Here I get the dictionary file: http://www.handedict.de/chinesisch_deutsch.php?mode=dl )

Do you have any further question that'll help you help me - ask me ;-)

One thing I would suggest is to keep your Dictionary processing completely separate from your UI. The loading of the dictionary can be triggered from the form, but should not be part of the form.

Can these be stored in a file or database?
If so, you can keep control of that data without having to recompile.
You will still load the data into the dictionary, but the values won't be hard-coded in your app.

Member Avatar for Malaoshi

That is a good idea, thank you thines01 - can you tell me how to do that in C#?

How to open a file? And what would be the best format for the data in the file? Can I just "include" the file or do I need to do something special?

I found "FileStream" and "StreamReader", are those the right things to work with?

It would be great if you could give me another hint on how to do this kinda easily ;-) And I'll keep on googeling - thanks so far for pushing me into the right direction.

Member Avatar for Malaoshi

Okay, how far have I come?

I've included a load button, that being pressed loads the file with Streamreader.

private void loadbutton_Click(object sender, EventArgs e)
        {
            StreamReader file = new StreamReader(@"C:\handedict_nb.u8");
            string strAllFile = file.ReadToEnd().Replace("\r\n", "\n").Replace("\n\r", "\n");
            string[] arrLines = strAllFile.Split(new char[] { '\n' });
            inputBox.Text = arrLines[1];
            file.Close();
        }

But I am not sure how to turn that file into this form: Dictionary<string, Dictionary<string, string>> dict = new Dictionary<string, Dictionary<string, string>>();

The file format is now: ChineseWord [Pronounciation] /Meaning/

我 [wo3] /I, me/
你 [ni3] /you/
...

How can I turn that into the dictionary I need? Here is how it might be done in php (someone told me and as far as I am concerned that should work):

$py = '';
$de = '';
$word = '';
if(preg_match('/\[(.*)\]/s',$line[$i],$r))
{
$py = trim($r[1]);
}
 
if(preg_match('/\/(.*)\//s',$line[$i],$r))
{
$de = trim($r[1]);
}
 
if(preg_match('/^(.*)\[/s',$line[$i],$r))
{
$word = trim($r[1]);
}
 
if($word == true && $py == true && $de == true)
{
$a[] = array($word => array('py' => $py,'de' => $de)); //here I would than simply put that into my dict-Dictionary.
$b[] = sha1($word); # hashing
}

Can you tell me how to do that with C#, I hope it should work then, right? Or will the programm still crash due to memory problems?

The handedict_nb is big, but not that big (12 MB).
You could load that as static data when your web page is loaded (or see if it can be static pre-loaded data to the entire website).

1) Make a class library that does NOTHING but load the file into the Dictionary
- then add that as a reference to your current project
(example coming shortly)

2) Use the Regex class for holding the regular expressions

3) When parsing by CR/LF, you can do it without the Replace command
like: .Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

Member Avatar for Malaoshi

Okay - but don't get me wrong, I will not upload the whole data to the webpage. I will annotate the Text offline and then put it online separately.

Right now I am trying to load that whole thing into a Dictionary. Trying to get the needed parts by searching each line with the Regex stuff. If that does not work I will prepare the file so that I can do the parsing via CR/LF.

Happily waiting for the example ;-)

Looking at it again, you can do this without the RegEx. The only difficulty (for me) would be the handling of the actual Chinese characters.

The file has a header, so the loader of the file would need to strip it.

In your class library (to be called ChineseToGerman), I would create three classes -- one for the object (representing the parsed row of data), another for a "master" of those objects (like a Dictionary) and one for loading that master.

I would first create a class for holding the individual (parsed) pieces of each input string. There is a constructor for handling new CChineseToGerman objects when fed a single string.

Examine this (yes, it's overkill):

//ChineseToGerman.cs (inside the ChineseToGerman class library)
using System;
using System.Data;

namespace ChineseToGerman
{
   public class CChineseToGerman
   {
      public string strChineseChars { get; set; }
      public string strStuffInBrackets { get; set; }
      public string strDescription { get; set; }

      public CChineseToGerman()
      {
         strChineseChars = "";
         strStuffInBrackets = "";
         strDescription = "";
      }

      public CChineseToGerman(CChineseToGerman copy)
      {
         strChineseChars = copy.strChineseChars;
         strStuffInBrackets = copy.strStuffInBrackets;
         strDescription = copy.strDescription;
      }

      public CChineseToGerman(string strData)
      {
         string[] arr_strData = strData.Split("\r\n[]/".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
         strChineseChars = arr_strData[0].Trim();
         strStuffInBrackets = arr_strData[1].Trim();
         // arr_strData[2] // There will be a space here, ignore it
         strDescription = arr_strData[3].Trim();
      }

      public CChineseToGerman(IDataReader rdr)
      {
         strChineseChars = rdr["CHINESE_CHARS"].ToString().Trim();
         strStuffInBrackets = rdr["STUFF_IN_BRACKETS"].ToString().Trim();
         strDescription = rdr["DESCRIPTION"].ToString().Trim();
      }

      public override string ToString()
      {
         return
            strChineseChars + '-' +
            strStuffInBrackets + '-' +
            strDescription;
      }

      public override int GetHashCode()
      {
         return this.ToString().GetHashCode();
      }

      public override bool Equals(object obj)
      {
         return this.ToString().Equals(((CChineseToGerman)obj).ToString());
      }
   }
}

Remember: YOU will need to decide how to handle the Chinese characters, but this class will parse the incoming string.

Member Avatar for Malaoshi

Hi ;-)

I finished it and it actually works ;-)

private void loadbutton_Click(object sender, EventArgs e)
        {
            StreamReader file = new StreamReader(@"C:\Users\Klingbeil\Documents\Visual Studio 2008\Projects\Annotator\Annotator\bin\Debug\handedict_nb.u8");
            string strAllFile = file.ReadToEnd().Replace("\r\n", "\n").Replace("\n\r", "\n");
            string[] arrLines = strAllFile.Split(new char[] { '\n' });
            Regex pinyin = new Regex(@"\[.*\]");
            Regex deutsch = new Regex(@"\/(.*)\/");
            Regex chinese = new Regex(@"^(.*)\[");
            /*if (Regex.IsMatch(arrLines[0], @"\[.*\]"))*/
            for(int i=0;i<=50000;i++)
            {
                Match cn = chinese.Match(arrLines[i]);
                Match py = pinyin.Match(arrLines[i]);
                Match de = deutsch.Match(arrLines[i]);
                string cnKey=Convert.ToString(cn).Trim('[').Trim();
                dict[cnKey] = new Dictionary<string, string>();
                dict[cnKey]["py"] = Convert.ToString(py).Trim('[').Trim(']');
                dict[cnKey]["de"] = Convert.ToString(de).Trim('/');
            }

            inputBox.Text = "Part 1 geladen";

            for (int i = 50000; i <= 100000; i++)
            {
                Match cn = chinese.Match(arrLines[i]);
                Match py = pinyin.Match(arrLines[i]);
                Match de = deutsch.Match(arrLines[i]);
                string cnKey = Convert.ToString(cn).Trim('[').Trim();
                dict[cnKey] = new Dictionary<string, string>();
                dict[cnKey]["py"] = Convert.ToString(py).Trim('[').Trim(']');
                dict[cnKey]["de"] = Convert.ToString(de).Trim('/');
            }

            inputBox.Text += "Part 2 geladen";

            for (int i = 100000; i <= arrLines.Length-1; i++)
            {
                Match cn = chinese.Match(arrLines[i]);
                Match py = pinyin.Match(arrLines[i]);
                Match de = deutsch.Match(arrLines[i]);
                string cnKey = Convert.ToString(cn).Trim('[').Trim();
                dict[cnKey] = new Dictionary<string, string>();
                dict[cnKey]["py"] = Convert.ToString(py).Trim('[').Trim(']');
                dict[cnKey]["de"] = Convert.ToString(de).Trim('/');
            }

            inputBox.Text += "Part 3 geladen";

            file.Close();
        }

Thanks for you help so far. If you have any additions I should make or anything else that is written too bad in that code - please let me know ;-)

And here again - the whole code (not sure if that'll help anyone but maybe it will)

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using System.Windows.Forms;

namespace WindowsFormsApplication1
{
    public partial class Form1 : Form
    {
        private string str = "";
        private string textpart = "";
        private int pos = 0;
        private int len = 5;
        private Dictionary<string, Dictionary<string, string>> dict = new Dictionary<string, Dictionary<string, string>>();
        

        public Form1()
        {
            InitializeComponent();
            
            
        }

        private void annobutton_Click(object sender, EventArgs e)
        {

           
            while (pos < inputBox.TextLength)
            {
                if (inputBox.Text.Substring(pos).Length < len)
                {
                    textpart = inputBox.Text.Substring(pos, inputBox.Text.Substring(pos).Length);
                }
                else
                {
                    textpart = inputBox.Text.Substring(pos, len);
                }

                if (!dict.ContainsKey(textpart) && len >= 1)
                {
                    len--;
                }

                else if (dict.ContainsKey(textpart) && len >= 1)
                {
                    str = str + "<span class=\"tttword\">" + textpart + "<span class=\"ttt\">" + textpart + " - " + dict[textpart]["py"] + "<br>" + dict[textpart]["de"] + "</span></span>";
                    pos = pos + len;
                    len = 5;
                }

                else if (len <= 0)
                {
                    textpart = inputBox.Text.Substring(pos, 1);
                    str = str + textpart;
                    pos++;
                    len = 5;
                }

            }
            outputBox.Text = str;
        }
        

        private void resetbutton_Click(object sender, EventArgs e)
        {
            outputBox.Text = "";
            str = "";
            len = 5;
            pos = 0;
        }

        private void loadbutton_Click(object sender, EventArgs e)
        {
            StreamReader file = new StreamReader(@"C:\Users\Klingbeil\Documents\Visual Studio 2008\Projects\Annotator\Annotator\bin\Debug\handedict_nb.u8");
            string strAllFile = file.ReadToEnd().Replace("\r\n", "\n").Replace("\n\r", "\n");
            string[] arrLines = strAllFile.Split(new char[] { '\n' });
            Regex pinyin = new Regex(@"\[.*\]");
            Regex deutsch = new Regex(@"\/(.*)\/");
            Regex chinese = new Regex(@"^(.*)\[");
            /*if (Regex.IsMatch(arrLines[0], @"\[.*\]"))*/
            for(int i=0;i<=50000;i++)
            {
                Match cn = chinese.Match(arrLines[i]);
                Match py = pinyin.Match(arrLines[i]);
                Match de = deutsch.Match(arrLines[i]);
                string cnKey=Convert.ToString(cn).Trim('[').Trim();
                dict[cnKey] = new Dictionary<string, string>();
                dict[cnKey]["py"] = Convert.ToString(py).Trim('[').Trim(']');
                dict[cnKey]["de"] = Convert.ToString(de).Trim('/');
            }

            inputBox.Text = "Part 1 geladen";

            for (int i = 50000; i <= 100000; i++)
            {
                Match cn = chinese.Match(arrLines[i]);
                Match py = pinyin.Match(arrLines[i]);
                Match de = deutsch.Match(arrLines[i]);
                string cnKey = Convert.ToString(cn).Trim('[').Trim();
                dict[cnKey] = new Dictionary<string, string>();
                dict[cnKey]["py"] = Convert.ToString(py).Trim('[').Trim(']');
                dict[cnKey]["de"] = Convert.ToString(de).Trim('/');
            }

            inputBox.Text += "Part 2 geladen";

            for (int i = 100000; i <= arrLines.Length-1; i++)
            {
                Match cn = chinese.Match(arrLines[i]);
                Match py = pinyin.Match(arrLines[i]);
                Match de = deutsch.Match(arrLines[i]);
                string cnKey = Convert.ToString(cn).Trim('[').Trim();
                dict[cnKey] = new Dictionary<string, string>();
                dict[cnKey]["py"] = Convert.ToString(py).Trim('[').Trim(']');
                dict[cnKey]["de"] = Convert.ToString(de).Trim('/');
            }

            inputBox.Text += "Part 3 geladen";

            file.Close();
        }
    }
}

After that, you will need a "Master" that will handle the collection of those objects. My "master" and my "loader" are based on an interface that you can ignore for now (comment it out).
Examine this:

//contained in ChineseToGermanMaster.cs (inside the ChineseToGerman class library)
using System.Collections.Generic;

namespace ChineseToGerman
{
   using IMaster;
   public class CChineseToGermanMaster : Dictionary<string, CChineseToGerman>, IMaster
   {
      private string _strFileName { get; set; }
      public CChineseToGermanMaster()
      {
         _strFileName = "";
      }

      public CChineseToGermanMaster(string strFileName)
      {
         _strFileName = strFileName;
      }

      public bool Load(ref string strError)
      {
         return (new CChineseToGermanLoader(_strFileName)).Load(this, ref strError);
      }
   }
}
Member Avatar for Malaoshi

Thanks again thines01 ;-) I will look into your solution. Hopefully I can use it to improve mine.

About the dictionary file I manually deleted the first line - but actually I could've just started from the second line. And I also manually deleted all the traditional Characters.

Glad it works. Let me finish the punch line before you go...

Keep in mind: What I'm describing can always be used for loading data of any type -- especially to KEEP IT AWAY FROM YOUR DISPLAY MECHANISM.

The Loader (the last piece of the trio for loading the data):

using System;
using System.IO;
using System.Text;

namespace ChineseToGerman
{
   using IMaster;
   public class CChineseToGermanLoader : IMasterLoader<CChineseToGermanMaster>
   {
      private string _strFileName { get; set; }
      public CChineseToGermanLoader()
      {
         _strFileName = "";
      }

      public CChineseToGermanLoader(string strFileName)
      {
         _strFileName = strFileName;
      }

      public bool LoadFromFile(CChineseToGermanMaster master, ref string strError)
      {
         bool blnRetVal = false;
         try
         {
            using (StreamReader fileIn = new StreamReader(_strFileName, Encoding.UTF8))
            {
               CChineseToGerman tempC2G = null;

               if(!fileIn.EndOfStream)
               {
                  blnRetVal = true;
                  fileIn.ReadLine();
               }

               while (!fileIn.EndOfStream)
               {
                  tempC2G = new CChineseToGerman(fileIn.ReadLine());
                  if (!master.ContainsKey(tempC2G.strChineseChars))
                  {
                     master.Add(tempC2G.strChineseChars, new CChineseToGerman(fileIn.ReadLine()));
                  }
               }
               //
               fileIn.Close();
            }
         }
         catch (Exception exc)
         {
            blnRetVal = false;
            strError = exc.Message;
         }

         return blnRetVal;
      }

      public bool LoadFromDatabase(CChineseToGermanMaster master, ref string strError)
      {
         bool blnRetVal = true;

         try
         {
            throw new NotImplementedException("This feature is not yet implemented");
         }
         catch (Exception exc)
         {
            blnRetVal = false;
            strError = exc.Message;
         }

         return blnRetVal;
      }

      public bool Load(CChineseToGermanMaster master, ref string strError)
      {
         return LoadFromFile(master, ref strError);
      }
   }
}

And of couse, a sample Usage of the class library:

With this technique, I can load various forms of data from multiple sources and will always know how it is handled.

Also, with my IMaster interface, I can load multiple masters with one command.
I hope you can use this stuff some time.

using System;

namespace DW_397785
{
   using ChineseToGerman;

   class CDW_397785
   {
      static void Main(string[] args)
      {
         string strError = "";
         CChineseToGermanMaster masterC2G = new CChineseToGermanMaster("c:/science/DaniWeb/DW_397785/data/handedict_nb.u8");
         if (!masterC2G.Load(ref strError))
         {
            Console.WriteLine("Could not load Chinese to German master: " + strError);
            return;
         }

         Console.WriteLine("Finished");
      }
   }
}
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.