Handle Numeric features

This feature is a experimental feature

https://github.com/BrikerMan/Kashgari/issues/90

Some time, except the text, we have some additional features like text formatting (italic, bold, centered), position in text and more. Kashgari provides NumericFeaturesEmbedding and StackedEmbedding for this kine data. Here is the details.

If you have a dataset like this.

  1. token=NLP start_of_p=True bold=True center=True B-Category
  2. token=Projects start_of_p=False bold=True center=True I-Category
  3. token=Project start_of_p=True bold=True center=False B-Project-name
  4. token=Name start_of_p=False bold=True center=False I-Project-name
  5. token=: start_of_p=False bold=False center=False I-Project-name

First, numerize your additional features. Convert your data to this. Remember to leave 0 for padding.

  1. text = ['NLP', 'Projects', 'Project', 'Name', ':']
  2. start_of_p = [1, 2, 1, 2, 2]
  3. bold = [1, 1, 1, 1, 2]
  4. center = [1, 1, 2, 2, 2]
  5. label = ['B-Category', 'I-Category', 'B-Project-name', 'I-Project-name', 'I-Project-name']

Then you have four input sequence and one output sequence. Prepare your embedding layers.

  1. import kashgari
  2. from kashgari.embeddings import NumericFeaturesEmbedding, BareEmbedding, StackedEmbedding
  3. import logging
  4. logging.basicConfig(level='DEBUG')
  5. text = ['NLP', 'Projects', 'Project', 'Name', ':']
  6. start_of_p = [1, 2, 1, 2, 2]
  7. bold = [1, 1, 1, 1, 2]
  8. center = [1, 1, 2, 2, 2]
  9. label = ['B-Category', 'I-Category', 'B-ProjectName', 'I-ProjectName', 'I-ProjectName']
  10. text_list = [text] * 100
  11. start_of_p_list = [start_of_p] * 100
  12. bold_list = [bold] * 100
  13. center_list = [center] * 100
  14. label_list = [label] * 100
  15. SEQUENCE_LEN = 100
  16. # You can use WordEmbedding or BERTEmbedding for your text embedding
  17. text_embedding = BareEmbedding(task=kashgari.LABELING, sequence_length=SEQUENCE_LEN)
  18. start_of_p_embedding = NumericFeaturesEmbedding(feature_count=2,
  19. feature_name='start_of_p',
  20. sequence_length=SEQUENCE_LEN)
  21. bold_embedding = NumericFeaturesEmbedding(feature_count=2,
  22. feature_name='bold',
  23. sequence_length=SEQUENCE_LEN)
  24. center_embedding = NumericFeaturesEmbedding(feature_count=2,
  25. feature_name='center',
  26. sequence_length=SEQUENCE_LEN)
  27. # first one must be the text embedding
  28. stack_embedding = StackedEmbedding([
  29. text_embedding,
  30. start_of_p_embedding,
  31. bold_embedding,
  32. center_embedding
  33. ])
  34. x = (text_list, start_of_p_list, bold_list, center_list)
  35. y = label_list
  36. stack_embedding.analyze_corpus(x, y)
  37. # Now we can embed with this stacked embedding layer
  38. print(stack_embedding.embed(x))

Once embedding layer prepared, you could use all of the classification and labeling models.

  1. # We can build any labeling model with this embedding
  2. from kashgari.tasks.labeling import BLSTMModel
  3. model = BLSTMModel(embedding=stack_embedding)
  4. model.fit(x, y)
  5. print(model.predict(x))
  6. print(model.predict_entities(x))

This is the struct of this model.

Handle Numeric features - 图1